Running Bulk Transcription and Load Tests at Scale

This guide applies to two closely related workloads:

Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.

The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.

Key recommendations

Ramp. Submit in 15-second windows. Start at 25 requests/window and grow ~8–9% per window until you reach your target sustained rate.
Measure. Use webhooks instead of polling. Record submit_ts, complete_ts, audio_duration, model, features, and status per request.
Coordinate. Recommended for runs above 200 requests/minute, required for large bulk uploads (tens of thousands of files), and for any EU workload.

Default concurrency for paid accounts is 200 concurrent jobs. If you need a higher limit, reach out to support@assemblyai.com — AssemblyAI offers custom concurrency limits at no additional cost.

Before you begin

Prerequisites

Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
Concurrency limit. Check your current limit on the Rate Limits page of your dashboard and size your target submission rate against it — see Size your target rate.
Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your audio_url. Each /v2/upload call counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to /v2/upload first to get a hosted URL, and factor those uploads into your rate-limit planning.
Pre-signed URL expiration. Set URL TTLs long enough to outlast your expected queue time plus turnaround time. One hour is a safe default for small-to-moderate runs; use two hours or more for runs that approach your concurrency limit or use less-common languages. URLs that expire while a job is queued or processing surface as 4xx errors when AssemblyAI tries to retrieve the audio.
Completion tracking. Configure webhooks (with a polling fallback) or a polling-only strategy before you start — you’ll need a way to detect completion and record per-request timestamps.
Static-IP egress. If your audio URLs are behind a strict S3 bucket policy, contact support to enable static-IP egress for retrieval. Webhook deliveries already come from fixed IPs documented on the whitelisting FAQ.
Pricing reference. See the pricing page for per-hour rates to plug into your cost estimate.

Workload configuration Match these to your expected production traffic:

Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, summarization, and so on — each has different processing characteristics, and every feature you enable adds to TaT. Audit your request body and turn off what you don’t need; a faster run is also a cheaper one.
Channels and speakers. For call-center audio where agent and customer are on separate stereo channels, add multichannel=true to get per-channel utterances. For single-channel recordings with multiple speakers, use speaker_labels=true instead. See Should I use Speaker Labels or Multi-channel? for guidance.
Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.
Cost estimate. Sanity-check the expected spend before you submit: total audio hours × your per-hour rate. Bulk runs are billed the same as any other transcription, so a large backfill can produce a surprisingly large invoice if you haven’t projected it.

Pilot first

Before a full bulk run or load test, submit a pilot batch of 50–200 files using the exact configuration you plan to use at scale — same model, same features, same language, same webhook receiver, same error-handling logic. A pilot verifies that:

The transcripts look right, and the model and feature set match what the downstream consumer expects.
Your webhook receiver is reachable, verifying signatures, and writing results durably — or your polling loop is keeping up without hitting rate limits.
Your retry logic handles 5xx correctly and your dead-letter path captures 4xx without silently dropping files.
Your ramp and concurrency controls behave as intended.

The most expensive bulk-run failures almost always come from discovering a configuration mistake — a wrong model, a feature flag left off, a webhook handler that drops results silently — after the whole batch has been billed. A pilot catches these while they’re cheap to fix, and gives you a realistic mean TaT to plug into Size your target rate.

When to run

Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
Large runs (200+ requests/minute): coordinate with our team before starting. Our team will pick a window and pre-scale for you.
EU region: coordinate regardless of size.

Ramp up gradually

The most common mistake — for both bulk uploads and load tests — is submitting all requests at once. A gradual ramp gives the pipeline time to scale ahead of your traffic, which is what produces the lowest and most consistent turnaround times. Submitting a large spike upfront typically results in higher TaT as capacity catches up. Recommended schedule (validated for Universal-3 Pro with speaker labels):

Divide your ramp into 15-second windows.
Start at 25 requests per window.
Grow by ~8–9% per window until you reach your target sustained rate.
Don’t pause mid-ramp. Stopping and restarting means ramping from the starting rate again, and you’ll see higher latency when traffic resumes.
Do not exceed your account’s concurrency limit during the ramp.

If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.

To compute a ramp for any target without a lookup table, use rate_n = ceil(25 × 1.085ⁿ) (capped at your target rate), where n is the window index starting at 0. For a fully worked 400 requests/minute schedule, see Example ramp schedule at the bottom of this page. After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.

Small runs (under 500 files)

If your total run is under 500 files, you don’t need the ramp. Fire a bounded thread pool at your concurrency limit and submit in parallel. The ramp is for sustained high rates over several minutes.

Size your target rate

Your concurrency limit caps how many jobs can be in progress at once, so your sustained submission rate needs to fit inside it. Pick a target rate that keeps the typical number of in-flight jobs comfortably below your limit:

Estimate mean turnaround time from your pilot run or the published benchmarks.
Multiply your target submission rate (requests per second) by that mean TaT to approximate the number of jobs that will be in flight at steady state.
Keep that number under ~80% of your concurrency limit. If it’s higher, lower the target rate, shorten mean TaT (fewer features, shorter audio, a faster model), or request a higher concurrency limit.

The 20% of headroom absorbs normal variation in audio duration, warm-up effects, and webhook-receive latency. Runs sized to the limit exactly will see TaT climb as the in-flight count bumps against the cap. For planning, Universal-3 Pro English async typically completes a 5-minute file in about 9 seconds at p50 and 60 seconds at p95. Use that as your starting turnaround estimate before the pilot.

Measure and verify

For every run — bulk or load test — track completion and record per-request metadata.

Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
Run a polling fallback alongside webhooks. Webhooks can drop for many reasons — receiver downtime, signature mismatches, transient network failures. For every submitted job, record an expected completion deadline (around 2× mean TaT from your pilot) and GET /v2/transcript/{id} for any job whose webhook hasn’t arrived by then. The fallback protects you from silent data loss without materially increasing your rate-limit usage.
Polling-only every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it when webhooks aren’t available or when you need precise TaT during a load test.
Retry failures with exponential backoff for 5xx responses — see Implement retry server error logic. Investigate 4xx responses; they indicate a client-side issue that retrying won’t fix.

Record these fields per request:

submit_ts — timestamp when POST /v2/transcript was sent
complete_ts — timestamp when completion was detected
audio_duration — length of the audio file, in seconds
model — speech model used
features — features enabled (e.g. speaker_labels, auto_highlights, sentiment_analysis)
status — completed or error
id — transcript ID, for debugging with support

Turnaround time = complete_ts − submit_ts. Normalize TaT by audio duration to get the real-time factor: RTF = turnaround_time ÷ audio_duration. An RTF of 0.5 means the API processed the file in half its audio duration. RTF is the headline metric for comparing runs across regions, models, and audio-duration buckets — raw TaT varies too much with audio length to mean anything on its own.

Polling without exceeding the rate limit

HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a 403 error. If webhooks aren’t an option, stay within that budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:

Polling interval	GETs/s at 330 in-flight	Total req/s (at 33 POST/s)	Within limit?
Every 3s	~110	~143	No
Every 5s	~66	~99	No
Every 10s	~33	~66	Yes
Every 15s	~22	~55	Yes

When many jobs share the same polling interval they tend to cluster at the same second boundaries, spiking your rate-limit usage and occasionally returning 403 errors. Stagger each job’s polling by ±25% of the interval (for example, interval × random.uniform(0.75, 1.25)) so GETs spread evenly across the window.

If you’re running a bulk job

Monitor for these signals during the run:

Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no errors arrive.
Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.

If you’re running a load test

Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.

Diagnosing problems

Signal	Meaning	Response
`403` HTTP error	You’ve exceeded the 20,000 requests per 5 minutes HTTP rate limit (polling counts)	Increase polling interval or switch to webhooks; slow your submission rate
`4xx` submission error (other than 403)	Client-side issue (bad request, auth, invalid audio URL, etc.)	Inspect the response body and fix the request; retrying won’t help
`5xx` submission error	Transient server-side issue	Retry with exponential backoff — see the retry guide
TaT rises with no errors	Submission rate is outpacing available capacity	Slow the ramp or extend its duration; verify you’re within your concurrency limit

Reference implementation

The AssemblyAI Python SDK’s non-blocking Transcriber.submit() returns as soon as a transcript is queued, so you can drive the ramp yourself while using the SDK’s TranscriptionConfig and exception classes. If you’d rather have the SDK handle both submission and polling for a smaller batch, see Transcribe multiple files simultaneously. The following script ramps submissions to approximate the recommended schedule, retries transient errors, writes unrecoverable failures to a dead-letter log, and persists submitted file → transcript_id pairs so the run is resumable after a crash. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.

import json
import logging
import math
import random
import time
from collections import deque
from pathlib import Path

import assemblyai as aai
from assemblyai.types import TranscriptError

logger = logging.getLogger(__name__)

aai.settings.api_key = "YOUR_API_KEY"
# For the EU region, point the SDK at the EU base URL:
# aai.settings.base_url = "https://api.eu.assemblyai.com"

STATE_PATH = Path("bulk_state.jsonl")       # Append-only submission log, for resume
FAILURES_PATH = Path("bulk_failures.jsonl") # Dead-letter log for unrecoverable submission errors

# Match your production configuration: model, features, language, etc.
config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"],
    # For split-stereo call audio (agent on one channel, customer on the other):
    # multichannel=True,
    # For single-channel audio with multiple speakers:
    # speaker_labels=True,
)

transcriber = aai.Transcriber()


def append_jsonl(path, entry):
    with path.open("a") as f:
        f.write(json.dumps(entry) + "\n")


def already_submitted(path):
    """Return the set of file URLs successfully submitted in previous runs."""
    if not path.exists():
        return set()
    with path.open() as f:
        return {json.loads(line)["file"] for line in f if line.strip()}


def submit_file(file_url, max_retries=3):
    """Submit one file without waiting for completion. Retries transient errors with exponential backoff + jitter."""
    for attempt in range(max_retries + 1):
        try:
            transcript = transcriber.submit(file_url, config=config)
            return transcript.id
        except TranscriptError:
            if attempt == max_retries:
                raise
            time.sleep(2 ** attempt + random.random())


def submit_all(files, max_rate):
    """
    Submit files in 15-second windows, starting at 25 requests per window
    and growing ~8–9% per window until reaching max_rate.

    Files already recorded in STATE_PATH are skipped on restart. Failed
    submissions are written to FAILURES_PATH for inspection and manual replay.
    """
    done = already_submitted(STATE_PATH)
    remaining = deque(f for f in files if f not in done)

    rate = 25        # Starting requests per window
    window = 15      # Seconds per window
    growth = 1.085   # Per-window growth factor
    window_num = 0

    while remaining:
        window_num += 1
        batch_size = min(rate, len(remaining))

        for _ in range(batch_size):
            file = remaining.popleft()
            try:
                transcript_id = submit_file(file)
                append_jsonl(STATE_PATH, {"file": file, "id": transcript_id})
            except Exception as exc:
                logger.exception("Submission failed: %s", file)
                append_jsonl(FAILURES_PATH, {"file": file, "error": str(exc)})

        logger.info(
            "Window %d | Rate: %d/window | Submitted: %d | Remaining: %d",
            window_num, rate, batch_size, len(remaining),
        )

        rate = min(math.ceil(rate * growth), max_rate)

        if remaining:
            time.sleep(window)


# Usage:
# file_urls = ["https://example.com/audio1.mp3", ...]
# submit_all(file_urls, max_rate=100)  # Target: 100 requests per 15-second window
#
# Pick max_rate using the guidance in "Size your target rate".
# Use webhooks (with a polling fallback) to track completion; account for GETs
# in your rate-limit budget.

Example ramp schedule

A fully worked schedule for ramping to 100 requests/window (400 requests/minute). Use it as a reference when building your own ramp, or as a starting point you can scale to a different target rate.

Time window	Requests	Cumulative
0:00–0:15	25	25
0:15–0:30	27	52
0:30–0:45	29	81
0:45–1:00	31	112
1:00–1:15	33	145
1:15–1:30	35	180
1:30–1:45	38	218
1:45–2:00	41	259
2:00–2:15	44	303
2:15–2:30	47	350
2:30–2:45	51	401
2:45–3:00	55	456
3:00–3:15	59	515
3:15–3:30	64	579
3:30–3:45	69	648
3:45–4:00	75	723
4:00–4:15	81	804
4:15–4:30	88	892
4:30–4:45	95	987
4:45–5:00	100	1,087

Coordinate with our team

Reach out to support@assemblyai.com or your account manager before you submit any requests if any of these apply:

You plan to exceed 200 requests per minute.
You’re running a large one-time upload (tens of thousands of files or more).
You want support available during the run — for example, if you’re running outside US business hours.
You’re using the EU region, regardless of size.

When you reach out, include:

Expected request volume and ramp schedule, broken into 15-second windows
Audio file durations and language breakdown
Speech models and features you’ll enable
Whether audio is single-channel or multichannel
Preferred run window (see When to run)

AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.

Documentation Index

​Key recommendations

​Before you begin

​Pilot first

​When to run

​Ramp up gradually

​Small runs (under 500 files)

​Size your target rate

​Measure and verify

​Polling without exceeding the rate limit

​If you’re running a bulk job

​If you’re running a load test

​Diagnosing problems

​Reference implementation

​Example ramp schedule

​Coordinate with our team

​Related pages

Key recommendations

Before you begin

Pilot first

When to run

Ramp up gradually

Small runs (under 500 files)

Size your target rate

Measure and verify

Polling without exceeding the rate limit

If you’re running a bulk job

If you’re running a load test

Diagnosing problems

Reference implementation

Example ramp schedule

Coordinate with our team

Related pages