Running Bulk Transcription and Load Tests at Scale
This guide applies to two closely related workloads:
- Bulk transcription. Submitting thousands of files in a batch — for example, a nightly backfill or a one-time migration.
- Load testing. Measuring turnaround time (TaT) and throughput before a production cutover.
The guidance is shared. If you’re running a load test, also read the load test subsection of Measure and verify.
Key recommendations
- Ramp. Submit in 15-second windows. Start at 25 requests/window and grow ~8–9% per window until you reach your target sustained rate.
- Measure. Use webhooks instead of polling. Record
submit_ts,complete_ts,audio_duration,model,features, andstatusper request. - Coordinate. Recommended for runs above 200 requests/minute, required for large bulk uploads (tens of thousands of files), and for any EU workload.
Default concurrency for paid accounts is 200 concurrent jobs. If you need a higher limit, reach out to support@assemblyai.com — AssemblyAI offers custom concurrency limits at no additional cost.
Before you begin
Prerequisites
- Account balance. If your balance hits zero mid-run, AssemblyAI drops your concurrency limit to 1 and invalidates your results. Bulk runs and load tests both incur standard transcription usage charges — fund your account for the full expected volume before you submit any requests.
- Concurrency limit. Check your current limit on the Rate Limits page of your dashboard and size your target submission rate against it — see Size your target rate.
- Audio source. Prefer pre-signed URLs (for example, from S3 or GCS) as your
audio_url. Each/v2/uploadcall counts against your HTTP rate limit and adds latency proportional to file size — at bulk scale, uploads alone can exhaust your rate-limit budget. If local files are your only option, upload them to/v2/uploadfirst to get a hosted URL, and factor those uploads into your rate-limit planning. - Pre-signed URL expiration. Set URL TTLs long enough to outlast your expected queue time plus turnaround time. One hour is a safe default for small-to-moderate runs; use two hours or more for runs that approach your concurrency limit or use less-common languages. URLs that expire while a job is queued or processing surface as
4xxerrors when AssemblyAI tries to retrieve the audio. - Completion tracking. Configure webhooks (with a polling fallback) or a polling-only strategy before you start — you’ll need a way to detect completion and record per-request timestamps.
- Static-IP egress. If your audio URLs are behind a strict S3 bucket policy, contact support to enable static-IP egress for retrieval. Webhook deliveries already come from fixed IPs documented on the whitelisting FAQ.
- Pricing reference. See the pricing page for per-hour rates to plug into your cost estimate.
Workload configuration
Match these to your expected production traffic:
- Region: US or EU. The EU region handles less traffic than US and is more sensitive to load spikes. Coordinate with our team for any EU workload, regardless of size.
- Traffic pattern. Requests per minute at peak and whether traffic is steady or bursty.
- Models and features. Universal-3 Pro, Universal-2, speaker labels, PII redaction, summarization, and so on — each has different processing characteristics, and every feature you enable adds to TaT. Audit your request body and turn off what you don’t need; a faster run is also a cheaper one.
- Channels and speakers. For call-center audio where agent and customer are on separate stereo channels, add
multichannel=trueto get per-channel utterances. For single-channel recordings with multiple speakers, usespeaker_labels=trueinstead. See Should I use Speaker Labels or Multi-channel? for guidance. - Language. Some languages, such as Hindi, Swedish, and Hebrew, scale differently and may show longer TaT. Coordinate with our team if your workload is primarily in a less common language.
- Audio format. Format conversion and preprocessing run before transcription and contribute to overall TaT.
- Cost estimate. Sanity-check the expected spend before you submit: total audio hours × your per-hour rate. Bulk runs are billed the same as any other transcription, so a large backfill can produce a surprisingly large invoice if you haven’t projected it.
Pilot first
Before a full bulk run or load test, submit a pilot batch of 50–200 files using the exact configuration you plan to use at scale — same model, same features, same language, same webhook receiver, same error-handling logic. A pilot verifies that:
- The transcripts look right, and the model and feature set match what the downstream consumer expects.
- Your webhook receiver is reachable, verifying signatures, and writing results durably — or your polling loop is keeping up without hitting rate limits.
- Your retry logic handles
5xxcorrectly and your dead-letter path captures4xxwithout silently dropping files. - Your ramp and concurrency controls behave as intended.
The most expensive bulk-run failures almost always come from discovering a configuration mistake — a wrong model, a feature flag left off, a webhook handler that drops results silently — after the whole batch has been billed. A pilot catches these while they’re cheap to fix, and gives you a realistic mean TaT to plug into Size your target rate.
When to run
- Small tests and moderate bulk runs (well within your concurrency limit): US business hours (roughly 14:00–21:00 UTC) produce the most representative baseline latency, since throughput is highest during these periods.
- Large runs (200+ requests/minute): coordinate with our team before starting. Our team will pick a window and pre-scale for you.
- EU region: coordinate regardless of size.
Ramp up gradually
The most common mistake — for both bulk uploads and load tests — is submitting all requests at once. A gradual ramp gives the pipeline time to scale ahead of your traffic, which is what produces the lowest and most consistent turnaround times. Submitting a large spike upfront typically results in higher TaT as capacity catches up.
Recommended schedule (validated for Universal-3 Pro with speaker labels):
- Divide your ramp into 15-second windows.
- Start at 25 requests per window.
- Grow by ~8–9% per window until you reach your target sustained rate.
- Don’t pause mid-ramp. Stopping and restarting means ramping from the starting rate again, and you’ll see higher latency when traffic resumes.
- Do not exceed your account’s concurrency limit during the ramp.
If you’re using different models or features, contact support for a tailored ramp plan — some components take longer to initialize and may need a slower ramp.
To compute a ramp for any target without a lookup table, use rate_n = ceil(25 × 1.085ⁿ) (capped at your target rate), where n is the window index starting at 0. For a fully worked 400 requests/minute schedule, see Example ramp schedule at the bottom of this page.
After reaching your target rate, sustain it for at least 5–10 minutes. Your measurements are representative once p50 and p95 stay consistent over 2–3 consecutive minutes. If p50 is still falling, extend the sustain phase.
Small runs (under 500 files)
If your total run is under 500 files, you don’t need the ramp. Fire a bounded thread pool at your concurrency limit and submit in parallel. The ramp is for sustained high rates over several minutes.
Size your target rate
Your concurrency limit caps how many jobs can be in progress at once, so your sustained submission rate needs to fit inside it. Pick a target rate that keeps the typical number of in-flight jobs comfortably below your limit:
- Estimate mean turnaround time from your pilot run or the published benchmarks.
- Multiply your target submission rate (requests per second) by that mean TaT to approximate the number of jobs that will be in flight at steady state.
- Keep that number under ~80% of your concurrency limit. If it’s higher, lower the target rate, shorten mean TaT (fewer features, shorter audio, a faster model), or request a higher concurrency limit.
The 20% of headroom absorbs normal variation in audio duration, warm-up effects, and webhook-receive latency. Runs sized to the limit exactly will see TaT climb as the in-flight count bumps against the cap.
For planning, Universal-3 Pro English async typically completes a 5-minute file in about 9 seconds at p50 and 60 seconds at p95. Use that as your starting turnaround estimate before the pilot.
Measure and verify
For every run — bulk or load test — track completion and record per-request metadata.
- Use webhooks when possible. You get a clean completion signal without polling overhead. See the Webhooks documentation for setup, retry behavior, and authentication.
- Run a polling fallback alongside webhooks. Webhooks can drop for many reasons — receiver downtime, signature mismatches, transient network failures. For every submitted job, record an expected completion deadline (around 2× mean TaT from your pilot) and
GET /v2/transcript/{id}for any job whose webhook hasn’t arrived by then. The fallback protects you from silent data loss without materially increasing your rate-limit usage. - Polling-only every 1–2 seconds measures closer to actual completion but adds to your rate-limit budget. Use it when webhooks aren’t available or when you need precise TaT during a load test.
- Retry failures with exponential backoff for
5xxresponses — see Implement retry server error logic. Investigate4xxresponses; they indicate a client-side issue that retrying won’t fix.
Record these fields per request:
submit_ts— timestamp whenPOST /v2/transcriptwas sentcomplete_ts— timestamp when completion was detectedaudio_duration— length of the audio file, in secondsmodel— speech model usedfeatures— features enabled (e.g.speaker_labels,auto_highlights,sentiment_analysis)status—completedorerrorid— transcript ID, for debugging with support
Turnaround time = complete_ts − submit_ts.
Normalize TaT by audio duration to get the real-time factor: RTF = turnaround_time ÷ audio_duration. An RTF of 0.5 means the API processed the file in half its audio duration. RTF is the headline metric for comparing runs across regions, models, and audio-duration buckets — raw TaT varies too much with audio length to mean anything on its own.
Polling without exceeding the rate limit
HTTP rate limits cap total API requests at 20,000 per 5 minutes across all endpoints — submissions and polling combined. Exceeding this returns a 403 error.
If webhooks aren’t an option, stay within that budget. As an illustration: at a sustained 33 requests/second submission rate with ~330 jobs in flight, submissions alone consume roughly 10,000 of your budget — so poll no more often than every 15 seconds to leave headroom. Scale these numbers to your own submission rate and in-flight job count:
When many jobs share the same polling interval they tend to cluster at the same second boundaries, spiking your rate-limit usage and occasionally returning 403 errors. Stagger each job’s polling by ±25% of the interval (for example, interval × random.uniform(0.75, 1.25)) so GETs spread evenly across the window.
If you’re running a bulk job
Monitor for these signals during the run:
- Healthy run: TaT stays within ~20% of your first 5 minutes of sustained submissions, your completion queue drains steadily, and no errors arrive.
- Diagnose, don’t panic: if something goes wrong, match the signal you see to the row in Diagnosing problems and respond accordingly — each signal has a different cause and fix.
If you’re running a load test
- Separate ramp-phase from sustain-phase metrics. Expect higher latency during ramp; use sustain-phase numbers as your benchmark.
- Report percentiles, not averages. Track p50, p75, p90, p95, p99, and max.
- Normalize by audio duration. Group results into duration buckets (0–5 min, 5–15 min, 15–30 min, 30–60 min) for meaningful comparison.
- Pre-scaling caveat. If our team pre-scaled for your test, your results reflect steady-state capacity — not cold-start or scale-up behavior.
Diagnosing problems
Reference implementation
The AssemblyAI Python SDK’s non-blocking Transcriber.submit() returns as soon as a transcript is queued, so you can drive the ramp yourself while using the SDK’s TranscriptionConfig and exception classes. If you’d rather have the SDK handle both submission and polling for a smaller batch, see Transcribe multiple files simultaneously.
The following script ramps submissions to approximate the recommended schedule, retries transient errors, writes unrecoverable failures to a dead-letter log, and persists submitted file → transcript_id pairs so the run is resumable after a crash. The table above is hand-tuned to observed pipeline behavior, so the numbers this script produces may differ by one or two requests per window. Adjust max_rate to your target sustained rate.
Example ramp schedule
A fully worked schedule for ramping to 100 requests/window (400 requests/minute). Use it as a reference when building your own ramp, or as a starting point you can scale to a different target rate.
Coordinate with our team
Reach out to support@assemblyai.com or your account manager before you submit any requests if any of these apply:
- You plan to exceed 200 requests per minute.
- You’re running a large one-time upload (tens of thousands of files or more).
- You want support available during the run — for example, if you’re running outside US business hours.
- You’re using the EU region, regardless of size.
When you reach out, include:
- Expected request volume and ramp schedule, broken into 15-second windows
- Audio file durations and language breakdown
- Speech models and features you’ll enable
- Whether audio is single-channel or multichannel
- Preferred run window (see When to run)
AssemblyAI can pre-scale pipeline components for your traffic, raise your concurrency limit, and monitor the run in real time. For recurring bulk workloads (for example, nightly batch jobs), we can set up persistent scaling.