Why real-time pacing matters
The Streaming API is designed for live audio. It expects audio to arrive at roughly the same rate it was originally spoken. When you stream a pre-recorded file without any pacing, your code reads and sends the entire file in seconds, even if the recording is minutes long. This causes problems:- Unexpected session behavior — Sending audio faster than real time can overwhelm the connection and cause the server to close the session or return errors.
- Inaccurate results — The speech model is optimized for real-time input. Audio that arrives too quickly may not be processed the same way as live speech, potentially affecting transcription quality.
- Unreliable benchmarks — If you’re evaluating transcription quality, faster-than-real-time streaming produces results that don’t reflect production conditions where audio arrives at normal speed.
If you only need a transcript and don’t need real-time results, use the pre-recorded transcription API instead. It processes audio as fast as possible and is optimized for batch workloads.
Before you begin
To complete this guide, you need:- An AssemblyAI API key. Sign up and get your key from the dashboard.
- Python 3.8+ or Node.js 18+.
- A WAV audio file (mono, 16-bit PCM). If your file is in a different format, see Prepare your audio file.
Quickstart
- Python
- JavaScript
Step-by-step guide
Install dependencies
- Python
- JavaScript
Prepare your audio file
The Streaming API accepts raw audio samples. WAV is the simplest format to work with because it contains uncompressed PCM data that you can read directly. Your audio file must be:- Mono (single channel)
- 16-bit PCM encoding
- A sample rate that matches the
sample_rateconnection parameter
Configure the connection
Set your API key and match thesample_rate parameter to your audio file:
- Python
- JavaScript
Implement wall-clock pacing
The key to simulating real-time audio is wall-clock pacing. Instead of callingsleep for a fixed duration after each chunk (which accumulates drift from processing time), track elapsed time from the start and sleep only until the next chunk is due.
Here’s the difference:
Naive approach (not recommended) — Fixed sleep after each send. Processing time adds up, so audio arrives progressively later than real time:
- Python
- JavaScript
time.monotonic() (Python) or Date.now() (JavaScript) to track elapsed time from the start of streaming. Each chunk is scheduled based on its position in the file, not relative to the previous chunk. If one iteration takes longer than expected, the next chunk is sent sooner to catch up — keeping the overall pace at real time.
End the session
After you send all audio, send aTerminate message so the server can flush its buffers and return any remaining transcripts:
- Python
- JavaScript
Termination message that includes the total audio duration processed. Wait for this message before closing the WebSocket connection so you don’t miss any final transcripts.
Choosing a chunk duration
TheCHUNK_DURATION value controls how much audio you send in each message. Common values:
- 100ms (
0.1) — Good default. Balances network overhead with smooth pacing. - 50ms (
0.05) — More closely simulates microphone input. Use this if you want behavior closest to a live mic stream. - 200ms (
0.2) — Fewer network calls, slightly less real-time feel. Acceptable for most benchmarks.
Common mistakes
| Mistake | Impact | Fix |
|---|---|---|
| No pacing at all | Audio arrives in seconds; session may close or return errors | Add wall-clock pacing as shown above |
| Naive fixed sleep | Drift accumulates over a long file; audio arrives late | Use wall-clock pacing with time.monotonic() or Date.now() |
| Wrong sample rate | Server interprets audio at the wrong speed | Match sample_rate to your file. Check with ffprobe |
| Sending stereo audio | Only the first channel is used, or the session errors | Convert to mono: ffmpeg -i input.wav -ac 1 output.wav |
Not sending Terminate | Server waits for more audio until the session times out, so you miss final transcripts | Always send {"type": "Terminate"} after the last audio chunk |
Next steps
- Transcribe audio files with Streaming — Full example with audio playback and transcript saving.
- Evaluate Streaming transcription accuracy with WER — Benchmark your streaming transcription quality.
- Common session errors and closures — Troubleshoot session disconnects.