Stream a pre-recorded file in real time

When you stream a pre-recorded audio file to the Streaming API, you need to send audio at the same pace it was recorded. If you send audio faster than real time, the server receives more data than it can process in sequence, which can cause degraded transcription accuracy, unexpected session closures, or other errors.

This guide shows you how to pace audio correctly so that the server processes it as if a person were speaking into a live microphone.

Why real-time pacing matters

The Streaming API is designed for live audio. It expects audio to arrive at roughly the same rate it was originally spoken. When you stream a pre-recorded file without any pacing, your code reads and sends the entire file in seconds, even if the recording is minutes long. This causes problems:

  • Unexpected session behavior — Sending audio faster than real time can overwhelm the connection and cause the server to close the session or return errors.
  • Inaccurate results — The speech model is optimized for real-time input. Audio that arrives too quickly may not be processed the same way as live speech, potentially affecting transcription quality.
  • Unreliable benchmarks — If you’re evaluating transcription quality, faster-than-real-time streaming produces results that don’t reflect production conditions where audio arrives at normal speed.

If you only need a transcript and don’t need real-time results, use the pre-recorded transcription API instead. It processes audio as fast as possible and is optimized for batch workloads.

Before you begin

To complete this guide, you need:

Quickstart

1import websocket
2import json
3import threading
4import time
5import wave
6import os
7from urllib.parse import urlencode
8
9# --- Configuration ---
10ASSEMBLYAI_API_KEY = os.environ["ASSEMBLYAI_API_KEY"]
11AUDIO_FILE = "audio.wav"
12CHUNK_DURATION = 0.1 # Send 100ms of audio per chunk
13SAMPLE_RATE = 16000 # Must match your audio file's sample rate
14
15CONNECTION_PARAMS = {
16 "speech_model": "u3-rt-pro",
17 "sample_rate": SAMPLE_RATE,
18}
19API_ENDPOINT = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(CONNECTION_PARAMS)}"
20
21ws_app = None
22audio_thread = None
23stop_event = threading.Event()
24
25
26def on_open(ws):
27 print("Connected. Streaming audio at real-time speed...")
28
29 def stream_file():
30 with wave.open(AUDIO_FILE, "rb") as wf:
31 frames_per_chunk = int(wf.getframerate() * CHUNK_DURATION)
32 start_time = time.monotonic()
33 chunks_sent = 0
34
35 while not stop_event.is_set():
36 frames = wf.readframes(frames_per_chunk)
37 if not frames:
38 break
39
40 ws.send(frames, websocket.ABNF.OPCODE_BINARY)
41 chunks_sent += 1
42
43 # Wall-clock pacing: sleep until the next chunk is due
44 next_chunk_time = start_time + (chunks_sent * CHUNK_DURATION)
45 sleep_duration = next_chunk_time - time.monotonic()
46 if sleep_duration > 0:
47 time.sleep(sleep_duration)
48
49 print("Finished sending audio. Waiting for final transcripts...")
50 try:
51 ws.send(json.dumps({"type": "Terminate"}))
52 except Exception:
53 pass
54
55 global audio_thread
56 audio_thread = threading.Thread(target=stream_file, daemon=True)
57 audio_thread.start()
58
59
60def on_message(ws, message):
61 data = json.loads(message)
62
63 if data["type"] == "Begin":
64 print(f"Session ID: {data['id']}")
65 elif data["type"] == "Turn":
66 transcript = data.get("transcript", "")
67 if not transcript:
68 return
69 if data.get("end_of_turn"):
70 print(f"[Final]: {transcript}")
71 else:
72 print(f"[Partial]: {transcript}")
73 elif data["type"] == "Termination":
74 print(f"Done. Processed {data.get('audio_duration_seconds', 0)}s of audio.")
75
76
77def on_error(ws, error):
78 print(f"Error: {error}")
79 stop_event.set()
80
81
82def on_close(ws, status_code, msg):
83 print(f"Disconnected (status={status_code})")
84 stop_event.set()
85
86
87ws_app = websocket.WebSocketApp(
88 API_ENDPOINT,
89 header={"Authorization": ASSEMBLYAI_API_KEY},
90 on_open=on_open,
91 on_message=on_message,
92 on_error=on_error,
93 on_close=on_close,
94)
95
96ws_thread = threading.Thread(target=ws_app.run_forever, daemon=True)
97ws_thread.start()
98
99try:
100 while ws_thread.is_alive():
101 time.sleep(0.1)
102except KeyboardInterrupt:
103 print("\nStopping...")
104 stop_event.set()
105 if ws_app and ws_app.sock and ws_app.sock.connected:
106 try:
107 ws_app.send(json.dumps({"type": "Terminate"}))
108 time.sleep(2)
109 except Exception:
110 pass
111 if ws_app:
112 ws_app.close()
113 ws_thread.join(timeout=2.0)

Step-by-step guide

Install dependencies

$pip install websocket-client

Prepare your audio file

The Streaming API accepts raw audio samples. WAV is the simplest format to work with because it contains uncompressed PCM data that you can read directly.

Your audio file must be:

  • Mono (single channel)
  • 16-bit PCM encoding
  • A sample rate that matches the sample_rate connection parameter

If your file doesn’t meet these requirements, convert it with FFmpeg:

$ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 output.wav

To check your file’s properties:

$ffprobe -v quiet -print_format json -show_streams audio.wav

Configure the connection

Set your API key and match the sample_rate parameter to your audio file:

1import websocket
2import json
3import threading
4import time
5import wave
6import os
7from urllib.parse import urlencode
8
9ASSEMBLYAI_API_KEY = os.environ["ASSEMBLYAI_API_KEY"]
10AUDIO_FILE = "audio.wav"
11CHUNK_DURATION = 0.1 # 100ms per chunk
12SAMPLE_RATE = 16000
13
14CONNECTION_PARAMS = {
15 "speech_model": "u3-rt-pro",
16 "sample_rate": SAMPLE_RATE,
17}
18API_ENDPOINT = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(CONNECTION_PARAMS)}"

Implement wall-clock pacing

The key to simulating real-time audio is wall-clock pacing. Instead of calling sleep for a fixed duration after each chunk (which accumulates drift from processing time), track elapsed time from the start and sleep only until the next chunk is due.

Here’s the difference:

Naive approach (not recommended) — Fixed sleep after each send. Processing time adds up, so audio arrives progressively later than real time:

1# Don't do this for benchmarking
2while True:
3 frames = wav_file.readframes(frames_per_chunk)
4 ws.send(frames)
5 time.sleep(chunk_duration) # Drift accumulates over time

Wall-clock approach (recommended) — Calculate when each chunk should be sent based on the start time. This self-corrects any drift:

1start_time = time.monotonic()
2chunks_sent = 0
3
4while not stop_event.is_set():
5 frames = wav_file.readframes(frames_per_chunk)
6 if not frames:
7 break
8
9 ws.send(frames, websocket.ABNF.OPCODE_BINARY)
10 chunks_sent += 1
11
12 # Sleep until the next chunk is due
13 next_chunk_time = start_time + (chunks_sent * CHUNK_DURATION)
14 sleep_duration = next_chunk_time - time.monotonic()
15 if sleep_duration > 0:
16 time.sleep(sleep_duration)

This approach uses time.monotonic() (Python) or Date.now() (JavaScript) to track elapsed time from the start of streaming. Each chunk is scheduled based on its position in the file, not relative to the previous chunk. If one iteration takes longer than expected, the next chunk is sent sooner to catch up — keeping the overall pace at real time.

End the session

After you send all audio, send a Terminate message so the server can flush its buffers and return any remaining transcripts:

1ws.send(json.dumps({"type": "Terminate"}))

The server responds with a Termination message that includes the total audio duration processed. Wait for this message before closing the WebSocket connection so you don’t miss any final transcripts.

Choosing a chunk duration

The CHUNK_DURATION value controls how much audio you send in each message. Common values:

  • 100ms (0.1) — Good default. Balances network overhead with smooth pacing.
  • 50ms (0.05) — More closely simulates microphone input. Use this if you want behavior closest to a live mic stream.
  • 200ms (0.2) — Fewer network calls, slightly less real-time feel. Acceptable for most benchmarks.

Smaller chunks send more WebSocket messages but more closely approximate continuous microphone input. For benchmarking, 100ms is a good starting point.

Common mistakes

MistakeImpactFix
No pacing at allAudio arrives in seconds; session may close or return errorsAdd wall-clock pacing as shown above
Naive fixed sleepDrift accumulates over a long file; audio arrives lateUse wall-clock pacing with time.monotonic() or Date.now()
Wrong sample rateServer interprets audio at the wrong speedMatch sample_rate to your file. Check with ffprobe
Sending stereo audioOnly the first channel is used, or the session errorsConvert to mono: ffmpeg -i input.wav -ac 1 output.wav
Not sending TerminateServer waits for more audio until the session times out, so you miss final transcriptsAlways send {"type": "Terminate"} after the last audio chunk

Next steps