Voice Agent API

Audio format

Supported encodings, sample rates, and how to stream and play Voice Agent API audio.

All audio exchanged over the Voice Agent API is base64-encoded, mono. The encoding you choose determines the sample rate and bit depth. Input and output encodings are configured independently.

Supported encodings

Set the encoding for input (microphone) and output (agent speech) via session.update:

1{
2 "type": "session.update",
3 "session": {
4 "input": {
5 "format": { "encoding": "audio/pcm" }
6 },
7 "output": {
8 "format": { "encoding": "audio/pcmu" }
9 }
10 }
11}
EncodingSample rateBit depthBest for
audio/pcm24,000 Hz16-bit signed integer (little-endian)Default, highest quality, ideal for most apps
audio/pcmu8,000 Hz8-bit μ-lawTelephony (G.711 μ-law)
audio/pcma8,000 Hz8-bit A-lawTelephony (G.711 A-law)

If you omit the format blocks, both input and output default to audio/pcm (24,000 Hz).

Choose the encoding that matches your audio pipeline. For browser and desktop apps, use audio/pcm (24 kHz). For telephony integrations where audio is already in G.711 format, use audio/pcmu or audio/pcma to avoid resampling. Input and output can use different encodings. For example, receive telephony audio at 8 kHz and send high-quality agent speech at 24 kHz.

Sending audio

Stream microphone audio as input.audio events. Each event contains a base64-encoded audio chunk in the configured encoding. Send chunks continuously. The server buffers them, so exact chunk size doesn’t matter. ~50 ms chunks work well.

1import base64
2
3# In your mic callback:
4def mic_callback(indata, *_):
5 if session_ready.is_set():
6 loop.call_soon_threadsafe(mic_queue.put_nowait, bytes(indata))
7
8# In your send loop:
9async def send_audio():
10 while True:
11 chunk = await mic_queue.get()
12 await ws.send(json.dumps({
13 "type": "input.audio",
14 "audio": base64.b64encode(chunk).decode()
15 }))

Only start sending input.audio after you receive session.ready.

Noise cancellation and voice focus

Voice focus (server-side noise cancellation) is enabled by default on every Voice Agent API session and tuned for real-world conditions like noisy environments and background speakers. The server isolates the primary speaker before transcription and turn detection, so you don’t need to add a separate noise-suppression stage to your audio pipeline. Send the raw mic audio as-is, adding client-side denoising on top usually introduces artifacts that hurt accuracy more than the original noise did.

Disable browser-level noiseSuppression (set noiseSuppression: false in your getUserMedia constraints) and skip heavier preprocessing like RNNoise, Krisp, or BVC. The Voice Agent API’s voice focus already handles background speech and ambient noise, stacking client-side denoising on top tends to introduce artifacts that hurt accuracy.

Echo cancellation

When the agent’s speech plays through speakers, the microphone can pick it up and send it back to the server, causing the agent to interrupt itself. To prevent this:

  • Browser apps: use getUserMedia with echoCancellation enabled. The browser’s built-in acoustic echo cancellation (AEC) removes speaker output from the mic signal automatically:

    1const stream = await navigator.mediaDevices.getUserMedia({
    2 audio: { echoCancellation: true, noiseSuppression: false }
    3});
  • Terminal / desktop apps: use headphones. Native audio APIs (PortAudio, sounddevice, etc.) don’t include echo cancellation, so the raw mic captures speaker output. Headphones eliminate the feedback path entirely.

Without echo cancellation or headphones, the agent’s own speech loops back through the microphone and triggers barge-in. Every response will be cut short with status: "interrupted". See Troubleshooting for more details.

Playing output audio

The server streams reply.audio events containing small audio chunks in the configured output encoding. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:

1SAMPLE_RATE = 24000 # 24 kHz for audio/pcm, 8 kHz for audio/pcmu or audio/pcma
2
3with sd.OutputStream(samplerate=SAMPLE_RATE, channels=1, dtype="int16") as speaker:
4 # In your event loop:
5 if event["type"] == "reply.audio":
6 pcm = np.frombuffer(base64.b64decode(event["data"]), dtype=np.int16)
7 speaker.write(pcm)

speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at the configured rate, producing smooth playback. Network jitter is absorbed by the buffer, even if a WebSocket message arrives late, there’s still audio playing.

Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps.

1# ❌ Don't do this
2while True:
3 chunk = get_next_chunk()
4 play(chunk)
5 await asyncio.sleep(0.020) # drift accumulates → audio artifacts

Handling interruptions

When the user speaks while the agent is responding (barge-in), the server stops generating audio and signals the interruption. Your client should:

  1. Flush the audio buffer: discard any queued audio so the user doesn’t hear stale speech.
  2. Restart the output stream: so it’s ready for the next response.
1if t == "reply.done":
2 if event.get("status") == "interrupted":
3 speaker.abort() # discard buffered audio
4 speaker.start() # restart stream for next response

The server emits two events on interruption:

Barge-in is semantic. Back-channels like “uh-huh” don’t trigger an interruption, but phrases like “wait, stop” do. See Turn detection and interruptions for how the model decides, and Session configuration for tuning barge-in sensitivity.

Platform-specific audio flush

The speaker.abort() call above is specific to Python’s sounddevice. Each platform has its own way to flush a playback buffer:

PlatformFlush approach
Python (sounddevice)speaker.abort() then speaker.start()
Web (AudioContext)Disconnect the source node, create a new AudioBufferSourceNode, and reconnect
iOS (AVAudioEngine)Call playerNode.stop() then playerNode.play() to clear the scheduled buffer
Android (AudioTrack)Call audioTrack.pause(), audioTrack.flush(), then audioTrack.play()