Audio format
All audio exchanged over the Voice Agent API is base64-encoded, mono. The encoding you choose determines the sample rate and bit depth. Input and output encodings are configured independently.
Supported encodings
Set the encoding for input (microphone) and output (agent speech) via session.update:
If you omit the format blocks, both input and output default to audio/pcm (24,000 Hz).
Choose the encoding that matches your audio pipeline. For browser and desktop apps, use audio/pcm (24 kHz). For telephony integrations where audio is already in G.711 format, use audio/pcmu or audio/pcma to avoid resampling. Input and output can use different encodings. For example, receive telephony audio at 8 kHz and send high-quality agent speech at 24 kHz.
Sending audio
Stream microphone audio as input.audio events. Each event contains a base64-encoded audio chunk in the configured encoding. Send chunks continuously. The server buffers them, so exact chunk size doesn’t matter. ~50 ms chunks work well.
Only start sending input.audio after you receive session.ready.
Noise cancellation and voice focus
Voice focus (server-side noise cancellation) is enabled by default on every Voice Agent API session and tuned for real-world conditions like noisy environments and background speakers. The server isolates the primary speaker before transcription and turn detection, so you don’t need to add a separate noise-suppression stage to your audio pipeline. Send the raw mic audio as-is, adding client-side denoising on top usually introduces artifacts that hurt accuracy more than the original noise did.
Disable browser-level noiseSuppression (set noiseSuppression: false in your getUserMedia constraints) and skip heavier preprocessing like RNNoise, Krisp, or BVC. The Voice Agent API’s voice focus already handles background speech and ambient noise, stacking client-side denoising on top tends to introduce artifacts that hurt accuracy.
Echo cancellation
When the agent’s speech plays through speakers, the microphone can pick it up and send it back to the server, causing the agent to interrupt itself. To prevent this:
-
Browser apps: use
getUserMediawithechoCancellationenabled. The browser’s built-in acoustic echo cancellation (AEC) removes speaker output from the mic signal automatically: -
Terminal / desktop apps: use headphones. Native audio APIs (PortAudio, sounddevice, etc.) don’t include echo cancellation, so the raw mic captures speaker output. Headphones eliminate the feedback path entirely.
Without echo cancellation or headphones, the agent’s own speech loops back through the microphone and triggers barge-in. Every response will be cut short with status: "interrupted". See Troubleshooting for more details.
Playing output audio
The server streams reply.audio events containing small audio chunks in the configured output encoding. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:
speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at the configured rate, producing smooth playback. Network jitter is absorbed by the buffer, even if a WebSocket message arrives late, there’s still audio playing.
Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps.
Handling interruptions
When the user speaks while the agent is responding (barge-in), the server stops generating audio and signals the interruption. Your client should:
- Flush the audio buffer: discard any queued audio so the user doesn’t hear stale speech.
- Restart the output stream: so it’s ready for the next response.
The server emits two events on interruption:
reply.donewithstatus: "interrupted"transcript.agentwithinterrupted: trueandtexttrimmed to what was spoken
Barge-in is semantic. Back-channels like “uh-huh” don’t trigger an interruption, but phrases like “wait, stop” do. See Turn detection and interruptions for how the model decides, and Session configuration for tuning barge-in sensitivity.
Platform-specific audio flush
The speaker.abort() call above is specific to Python’s sounddevice. Each platform has its own way to flush a playback buffer: