All audio exchanged over the Voice Agent API is base64-encoded, mono. The encoding you choose determines the sample rate and bit depth. Input and output encodings are configured independently.Documentation Index
Fetch the complete documentation index at: https://assemblyai.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Supported encodings
Set the encoding for input (microphone) and output (agent speech) viasession.update:
| Encoding | Sample rate | Bit depth | Best for |
|---|---|---|---|
audio/pcm | 24,000 Hz | 16-bit signed integer (little-endian) | Default, highest quality, ideal for most apps |
audio/pcmu | 8,000 Hz | 8-bit μ-law | Telephony (G.711 μ-law) |
audio/pcma | 8,000 Hz | 8-bit A-law | Telephony (G.711 A-law) |
format blocks, both input and output default to audio/pcm (24,000 Hz).
Sending audio
Stream microphone audio asinput.audio events. Each event contains a base64-encoded audio chunk in the configured encoding. Send chunks continuously. The server buffers them, so exact chunk size doesn’t matter. ~50 ms chunks work well.
Only start sending
input.audio after you receive session.ready.Noise cancellation
Server-side noise cancellation is on by default. You cannot disable it. Send raw mic audio. Do not stack a second denoising layer on top in your client:Echo cancellation
When the agent’s speech plays through speakers, the microphone can pick it up and send it back to the server, causing the agent to interrupt itself. To prevent this:-
Browser apps: use
getUserMediawithechoCancellationenabled. The browser’s built-in acoustic echo cancellation (AEC) removes speaker output from the mic signal automatically: - Terminal / desktop apps: use headphones. Native audio APIs (PortAudio, sounddevice, etc.) don’t include echo cancellation, so the raw mic captures speaker output. Headphones eliminate the feedback path entirely.
Playing output audio
The server streamsreply.audio events containing small audio chunks in the configured output encoding. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:
speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at the configured rate, producing smooth playback. Network jitter is absorbed by the buffer, even if a WebSocket message arrives late, there’s still audio playing.
Handling interruptions
When the user speaks while the agent is responding (barge-in), the server stops generating audio and signals the interruption. Your client should:- Flush the audio buffer: discard any queued audio so the user doesn’t hear stale speech.
- Restart the output stream: so it’s ready for the next response.
reply.donewithstatus: "interrupted"transcript.agentwithinterrupted: trueandtexttrimmed to what was spoken
Platform-specific audio flush
Thespeaker.abort() call above is specific to Python’s sounddevice. Each platform has its own way to flush a playback buffer:
| Platform | Flush approach |
|---|---|
| Python (sounddevice) | speaker.abort() then speaker.start() |
| Web (AudioContext) | Disconnect the source node, create a new AudioBufferSourceNode, and reconnect |
| iOS (AVAudioEngine) | Call playerNode.stop() then playerNode.play() to clear the scheduled buffer |
| Android (AudioTrack) | Call audioTrack.pause(), audioTrack.flush(), then audioTrack.play() |