Voice Agent API
AssemblyAI’s Voice Agent API is a native real-time voice conversation endpoint. Unlike a traditional STT → LLM → TTS pipeline, the Voice Agent API handles everything in a single WebSocket connection: it listens to the user, understands what they said, generates a response, and speaks it back — all with low latency.
You need a credit card on file to access the Voice Agent API. Add one in your AssemblyAI dashboard.
Key capabilities:
- Built-in VAD — server-side voice activity detection with configurable sensitivity
- Customizable turn detection — tune silence thresholds, barge-in behavior, and interruption sensitivity
- Native audio output — streams PCM16 audio back directly, no separate TTS step
- Tool calling — register tools and handle
tool.call/tool.resultevents - Barge-in / interruption — users can interrupt the agent mid-response
Quickstart
A complete working example — connects to the Voice Agent API endpoint, streams microphone audio, plays back the agent’s voice, and handles two tools (get_weather and get_time).
Audio format
Both input and output audio use the same format:
We recommend sending chunks of around 50ms (2,400 bytes at 24kHz). The server buffers and processes continuously, so exact chunk size isn’t critical.
Playing output audio
The server streams reply.audio events containing small PCM16 chunks. Write each chunk directly into an audio output buffer and let the OS drain it at the correct sample rate:
speaker.write() copies samples into the OS audio buffer and returns immediately. The hardware drains the buffer at exactly 24kHz, producing smooth playback. Network jitter is absorbed by the buffer — even if a WebSocket message arrives late, there’s still audio playing.
Don’t use sleep-based timing to schedule playback. The OS doesn’t guarantee exact sleep durations, so the playback clock drifts from the hardware clock over time, causing pops and gaps in the audio.
Stopping playback on interruption
When the user interrupts the agent, the server stops generating audio and sends reply.done with status: "interrupted". Your output buffer may still have queued audio from before the interruption. Flush it so the user doesn’t hear stale speech:
Connection
Endpoint
Authentication
Pass your API key as a Bearer token in the HTTP upgrade request:
Browsers cannot set custom headers on WebSocket connections. For browser-based apps, use a server-side proxy. See Browser integration.
Events reference
Client → Server
input.audio
Stream PCM16 audio to the agent.
session.update
Configure the session. Send immediately on WebSocket connect — before session.ready. Can also be sent mid-conversation to update any field.
session.resume
Reconnect to an existing session using the session_id from a previous session.ready. Preserves conversation context across dropped connections.
Sessions are preserved for 30 seconds after every disconnection before expiring. If the session has expired, the server returns a session.error with code session_not_found or session_forbidden. Start a fresh connection without session.resume.
tool.result
Send a tool result back to the agent. Send this in the reply.done handler — not immediately in tool.call. See Tool calling.
Server → Client
session.ready
Session is established and ready to receive audio. Save session_id for reconnection. Start sending input.audio only after this event.
session.updated
Sent after session.update is applied successfully.
input.speech.started
VAD detected the user has started speaking.
input.speech.stopped
VAD detected the user has stopped speaking.
transcript.user.delta
Partial transcript of what the user is saying, updating in real-time.
transcript.user
Final transcript of the user’s utterance.
reply.started
Agent has begun generating a response.
reply.audio
A chunk of the agent’s spoken response as base64 PCM16. Decode and play immediately.
transcript.agent
Full text of the agent’s response, sent after all audio for the response has been delivered. If the agent was interrupted, interrupted is true and text contains only what was actually spoken before the interruption.
reply.done
Agent has finished speaking. The optional status field indicates why the reply ended.
tool.call
Agent wants to call a registered tool. args is a dict — ready to use directly.
session.error
Session or protocol error.
Also handle "error" (without the session. prefix) for connection-level errors.
Error codes:
Session configuration
System prompt
Set the agent’s personality and behaviour. Can be updated mid-session with another session.update.
Tips for voice-first prompts:
- Keep instructions concise — the model reads this, not the user
- Ban specific phrases:
"Never say 'Certainly' or 'Absolutely'" - Enforce brevity:
"Max 2 sentences per turn" - Tell the agent when to use each tool
Greeting
What the agent says at the start of the conversation, spoken aloud. If omitted, the agent waits silently for the user to speak first.
Turn detection
Customize VAD sensitivity, end-of-turn detection, and barge-in behavior. All fields are optional — only include the ones you want to change. Settings can be updated mid-session.
Use min_end_of_turn_silence_ms and max_turn_silence_ms together to control responsiveness. A lower min_end_of_turn_silence_ms makes the agent respond faster after the user pauses, while max_turn_silence_ms sets the hard cutoff.
Voices
Set the agent’s voice in session.update. You can change the voice mid-session.
English voices
Multilingual voices
All multilingual voices also speak English and support code-switching between their language(s) and English.
Tool calling
Register tools to let the agent take real-world actions.
Tool schema
Tools use a flat format — type, name, description, and parameters at the top level:
Handling tool calls
The key pattern: accumulate tool results, then send them all in reply.done — not immediately in tool.call. The agent speaks a transition phrase while waiting; sending results too early can cause timing issues.
Interruptions
When the user speaks mid-response, the server stops the agent and sends reply.done with status: "interrupted". The transcript.agent event will also fire with interrupted: true and text trimmed to what was actually spoken before the interruption. Discard any pending tool results — the agent is ready to listen again.
You can customize interruption behavior via turn_detection in session.update:
interrupt_response— set tofalseto disable barge-in entirelymin_interrupt_duration_ms— how long the user must speak before triggering an interruption (default: 600ms)min_interrupt_words— minimum words the user must say before interrupting (default: 0)
Browser integration
Browsers cannot set the Authorization header on WebSocket connections. Use a server-side proxy:
Minimal Node.js proxy:
Framework integrations
Pipecat
Copy pipecat_assemblyai_realtime.py into your project, then use AssemblyAIRealtimeLLMService as a drop-in LLMService.
LiveKit
Copy assemblyai_realtime.py into your project, then pass RealtimeModel to AgentSession.