Voice Agent API

Session configuration

Set the system prompt, greeting, and turn detection behavior for your voice agent.

Send a session.update as your first WebSocket message, and any time after, to control how the agent speaks, listens, and responds. Some fields can only be set in the first update; see Mutability after session.ready below.

Here’s a full configuration showing every available field:

1{
2 "type": "session.update",
3 "session": {
4 "system_prompt": "You are a friendly support agent. Keep responses under 2 sentences.",
5 "greeting": "Hi! How can I help you today?",
6 "tools": [],
7 "input": {
8 "format": { "encoding": "audio/pcm" },
9 "keyterms": ["AssemblyAI", "Universal"],
10 "turn_detection": {
11 "vad_threshold": 0.5,
12 "min_silence": 1000,
13 "max_silence": 3000,
14 "interrupt_response": true
15 }
16 },
17 "output": {
18 "voice": "ivy",
19 "format": { "encoding": "audio/pcm" }
20 }
21 }
22}

Every field is optional. Include only what you want to set or change. Jump to any section below for details.

Mutability after session.ready

The first session.update you send before session.ready initializes the session. After session.ready, only a subset of fields can be changed — changing one of the immutable fields raises a session.error with code immutable_field and the rejected change is ignored.

FieldMutable after session.ready?
session.system_promptYes — send a new prompt at any time to change the agent’s behavior on the next turn.
session.input.turn_detectionYes — adjust VAD thresholds, silence windows, and barge-in on the fly.
session.greetingNo — raises immutable_field. The greeting is spoken once at session start.
session.outputNo — raises immutable_field. The voice and output audio format are fixed for the session; pick the right voice in the first update.

Fields not listed here (session.tools, session.input.keyterms, session.input.format) are accepted in subsequent session.update messages and don’t currently raise immutable_field.

System prompt

Set the agent’s personality and behavior. Can be updated mid-session with another session.update.

1{
2 "type": "session.update",
3 "session": {
4 "system_prompt": "You are a friendly support agent. Keep responses under 2 sentences. Never make up information."
5 }
6}

Tips for voice-first prompts:

  • Ban specific phrases: "Never say 'Certainly' or 'Absolutely'"
  • Enforce brevity: "Max 2 sentences per turn"
  • Tell the agent when to use each tool

Prompt engineering for voice agents is iterative. Test your prompt in a live conversation, listen to how the agent responds, and refine it until the tone, length, and behavior match your use case. See the Prompting guide for patterns that improve instruction following, conversationality, and voice output quality.

Greeting

What the agent says at the start of the conversation, spoken aloud. If omitted, the agent waits silently for the user to speak first.

1{
2 "type": "session.update",
3 "session": {
4 "system_prompt": "You are a helpful assistant.",
5 "greeting": "Hi there! How can I help you today?"
6 }
7}

Voice and audio format

Choose a voice and configure the input/output audio encoding under session.output and session.input. The encoding determines the sample rate. Input and output encodings can differ. Both default to audio/pcm (24 kHz) if omitted.

session.output (including voice) and greeting are locked after the first session.update is applied. Later attempts to change them return an immutable_field error. Set the voice and output format on your first session.update.

1{
2 "type": "session.update",
3 "session": {
4 "input": {
5 "format": { "encoding": "audio/pcm" }
6 },
7 "output": {
8 "voice": "ivy",
9 "format": { "encoding": "audio/pcm" }
10 }
11 }
12}

See Voices for the voice catalog and Audio format for supported encodings and playback details.

Key terms

If your conversation involves rare or domain-specific words, like a person’s name, company name, or product, add them to session.input.keyterms to improve transcription accuracy. This works like a word boost, biasing the speech recognition model toward these terms.

1{
2 "type": "session.update",
3 "session": {
4 "input": {
5 "keyterms": ["AssemblyAI", "Universal", "Ozempic"]
6 }
7 }
8}

Turn detection

Turn detection and interruption handling are intelligent and semantic out of the box: back-channels like “uh-huh” don’t interrupt, but “wait, stop” does. This works with no configuration. See Turn detection and interruptions for the full explanation.

If you do want to customize sensitivity or disable barge-in, override the underlying VAD knobs under session.input.turn_detection. All fields are optional. Only include the ones you want to change. Settings can be updated mid-session.

1{
2 "type": "session.update",
3 "session": {
4 "input": {
5 "turn_detection": {
6 "vad_threshold": 0.5,
7 "min_silence": 1000,
8 "max_silence": 3000,
9 "interrupt_response": true
10 }
11 }
12 }
13}
FieldTypeDefaultDescription
vad_thresholdfloat0.5Speech detection sensitivity (0.0–1.0). Lower = more sensitive to speech.
min_silenceinteger1000Minimum silence to consider a confident end-of-turn, in milliseconds.
max_silenceinteger3000Maximum silence before forcing end-of-turn, in milliseconds.
interrupt_responsebooleantrueWhether user speech interrupts the agent. Set false to disable barge-in.

If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.

Not sure which turn detection settings to use? Check out the quick start configurations for turn detection to find the best preset for your use case.