Session configuration
Send a session.update as your first WebSocket message, and any time after, to control how the agent speaks, listens, and responds. Some fields can only be set in the first update; see Mutability after session.ready below.
Here’s a full configuration showing every available field:
Every field is optional. Include only what you want to set or change. Jump to any section below for details.
Mutability after session.ready
The first session.update you send before session.ready initializes the session. After session.ready, only a subset of fields can be changed — changing one of the immutable fields raises a session.error with code immutable_field and the rejected change is ignored.
Fields not listed here (session.tools, session.input.keyterms, session.input.format) are accepted in subsequent session.update messages and don’t currently raise immutable_field.
System prompt
Set the agent’s personality and behavior. Can be updated mid-session with another session.update.
Tips for voice-first prompts:
- Ban specific phrases:
"Never say 'Certainly' or 'Absolutely'" - Enforce brevity:
"Max 2 sentences per turn" - Tell the agent when to use each tool
Prompt engineering for voice agents is iterative. Test your prompt in a live conversation, listen to how the agent responds, and refine it until the tone, length, and behavior match your use case. See the Prompting guide for patterns that improve instruction following, conversationality, and voice output quality.
Greeting
What the agent says at the start of the conversation, spoken aloud. If omitted, the agent waits silently for the user to speak first.
Voice and audio format
Choose a voice and configure the input/output audio encoding under session.output and session.input. The encoding determines the sample rate. Input and output encodings can differ. Both default to audio/pcm (24 kHz) if omitted.
session.output (including voice) and greeting are locked after the first session.update is applied. Later attempts to change them return an immutable_field error. Set the voice and output format on your first session.update.
See Voices for the voice catalog and Audio format for supported encodings and playback details.
Key terms
If your conversation involves rare or domain-specific words, like a person’s name, company name, or product, add them to session.input.keyterms to improve transcription accuracy. This works like a word boost, biasing the speech recognition model toward these terms.
Turn detection
Turn detection and interruption handling are intelligent and semantic out of the box: back-channels like “uh-huh” don’t interrupt, but “wait, stop” does. This works with no configuration. See Turn detection and interruptions for the full explanation.
If you do want to customize sensitivity or disable barge-in, override the underlying VAD knobs under session.input.turn_detection. All fields are optional. Only include the ones you want to change. Settings can be updated mid-session.
If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.
Not sure which turn detection settings to use? Check out the quick start configurations for turn detection to find the best preset for your use case.