Turn detection and interruptions
The Voice Agent API handles turn detection and interruptions automatically using a multimodal model that looks at both the live audio stream and the transcript text returned by Universal-3 Pro. Decisions are semantic, based on the meaning of what the user actually said, not just on silence or volume.
You don’t need to wire anything in or configure it on your end. You get great out-of-the-box performance for both turn-taking and barge-in.
How it works
Traditional voice agents rely on voice activity detection (VAD) and silence thresholds to decide when a user is done talking or trying to interrupt. That works for clean speech, but it falls apart in real conversations: the agent gets cut off by a quick “uh-huh”, or it fails to react when the user clearly wants it to stop.
The Voice Agent API uses a multimodal turn detection and interruption model that takes two inputs:
- Audio: the live PCM stream from the user.
- Transcript text: the partial and final transcripts produced by Universal-3 Pro.
Combining these signals lets the model make semantic decisions about what the user is actually doing, not just whether they’re making noise.
Semantic interruptions
While the agent is speaking, the model classifies user speech as either a back-channel or a true interruption.
Back-channeling
Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.
Examples:
- “Uh-huh”
- “Okay”
- “Awesome”
- “Yeah, makes sense”
- “Mm-hmm”
True interruptions
Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.
Examples:
- “Wait, stop”
- “Sorry, that’s not right”
- “Okay, wait a minute”
- “Hold on”
When a true interruption is detected, the server emits:
reply.donewithstatus: "interrupted"transcript.agentwithinterrupted: trueandtexttrimmed to what the user actually heard before being cut off.
See Handling interruptions for the client-side audio flush pattern.
Semantic turn detection
The same multimodal model decides when the user has finished a turn. Instead of waiting for a fixed silence window, it uses the meaning of what the user said to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished.
A typical user turn produces:
input.speech.startedwhen the user begins speaking.transcript.user.deltaevents with partial transcripts as the user keeps talking.input.speech.stoppedwhen the model decides the turn has ended.transcript.userwith the final transcript.reply.startedas the agent begins generating a response.
You don’t need to send any signal to end a turn. The model handles it for you.
Configuration
Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.
If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs via session.input.turn_detection:
See Session configuration → Turn detection for the full reference, default values, and example payloads.
If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.