Turn detection and barge-in are on by default. Decisions are semantic, based on what the user actually said, not just silence or volume. You don’t need to wire anything in or configure it on your end.
This page covers the behaviors you can rely on and the events that go with them.
While the agent is speaking, the API classifies user speech as either a back-channel or a true interruption.
Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.
Examples:
Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.
Examples:
When a true interruption is detected, the server emits:
reply.done with status: "interrupted"transcript.agent with interrupted: true and text trimmed to what the user actually heard before being cut off.See Handling interruptions for the client-side audio flush pattern.
The API also decides when the user has finished a turn based on what they said, not just on silence. Instead of waiting for a fixed silence window, it uses the meaning of the user’s speech to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished.
A typical user turn produces:
input.speech.started when the user begins speaking.transcript.user.delta events with partial transcripts as the user keeps talking.input.speech.stopped when the turn is detected as ended.transcript.user with the final transcript.reply.started as the agent begins generating a response.You don’t need to send any signal to end a turn. The API handles it for you.
When the agent has just asked for something specific (a phone number, an email, a date, a name, a yes/no, a digit sequence, a choice from a list), turn detection adapts to the kind of answer expected. The agent doesn’t cut users off mid-answer when they pause inside a long string of digits, and it doesn’t sit waiting for more once a clean answer has clearly landed.
You’ll notice this most on:
This is on by default and adapts to the agent’s wording. There’s nothing to configure on the client. Free-form open questions (“how are you feeling today?”) fall back to the standard silence-based turn detection.
End-of-turn timing adapts to the user during a conversation. When a user tends to pause mid-thought, the agent learns to give them more room before responding. When the user speaks more crisply, the agent responds more quickly. You don’t have to tune anything; it just gets better as the conversation goes on.
This is on by default. The moment you set min_silence or max_silence explicitly in session.input.turn_detection, the server respects your values and stops adapting for the rest of the session.
Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.
If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs via session.input.turn_detection:
See Session configuration → Turn detection for the full reference, default values, and example payloads.
If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.