Turn detection and interruptions

How the Voice Agent API decides when the user has finished speaking, and when they're trying to interrupt.

Turn detection and interruptions

How the Voice Agent API decides when the user has finished speaking, and when they're trying to interrupt.

Turn detection and barge-in are on by default. Decisions are semantic, based on what the user actually said, not just silence or volume. You don’t need to wire anything in or configure it on your end.

This page covers the behaviors you can rely on and the events that go with them.

Semantic interruptions

While the agent is speaking, the API classifies user speech as either a back-channel or a true interruption.

Back-channeling

Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.

Examples:

“Uh-huh”
“Okay”
“Awesome”
“Yeah, makes sense”
“Mm-hmm”

True interruptions

Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.

Examples:

“Wait, stop”
“Sorry, that’s not right”
“Okay, wait a minute”
“Hold on”

When a true interruption is detected, the server emits:

reply.done with status: "interrupted"
transcript.agent with interrupted: true and text trimmed to what the user actually heard before being cut off.

See Handling interruptions for the client-side audio flush pattern.

Semantic turn detection

The API also decides when the user has finished a turn based on what they said, not just on silence. Instead of waiting for a fixed silence window, it uses the meaning of the user’s speech to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished.

A typical user turn produces:

input.speech.started when the user begins speaking.
transcript.user.delta events with partial transcripts as the user keeps talking.
input.speech.stopped when the turn is detected as ended.
transcript.user with the final transcript.
reply.started as the agent begins generating a response.

You don’t need to send any signal to end a turn. The API handles it for you.

Smart end-of-turn for structured answers

When the agent has just asked for something specific (a phone number, an email, a date, a name, a yes/no, a digit sequence, a choice from a list), turn detection adapts to the kind of answer expected. The agent doesn’t cut users off mid-answer when they pause inside a long string of digits, and it doesn’t sit waiting for more once a clean answer has clearly landed.

You’ll notice this most on:

Phone numbers, account numbers, and other digit sequences
Email addresses
Dates
Yes/no questions and choices from a short list
Names, places, companies, and other named entities

This is on by default and adapts to the agent’s wording. There’s nothing to configure on the client. Free-form open questions (“how are you feeling today?”) fall back to the standard silence-based turn detection.

Adaptive endpointing

End-of-turn timing adapts to the user during a conversation. When a user tends to pause mid-thought, the agent learns to give them more room before responding. When the user speaks more crisply, the agent responds more quickly. You don’t have to tune anything; it just gets better as the conversation goes on.

This is on by default. The moment you set min_silence or max_silence explicitly in session.input.turn_detection, the server respects your values and stops adapting for the rest of the session.

Configuration

Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.

If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs via session.input.turn_detection:

Field	Description
`vad_threshold`	Speech detection sensitivity (0.0–1.0). Lower = more sensitive to speech.
`min_silence`	Minimum silence to consider a confident end-of-turn, in milliseconds.
`max_silence`	Maximum silence before forcing end-of-turn, in milliseconds.
`interrupt_response`	Whether user speech can interrupt the agent. Set `false` to disable barge-in.

See Session configuration → Turn detection for the full reference, default values, and example payloads.

If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.

This page covers the behaviors you can rely on and the events that go with them.

Semantic interruptions

While the agent is speaking, the API classifies user speech as either a back-channel or a true interruption.

Back-channeling

Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.

Examples:

“Uh-huh”
“Okay”
“Awesome”
“Yeah, makes sense”
“Mm-hmm”

True interruptions

Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.

Examples:

“Wait, stop”
“Sorry, that’s not right”
“Okay, wait a minute”
“Hold on”

When a true interruption is detected, the server emits:

reply.done with status: "interrupted"
transcript.agent with interrupted: true and text trimmed to what the user actually heard before being cut off.

See Handling interruptions for the client-side audio flush pattern.

Semantic turn detection

A typical user turn produces:

input.speech.started when the user begins speaking.
transcript.user.delta events with partial transcripts as the user keeps talking.
input.speech.stopped when the turn is detected as ended.
transcript.user with the final transcript.
reply.started as the agent begins generating a response.

You don’t need to send any signal to end a turn. The API handles it for you.

Smart end-of-turn for structured answers

You’ll notice this most on:

Phone numbers, account numbers, and other digit sequences
Email addresses
Dates
Yes/no questions and choices from a short list
Names, places, companies, and other named entities

Adaptive endpointing

This is on by default. The moment you set min_silence or max_silence explicitly in session.input.turn_detection, the server respects your values and stops adapting for the rest of the session.

Configuration

Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.

Field	Description
`vad_threshold`	Speech detection sensitivity (0.0–1.0). Lower = more sensitive to speech.
`min_silence`	Minimum silence to consider a confident end-of-turn, in milliseconds.
`max_silence`	Maximum silence before forcing end-of-turn, in milliseconds.
`interrupt_response`	Whether user speech can interrupt the agent. Set `false` to disable barge-in.

See Session configuration → Turn detection for the full reference, default values, and example payloads.