Voice Agent API

Turn detection and interruptions

How the Voice Agent API decides when the user has finished speaking, and when they're trying to interrupt.

The Voice Agent API handles turn detection and interruptions automatically using a multimodal model that looks at both the live audio stream and the transcript text returned by Universal-3 Pro. Decisions are semantic, based on the meaning of what the user actually said, not just on silence or volume.

You don’t need to wire anything in or configure it on your end. You get great out-of-the-box performance for both turn-taking and barge-in.


How it works

Traditional voice agents rely on voice activity detection (VAD) and silence thresholds to decide when a user is done talking or trying to interrupt. That works for clean speech, but it falls apart in real conversations: the agent gets cut off by a quick “uh-huh”, or it fails to react when the user clearly wants it to stop.

The Voice Agent API uses a multimodal turn detection and interruption model that takes two inputs:

  • Audio: the live PCM stream from the user.
  • Transcript text: the partial and final transcripts produced by Universal-3 Pro.

Combining these signals lets the model make semantic decisions about what the user is actually doing, not just whether they’re making noise.


Semantic interruptions

While the agent is speaking, the model classifies user speech as either a back-channel or a true interruption.

Back-channeling

Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.

Examples:

  • “Uh-huh”
  • “Okay”
  • “Awesome”
  • “Yeah, makes sense”
  • “Mm-hmm”

True interruptions

Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.

Examples:

  • “Wait, stop”
  • “Sorry, that’s not right”
  • “Okay, wait a minute”
  • “Hold on”

When a true interruption is detected, the server emits:

  • reply.done with status: "interrupted"
  • transcript.agent with interrupted: true and text trimmed to what the user actually heard before being cut off.

See Handling interruptions for the client-side audio flush pattern.


Semantic turn detection

The same multimodal model decides when the user has finished a turn. Instead of waiting for a fixed silence window, it uses the meaning of what the user said to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished.

A typical user turn produces:

  1. input.speech.started when the user begins speaking.
  2. transcript.user.delta events with partial transcripts as the user keeps talking.
  3. input.speech.stopped when the model decides the turn has ended.
  4. transcript.user with the final transcript.
  5. reply.started as the agent begins generating a response.

You don’t need to send any signal to end a turn. The model handles it for you.


Configuration

Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.

If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs via session.input.turn_detection:

FieldDescription
vad_thresholdSpeech detection sensitivity (0.0–1.0). Lower = more sensitive to speech.
min_silenceMinimum silence to consider a confident end-of-turn, in milliseconds.
max_silenceMaximum silence before forcing end-of-turn, in milliseconds.
interrupt_responseWhether user speech can interrupt the agent. Set false to disable barge-in.

See Session configuration → Turn detection for the full reference, default values, and example payloads.

If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.