For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Voice Agent API
    • Quickstart
    • Configure your agent
    • Turn detection and interruptions
    • Prompting guide
    • Choose a voice
    • Send and play audio
    • Add tools to your agent
    • Connect from a browser
    • Connect to Twilio
    • Reference WebSocket events
    • Troubleshooting
  • Integrations
    • LiveKit
    • Pipecat
  • Build your own
    • Universal-3 Pro Streaming API
    • Best practices
LogoLogo
PlaygroundChangelogSign In
On this page
  • Semantic interruptions
  • Back-channeling
  • True interruptions
  • Semantic turn detection
  • Smart end-of-turn for structured answers
  • Adaptive endpointing
  • Configuration
Voice Agent API

Turn detection and interruptions

How the Voice Agent API decides when the user has finished speaking, and when they're trying to interrupt.
Was this page helpful?
Previous

Prompting guide

Patterns for writing system prompts that improve instruction following, conversationality, and voice output quality.
Next
Built with

Turn detection and barge-in are on by default. Decisions are semantic, based on what the user actually said, not just silence or volume. You don’t need to wire anything in or configure it on your end.

This page covers the behaviors you can rely on and the events that go with them.


Semantic interruptions

While the agent is speaking, the API classifies user speech as either a back-channel or a true interruption.

Back-channeling

Short verbal acknowledgements that show the user is engaged but not trying to take the floor. The agent keeps speaking and the API does not emit an interruption.

Examples:

  • “Uh-huh”
  • “Okay”
  • “Awesome”
  • “Yeah, makes sense”
  • “Mm-hmm”

True interruptions

Phrases that signal the user wants the agent to stop. The API immediately interrupts the agent.

Examples:

  • “Wait, stop”
  • “Sorry, that’s not right”
  • “Okay, wait a minute”
  • “Hold on”

When a true interruption is detected, the server emits:

  • reply.done with status: "interrupted"
  • transcript.agent with interrupted: true and text trimmed to what the user actually heard before being cut off.

See Handling interruptions for the client-side audio flush pattern.


Semantic turn detection

The API also decides when the user has finished a turn based on what they said, not just on silence. Instead of waiting for a fixed silence window, it uses the meaning of the user’s speech to decide whether they’re done, so the agent doesn’t cut you off mid-thought, and doesn’t sit on long pauses after you’ve clearly finished.

A typical user turn produces:

  1. input.speech.started when the user begins speaking.
  2. transcript.user.delta events with partial transcripts as the user keeps talking.
  3. input.speech.stopped when the turn is detected as ended.
  4. transcript.user with the final transcript.
  5. reply.started as the agent begins generating a response.

You don’t need to send any signal to end a turn. The API handles it for you.


Smart end-of-turn for structured answers

When the agent has just asked for something specific (a phone number, an email, a date, a name, a yes/no, a digit sequence, a choice from a list), turn detection adapts to the kind of answer expected. The agent doesn’t cut users off mid-answer when they pause inside a long string of digits, and it doesn’t sit waiting for more once a clean answer has clearly landed.

You’ll notice this most on:

  • Phone numbers, account numbers, and other digit sequences
  • Email addresses
  • Dates
  • Yes/no questions and choices from a short list
  • Names, places, companies, and other named entities

This is on by default and adapts to the agent’s wording. There’s nothing to configure on the client. Free-form open questions (“how are you feeling today?”) fall back to the standard silence-based turn detection.


Adaptive endpointing

End-of-turn timing adapts to the user during a conversation. When a user tends to pause mid-thought, the agent learns to give them more room before responding. When the user speaks more crisply, the agent responds more quickly. You don’t have to tune anything; it just gets better as the conversation goes on.

This is on by default. The moment you set min_silence or max_silence explicitly in session.input.turn_detection, the server respects your values and stops adapting for the rest of the session.


Configuration

Semantic turn detection and interruption handling are on by default and tuned for typical conversational use cases. For most agents, the right move is to leave them alone.

If you do need to adjust sensitivity, for example to be more patient in a noisy environment or to disable barge-in entirely, you can override the underlying VAD knobs via session.input.turn_detection:

FieldDescription
vad_thresholdSpeech detection sensitivity (0.0–1.0). Lower = more sensitive to speech.
min_silenceMinimum silence to consider a confident end-of-turn, in milliseconds.
max_silenceMaximum silence before forcing end-of-turn, in milliseconds.
interrupt_responseWhether user speech can interrupt the agent. Set false to disable barge-in.

See Session configuration → Turn detection for the full reference, default values, and example payloads.

If the agent keeps interrupting itself, the microphone is picking up the agent’s own TTS output. Use headphones or switch to a browser-based client (which provides echo cancellation). See Troubleshooting for more detail.