AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. Unlike traditional voice activity detection that only listens for silence, our model understands the meaning and flow of speech to make better decisions about when a turn has ended.
The model has two ways to detect end-of-turn:
When either method detects an end-of-turn, the model returns end_of_turn=True in the response.
This approach solves common voice agent problems:
Ends turns very quickly, optimized for short responses and rapid back-and-forth.
Recommended use cases: Agent Assist, IVR replacements, Retail/E-commerce (order confirmations, delivery status), Telecom (outage reporting, yes/no checks)
Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.
Recommended use cases: Customer Support, Tech Support/SaaS, Financial Services (account inquiries, balance checks), Travel & Hospitality, Education, Government Services
Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.
Recommended use cases: Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance, Language Learning
These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.
The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:
Triggers when all conditions are met:
end_of_turn_confidence_threshold0.4 (user configurable)min_turn_silence milliseconds must pass400 ms (user configurable)80 ms since the last end-of-turn (ensures at least one word)80 ms (internal)turn.words has been finalizedTriggers when all conditions are met:
end_of_turn_confidence_threshold0.4 (user configurable)max_turn_silence milliseconds must pass1280 ms (user configurable)80 ms since the last end-of-turn (ensures at least one word)80 ms (internal)turn.words has been finalizedTo disable model-based turn detection, you have 2 options:
end_of_turn_confidence_threshold to 1. This will cause the model to end a turn after a pre-determined amount of silence (based on max_turn_silence).
end_of_turn_confidence_threshold to 0. This will cause the model to force end-of-turn as soon as silence is detected (based on min_turn_silence).
Setting end_of_turn_confidence_threshold to 0 completely disables all semantic turn detection. The model will force a turn boundary at every silence that exceeds min_turn_silence, regardless of whether the speaker has actually finished their thought.
This is especially problematic for medical dictation, conversational audio, and other use cases where speakers naturally pause mid-sentence to think. With this setting, those pauses are incorrectly treated as completed turns, fragmenting the transcript.
If you want longer, more patient turn detection without disabling the semantic model, increase end_of_turn_confidence_threshold toward 1 instead, and raise min_turn_silence and max_turn_silence. See the Conservative quick start configuration for recommended values.
If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint event to the server to force the end of a turn and receive the final turn transcript.
end_of_turn to detect turn completion. Do not use turn_is_formatted for end-of-turn detection. The only reliable way to detect turn completion is end_of_turn: true.