Turn detection

Overview

AssemblyAI’s turn detection model uses a neural network to detect when someone has finished speaking. Unlike traditional voice activity detection that only listens for silence, our model understands the meaning and flow of speech to make better decisions about when a turn has ended.

The model has two ways to detect end-of-turn:

Semantic detection - The model predicts when speech naturally ends based on meaning and context
Acoustic detection - Traditional silence-based detection as a backup using VAD

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

This approach solves common voice agent problems:

No more awkward long pauses waiting for silence thresholds
No more cutting people off mid-sentence during natural pauses
Better handling of “um” and thinking pauses
Easily fine tune the model to your use case

Quick start configurations

Aggressive

Ends turns very quickly, optimized for short responses and rapid back-and-forth.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.4,
3   min_turn_silence: 160,
4   max_turn_silence: 400,
5 };

Recommended use cases: Agent Assist, IVR replacements, Retail/E-commerce (order confirmations, delivery status), Telecom (outage reporting, yes/no checks)

Balanced

Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.4,
3   min_turn_silence: 400,
4   max_turn_silence: 1280,
5 };

Recommended use cases: Customer Support, Tech Support/SaaS, Financial Services (account inquiries, balance checks), Travel & Hospitality, Education, Government Services

Conservative

Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.7,
3   min_turn_silence: 800,
4   max_turn_silence: 3600,
5 };

Recommended use cases: Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance, Language Learning

These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.

How it works

The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:

Semantic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence greater than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Minimum silence duration

After the end of speech detected by VAD, min_turn_silence milliseconds must pass
Default: 400 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Acoustic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence less than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Maximum silence duration

After the end of speech detected by VAD, max_turn_silence milliseconds must pass
Default: 1280 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Disable turn detection

To disable model-based turn detection, you have 2 options:

Set a VAD-based silence latency for each turn: Set end_of_turn_confidence_threshold to 1. This will cause the model to end a turn after a pre-determined amount of silence (based on max_turn_silence).
- Most useful when you are using the model as your VAD for silence-based turns.
Return turns as fast as possible on silence: Set end_of_turn_confidence_threshold to 0. This will cause the model to force end-of-turn as soon as silence is detected (based on min_turn_silence).
- Most useful when you are using a custom turn detection model on top of the transcript results.

Do not set end_of_turn_confidence_threshold to 0 unless you have a custom turn detection model

Setting end_of_turn_confidence_threshold to 0 completely disables all semantic turn detection. The model will force a turn boundary at every silence that exceeds min_turn_silence, regardless of whether the speaker has actually finished their thought.

This is especially problematic for medical dictation, conversational audio, and other use cases where speakers naturally pause mid-sentence to think. With this setting, those pauses are incorrectly treated as completed turns, fragmenting the transcript.

If you want longer, more patient turn detection without disabling the semantic model, increase end_of_turn_confidence_threshold toward 1 instead, and raise min_turn_silence and max_turn_silence. See the Conservative quick start configuration for recommended values.

If you are using your own form of turn detection (such as VAD or a custom turn detection model), you can send a ForceEndpoint event to the server to force the end of a turn and receive the final turn transcript.

1 ws.send(json.dumps({"type": "ForceEndpoint"}))

Important notes

Silence-based detection can override model-based detection even with high EOT confidence thresholds
Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context
Use end_of_turn to detect turn completion. Do not use turn_is_formatted for end-of-turn detection. The only reliable way to detect turn completion is end_of_turn: true.

Overview

The model has two ways to detect end-of-turn:

Semantic detection - The model predicts when speech naturally ends based on meaning and context
Acoustic detection - Traditional silence-based detection as a backup using VAD

When either method detects an end-of-turn, the model returns end_of_turn=True in the response.

This approach solves common voice agent problems:

No more awkward long pauses waiting for silence thresholds
No more cutting people off mid-sentence during natural pauses
Better handling of “um” and thinking pauses
Easily fine tune the model to your use case

Quick start configurations

Aggressive

Ends turns very quickly, optimized for short responses and rapid back-and-forth.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.4,
3   min_turn_silence: 160,
4   max_turn_silence: 400,
5 };

Recommended use cases: Agent Assist, IVR replacements, Retail/E-commerce (order confirmations, delivery status), Telecom (outage reporting, yes/no checks)

Balanced

Provides a natural middle ground, allowing enough pause for most conversational turns without feeling sluggish or overly eager.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.4,
3   min_turn_silence: 400,
4   max_turn_silence: 1280,
5 };

Recommended use cases: Customer Support, Tech Support/SaaS, Financial Services (account inquiries, balance checks), Travel & Hospitality, Education, Government Services

Conservative

Holds the floor longer, optimized for reflective or complex speech where users may pause to think before finishing.

1 const streamingConfig = {
2   end_of_turn_confidence_threshold: 0.7,
3   min_turn_silence: 800,
4   max_turn_silence: 3600,
5 };

Recommended use cases: Healthcare, Mental Health Support, Sales & Consulting, Legal & Insurance, Language Learning

These configurations are just starting points and can be fine-tuned based on your specific use case. Reach out to support@assemblyai.com for help.

How it works

The turn detection model uses a neural network to detect when someone has finished speaking. It has two ways to detect end-of-turn:

Semantic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence greater than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Minimum silence duration

After the end of speech detected by VAD, min_turn_silence milliseconds must pass
Default: 400 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Acoustic Detection

Triggers when all conditions are met:

Model confidence threshold

Model predicts semantic end-of-turn confidence less than end_of_turn_confidence_threshold
Default: 0.4 (user configurable)

Maximum silence duration

After the end of speech detected by VAD, max_turn_silence milliseconds must pass
Default: 1280 ms (user configurable)

Minimum speech duration

The user must speak for at least 80 ms since the last end-of-turn (ensures at least one word)
Set to 80 ms (internal)

Word finalized

Last word in turn.words has been finalized
Internal configuration

Disable turn detection

To disable model-based turn detection, you have 2 options:

Set a VAD-based silence latency for each turn: Set end_of_turn_confidence_threshold to 1. This will cause the model to end a turn after a pre-determined amount of silence (based on max_turn_silence).
- Most useful when you are using the model as your VAD for silence-based turns.
Return turns as fast as possible on silence: Set end_of_turn_confidence_threshold to 0. This will cause the model to force end-of-turn as soon as silence is detected (based on min_turn_silence).
- Most useful when you are using a custom turn detection model on top of the transcript results.

Do not set end_of_turn_confidence_threshold to 0 unless you have a custom turn detection model

1 ws.send(json.dumps({"type": "ForceEndpoint"}))

Important notes

Silence-based detection can override model-based detection even with high EOT confidence thresholds
Word finalization always takes precedence — endpointing won’t occur until the last word is finalized
We define end-of-turn detection as the process of detecting the end of sustained speech activity, often called end-pointing in the Voice Agents context
Use end_of_turn to detect turn completion. Do not use turn_is_formatted for end-of-turn detection. The only reliable way to detect turn completion is end_of_turn: true.

1	const streamingConfig = {
2	end_of_turn_confidence_threshold: 0.4,
3	min_turn_silence: 160,
4	max_turn_silence: 400,
5	};

1	const streamingConfig = {
2	end_of_turn_confidence_threshold: 0.4,
3	min_turn_silence: 400,
4	max_turn_silence: 1280,
5	};

1	const streamingConfig = {
2	end_of_turn_confidence_threshold: 0.7,
3	min_turn_silence: 800,
4	max_turn_silence: 3600,
5	};