Streaming v2 (Legacy) | AssemblyAI

This documentation is for a previous, legacy version, of our Streaming STT feature. If you are looking for Streaming STT please use our latest Universal Streaming model, which you can learn more about here

Need to migrate to Universal-Streaming from this old model? Check out this migration guide!

AssemblyAI’s Streaming Speech-to-Text (STT) allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds.

Supported languages

Streaming Speech-to-Text is only available for English.

Getting started

Check out our getting started guides:

Getting Started Guides

If your programming language isn’t supported yet, see the WebSocket API:

Streaming API reference

Audio requirements

The audio format must conform to the following requirements:

PCM16 or Mu-law encoding (See Specify the encoding)
A sample rate that matches the value of the supplied sample_rate parameter
Single-channel
100 to 2000 milliseconds of audio per message

Audio segments with a duration between 100 ms and 450 ms produce the best results in transcription accuracy.

Specify the encoding

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

By default, transcriptions expect PCM16 encoding. If you want to use Mu-law encoding, you must set the encoding parameter to aai.AudioEncoding.pcm_mulaw:

1 transcriber = aai.RealtimeTranscriber(
2     ...,
3     encoding=aai.AudioEncoding.pcm_mulaw
4 )

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

Encoding	SDK Parameter	Description
PCM16 (default)	`aai.AudioEncoding.pcm_s16le`	PCM signed 16-bit little-endian.
Mu-law	`aai.AudioEncoding.pcm_mulaw`	PCM Mu-law.

Add custom vocabulary

You can add up to 2500 characters of custom vocabulary to boost their transcription probability.

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

For this, create a list of strings and set the word_boost parameter:

1 transcriber = aai.RealtimeTranscriber(
2     ...,
3     word_boost=["aws", "azure", "google cloud"]
4 )

If you’re not using one of the SDKs, you must ensure that the word_boost parameter is a JSON array that is URL encoded. See this code example.

Authenticate with a temporary token

If you need to authenticate on the client, you can avoid exposing your API key by using temporary authentication tokens. You should generate this token on your server and pass it to the client.

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

cURL

To generate a temporary token, call aai.RealtimeTranscriber.create_temporary_token().

Use the expires_in parameter to specify the session duration in seconds for which the token will remain valid.

1 token = aai.RealtimeTranscriber.create_temporary_token(
2     expires_in=60
3 )

The expiration time must be a value between 60 and 360000 seconds.

The client should retrieve the token from the server and use the token to authenticate the transcriber.

Each token has a one-time use restriction and can only be used for a single session. Any usage associated with a temporary token will be attributed to the API key that generated it.

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

To use it, specify the token parameter when initializing the streaming transcriber.

1 transcriber = aai.RealtimeTranscriber(
2     ...,
3     token=token
4 )

Manually end current utterance

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

To manually end an utterance, call force_end_utterance():

1 transcriber.force_end_utterance()

Manually ending an utterance immediately produces a final transcript.

Configure the threshold for automatic utterance detection

You can configure the threshold for how long to wait before ending an utterance.

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

To change the threshold, you can specify the end_utterance_silence_threshold parameter when initializing the streaming transcriber.

After the session has started, you can change the threshold by calling configure_end_utterance_silence_threshold().

1 transcriber = aai.RealtimeTranscriber(
2     ...,
3     end_utterance_silence_threshold=500
4 )
5 
6 # after connecting
7 transcriber.configure_end_utterance_silence_threshold(300)

By default, Streaming Speech-to-Text ends an utterance after 700 milliseconds of silence. You can configure the duration threshold any number of times during a session after the session has started. The valid range is between 0 and 20000.

Disable partial transcripts

If you’re only using the final transcript, you can disable partial transcripts to reduce network traffic.

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

To disable partial transcripts, set the disable_partial_transcripts parameter to True.

1 transcriber = aai.RealtimeTranscriber(
2     ...,
3     disable_partial_transcripts=True
4 )

Enable extra session information

Python SDK

Python

TypeScript SDK

TypeScript

C#

Ruby

If you enable extra session information, the client receives a RealtimeSessionInformation message right before receiving the session termination message.

To enable it, define a callback function to handle the event and configure the on_extra_session_information parameter.

1 # Define a callback to handle the session information message
2 def on_extra_session_information(data: aai.RealtimeSessionInformation):
3     print(data.audio_duration_seconds)
4 
5 # Configure the RealtimeTranscriber
6 transcriber = aai.RealtimeTranscriber(
7     ...,
8     on_extra_session_information=on_extra_session_information,
9 )

Best practices

Here are some best practices to get the best results from Streaming Speech-to-Text:

Audio quality

Use a sample rate of at least 16000 Hz for better transcription accuracy. Higher quality audio input generally leads to better results.
For noisy environments, see our Cookbook example for implementing noise reduction in your streaming pipeline.

Connection management

Keep the WebSocket connection open for the entire duration of a session instead of frequently reconnecting. Reconnecting adds latency overhead and can impact real-time performance.
For client-side implementations of our Streaming API, we recommend first creating a temporary authentication token using a temporary token to avoid exposing your API key.

Optimizing for latency

You can optimize for lower latency in several ways:

Use partial transcripts instead of waiting for final transcripts. Partial transcripts provide faster feedback but may be less accurate. See our Cookbook example for more information.
Configure the threshold for silence between speaker turns using end_utterance_silence_threshold to control when utterances are finalized.
If you don’t need real-time feedback, you can disable partial transcripts to reduce network traffic.

Learn more

To learn about using Streaming Speech-to-Text, see the following resources: