Streaming
AssemblyAI’s Streaming Speech-to-Text (STT) allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds.
Supported languages
Streaming Speech-to-Text is only available for English.
Getting started
Get started with any of our official SDKs:
Getting Started Guides
If your programming language isn’t supported yet, see the WebSocket API:
Audio requirements
The audio format must conform to the following requirements:
- PCM16 or Mu-law encoding (See Specify the encoding)
- A sample rate that matches the value of the supplied
sample_rate
parameter - Single-channel
- 100 to 2000 milliseconds of audio per message
Audio segments with a duration between 100 ms and 450 ms produce the best results in transcription accuracy.
Specify the encoding
Python
TypeScript
Go
Java
C#
By default, transcriptions expect PCM16 encoding. If you want to use Mu-law encoding, you must set the encoding
parameter to aai.AudioEncoding.pcm_mulaw
:
Python
TypeScript
Go
Java
C#
Add custom vocabulary
You can add up to 2500 characters of custom vocabulary to boost their transcription probability.
Python
TypeScript
Go
Java
C#
For this, create a list of strings and set the word_boost
parameter:
If you’re not using one of the SDKs, you must ensure that the word_boost
parameter is a JSON array that is URL encoded.
See this code example.
Authenticate with a temporary token
If you need to authenticate on the client, you can avoid exposing your API key by using temporary authentication tokens. You should generate this token on your server and pass it to the client.
Manually end current utterance
Python
TypeScript
Go
Java
C#
To manually end an utterance, call force_end_utterance()
:
Manually ending an utterance immediately produces a final transcript.
Configure the threshold for automatic utterance detection
You can configure the threshold for how long to wait before ending an utterance.
Python
TypeScript
Go
Java
C#
To change the threshold, you can specify the end_utterance_silence_threshold
parameter when initializing the streaming transcriber.
After the session has started, you can change the threshold by calling configure_end_utterance_silence_threshold()
.
By default, Streaming Speech-to-Text ends an utterance after 700 milliseconds of silence. You can configure the duration threshold any number of times during a session after the session has started. The valid range is between 0 and 20000.
Disable partial transcripts
If you’re only using the final transcript, you can disable partial transcripts to reduce network traffic.
Python
TypeScript
Go
Java
C#
To disable partial transcripts, set the disable_partial_transcripts
parameter to True
.
Enable extra session information
Python
TypeScript
Go
Java
C#
If you enable extra session information, the client receives a RealtimeSessionInformation
message right before receiving the session termination message.
To enable it, define a callback function to handle the event and cofigure the on_extra_session_information
parameter.
Learn more
To learn about using Streaming Speech-to-Text, see the following resources: