Streaming Speech-to-Text

AssemblyAI’s Streaming Speech-to-Text (STT) allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds.

Supported languages

Streaming Speech-to-Text is only available for English.

Audio requirements

The audio format must conform to the following requirements:

  • PCM16 or Mu-law encoding (See Specify the encoding)
  • A sample rate that matches the value of the supplied sample_rate parameter
  • Single-channel
  • 100 to 2000 milliseconds of audio per message

Audio segments with a duration between 100 ms and 450 ms produce the best results in transcription accuracy.

Specify the encoding

By default, transcriptions expect PCM16 encoding. If you want to use Mu-law encoding, you must set the encoding parameter to AudioEncoding.PCM_MULAW:

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .encoding(AudioEncoding.PCM_MULAW)
4 .build();
EncodingSDK ParameterDescription
PCM16 (default)AudioEncoding.PCM_S16LEPCM signed 16-bit little-endian.
Mu-lawAudioEncoding.PCM_MULAWPCM Mu-law.

Add custom vocabulary

You can add up to 2500 characters of custom vocabulary to boost their transcription probability.

For this, create a list of strings and call the wordBoost() method when building the real-time transcriber.

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .wordBoost(List.of("aws", "azure", "google cloud"))
4 .build();

If you’re not using one of the SDKs, you must ensure that the word_boost parameter is a JSON array that is URL encoded. See this code example.

Authenticate with a temporary token

If you need to authenticate on the client, you can avoid exposing your API key by using temporary authentication tokens. You should generate this token on your server and pass it to the client.

1

Use the CreateRealtimeTemporaryTokenParams.builder() to configure parameters to generate the token. Configure the expiresIn() parameter parameter to specify how long the token should be valid for, in seconds.

1var tokenResponse = client.realtime().createTemporaryToken(CreateRealtimeTemporaryTokenParams.builder()
2 .expiresIn(60)
3 .build()
4);
The expiration time must be a value between 60 and 360000 seconds.
2

The client should retrieve the token from the server and use the token to authenticate the transcriber.

Each token has a one-time use restriction and can only be used for a single session.

To use it, specify the token parameter when initializing the streaming transcriber.

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .token(tokenResponse.getToken())
4 .build();

Manually end current utterance

To manually end an utterance, call forceEndUtterance():

1realtimeTranscriber.forceEndUtterance()

Manually ending an utterance immediately produces a final transcript.

Configure the threshold for automatic utterance detection

You can configure the threshold for how long to wait before ending an utterance.

To change the threshold, you can call the endUtteranceSilenceThreshold() method when building the real-time transcriber.

After the session has started, you can change the threshold by calling configureEndUtteranceSilenceThreshold().

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .endUtteranceSilenceThreshold(500)
4 .build();
5
6// after connecting
7
8realtimeTranscriber.configureEndUtteranceSilenceThreshold(300)

By default, Streaming Speech-to-Text ends an utterance after 700 milliseconds of silence. You can configure the duration threshold any number of times during a session after the session has started. The valid range is between 0 and 20000.

Disable partial transcripts

If you’re only using the final transcript, you can disable partial transcripts to reduce network traffic.

To disable partial transcripts, call the disablePartialTranscripts() builder method.

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .disablePartialTranscripts()
4 .build();

Enable extra session information

The client receives a SessionInformation message right before receiving the session termination message. Configure the onSessionInformation() callback when you build the transcriber to receive the message.

1var realtimeTranscriber = RealtimeTranscriber.builder()
2 ...
3 .onSessionInformation((info) -> System.out.println(info.getAudioDurationSeconds()))
4 .build()

For best practices, see the Best Practices section in the Streaming guide.