Using real-time streaming — AssemblyAI

AssemblyAI’s Streaming Speech-to-Text (STT) service allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds, and our system continues to revise these transcripts with greater accuracy over time as more context arrives.

In this guide, you’ll learn how to establish a WebSocket connection, send audio data, and receive partial and final transcription results. For more information about the expected audio format, see Audio Requirements.

Get started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for a free account and get your API key from your dashboard. Please note that this feature is available for paid accounts only. If you’re on the free plan, you’ll need to upgrade.

The entire source code of this guide can be viewed here.

Step-by-step instructions

Python SDK

Install the AssemblyAI Python SDK. To use the microphone stream you need to install the extras for this SDK. Mac and Linux users also need to install portaudio before installing the extras.

Python SDK

Python

$ # (Mac)
> brew install portaudio
> 
> # (Debian/Ubuntu)
> apt install portaudio19-dev
> 
> pip install "assemblyai[extras]"

Python SDK

Import the assemblyai package and set the API key.

Python SDK

Python

1 import assemblyai as aai
2 
3 aai.settings.api_key = "<YOUR_API_KEY>"

Python SDK

Create functions to handle different events during transcription.

Python SDK

Python

1 def on_open(session_opened: aai.RealtimeSessionOpened):
2   "This function is called when the connection has been established."
3 
4   print("Session ID:", session_opened.session_id)
5 
6 def on_data(transcript: aai.RealtimeTranscript):
7   "This function is called when a new transcript has been received."
8 
9   if not transcript.text:
10     return
11 
12   if isinstance(transcript, aai.RealtimeFinalTranscript):
13     print(transcript.text, end="\r\n")
14   else:
15     print(transcript.text, end="\r")
16 
17 def on_error(error: aai.RealtimeError):
18   "This function is called when the connection has been closed."
19 
20   print("An error occurred:", error)
21 
22 def on_close():
23   "This function is called when the connection has been closed."
24 
25   print("Closing Session")

Python SDK

Create a RealtimeTranscriber to set up the Streaming Speech-to-Text configuration.

Python SDK

Python

1 transcriber = aai.RealtimeTranscriber(
2   on_data=on_data,
3   on_error=on_error,
4   sample_rate=44_100,
5   on_open=on_open, # optional
6   on_close=on_close, # optional
7 )

Python SDK

Optional: Add up to 2,500 characters of custom vocabulary to your streaming session by including the word_boost parameter as an optional query parameter to the RealTimeTranscriber.

Python SDK

Begin the streaming process.

Python SDK

Python

1 # Start the connection
2 transcriber.connect()
3 
4 # Open a microphone stream
5 microphone_stream = aai.extras.MicrophoneStream()
6 
7 # Press CTRL+C to abort
8 transcriber.stream(microphone_stream)
9 
10 transcriber.close()

Audio Requirements

The raw audio data must comply with a strict encoding format. This is because we don’t do any transcoding to your data, we send it directly to the model for transcription to reduce latency. The encoding of your audio must be in:

16-bit signed integer PCM or mu-law encoding
A sample rate that matches the value of the sample_rate query param you supply
Single-channel
100 to 2000 milliseconds of audio per message

Audio segments with a duration between 100 ms and 450 ms produce the best results in transcription accuracy.

Specifying the encoding

By default, transcriptions expect PCM16 encoding. If you want to use mu-law encoding, you must set the encoding parameter to pcm_mulaw:

$ wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000?encoding=pcm_mulaw

Encoding	Description
`pcm_s16le` (Default)	PCM signed 16-bit little-endian.
`pcm_mulaw`	PCM Mu-law.

Request Types

These are the types of requests that can be sent to the WebSocket API.

Opening a Session

When opening a Session you can pass the following query attributes to the WebSocket URL:

`sample_rate`

The sample rate of the streamed audio.

Example: wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000

`word_boost`

`encoding`

`token`

Sending Audio

When sending audio over the WebSocket connection, you can use the websocket’s binary mode to send raw audio data. This can be the raw data recorded directly from a microphone or read from an audio file.

1 # read from the microphone
2 data = stream.read(FRAMES_PER_BUFFER)
3 
4 # binary data can be sent directly
5 ws.send(data)
6 
7 # Note: Some WebSocket clients require that you specify the type:
8 # ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)

Sending audio_data via JSON is also supported but will be deprecated in the future. Use the binary mode instead.

Field	Example	Description
`audio_data`	`"UklGRtjIAABXQVZFZ"`	Raw audio data, base64 encoded.

Terminating a Session

When you’ve completed your session, clients should send a JSON message with the following field.

Field	Example	Description
`terminate_session`	`true`	A boolean value to communicate that you wish to end your streaming session forever.

After requesting session termination, the server will send the remaining transcript messages, followed by a SessionTerminated message.

Response Types

These are the types of responses that can be received from the WebSocket API.

Session Start

Once your request is authorized and connection established, your client receives a SessionBegins message with the following JSON data:

Field	Example	Description
`message_type`	`"SessionBegins"`	Describes the type of the message.
`session_id`	`"d3e8c537-2f11-494b-b497-e59a434588bd"`	Unique identifier for the established session.
`expires_at`	`"2023-05-24T08:09:10.161850"`	Timestamp when this session will expire.

Transcripts

Our Streaming Speech-to-Text pipeline uses a two-phase transcription strategy, broken into partial and final results.

Partial Transcripts

As you send audio data to the API, the API immediately starts responding with Partial Results. The following keys are returned from the WebSocket API.

Field	Example	Description
`message_type`	`"PartialTranscript"`	Describes the type of message.
`audio_start`	`0`	Start time of audio sample relative to session start, in milliseconds.
`audio_end`	`1500`	End time of audio sample relative to session start, in milliseconds.
`confidence`	`0.987190506414702`	The confidence score of the entire transcription, between 0 and 1.
`text`	`"there is a house in new orleans"`	The partial transcript for your audio.
`words`	`[{"start": 0, "end": 440, "confidence": 1.0, "text": "there"}, ...]`	An array of objects, with the information for each word in the transcription text. Includes the `start`/`end` time (in milliseconds) of the word, the `confidence` score of the word, and the `text` (i.e. the word itself).
`created`	`"2023-05-24T08:09:10.161850"`	The timestamp for the partial transcript.

Final Transcripts

After you’ve received your partial results, our model continues to analyze incoming audio and, when it detects the end of an “utterance” (usually a pause in speech), it’ll finalize the results sent to you so far with higher accuracy, as well as add punctuation and casing to the transcription text.

The following keys are returned from the WebSocket API when Final Results are sent:

Field	Example	Description
`message_type`	`"FinalTranscript"`	Describes the type of message.
`audio_start`	`0`	Start time of audio sample relative to session start, in milliseconds.
`audio_end`	`1500`	End time of audio sample relative to session start, in milliseconds.
`confidence`	`0.997190506414702`	The confidence score of the entire transcription, between 0 and 1.
`text`	`"There is a house in New Orleans"`	The final transcript for your audio.
`words`	`[{"start": 0, "end": 440, "confidence": 1.0, "text": "There"}, ...]`	An array of objects, with the information for each word in the transcription text. Includes the `start`/`end` time (in milliseconds) of the word, the `confidence` score of the word, and the `text` (i.e. the word itself).
`created`	`"2023-05-24T08:09:10.161850"`	The timestamp for the final transcript.
`punctuated`	`true`	Whether the text has been punctuated and cased.
`text_formatted`	`true`	Whether the text has been formatted (e.g. Dollar -> $)

Session Terminated

After requesting session termination, the server will send the remaining transcript messages, followed by a SessionTerminated message. Your client receives a SessionTerminated message with the following JSON data:

Field	Example	Description
`message_type`	`"SessionTerminated"`	Describes the type of the message.

Closing and Status Codes

The WebSocket specification provides standard errors.

Our API provides application-level WebSocket errors for well-known scenarios:

| Error Condition | Status Code | Message | | -------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | --- | | bad sample rate | 4000 | “Sample rate must be a positive integer” | | auth failed | 4001 | “Not Authorized” | | insufficient funds | 4002 | “Insufficient Funds” | | free tier user | 4003 | “This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account” | | attempt to connect to nonexistent session id | 4004 | “Session not found” | | session expired | 4008 | “Session Expired” | | attempt to connect to closed session | 4010 | “Session previously closed” | | rate limited | 4029 | “Client sent audio too fast” | | unique session violation | 4030 | “Session is handled by another WebSocket” | | session times out | 4031 | “Session idle for too long” | | audio too short | 4032 | “Audio duration is too short” | | audio too long | 4033 | “Audio duration is too long” | | audio too small to transcode | 4034 | “Audio too small to transcode” | | bad schema | 4101 | “Endpoint received a message with an invalid schema” | | too many streams | 4102 | “This account has exceeded the number of allowed streams” | | reconnected | 4103 | “This session has been reconnected. This WebSocket is no longer valid” | | word boost parameter parsing failed | 4104 | “Could not parse word boost parameter” | |

Quotas and Limits

The following limits are imposed to ensure performance and service quality.

Idle Sessions - Sessions that don’t receive audio within 1 minute will be terminated.
Session Limit - 100 sessions at a time for paid users. Please contact us if you need to increase this limit. Free-tier users must upgrade their account to use real-time streaming.
Session Uniqueness - Only one WebSocket per session.
Audio Sampling Rate Limit - Customers must send data in near real-time. If a client sends data faster than 1 second of audio per second for longer than 1 minute, we’ll terminate the session.

Adding Custom Vocabulary

Developers can also add up to 2500 characters of custom vocabulary to their real-time session by adding the optional query parameter word_boost in the URL. The parameter should map to a JSON encoded list of strings as shown in this Python example:

1 from urllib.parse 
2 
3 sample_rate = 16000
4 word_boost = ["foo", "bar"]
5 params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}
6 
7 url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"

Creating Temporary Authentication Tokens

If you need to authenticate on the client, you can avoid exposing your API key by using temporary authentication tokens. Temporary tokens have a one-time use restriction. To generate a temporary token, send a POST request to https://api.assemblyai.com/v2/realtime/token. Use the expires_in parameter to specify how long the token should be valid for, in seconds.

The expires_in parameter must have a value between 60 and 360000 seconds.

1 curl --request POST \
2   --url https://api.assemblyai.com/v2/realtime/token \
3   --header 'authorization: <YOUR_API_KEY>' \
4   --header 'Content-Type: application/json' \
5   --data '{"expires_in": 60}'

In response you’ll receive the following JSON output:

1 {
2   "token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
3 }

A developer can now use this temporary token in the browser to authenticate a new WebSocket session with the following endpoint wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}. For example:

1 let socket
2 const token = 'b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd'
3 
4 socket = new WebSocket(
5   `wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
6 )

Conclusion

Streaming Speech-to-Text is a powerful feature with even more powerful possibilities for integration. On the AssemblyAI blog, you can learn about using Streaming Speech-to-Text to:

You can also find an example of using Express.js for Streaming Speech-to-Text on GitHub.

$	# (Mac)
>	brew install portaudio
>
>	# (Debian/Ubuntu)
>	apt install portaudio19-dev
>
>	pip install "assemblyai[extras]"

1	import assemblyai as aai
2
3	aai.settings.api_key = "<YOUR_API_KEY>"

1	def on_open(session_opened: aai.RealtimeSessionOpened):
2	"This function is called when the connection has been established."
3
4	print("Session ID:", session_opened.session_id)
5
6	def on_data(transcript: aai.RealtimeTranscript):
7	"This function is called when a new transcript has been received."
8
9	if not transcript.text:
10	return
11
12	if isinstance(transcript, aai.RealtimeFinalTranscript):
13	print(transcript.text, end="\r\n")
14	else:
15	print(transcript.text, end="\r")
16
17	def on_error(error: aai.RealtimeError):
18	"This function is called when the connection has been closed."
19
20	print("An error occurred:", error)
21
22	def on_close():
23	"This function is called when the connection has been closed."
24
25	print("Closing Session")

1	transcriber = aai.RealtimeTranscriber(
2	on_data=on_data,
3	on_error=on_error,
4	sample_rate=44_100,
5	on_open=on_open, # optional
6	on_close=on_close, # optional
7	)

1	# Start the connection
2	transcriber.connect()
3
4	# Open a microphone stream
5	microphone_stream = aai.extras.MicrophoneStream()
6
7	# Press CTRL+C to abort
8	transcriber.stream(microphone_stream)
9
10	transcriber.close()

1	# read from the microphone
2	data = stream.read(FRAMES_PER_BUFFER)
3
4	# binary data can be sent directly
5	ws.send(data)
6
7	# Note: Some WebSocket clients require that you specify the type:
8	# ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)

1	from urllib.parse
2
3	sample_rate = 16000
4	word_boost = ["foo", "bar"]
5	params = {"sample_rate": sample_rate, "word_boost": json.dumps(word_boost)}
6
7	url = f"wss://api.assemblyai.com/v2/realtime/ws?{urlencode(params)}"

1	curl --request POST \
2	--url https://api.assemblyai.com/v2/realtime/token \
3	--header 'authorization: <YOUR_API_KEY>' \
4	--header 'Content-Type: application/json' \
5	--data '{"expires_in": 60}'

1	{
2	"token": "b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd"
3	}

1	let socket
2	const token = 'b2e3c6c71d450589b2f4f0bb1ac4efd2d5e55b1f926e552e02fc0cc070eaedbd'
3
4	socket = new WebSocket(
5	`wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token=${token}`
6	)