Multilingual streaming

Supported languages

English, Spanish, French, German, Italian, and Portuguese

Multilingual streaming allows you to transcribe audio streams in multiple languages.

Streaming is billed per session

Universal-Streaming Multilingual is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. Always send a Terminate message when you’re done with a stream — sessions that aren’t closed auto-close after 3 hours and are billed for the full duration. See Billing and pricing for details.

Need more than 6 languages?

If you need support beyond the 6 languages listed here, consider using the Whisper Streaming model (speech_model: "whisper-rt"), which supports 99 languages with automatic language detection. See the Whisper Streaming section below for details.

Configuration

To utilize multilingual streaming, you need to include "speech_model":"universal-streaming-multilingual" as a query parameter in the WebSocket URL.

Supported languages

Multilingual currently supports: English, Spanish, French, German, Italian, and Portuguese.

Quickstart

Python

Python SDK

Javascript

JavaScript SDK

$ pip install websockets pyaudio

The Python example uses the websockets library. If you’re using websockets version 13.0 or later, use additional_headers parameter. For older versions (< 13.0), use extra_headers instead.

Python

Python SDK

Javascript

JavaScript SDK

1 import websockets
2 import asyncio
3 import json
4 from urllib.parse import urlencode
5 
6 import pyaudio
7 
8 FRAMES_PER_BUFFER = 3200
9 FORMAT = pyaudio.paInt16
10 CHANNELS = 1
11 RATE = 48000
12 p = pyaudio.PyAudio()
13 
14 stream = p.open(
15     format=FORMAT,
16     channels=CHANNELS,
17     rate=RATE,
18     input=True,
19     frames_per_buffer=FRAMES_PER_BUFFER
20 )
21 
22 BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
23 CONNECTION_PARAMS = {
24     "sample_rate": RATE,
25     "speech_model": "universal-streaming-multilingual",
26     "language_detection": True,
27 }
28 URL = f"{BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
29 
30 async def send_receive():
31 
32     print(f'Connecting websocket to url ${URL}')
33 
34     async with websockets.connect(
35         URL,
36         additional_headers={"Authorization": "YOUR-API-KEY"},
37         ping_interval=5,
38         ping_timeout=20
39     ) as _ws:
40         await asyncio.sleep(0.1)
41         print("Receiving SessionBegins ...")
42 
43         session_begins = await _ws.recv()
44         print(session_begins)
45         print("Sending messages ...")
46 
47         async def send():
48             while True:
49                 try:
50                     data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
51                     await _ws.send(data)
52                 except websockets.exceptions.ConnectionClosedError as e:
53                     print(e)
54                 except Exception as e:
55                     print(e)
56                 await asyncio.sleep(0.01)
57 
58         async def receive():
59             while True:
60                 try:
61                     result_str = await _ws.recv()
62                     data = json.loads(result_str)
63                     transcript = data['transcript']
64                     utterance = data['utterance']
65 
66                     if data['type'] == 'Turn':
67                         if not data.get('end_of_turn') and transcript:
68                             print(f"[PARTIAL TURN TRANSCRIPT]: {transcript}")
69                         if data.get('utterance'):
70                             print(f"[PARTIAL TURN UTTERANCE]: {utterance}")
71                             # Display language detection info if available
72                             if 'language_code' in data:
73                                 print(f"[UTTERANCE LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
74                         if data.get('end_of_turn'):
75                             print(f"[FULL TURN TRANSCRIPT]: {transcript}")
76                             # Display language detection info if available
77                             if 'language_code' in data:
78                                 print(f"[END OF TURN LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
79                     else:
80                         pass
81 
82                 except websockets.exceptions.ConnectionClosed:
83                     break
84                 except Exception as e:
85                     print(f"\nError receiving data: {e}")
86                     break
87 
88         try:
89             await asyncio.gather(send(), receive())
90         except KeyboardInterrupt:
91             await _ws.send({"type": "Terminate"})
92             # Wait for the server to close the connection after receiving the message
93             await _ws.wait_closed()
94             print("Session terminated and connection closed.")
95 
96 if __name__ == "__main__":
97     try:
98         asyncio.run(send_receive())
99     finally:
100         stream.stop_stream()
101         stream.close()
102         p.terminate()

Language detection

The multilingual streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. When enabled, the model returns the detected language code and confidence score with each complete utterance and final turn.

Configuration

To enable language detection, include language_detection=true as a query parameter in the WebSocket URL:

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=universal-streaming-multilingual&language_detection=true

Output format

When language detection is enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:

language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)
language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection

The language_code and language_confidence fields only appear when either:

The utterance field is non-empty and contains a complete utterance - The end_of_turn field is true

Example response

Here’s an example Turn message with language detection enabled, showing Spanish being detected:

1 {
2   "turn_order": 1,
3   "turn_is_formatted": false,
4   "end_of_turn": false,
5   "transcript": "Buenos",
6   "end_of_turn_confidence": 0.991195,
7   "words": [
8     {
9       "start": 29920,
10       "end": 30080,
11       "text": "Buenos",
12       "confidence": 0.979445,
13       "word_is_final": true
14     },
15     {
16       "start": 30320,
17       "end": 30400,
18       "text": "días",
19       "confidence": 0.774696,
20       "word_is_final": false
21     }
22   ],
23   "utterance": "Buenos días.",
24   "language_code": "es",
25   "language_confidence": 0.999997,
26   "type": "Turn"
27 }

In this example, the model detected Spanish ("es") with a confidence of 0.999997.

Understanding formatting

The multilingual model produces transcripts with punctuation and capitalization already built into the model outputs. This means you’ll receive properly formatted text without requiring any additional post-processing.

While the API still returns the turn_is_formatted parameter to maintain interface consistency with other streaming models, the multilingual model doesn’t perform additional formatting operations. All transcripts from the multilingual model are already formatted as they’re generated.

Whisper Streaming

Whisper streaming allows you to transcribe audio streams in 99 languages using the WhisperLiveKit model. To use Whisper streaming, set speech_model to "whisper-rt" in the WebSocket URL.

The whisper-rt model does not support the language parameter. The model automatically detects the language being spoken. Do not include a language parameter when using this model.

Supported languages (99)

Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yiddish, Yoruba

Language detection

The Whisper streaming model supports automatic language detection, allowing you to identify which language is being spoken in real-time. To enable it, include language_detection=true as a query parameter in the WebSocket URL:

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&language_detection=true

When enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:

language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)
language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection

The language_code and language_confidence fields only appear when either:

The utterance field is non-empty and contains a complete utterance - The end_of_turn field is true

Example response

1 {
2   "turn_order": 0,
3   "turn_is_formatted": false,
4   "end_of_turn": true,
5   "transcript": "buenos días",
6   "end_of_turn_confidence": 1.0,
7   "words": [
8     {
9       "start": 1200,
10       "end": 2596,
11       "text": "buenos",
12       "confidence": 0.0,
13       "word_is_final": true
14     },
15     {
16       "start": 2828,
17       "end": 3760,
18       "text": "días",
19       "confidence": 0.0,
20       "word_is_final": true
21     }
22   ],
23   "utterance": "Buenos días.",
24   "language_code": "es",
25   "language_confidence": 0.846999,
26   "type": "Turn"
27 }

Non-speech tags

The Whisper streaming model can detect and transcribe non-speech audio events. These are returned as bracketed tags in the utterance field. Common non-speech tags include:

[Silence] - Periods of silence or no speech
[Música] / [Music] - Background music detected
Other audio events may appear in similar bracketed format

Non-speech tags appear in the utterance field with brackets. The transcript field contains the raw text without formatting. You can filter out non-speech turns by checking if the utterance contains bracketed tags like [Silence] or [Music].

Understanding formatting

By default, the Whisper streaming model returns unformatted transcripts. To receive formatted transcripts with proper punctuation and capitalization, set format_turns=true as a query parameter.

For voice agent pipelines, formatting is not required since LLMs process unformatted text directly. For notetaking and closed captioning applications, enable format_turns to make output human-readable.

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&format_turns=true

Supported languages

English, Spanish, French, German, Italian, and Portuguese

Multilingual streaming allows you to transcribe audio streams in multiple languages.

Streaming is billed per session

Need more than 6 languages?

Configuration

To utilize multilingual streaming, you need to include "speech_model":"universal-streaming-multilingual" as a query parameter in the WebSocket URL.

Supported languages

Multilingual currently supports: English, Spanish, French, German, Italian, and Portuguese.

Quickstart

Python

Python SDK

Javascript

JavaScript SDK

$ pip install websockets pyaudio

The Python example uses the websockets library. If you’re using websockets version 13.0 or later, use additional_headers parameter. For older versions (< 13.0), use extra_headers instead.

Python

Python SDK

Javascript

JavaScript SDK

1 import websockets
2 import asyncio
3 import json
4 from urllib.parse import urlencode
5 
6 import pyaudio
7 
8 FRAMES_PER_BUFFER = 3200
9 FORMAT = pyaudio.paInt16
10 CHANNELS = 1
11 RATE = 48000
12 p = pyaudio.PyAudio()
13 
14 stream = p.open(
15     format=FORMAT,
16     channels=CHANNELS,
17     rate=RATE,
18     input=True,
19     frames_per_buffer=FRAMES_PER_BUFFER
20 )
21 
22 BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
23 CONNECTION_PARAMS = {
24     "sample_rate": RATE,
25     "speech_model": "universal-streaming-multilingual",
26     "language_detection": True,
27 }
28 URL = f"{BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
29 
30 async def send_receive():
31 
32     print(f'Connecting websocket to url ${URL}')
33 
34     async with websockets.connect(
35         URL,
36         additional_headers={"Authorization": "YOUR-API-KEY"},
37         ping_interval=5,
38         ping_timeout=20
39     ) as _ws:
40         await asyncio.sleep(0.1)
41         print("Receiving SessionBegins ...")
42 
43         session_begins = await _ws.recv()
44         print(session_begins)
45         print("Sending messages ...")
46 
47         async def send():
48             while True:
49                 try:
50                     data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
51                     await _ws.send(data)
52                 except websockets.exceptions.ConnectionClosedError as e:
53                     print(e)
54                 except Exception as e:
55                     print(e)
56                 await asyncio.sleep(0.01)
57 
58         async def receive():
59             while True:
60                 try:
61                     result_str = await _ws.recv()
62                     data = json.loads(result_str)
63                     transcript = data['transcript']
64                     utterance = data['utterance']
65 
66                     if data['type'] == 'Turn':
67                         if not data.get('end_of_turn') and transcript:
68                             print(f"[PARTIAL TURN TRANSCRIPT]: {transcript}")
69                         if data.get('utterance'):
70                             print(f"[PARTIAL TURN UTTERANCE]: {utterance}")
71                             # Display language detection info if available
72                             if 'language_code' in data:
73                                 print(f"[UTTERANCE LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
74                         if data.get('end_of_turn'):
75                             print(f"[FULL TURN TRANSCRIPT]: {transcript}")
76                             # Display language detection info if available
77                             if 'language_code' in data:
78                                 print(f"[END OF TURN LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
79                     else:
80                         pass
81 
82                 except websockets.exceptions.ConnectionClosed:
83                     break
84                 except Exception as e:
85                     print(f"\nError receiving data: {e}")
86                     break
87 
88         try:
89             await asyncio.gather(send(), receive())
90         except KeyboardInterrupt:
91             await _ws.send({"type": "Terminate"})
92             # Wait for the server to close the connection after receiving the message
93             await _ws.wait_closed()
94             print("Session terminated and connection closed.")
95 
96 if __name__ == "__main__":
97     try:
98         asyncio.run(send_receive())
99     finally:
100         stream.stop_stream()
101         stream.close()
102         p.terminate()

Language detection

Configuration

To enable language detection, include language_detection=true as a query parameter in the WebSocket URL:

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=universal-streaming-multilingual&language_detection=true

Output format

When language detection is enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:

language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)
language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection

The language_code and language_confidence fields only appear when either:

The utterance field is non-empty and contains a complete utterance - The end_of_turn field is true

Example response

Here’s an example Turn message with language detection enabled, showing Spanish being detected:

1 {
2   "turn_order": 1,
3   "turn_is_formatted": false,
4   "end_of_turn": false,
5   "transcript": "Buenos",
6   "end_of_turn_confidence": 0.991195,
7   "words": [
8     {
9       "start": 29920,
10       "end": 30080,
11       "text": "Buenos",
12       "confidence": 0.979445,
13       "word_is_final": true
14     },
15     {
16       "start": 30320,
17       "end": 30400,
18       "text": "días",
19       "confidence": 0.774696,
20       "word_is_final": false
21     }
22   ],
23   "utterance": "Buenos días.",
24   "language_code": "es",
25   "language_confidence": 0.999997,
26   "type": "Turn"
27 }

In this example, the model detected Spanish ("es") with a confidence of 0.999997.

Understanding formatting

Whisper Streaming

Whisper streaming allows you to transcribe audio streams in 99 languages using the WhisperLiveKit model. To use Whisper streaming, set speech_model to "whisper-rt" in the WebSocket URL.

The whisper-rt model does not support the language parameter. The model automatically detects the language being spoken. Do not include a language parameter when using this model.

Supported languages (99)

Language detection

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&language_detection=true

When enabled, each Turn message (with either a complete utterance or end_of_turn: true) will include two additional fields:

language_code: The language code of the detected language (e.g., "es" for Spanish, "fr" for French)
language_confidence: A confidence score between 0 and 1 indicating how confident the model is in the language detection

The language_code and language_confidence fields only appear when either:

The utterance field is non-empty and contains a complete utterance - The end_of_turn field is true

Example response

1 {
2   "turn_order": 0,
3   "turn_is_formatted": false,
4   "end_of_turn": true,
5   "transcript": "buenos días",
6   "end_of_turn_confidence": 1.0,
7   "words": [
8     {
9       "start": 1200,
10       "end": 2596,
11       "text": "buenos",
12       "confidence": 0.0,
13       "word_is_final": true
14     },
15     {
16       "start": 2828,
17       "end": 3760,
18       "text": "días",
19       "confidence": 0.0,
20       "word_is_final": true
21     }
22   ],
23   "utterance": "Buenos días.",
24   "language_code": "es",
25   "language_confidence": 0.846999,
26   "type": "Turn"
27 }

Non-speech tags

The Whisper streaming model can detect and transcribe non-speech audio events. These are returned as bracketed tags in the utterance field. Common non-speech tags include:

[Silence] - Periods of silence or no speech
[Música] / [Music] - Background music detected
Other audio events may appear in similar bracketed format

Understanding formatting

By default, the Whisper streaming model returns unformatted transcripts. To receive formatted transcripts with proper punctuation and capitalization, set format_turns=true as a query parameter.

wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&speech_model=whisper-rt&format_turns=true

1	import websockets
2	import asyncio
3	import json
4	from urllib.parse import urlencode
5
6	import pyaudio
7
8	FRAMES_PER_BUFFER = 3200
9	FORMAT = pyaudio.paInt16
10	CHANNELS = 1
11	RATE = 48000
12	p = pyaudio.PyAudio()
13
14	stream = p.open(
15	format=FORMAT,
16	channels=CHANNELS,
17	rate=RATE,
18	input=True,
19	frames_per_buffer=FRAMES_PER_BUFFER
20	)
21
22	BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
23	CONNECTION_PARAMS = {
24	"sample_rate": RATE,
25	"speech_model": "universal-streaming-multilingual",
26	"language_detection": True,
27	}
28	URL = f"{BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
29
30	async def send_receive():
31
32	print(f'Connecting websocket to url ${URL}')
33
34	async with websockets.connect(
35	URL,
36	additional_headers={"Authorization": "YOUR-API-KEY"},
37	ping_interval=5,
38	ping_timeout=20
39	) as _ws:
40	await asyncio.sleep(0.1)
41	print("Receiving SessionBegins ...")
42
43	session_begins = await _ws.recv()
44	print(session_begins)
45	print("Sending messages ...")
46
47	async def send():
48	while True:
49	try:
50	data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
51	await _ws.send(data)
52	except websockets.exceptions.ConnectionClosedError as e:
53	print(e)
54	except Exception as e:
55	print(e)
56	await asyncio.sleep(0.01)
57
58	async def receive():
59	while True:
60	try:
61	result_str = await _ws.recv()
62	data = json.loads(result_str)
63	transcript = data['transcript']
64	utterance = data['utterance']
65
66	if data['type'] == 'Turn':
67	if not data.get('end_of_turn') and transcript:
68	print(f"[PARTIAL TURN TRANSCRIPT]: {transcript}")
69	if data.get('utterance'):
70	print(f"[PARTIAL TURN UTTERANCE]: {utterance}")
71	# Display language detection info if available
72	if 'language_code' in data:
73	print(f"[UTTERANCE LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
74	if data.get('end_of_turn'):
75	print(f"[FULL TURN TRANSCRIPT]: {transcript}")
76	# Display language detection info if available
77	if 'language_code' in data:
78	print(f"[END OF TURN LANGUAGE DETECTION]: {data['language_code']} - {data['language_confidence']:.2%}")
79	else:
80	pass
81
82	except websockets.exceptions.ConnectionClosed:
83	break
84	except Exception as e:
85	print(f"\nError receiving data: {e}")
86	break
87
88	try:
89	await asyncio.gather(send(), receive())
90	except KeyboardInterrupt:
91	await _ws.send({"type": "Terminate"})
92	# Wait for the server to close the connection after receiving the message
93	await _ws.wait_closed()
94	print("Session terminated and connection closed.")
95
96	if __name__ == "__main__":
97	try:
98	asyncio.run(send_receive())
99	finally:
100	stream.stop_stream()
101	stream.close()
102	p.terminate()

1	{
2	"turn_order": 1,
3	"turn_is_formatted": false,
4	"end_of_turn": false,
5	"transcript": "Buenos",
6	"end_of_turn_confidence": 0.991195,
7	"words": [
8	{
9	"start": 29920,
10	"end": 30080,
11	"text": "Buenos",
12	"confidence": 0.979445,
13	"word_is_final": true
14	},
15	{
16	"start": 30320,
17	"end": 30400,
18	"text": "días",
19	"confidence": 0.774696,
20	"word_is_final": false
21	}
22	],
23	"utterance": "Buenos días.",
24	"language_code": "es",
25	"language_confidence": 0.999997,
26	"type": "Turn"
27	}

1	{
2	"turn_order": 0,
3	"turn_is_formatted": false,
4	"end_of_turn": true,
5	"transcript": "buenos días",
6	"end_of_turn_confidence": 1.0,
7	"words": [
8	{
9	"start": 1200,
10	"end": 2596,
11	"text": "buenos",
12	"confidence": 0.0,
13	"word_is_final": true
14	},
15	{
16	"start": 2828,
17	"end": 3760,
18	"text": "días",
19	"confidence": 0.0,
20	"word_is_final": true
21	}
22	],
23	"utterance": "Buenos días.",
24	"language_code": "es",
25	"language_confidence": 0.846999,
26	"type": "Turn"
27	}