Models

AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.

Pre-recorded models

Streaming models

Choosing the right model

Pre-recorded

Universal-3 Pro

Universal-3 Pro is our most advanced transcription model, delivering state-of-the-art accuracy across 6 languages with powerful prompting capabilities. It supports prompting in plain language for tasks like context-specific transcription, verbatim output, audio tagging, and speaker diarization, giving you fine-grained control to guide transcription results. With keyterms prompting supporting up to 1,000 words, built-in code switching, and multichannel support, Universal-3 Pro is ideal for complex audio scenarios requiring the highest accuracy.

Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Try Universal-3 Pro here

Universal-2

Universal 2 offers the broadest language coverage of any of our models, supporting high-accuracy transcription across 99 languages with low latency. It supports customization through keyterms prompting (up to 200 words) and includes features like multichannel support, automatic language detection, code switching, speaker diarization, and more. Universal 2 is the go-to choice when you need reliable transcription across diverse languages.

Global Englishen
Australian Englishen_au
British Englishen_uk
US Englishen_us
Spanishes
Frenchfr
Germande
Italianit
Portuguesept
Dutchnl
Hindihi
Japaneseja
Chinesezh
Finnishfi
Koreanko
Polishpl
Russianru
Turkishtr
Ukrainianuk
Vietnamesevi
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bosnianbs
Bretonbr
Bulgarianbg
Burmesemy
Catalanca
Croatianhr
Czechcs
Danishda
Estonianet
Faroesefo
Galiciangl
Georgianka
Greekel
Gujaratigu
Haitianht
Hausaha
Hawaiianhaw
Hebrewhe
Hungarianhu
Icelandicis
Indonesianid
Javanesejw
Kannadakn
Kazakhkk
Khmerkm
Laolo
Latinla
Latvianlv
Lingalaln
Lithuanianlt
Luxembourgishlb
Macedonianmk
Malagasymg
Malayms
Malayalamml
Maltesemt
Maorimi
Marathimr
Mongolianmn
Nepaline
Norwegianno
Norwegian Nynorsknn
Occitanoc
Panjabipa
Pashtops
Persianfa
Romanianro
Sanskritsa
Serbiansr
Shonasn
Sindhisd
Sinhalasi
Slovaksk
Sloveniansl
Somaliso
Sundanesesu
Swahilisw
Swedishsv
Swiss Germande_ch
Tagalogtl
Tajiktg
Tamilta
Tatartt
Telugute
Thaith
Tibetanbo
Turkmentk
Urduur
Uzbekuz
Welshcy
Yiddishyi
Yorubayo

Try Universal-2 here

Streaming

Universal-3 Pro Streaming

The most accurate model for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports native multilingual code switching, entity accuracy, and disfluency detection.

Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Learn more about Universal-3 Pro Streaming

Universal-Streaming Multilingual

Multilingual transcription at the speed and cost of Universal-Streaming. Same fast performance and competitive pricing as our English model, but with expanded language coverage. Features intelligent endpointing and keyterms prompting support for up to 100 words.

Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Learn more about Universal-Streaming Multilingual

Universal-Streaming English

The fastest model for real-time English transcription. Optimized for speed and cost-effectiveness for English-only applications. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.

Englishen

Learn more about Universal-Streaming English

Whisper Streaming

Open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.

Global Englishen
Spanishes
Frenchfr
Germande
Italianit
Portuguesept
Dutchnl
Hindihi
Japaneseja
Chinesezh
Finnishfi
Koreanko
Polishpl
Russianru
Turkishtr
Ukrainianuk
Vietnamesevi
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bosnianbs
Bretonbr
Bulgarianbg
Burmesemy
Cantoneseyue
Catalanca
Croatianhr
Czechcs
Danishda
Estonianet
Faroesefo
Galiciangl
Georgianka
Greekel
Gujaratigu
Haitianht
Hausaha
Hawaiianhaw
Hebrewhe
Hungarianhu
Icelandicis
Indonesianid
Javanesejw
Kannadakn
Kazakhkk
Khmerkm
Laolo
Latinla
Latvianlv
Lingalaln
Lithuanianlt
Luxembourgishlb
Macedonianmk
Malagasymg
Malayms
Malayalamml
Maltesemt
Maorimi
Marathimr
Mongolianmn
Nepaline
Norwegianno
Norwegian Nynorsknn
Occitanoc
Panjabipa
Pashtops
Persianfa
Romanianro
Sanskritsa
Serbiansr
Shonasn
Sindhisd
Sinhalasi
Slovaksk
Sloveniansl
Somaliso
Sundanesesu
Swahilisw
Swedishsv
Tagalogtl
Tajiktg
Tamilta
Tatartt
Telugute
Thaith
Tibetanbo
Turkmentk
Urduur
Uzbekuz
Welshcy
Yiddishyi
Yorubayo

Learn more about Whisper Streaming

To learn how to specify a model, click here for pre-recorded audio and here for streaming audio.

Pricing

For detailed pricing information, visit our pricing page.

Pre-recorded

ModelPrice per HourVolume discounts
Universal-3 Pro$0.21/hrAvailable
Universal-2$0.15/hrAvailable

Streaming

ModelPrice per HourVolume discounts
Universal-3 Pro Streaming$0.45/hrAvailable
Universal-Streaming Multilingual$0.15/hrAvailable
Universal-Streaming English$0.15/hrAvailable
Whisper Streaming$0.30/hrAvailable

The rates shown above are offered subject to participation in our model improvement program to help us continue to provide best-in-class speech-to-text. Rates may be different for accounts that opt out of this program.

For volume discounts, please reach out to sales@assemblyai.com.

Next steps