Getting started

Models

AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.

Pre-recorded models

Streaming models

Add-on models

Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.

Medical Mode

Medical Mode (domain: "medical-v1") is an add-on that enhances transcription accuracy for medical terminology — including medication names, procedures, conditions, and dosages. It is optimized for medical entity recognition to correct terms that other models frequently get wrong.

Supported models:

  • Pre-recorded: Universal-3 Pro, Universal-2
  • Streaming: Universal-3 Pro Streaming, Universal-Streaming English, Universal-Streaming Multilingual

Supported languages: English, Spanish, German, French

Medical Mode is billed as a separate add-on. See the pricing page for details.

Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming

Choosing the right model

Pre-recorded

Universal-3 Pro

Universal-3 Pro is our most advanced transcription model, delivering state-of-the-art accuracy across 6 languages with powerful prompting capabilities. It supports prompting in plain language for tasks like context-specific transcription, verbatim output, audio tagging, and speaker diarization, giving you fine-grained control to guide transcription results. With keyterms prompting supporting up to 1,000 words, built-in code switching, and multichannel support, Universal-3 Pro is ideal for complex audio scenarios requiring the highest accuracy.

Best for:

  • Highest-accuracy transcription where quality > speed
  • Post-call analytics and conversation intelligence
  • Meeting notetakers
  • Medical transcription
  • Recruiting and interviews — high-quality diarization + entity accuracy
  • Domain-specific accuracy via keyterm prompting (up to 1,000 words) — entities, proper nouns, rare terms
  • Code-switching across EN/ES/DE/FR/PT/IT
Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Regional dialects

Universal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Try Universal-3 Pro here

Universal-2

Universal 2 offers the broadest language coverage of any of our models, supporting high-accuracy transcription across 99 languages with low latency. It supports customization through keyterms prompting (up to 200 words) and includes features like multichannel support, automatic language detection, code switching, speaker diarization, and more. Universal 2 is the go-to choice when you need reliable transcription across diverse languages.

Best for:

  • High accuracy at lower cost with broad language support
  • High-volume, price-sensitive batch transcription
  • Support for over 99 languages
  • Recommended fallback when a requested language isn’t supported by Universal-3 Pro
Global Englishen
Australian Englishen_au
British Englishen_uk
US Englishen_us
Spanishes
Frenchfr
Germande
Italianit
Portuguesept
Dutchnl
Hindihi
Japaneseja
Chinesezh
Finnishfi
Koreanko
Polishpl
Russianru
Turkishtr
Ukrainianuk
Vietnamesevi
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bosnianbs
Bretonbr
Bulgarianbg
Burmesemy
Catalanca
Croatianhr
Czechcs
Danishda
Estonianet
Faroesefo
Galiciangl
Georgianka
Greekel
Gujaratigu
Haitianht
Hausaha
Hawaiianhaw
Hebrewhe
Hungarianhu
Icelandicis
Indonesianid
Javanesejw
Kannadakn
Kazakhkk
Khmerkm
Laolo
Latinla
Latvianlv
Lingalaln
Lithuanianlt
Luxembourgishlb
Macedonianmk
Malagasymg
Malayms
Malayalamml
Maltesemt
Maorimi
Marathimr
Mongolianmn
Nepaline
Norwegianno
Norwegian Nynorsknn
Occitanoc
Panjabipa
Pashtops
Persianfa
Romanianro
Sanskritsa
Serbiansr
Shonasn
Sindhisd
Sinhalasi
Slovaksk
Sloveniansl
Somaliso
Sundanesesu
Swahilisw
Swedishsv
Swiss Germande_ch
Tagalogtl
Tajiktg
Tamilta
Tatartt
Telugute
Thaith
Tibetanbo
Turkmentk
Urduur
Uzbekuz
Welshcy
Yiddishyi
Yorubayo

Try Universal-2 here

Streaming

Universal-3 Pro Streaming

The most accurate model with the fastest word emissions for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports English, Spanish, German, French, Portuguese, and Italian.

Best for:

  • Real-time voice agents
  • Applications requiring premium accuracy
  • Customer service voice agents needing elite entity accuracy
  • IVR replacement / binary response detection in short utterances
  • Agent assist and sales intelligence needing real-time speaker diarization, mid-session dynamic prompting
  • Multilingual voice agents — native EN/ES/DE/FR/PT/IT code-switching
  • Compliance and verbatim recording — disfluency control via prompting
Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Regional dialects

Universal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.

Learn more about Universal-3 Pro Streaming

Universal-Streaming Multilingual

A multilingual transcription model offering a good balance of speed and cost-effectiveness. Supports English, Spanish, German, French, Portuguese, and Italian. Features intelligent endpointing and keyterms prompting support for up to 100 words.

Best for:

  • Cost-effective real-time transcription across languages
  • Cost-sensitive multilingual streaming across EN/ES/DE/FR/PT/IT
Englishen
Spanishes
Germande
Frenchfr
Portuguesept
Italianit

Learn more about Universal-Streaming Multilingual

Universal-Streaming English

An English transcription model offering a good balance of speed and cost-effectiveness. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.

Best for:

  • Cost-effective real-time transcription for English
  • English-only real-time apps — fastest and cheapest streaming option for English
Englishen

Learn more about Universal-Streaming English

Whisper Streaming

An open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.

Best for:

  • Multilingual applications and open-source flexibility
  • Customers who prefer open-source models
  • Cost-sensitive multilingual transcription
Global Englishen
Spanishes
Frenchfr
Germande
Italianit
Portuguesept
Dutchnl
Hindihi
Japaneseja
Chinesezh
Finnishfi
Koreanko
Polishpl
Russianru
Turkishtr
Ukrainianuk
Vietnamesevi
Afrikaansaf
Albaniansq
Amharicam
Arabicar
Armenianhy
Assameseas
Azerbaijaniaz
Bashkirba
Basqueeu
Belarusianbe
Bengalibn
Bosnianbs
Bretonbr
Bulgarianbg
Burmesemy
Cantoneseyue
Catalanca
Croatianhr
Czechcs
Danishda
Estonianet
Faroesefo
Galiciangl
Georgianka
Greekel
Gujaratigu
Haitianht
Hausaha
Hawaiianhaw
Hebrewhe
Hungarianhu
Icelandicis
Indonesianid
Javanesejw
Kannadakn
Kazakhkk
Khmerkm
Laolo
Latinla
Latvianlv
Lingalaln
Lithuanianlt
Luxembourgishlb
Macedonianmk
Malagasymg
Malayms
Malayalamml
Maltesemt
Maorimi
Marathimr
Mongolianmn
Nepaline
Norwegianno
Norwegian Nynorsknn
Occitanoc
Panjabipa
Pashtops
Persianfa
Romanianro
Sanskritsa
Serbiansr
Shonasn
Sindhisd
Sinhalasi
Slovaksk
Sloveniansl
Somaliso
Sundanesesu
Swahilisw
Swedishsv
Tagalogtl
Tajiktg
Tamilta
Tatartt
Telugute
Thaith
Tibetanbo
Turkmentk
Urduur
Uzbekuz
Welshcy
Yiddishyi
Yorubayo

Learn more about Whisper Streaming

To learn how to specify a model, click here for pre-recorded audio and here for streaming audio.

Pricing

For detailed pricing information, visit our pricing page.

Pre-recorded

ModelPrice per HourVolume discounts
Universal-3 Pro$0.21/hrAvailable
Universal-2$0.15/hrAvailable

Streaming

ModelPrice per HourVolume discounts
Universal-3 Pro Streaming$0.45/hrAvailable
Universal-Streaming Multilingual$0.15/hrAvailable
Universal-Streaming English$0.15/hrAvailable
Whisper Streaming$0.30/hrAvailable

The rates shown above are offered subject to participation in our model improvement program to help us continue to provide best-in-class speech-to-text. Rates may be different for accounts that opt out of this program.

For volume discounts, please reach out to sales@assemblyai.com.

Next steps