Models
AssemblyAI offers several state-of-the-art speech recognition models, each optimized for different use cases. Choose the model that best fits your needs based on accuracy, latency, cost, and language requirements.
Pre-recorded models
- Highest accuracy across 6 languages
- Regional dialect and local variant recognition
- Advanced prompting capabilities
- Keyterms prompting up to 1,000 words
- Native code switching
- High accuracy, low latency
- Support across 99 languages
- Keyterms prompting up to 200 words
- Code switching
Streaming models
- Highest accuracy for voice agents
- Fastest word emissions
- Advanced prompting capabilities
- Keyterms prompting up to 100 words
- 6 languages: en, es, pt, de, fr, it
- Good balance of speed and cost-effectiveness
- Multilingual real-time transcription
- Keyterms prompting up to 100 words
- 6 languages: en, es, pt, de, fr, it
- Good balance of speed and cost-effectiveness
- English transcription
- Keyterms prompting up to 100 words
- Intelligent endpointing
- Open-source Whisper with AssemblyAI infrastructure
- 99+ languages at an accessible price point
- Automatic language detection
- Unlimited scale
Add-on models
Add-on models enhance transcription accuracy for specialized domains. They work alongside your chosen speech model and are billed separately.
Medical Mode
Medical Mode (domain: "medical-v1") is an add-on that enhances transcription accuracy for medical terminology — including medication names, procedures, conditions, and dosages. It is optimized for medical entity recognition to correct terms that other models frequently get wrong.
Supported models:
- Pre-recorded: Universal-3 Pro, Universal-2
- Streaming: Universal-3 Pro Streaming, Universal-Streaming English, Universal-Streaming Multilingual
Supported languages: English, Spanish, German, French
Medical Mode is billed as a separate add-on. See the pricing page for details.
Learn more: Medical Mode for pre-recorded audio | Medical Mode for streaming
Choosing the right model
Pre-recorded
Universal-3 Pro
Universal-3 Pro is our most advanced transcription model, delivering state-of-the-art accuracy across 6 languages with powerful prompting capabilities. It supports prompting in plain language for tasks like context-specific transcription, verbatim output, audio tagging, and speaker diarization, giving you fine-grained control to guide transcription results. With keyterms prompting supporting up to 1,000 words, built-in code switching, and multichannel support, Universal-3 Pro is ideal for complex audio scenarios requiring the highest accuracy.
Best for:
- Highest-accuracy transcription where quality > speed
- Post-call analytics and conversation intelligence
- Meeting notetakers
- Medical transcription
- Recruiting and interviews — high-quality diarization + entity accuracy
- Domain-specific accuracy via keyterm prompting (up to 1,000 words) — entities, proper nouns, rare terms
- Code-switching across EN/ES/DE/FR/PT/IT
Supported languages
enesdefrptitRegional dialects
Universal-3 Pro also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Universal-2
Universal 2 offers the broadest language coverage of any of our models, supporting high-accuracy transcription across 99 languages with low latency. It supports customization through keyterms prompting (up to 200 words) and includes features like multichannel support, automatic language detection, code switching, speaker diarization, and more. Universal 2 is the go-to choice when you need reliable transcription across diverse languages.
Best for:
- High accuracy at lower cost with broad language support
- High-volume, price-sensitive batch transcription
- Support for over 99 languages
- Recommended fallback when a requested language isn’t supported by Universal-3 Pro
Supported languages
enen_auen_uken_usesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmycahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvde_chtltgtatttethbotkuruzcyyiyoStreaming
Universal-3 Pro Streaming
The most accurate model with the fastest word emissions for voice agents that demand the highest quality. Best-in-class accuracy with advanced prompting capabilities, including both keyterms prompting and native prompting. Supports English, Spanish, German, French, Portuguese, and Italian.
Best for:
- Real-time voice agents
- Applications requiring premium accuracy
- Customer service voice agents needing elite entity accuracy
- IVR replacement / binary response detection in short utterances
- Agent assist and sales intelligence needing real-time speaker diarization, mid-session dynamic prompting
- Multilingual voice agents — native EN/ES/DE/FR/PT/IT code-switching
- Compliance and verbatim recording — disfluency control via prompting
Supported languages
enesdefrptitRegional dialects
Universal-3 Pro Streaming also supports regional dialects and local speech variants out of the box — no special configuration needed. See the full list of supported dialects.
Learn more about Universal-3 Pro Streaming
Universal-Streaming Multilingual
A multilingual transcription model offering a good balance of speed and cost-effectiveness. Supports English, Spanish, German, French, Portuguese, and Italian. Features intelligent endpointing and keyterms prompting support for up to 100 words.
Best for:
- Cost-effective real-time transcription across languages
- Cost-sensitive multilingual streaming across EN/ES/DE/FR/PT/IT
Supported languages
enesdefrptitLearn more about Universal-Streaming Multilingual
Universal-Streaming English
An English transcription model offering a good balance of speed and cost-effectiveness. Features ~300ms word-by-word immutable transcripts, intelligent endpointing, and keyterms prompting support for up to 100 words.
Best for:
- Cost-effective real-time transcription for English
- English-only real-time apps — fastest and cheapest streaming option for English
Supported languages
enLearn more about Universal-Streaming English
Whisper Streaming
An open-source Whisper model enhanced with AssemblyAI’s reliable infrastructure and unlimited scale. Supports 99+ languages at an accessible price point with automatic language detection and non-speech tags.
Best for:
- Multilingual applications and open-source flexibility
- Customers who prefer open-source models
- Cost-sensitive multilingual transcription
Supported languages
enesfrdeitptnlhijazhfikoplrutrukviafsqamarhyasazbaeubebnbsbrbgmyyuecahrcsdaetfoglkaelguhthahawhehuisidjwknkkkmlolalvlnltlbmkmgmsmlmtmimrmnnenonnocpapsfarosasrsnsdsiskslsosuswsvtltgtatttethbotkuruzcyyiyoLearn more about Whisper Streaming
Pricing
For detailed pricing information, visit our pricing page.
Pre-recorded
Streaming
The rates shown above are offered subject to participation in our model improvement program to help us continue to provide best-in-class speech-to-text. Rates may be different for accounts that opt out of this program.
For volume discounts, please reach out to sales@assemblyai.com.
Next steps
- Explore Speech Understanding features like summarization, sentiment analysis, and more
- Learn about prompting: Universal-3 Pro prompting guide | Universal-3 Pro Streaming prompting guide