Simple transparent pricing

Free

Start building with $50 of free credits

For developers looking to prototype with Speech AI
  • Access to industry-leading Speech-to-Text and Audio Intelligence models
    • Speech recognition
    • Speaker diarization
    • Custom spelling and vocabulary
    • Profanity filtering, auto punctuation and casing
  • Transcribe up to 416 hours of audio for free
  • Get tips and support as you build from developer docs and Community resources
Start building for free

Pay as you go

Start as low as $0.12/hr for Speech-to-Text

For teams ready to integrate Speech AI into their products
  • Unlimited access to Speech-to-Text, Audio Intelligence, and LeMUR
  • Streaming Speech-to-Text
  • Concurrency starting at 200 files and 100 streams
  • Technical support via live chat and email
Get started

Custom

Start building with $50 of free credit

For teams and organizations building AI products at scale
  • Flexible, zero-obligation pricing that scales to millions of hours
  • Dedicated technical support with response time under one hour
  • Customize rate limits - scale to any workload
  • Customized SLAs and SLOs
  • Compliance with EU Data Residency standards
  • Self-hosted deployments (On-prem, VPC) (Coming soon!)
  • Early access to new models and model improvements
  • Available through AWS Marketplace
Contact us

Speech-to-Text

Build on top of the most accurate Speech-to-Text model on the market with >93% accuracy

Tiers
Pay as you go
Nano

Fast, lightweight Speech AI at an accessible price point

Free up to $50
$0.12 /hr
Lower rates based on volume
Best

Highest accuracy, and most advanced capabilities

Free up to $50
$0.37 /hr
Lower rates based on volume
Features
Speaker Diarization

Automatically detect the number of speakers in your audio file, and each word in the transcription text can be associated with its speaker.

Automatic Language Detection

Automatically detect if the dominant language of the spoken audio is supported by our API and route it to the appropriate model for transcription.

Profanity Filtering

Automatically detect and replace profanity in the transcription text.

Custom Vocabulary

Only available in Best tier. Boost accuracy for vocabulary that is unique or custom to your specific use case or product.

Multichannel

Transcribe audio files with multiple speakers separately.

Filler Word Filtering

Optionally include disfluencies in the transcripts of your audio files.

Custom Spelling

Specify how you would like certain words to be spelled or formatted in the transcription text.

Word Timestamps

Word-by-word timestamps across the entire transcript text.

Auto Punctuation and Casing

Automatically add casing and punctuation of proper nouns to the transcription text.

ITN/Formatting

Automatically convert spoken form text into its proper written format to increase transcript readability.

Confidence Scores

Get a confidence score for each word in the transcript.

Word Search

Search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information.

Export SRT/VTT Captions

Export completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos.

Export Paragraphs/Sentences

Retrieve transcripts that are automatically segmented into paragraphs or sentences, for a more reader-friendly experience.

See all features

Streaming Speech-to-Text

Transcribe live audio and video files synchronously at low latency and high quality

Tiers
Pay as you go
Best

Highest accuracy, and most advanced capabilities

$0.47 /hr
Lower rates based on volume
Features
Auto Punctuation and Casing

Automatically add casing and punctuation of proper nouns to the transcription text.

Custom Vocabulary

Only available in Best tier. Boost accuracy for vocabulary that is unique or custom to your specific use case or product.

End of Utterance Detection

Customize End of Utterance Detection to more accurately detect when one speaker finishes an utterance in Streaming Speech-to-Text.

ITN/Formatting

Automatically convert spoken form text into its proper written format to increase transcript readability.

Speech Understanding

Transcribe live audio and video files synchronously at low latency and high quality

LeMUR Models
Pay as you go
Claude 3.5 Sonnet

Claude 3.5 Sonnet is the most intelligent model to date, outperforming Claude 3 Opus on a wide range of evaluations, with the speed and cost of Claude 3 Sonnet.

$0.003 / 1k tokens (Input)
$0.015 / 1k tokens (Output)
$0.003 / 1k tokens (Input)
$0.015 / 1k tokens (Output)
Claude 3 Opus

Claude 3 Opus is good at handling complex analysis, longer tasks with many steps, and higher-order math and coding tasks.

$0.015 / 1k tokens (Input)
$0.075 / 1k tokens (Output)
$0.015 / 1k tokens (Input)
$0.075 / 1k tokens (Output)
Claude 3 Haiku

Claude 3 Haiku is the fastest model that can execute lightweight actions.

$0.00025 / 1k tokens (Input)
$0.00125 / 1k tokens (Output)
$0.00025 / 1k tokens (Input)
$0.00125 / 1k tokens (Output)
Claude 3 Sonnet

Claude 3 Sonnet is a legacy model with a balanced combination of performance and speed for efficient, high-throughput tasks.

$0.003 / 1k tokens (Input)
$0.015 / 1k tokens (Output)
$0.003 / 1k tokens (Input)
$0.015 / 1k tokens (Output)
Audio Intelligence Features
Entity Detection

Identify a wide range of entities that are spoken in your audio files, such as person and company names, email addresses, dates, and locations.

Included in free credits
$0.08 /hr
Lower rates based on volume
Topic Detection

Label the topics that are spoken in your audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting.

Included in free credits
$0.15 /hr
Lower rates based on volume
Key Phrases

Accurately identify significant words and phrases in your transcription, enabling you to extract the most pertinent concepts or highlights from your audio/video file.

Included in free credits
$0.01 /hr
Lower rates based on volume
PII Audio Redaction
Included in free credits
$0.05 /hr
Lower rates based on volume
PII Redaction

Identify and remove Personally Identifiable Information, such as phone numbers and social security numbers, from the transcription text before it is returned to you.

Included in free credits
$0.08 /hr
Lower rates based on volume
Sentiment Analysis

With Sentiment Analysis, AssemblyAI can detect the sentiment of each sentence of speech spoken in your audio files.

Included in free credits
$0.02 /hr
Lower rates based on volume
Content Moderation

Detect sensitive content in your audio and video files - such as hate speech, violence, sensitive social issues, alcohol, drugs, and more.

Included in free credits
$0.15 /hr
Lower rates based on volume
Auto Chapters

Automatically generate a summary over time for audio and video files.

Included in free credits
$0.08 /hr
Lower rates based on volume
Summarization

Leverage our AI-powered Summarization models to automatically summarize audio/video data in your products at scale. Customize the summary types to best fit your use case.

Included in free credits
$0.03 /hr
Lower rates based on volume

Rate Limits

Pay as you go
Hours of audio
Up to 416 hours
Unlimited
Unlimited
Pre-recorded concurrency
5 files
Starting at 200 files
Talk to us
Streaming concurrency
Starting at 100 streams
Talk to us

Security and Privacy

Pay as you go
GDPR
PCI-DSS
SOC 2 Type 1/Type 2
EU Data Residency
ISO 27001

Frequently Asked Questions

What are the differences between Speech-to-Text tiers?

AssemblyAI’s Best tier is our most robust and accurate offering, houses our most powerful models, and has the broadest range of capabilities. The Best tier is suited for use cases where accuracy and power are paramount. AssemblyAI’s Nano tier is a fast, lightweight offering that gives product and development teams access to Speech AI at an attainable price point across 99 languages. It is best for teams with extensive language needs, and those who are looking for a low-cost Speech AI option.

Can I sign up for free?

Yes! With the free offer, you get $50 in credits to use towards AssemblyAI’s Speech-to-Text APIs. To add more credits and gain access to Streaming and LeMUR, simply add a credit card to your account.

Do you offer volume discounts?

Absolutely! If you plan to send large volumes of audio and video content through our API, please reach out to us here to see if you qualify for a volume discount.

How fast does it take for audio and video files to process?

Most audio files sent to AssemblyAI's API can be processed in less than 60 seconds.

How does billing work?

Great question. Once you add a credit card and deposit funds into your account, your account's funds will be drained as you use the API.

How is multichannel billed?

When multichannel is enabled, each channel will be transcribed and billed separately. The total cost is calculated by taking the hourly transcription rate (billed per second) and multiplying it by the number of channels. To calculate your total cost, simply multiply your recording's duration by the hourly rate, then multiply that result by the number of channels.

For example, if you sent a 5-minute recording with three channels, you would be billed for the 5 minutes of audio multiplied by the standard rate, with that total multiplied by three channels. This is equivalent to being billed for 15 minutes of audio.

Can I purchase or use AssemblyAI through the AWS Marketplace?

You can also get started with AssemblyAI on the AWS Marketplace—or ask your AWS account team about how to leverage AssemblyAI to revolutionize the way your company understands its customers.

How can I talk to someone?

Feel free to email us at support@assemblyai.com, or click the chat button in the bottom right corner of your browser to chat live with our API Support team!

What languages do you support?

We support over 99 languages and counting, including Global English (English and all of its accents).

What is a token?

In the context of a Large Language Model (LLM), a “token” is the smallest unit of text processed by the model. 100 tokens roughly maps to ~75 words.

Turn voice data into unparalleled product experiences

Partner with the leader in Speech AI to build powerful products with breakthrough industry impact.