July 21, 2026

AssemblyAI vs Deepgram for medical transcription

AssemblyAI vs Deepgram for medical transcription: compare accuracy, speed, speaker diarization, PII redaction, and pricing to choose the right API.

Kelsey Foster

Growth

Medical

Speech-to-Text

Reviewed by

Table of contents

[Visible on live site]

If you're building anything that listens to clinical audio—an ambient scribe, a telehealth summarizer, a nurse triage tool—the number that should keep you up at night isn't overall word error rate. It's how often the model drops the one word that matters.

A visit note that transcribes 99% of a conversation perfectly and mangles the drug name is not 99% correct. It's wrong in the only place a clinician will actually read.

So when people ask us how AssemblyAI stacks up against Deepgram Nova-3 Medical, I don't start with generic accuracy. I start with missed entities: how often each system fails to catch a medication, a condition, a procedure, or a clinical term. On that measure, AssemblyAI Medical Mode misses roughly 4.9% of medical entities versus roughly 7.3% for Deepgram Nova-3 Medical. That's about a 32% lower error rate on exactly the words that carry clinical risk. [VERIFY: confirm current canonical benchmark number]

Here's the full comparison—accuracy, the failure that scares me most, contextual prompting, speaker labeling, pricing, and compliance—so you can decide which medical speech-to-text engine belongs in your stack.

The number that matters: missed entity rate

Overall word error rate is a fine headline metric. Our async English Universal-3.5 Pro model runs a mean WER of 5.6% (median 4.9%), and you can see the full breakdown on our benchmarks page.

But WER treats every word the same. "The" and "epinephrine" count identically. In a clinical note, they don't.

That's why the metric we optimize Medical Mode against is Missed Entity Rate—the percentage of medical entities (drugs, conditions, procedures, clinical terms) the model fails to transcribe correctly. Turn on Medical Mode and that rate drops about 20% relative to our general model on medical audio.

Against Deepgram, the gap looks like this:

Metric	AssemblyAI Medical Mode	Deepgram Nova-3 Medical
Missed entity rate (drugs, conditions, procedures, clinical terms)	~4.9%	~7.3%
Relative error reduction	~32% lower	baseline
Contextual prompting (prime with prior-visit note)	Yes	No
Native code-switching	18 languages	Limited
Medical Mode pricing	+$0.15/hr add-on	Bundled, annual commit
Billing model	Pay-as-you-go, per-second, no minimums	Typically annual commit (~$40–50K)

Roughly 4.9% versus 7.3% doesn't sound dramatic until you scale it. Across ten thousand patient encounters, that's thousands of additional correctly captured medical terms—each one a place where a clinician doesn't have to stop, rewind, and second-guess the note.

The error that should scare you

Let me give you the concrete version, because abstractions don't communicate clinical risk.

A clinician says a medication should be given "epinephrine, IM"—intramuscular. A transcription model that isn't tuned for clinical audio hears "epinephrine, I'm" and moves on. Now the route of administration has vanished from the note, replaced by a stray pronoun.

That's not a typo. That's a missing clinical instruction in a document someone downstream will trust.

This is the whole reason Missed Entity Rate matters more than WER. The "I'm" is spelled correctly. A spellchecker waves it through. WER barely flinches. And the note is still wrong in a way that can hurt a patient.

Medical Mode is trained to hold onto exactly these entities—drug names, routes, dosages, procedures—instead of smoothing them into whatever ordinary English word sounds closest.

Test Medical Mode on Your Own Clinical Audio

See how Medical Mode holds onto drug names, dosages, and routes instead of smoothing them into everyday words. Run your own recordings through the Playground in minutes.

Try playground

Contextual prompting: the differentiator Deepgram doesn't have

Here's where it gets interesting.

Most transcription is context-blind. The model hears the audio and nothing else. But clinical audio almost never arrives with zero context—there's a patient, and that patient has a history.

With AssemblyAI you can prime the model with that context before it transcribes. Feed it the patient's prior-visit note—their medication list, their standing conditions, the procedures already on their chart—and the model goes into the audio already expecting those terms.

In an internal healthcare test, priming the model with a patient's prior-visit note cut missed medical terms by 31%. Same audio. Same model. The only difference was giving it the context a human scribe would already have in front of them.

Think about why that works. If the chart already says a patient is on "metoprolol," the model doesn't have to guess between that and a dozen phonetic neighbors when the physician mutters it mid-sentence. The prior note tips the odds toward the right answer.

Speaker labeling for real consultations

Clinical audio is rarely one voice. It's a physician and a patient, sometimes a caregiver, sometimes an interpreter, sometimes a resident chiming in from across the room.

If your note can't tell who said what, the summary downstream inherits the confusion—patient-reported symptoms get attributed to the clinician, or instructions get logged as complaints.

Universal-3.5 Pro ships the most accurate speaker diarization we've released, optimized specifically for concatenated permutation word error rate (cpWER), which measures diarization and transcription together rather than pretending they're separate problems. We've written more about why speaker identification is its own hard problem if you want to go deeper.

And because native code-switching covers 18 languages, a consultation that slides between English and Spanish mid-sentence—extremely common in real clinics—doesn't fall apart at the language boundary. Medical Mode itself supports English, Spanish, German, and French.

What it costs, and how you pay

This is where a lot of medical transcription decisions actually get made, so let's be specific.

AssemblyAI Medical Mode is a $0.15/hr add-on. You set domain="medical-v1"—no model switch, no separate endpoint. Layered on flagship async Universal-3.5 Pro at $0.21/hr, your all-in rate is $0.36/hr. If you need broader language coverage over peak accuracy, Universal-2 runs $0.15/hr across 99+ languages, and Universal-3 Pro remains available as a pinnable snapshot if you've standardized on it. Full breakdown lives on the pricing page.

Now the comparison that usually gets buried.

Deepgram typically requires an annual commit—we routinely see contracts in the ~$40–50K range before you've processed a single production hour. AssemblyAI is pay-as-you-go, billed per second, with no minimums. You can prototype an ambient scribe this afternoon, run a hundred hours through it, and pay for a hundred hours. No procurement cycle to find out whether the accuracy holds up on your audio.

And if you're eyeing AWS Transcribe Medical as the "enterprise-safe" option: it runs around $4.15/hr. Our medical add-on is $0.15/hr on top of a $0.21/hr base. You can run the math, but the short version is that "the AWS medical tier" and "the AssemblyAI medical add-on" are not in the same pricing universe.

Prototype an Ambient Scribe This Afternoon

Pay-as-you-go, billed per second, no minimums and no annual commit. Get a free API key, set domain="medical-v1", and validate accuracy on your own audio.

Compliance: what we can actually say

I'm careful with this language, and you should be too, because the industry throws around claims it isn't entitled to make.

AssemblyAI enables covered entities and their business associates subject to HIPAA to use the AssemblyAI services to process protected health information (PHI). AssemblyAI is considered a business associate under HIPAA, and we offer a Business Associate Addendum (BAA) that is required under HIPAA to ensure that AssemblyAI appropriately safeguards PHI.

That's the framing that matters for a healthcare build. If a vendor tells you their API is simply "HIPAA-compliant" with no BAA in the conversation, that's a flag, not a feature.

On the data-handling side, you can also automatically redact PII from transcripts—useful when you want to strip identifiers before text moves into a less-controlled part of your pipeline. The configuration lives in the docs.

Who's using it

We built Medical Mode with design partners in the room, and the early signal has been strong.

"We've integrated the newest models from AssemblyAI for pre-recorded audio ASR in our ambient product, and it's been excellent. We're now exploring Universal-3.5 Pro for async and realtime speech-to-text capabilities for new use cases. What's been just as important is the reliability of the platform itself—both technically and in terms of partnership." — Gautam Pradeep, Tech Lead, Commure

So which one should you use?

If you already have a Deepgram Nova-3 Medical contract and your audio is clean, single-speaker, English-only dictation, you may not feel the difference every day.

But the moment your audio looks like real clinical work—multiple speakers, code-switching, patients with histories, dense medication and procedure vocabulary—the gap opens up. About 4.9% versus 7.3% missed entities. Contextual prompting that Deepgram can't match. Diarization tuned for cpWER. And a billing model that lets you validate all of it on your own audio before you sign anything.

The reason this comparison isn't really close, though, isn't any single number. It's that AssemblyAI treats clinical context as an input to transcription, not an afterthought. Deepgram transcribes the audio. We transcribe the audio in light of who the patient is. In a domain where the whole job is getting the rare, high-stakes word right, that's the difference between a tool clinicians tolerate and one they trust.

Talk to Us About BAA + Volume Pricing

Building healthcare workflows that handle PHI? Get a Business Associate Addendum and volume pricing guidance tailored to your clinical use case.

Talk to AI expert

Frequently asked questions

How accurate is AssemblyAI on pharmaceutical and drug names?

Medical Mode is tuned to hold onto medical entities—drug names, dosages, routes, procedures, and conditions—rather than smoothing them into similar-sounding everyday words. On our internal benchmark it misses roughly 4.9% of medical entities versus roughly 7.3% for Deepgram Nova-3 Medical, about a 32% lower error rate on exactly those terms. You can prime accuracy even higher by feeding the model a patient's prior medication list as context. [VERIFY: confirm current canonical benchmark number]

Can it tell speakers apart in a multi-speaker consultation?

Yes. Universal-3.5 Pro includes the most accurate speaker diarization we've shipped, optimized for cpWER, which scores diarization and transcription together. That matters in real consultations with a physician, a patient, and sometimes a caregiver or interpreter, where mislabeling who said what corrupts the downstream note.

Is AssemblyAI able to handle PHI, and do you offer a BAA?

How does the cost compare to Deepgram and AWS Transcribe Medical?

Medical Mode is a $0.15/hr add-on on top of flagship async Universal-3.5 Pro at $0.21/hr, for $0.36/hr all-in, billed per second with no minimums. Deepgram typically requires an annual commit in the ~$40–50K range. AWS Transcribe Medical runs around $4.15/hr. AssemblyAI is pay-as-you-go, so you can validate accuracy on your own audio before committing to anything.

What's involved in migrating from Deepgram Nova-3 Medical?

You get a free API key, point your audio at our speech-to-text API, and set domain="medical-v1" to turn on Medical Mode—no separate model or endpoint. Because billing is per-second with no minimums, you can run a side-by-side test on your own recordings before moving production traffic, and add contextual prompting with prior-visit notes for another accuracy lift.

‍