ModelsUniversal-3 Pro

Prompting Guide (Async)

Start with no prompt

The default prompt outperforms most custom prompts. Omit the prompt parameter first — Universal-3 Pro automatically applies a built-in default that is already optimized for accuracy across a wide range of audio.

If the default isn’t a fit, start from one of the recommended prompts, then test against a representative set of your own audio (we suggest at least 25 files — see Evaluating your prompts). Layer in one additional instruction at a time. Do not start from scratch.

How prompting works

Universal-3 Pro is a Speech-augmented Large Language Model (SpeechLLM): a multi-modal LLM with an audio encoder and LLM decoder that processes speech, audio, and text inputs in the same workflow.

Think of SpeechLLM prompting as selecting modes and knobs, not open-ended instruction following. The model is trained primarily to transcribe, then fine-tuned to respond to common transcription instructions for style, speakers, and speech events. It responds best to explicit formatting rules and behavioral instructions (e.g., “include all filler words”, “use periods only for complete sentences”). Domain context like “this is a cardiology appointment” only helps when paired with specific instructions on how to transcribe.

If you know your terms, use keyterms — not the prompt

If you already know the specific names, brands, drug names, acronyms, or jargon that will appear in your audio, use keyterms prompting instead of a free-form prompt. The keyterms_prompt parameter is optimized for term boosting and produces more reliable results than describing the same terms in plain language. Reach for free-form prompts when you want to control style or behavior — not when you want to boost specific words.

What prompts can do

CapabilityDescriptionReliability
Verbatim transcription and disfluenciesInclude filler words, false starts, repetitions, stuttersHigh
Native code switchingHandle multilingual audio in the same transcriptHigh
Output style and formattingControl punctuation, capitalization, number formattingHigh
Context aware cluesHelp with jargon, names, and domain expectationsMedium
Entity accuracy and spellingImprove accuracy for proper nouns, brands, technical termsMedium

These three prompts are battle-tested and the strongest starting points. Use one as your base and tweak from there — don’t start from scratch.

Best all around (default)

This is also the current built-in default prompt — when you omit the prompt parameter, this is what Universal-3 Pro uses. You don’t need to set it explicitly; it’s shown here so you can build off it.

Transcribe with context and proper nouns preserved, where speech is
present in the audio. Each language as spoken. English as English.
Non-native speakers.

Verbatim with multilingual support

This prompt maximizes speech pattern capture, preserves code-switching, and tells the model to always attempt transcription even on difficult audio. The trade-off is that the model may occasionally hallucinate disfluencies or language switches that don’t exist in the audio.

Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.
Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.

Handling unclear audio with [unclear]

This prompt flags uncertain segments rather than forcing the model to guess. It is one of the strongest tools for avoiding hallucinations on unclear audio.

Always: Transcribe speech exactly as heard. If uncertain or audio is
unclear, mark as [unclear]. After the first output, review the transcript
again. Pay close attention to hallucinations, misspellings, or errors,
and revise them like a computer performing spell and grammar checks.
Ensure words and phrases make grammatical sense in sentences.

Result:

  • Hallucinations are materially reduced — the model doesn’t force incorrect guesses on uncertain audio.
  • Uncertain sections are explicitly flagged as [unclear], surfacing exactly where audio quality is insufficient.
  • Clearly audible speech is still preserved.

Capabilities reference

Each capability is a “knob” you can turn. Each section below shows one audio demo with before/after output and one recommended prompt. Layer capabilities in one at a time so you can measure the impact of each — conflicting instructions degrade output, so keep your prompt focused.

Verbatim transcription and disfluencies

Preserves natural speech patterns including filler words, false starts, repetitions, and self-corrections. Reliability: High.

Without prompt:

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With prompt, the model captures filler words like “uh” and false starts like “we, we, we’re friends”:

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?
Preserve all disfluencies exactly as spoken including verbal hesitations,
restarts, and self-corrections.

Native code switching

Handles audio where speakers switch between languages. Reliability: High.

Transcribe in the original language mix (code-switching), preserving
words in the language they are spoken.

Universal-3 Pro is natively multilingual for English, Spanish, French, German, Italian, and Portuguese. For audio in other languages, set language_detection: true so files are routed to the right model. Without this, unsupported languages may be marked [FOREIGN LANGUAGE].


Output style and formatting

Controls punctuation, capitalization, and readability without changing words. Reliability: High.

Without prompt:

You got called because you were being loud and screaming. No, that's literally what my dispatch said. I don't give a fuck what your dispatch said. They lied. Okay, well, you need to calm down. I don't. Okay, yeah, calm down please. No, I don't. Yes, I'm Jesus Christ's daughter. I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.

With prompt, the model uses punctuation to reflect the speaker’s emotional state:

You got called because you were being loud and screaming. No, I wasn't. That's literally what my dispatch said. I don't give a fuck what your dispatch said! They lied! Okay, well, you need to calm down. I don't! Okay, yeah, calm down, please. No, I don't! I'm Jesus Christ's daughter! I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.
Use expressive punctuation to reflect emotion and prosody.

Context aware clues

Helps with jargon, names, and domain expectations from the audio file. Reliability: Medium.

Without prompt:

I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside. Excellent.

With prompt, adding clinical history evaluation as a context clue corrects spelling of “Glicoside” to “Glycoside”:

I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi— glycosi— glycoside. Excellent.
This is a doctor-patient visit. Prioritize accurately transcribing
medications and diseases wherever possible.

Context alone does not tell the model how to transcribe. Pair domain context with a specific instruction. This is a doctor-patient visit is context; prioritize accurately transcribing medications and diseases is the actionable instruction.


Entity accuracy and spelling

Improves accuracy for proper nouns, brands, technical terms, and domain vocabulary. Reliability: Medium. If you already know the exact terms you want boosted, use keyterms prompting instead of describing them in your prompt.

Without prompt:

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Entiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.

With prompt, the model corrects the misrecognition of “Anktiva” (transcribed as “Entiva” without context):

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Anktiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.
Use standard spelling and the most contextually correct spelling of all
words including names, brands, drug names, medical terms, and proper nouns.

Describe the pattern of entities you want corrected, not the specific errors — listing specific spellings often causes the model to hallucinate them. See What to avoid.


What works / what to avoid

What works

PracticeWhy it helpsExampleImpact
Start with Transcribe…The model has transcription prompts in its training data, so leading with this focuses it on the task.Transcribe this audio or Transcribe verbatimMassive
Use authoritative languageStrong directive keywords get higher compliance than soft language.Mandatory:, Non-negotiable:, Required:, Always:Massive
Start with fewer instructions, add one at a timeEvery added instruction risks conflicting with another. The previous “3–6 instructions” guidance is an upper bound, not a target — test each addition against your own audio before adding the next.Add a single capability instruction, evaluate, then add the next.High
Describe the desired output formatTelling the model the pattern to watch for is more reliable than listing specifics.Pharmaceutical accuracy required across all medications and drug namesHigh
Spell out disfluency behavior explicitlyEnumerated behavior produces more consistent output than a bare directive.Preserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialismsHigh

What to avoid

Anti-patternWhy it hurtsExampleImpact
Listing explicit errors from your audioMakes the model over-eager to insert those exact phrases, including in places they don’t belong. Describe the pattern, not the corrections. Use keyterms prompting if you know specific terms.Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)Hallucinations
Using negative languageDon't, Avoid, Never, Not are not reliably processed by the model. Phrase instructions positively.Don't include filler words → use Output complete sentences without disfluenciesSevere
Conflicting instructionsForces the model to pick one; the outcome becomes non-deterministic.Include disfluencies. Maximum readability.Severe
Being short or vagueGives the model no actionable pattern.Be accurate, Best transcript ever, Superhero human transcriptionistHigh

Evaluating your prompts

Prompts only work on your audio — universal best practices don’t transfer reliably across use cases. Before settling on a prompt, run it against a representative dataset.

The workflow:

  1. Build an evaluation set of at least 25 audio files that reflect the speakers, accents, audio quality, and vocabulary you expect in production. See Evaluate model accuracy for the full methodology.
  2. Transcribe each file with no prompt to establish a baseline.
  3. Try the Best all around and [unclear] recommended prompts and compare.
  4. Layer in one capability instruction at a time and re-measure.
Watch out for misleading WER

Universal-3 Pro frequently outperforms human transcribers. If your word error rate (WER) shows unexpected insertions, listen to the audio at those timestamps before assuming the model is wrong — many “errors” are the model catching audio a human missed. Similarly, substitutions like “offsite” vs. “off site” or “alright” vs. “all right” inflate WER without representing real errors.

Tips:

  • Use the [unclear] tag in your evaluation prompt so the model doesn’t guess where a human transcriber would also miss. This improves WER alignment.
  • Review insertions manually by listening at flagged timestamps.
  • Consider Semantic WER over normalized WER — it won’t penalize formatting-level differences that aren’t real errors.

Generate a starting prompt with AI

If the recommended prompts above aren’t a fit for your audio, use the generator below to produce a starting prompt. It opens your preferred AI assistant with a pre-loaded brief built from this guide — the capability knobs, the keyterms-vs-prompt routing, the positive-language rule, and the “start with fewer instructions, add one at a time” framing. The output is a starting point, not a final prompt. Test it against your evaluation set using the workflow above before settling on it.

Click a button to open your preferred AI assistant with your transcript sample and instructions pre-loaded. The AI will generate an optimized prompt based on our prompt engineering best practices.


System prompt history

The current default prompt is shown above under Best all around (default). Prior defaults are kept here for changelog transparency.

Always: Transcribe code-switching speech with your best guess based on
context in all possible scenarios where speech is present in the audio.
Languages: English, Spanish, German, French, Portuguese, Italian.
Language codes: en, es, de, fr, pt, it.
Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.
Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.
Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.
Transcribe this audio

Need help?

Prompting Universal-3 Pro is instructional, not open-ended — use the knobs above and test against your own data. If you’d like help building or optimizing a prompt for your audio, our team can help: open a live chat or email us via the widget in the bottom-right corner (contact info).