Prompting Guide (Async)

Looking for streaming prompting?

Prompting behavior differs between async (pre-recorded) and streaming use cases. This guide covers prompting for async (pre-recorded audio). If you’re working with real-time audio, see the Prompting Guide (Streaming).

Use prompt engineering to control transcription style and improve accuracy for domain-specific terminology. This guide documents best practices for crafting effective prompts for Universal-3 Pro async speech transcription.

Start with no prompt

We strongly recommend testing with no prompt first. When you omit the prompt parameter, Universal-3 Pro automatically applies a built-in default prompt that is already optimized for accuracy across a wide range of audio types.

If you’re going to build a prompt, start with one of the recommended prompts and then tweak it for your use case. You should not start from scratch with your prompt — use a recommended prompt and then build off of it.

Remember, prompts are primarily instructional, so adding a large amount of context may not make a significant impact on accuracy and could reduce instruction-following coherence. Feel free to layer in additional instructions from this guide.

How prompting works

Universal-3 Pro is a Speech-augmented Large Language Model (SpeechLLM). The architecture is a multi-modal LLM with an audio encoder and LLM decoder designed to understand and process speech, audio, and text inputs in the same workflow.

SpeechLLM prompting works more like selecting modes and knobs than open-ended instruction following. The model is trained primarily to transcribe, then fine-tuned to respond to common transcription instructions for style, speakers, and speech events.

Prompting is more instructional than contextual — the model responds best to explicit formatting rules and behavioral instructions (e.g., “include all filler words” or “use periods only for complete sentences”). Providing domain context like “this is a cardiology appointment” is most effective when paired with specific instructions telling the model how to transcribe. We are actively working to make the model more contextual in the future. For boosting specific domain terms today, use keyterms prompting.

What prompts can do

CapabilityDescriptionReliability
Verbatim transcription and disfluenciesInclude filler words, false starts, repetitions, stuttersHigh
Audio event tagsMark laughter, music, applause, background soundsExperimental (YMMV)
Labeling crosstalkLabel overlapping speech, interruptions, and crosstalk segmentsExperimental (YMMV)
Output style and formattingControl punctuation, capitalization, number formattingHigh
Numbers and measurementsControl how numbers, percentages, and measurements are formattedMedium
Context aware cluesHelp with jargon, names, and domain expectationsMedium
Entity accuracy and spellingImprove accuracy for proper nouns, brands, technical termsMedium
Speaker attributionMark speaker turns and add labelsExperimental (YMMV)
Native code switchingHandle multilingual audio in same transcriptMedium
Difficult audio handlingMaximize guesses or flag uncertainty on unclear audioExperimental (YMMV)
PII redactionTag and redact personal information like names, addresses, contact infoExperimental (YMMV)

The following prompts are our top recommendations for different use cases. Start here before exploring the detailed prompt capabilities below.

Best all around (default)

This is the current default prompt, providing strong accuracy with minimal instructions:

Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

This gives the model clear guidance to always attempt transcription while keeping instructions minimal. It’s a great starting point for most use cases.

Verbatim with multilingual support

If you need maximum verbatim capture and multilingual code-switching support, use this prompt:

Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.
Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.

This prompt maximizes speech pattern capture, preserves code-switching, and tells the model to always attempt transcription even on difficult audio. The trade-off is that the model may occasionally hallucinate disfluencies or language switches that don’t exist in the audio.

Handling unclear audio with [masked]

Recommended for reducing hallucinations

This prompt is one of the most effective strategies for avoiding hallucinations on unclear or difficult audio. Instead of forcing the model to guess, it explicitly flags uncertain segments, giving you visibility into areas of uncertainty in the transcript.

Always: Transcribe speech exactly as heard. If uncertain or audio is
unclear, mark as [masked]. After the first output, review the transcript
again. Pay close attention to hallucinations, misspellings, or errors,
and revise them like a computer performing spell and grammar checks.
Ensure words and phrases make grammatical sense in sentences.

You can also use [unclear] instead of [masked]:

Always: Transcribe speech exactly as heard. If uncertain or audio is
unclear, mark as [unclear]. After the first output, review the transcript
again. Pay close attention to hallucinations, misspellings, or errors,
and revise them like a computer performing spell and grammar checks.
Ensure words and phrases make grammatical sense in sentences.

The [masked] tag may also be applied to profanity in the audio. If preserving profanity is important for your use case, use [unclear] instead to avoid profanity being tagged.

This prompt tells the model to never guess on unclear or difficult audio and instead label it explicitly. The result is a transcript where:

  • Hallucinations are materially reduced — the model doesn’t force potentially incorrect guesses on uncertain audio segments.
  • Uncertain sections are explicitly flagged as [masked] or [unclear], giving you transparency into exactly where audio quality was insufficient for confident transcription.
  • Genuine but difficult speech is still preserved — the model transcribes what it can hear clearly while honestly marking what it cannot.

This is especially useful for quality-sensitive workflows where incorrect guesses are worse than gaps, and for building review pipelines where human reviewers can focus on flagged segments.


System prompts

Current system prompt

The current built-in system prompt used by Universal-3 Pro when no prompt parameter is provided:

Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

This prompt provides the model with clear guidance to always attempt transcription while keeping instructions minimal.

The previous built-in system prompt was:

Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.
Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.
Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.

The previous built-in system prompt was:

Transcribe this audio

Evaluating transcription accuracy

When evaluating Universal-3 Pro output against human-labeled ground truth files, be aware that the model frequently outperforms human transcribers. If your word error rate (WER) evaluation shows unexpected insertions from Universal-3 Pro, listen back to the original audio before assuming the model is wrong. In many cases, the model is correctly transcribing audio that a human transcriber missed or normalized. Similarly, some substitutions are purely semantic or formatting differences (e.g., “offsite” vs. “off site,” “alright” vs. “all right”) that inflate WER without representing meaningful errors.

Tips for accurate evaluation:

  • Use the [unclear] tag in your evaluation prompt to prevent the model from guessing on audio that a human transcriber would also miss. This improves WER alignment.
  • Review insertions manually by listening to the audio at the flagged timestamps. Many apparent errors are actually the model being more accurate.
  • Watch for semantic substitutions — formatting-level differences inflate WER without representing meaningful errors.
  • Consider Semantic WER over traditional normalized WER for a more accurate evaluation. Semantic WER won’t penalize formatting-level substitutions or insertions that are actually correct transcription, giving you a more realistic measure of true transcription quality.

In addition to the default prompt, we recommend testing with the Best all around (default) prompt and the Handling unclear audio with [masked] prompt to find the best fit for your evaluation and use case.


Prompt capabilities

Each capability below acts as a “knob” you can turn. Combine 3-6 capabilities maximum for best results. Each section includes an audio demo showing the before/after effect of prompting.

1. Verbatim transcription and disfluencies

What it does: Preserves natural speech patterns including filler words, false starts, repetitions, and self-corrections.

Reliability: High

Without prompt:

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With prompt, the model better captures filler words like “uh” and false starts like “we, we, we’re friends”.

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Example prompts:

Include spoken filler words like "um," "uh," "you know," "like," plus repetitions
and false starts when clearly spoken.
Preserve all disfluencies exactly as spoken including verbal hesitations,
restarts, and self-corrections.
Transcribe verbatim:
- Filler words: yes
- Repetitions: yes
- Stutters: yes
- False starts: yes
- Colloquial: yes

2. Audio event tags

What it does: Marks non-speech sounds like music, laughter, applause, and background noise.

Reliability: Experimental (YMMV)

Without prompt:

Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options.

With prompting, non-speech events like beeps are called out in the transcript.

Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]

Here are some examples of audio tags you can prompt for: [music], [laughter], [applause], [noise], [pause], [inaudible], [sigh], [gasp], [cheering], [sound], [screaming], [bell], [beep], [sound effect], [buzzer], and more.

Example prompts:

Preserve non-speech audio in tags to indicate when the audio occurred.
Tag sounds: [laughter], [silence], [noise], [cough], [sigh].
Include audio event markers for music, laughter, and applause.

3. Labeling crosstalk

What it does: Labels overlapping speech, interruptions, and crosstalk segments in the transcript.

Reliability: Experimental (YMMV)

Without prompt:

I hope you got our card. Okay, nobody talk. We'll just wait for her to talk. Well, we just wanted to— damn it!

With prompt:

I hope you got our card. [CROSSTALK] Okay, nobody talk. We'll just wait for her to talk. Well, we just wanted to— [CROSSTALK] Damn it!

Example prompts:

When multiple speakers talk simultaneously, mark crosstalk segments.
Mark inaudible segments. Preserve overlapping speech and crosstalk.
If unintelligible, write (unclear).

4. Output style and formatting

What it does: Controls punctuation, capitalization, and readability without changing words.

Reliability: High

Without prompt:

You got called because you were being loud and screaming. No, that's literally what my dispatch said. I don't give a fuck what your dispatch said. They lied. Okay, well, you need to calm down. I don't. Okay, yeah, calm down please. No, I don't. Yes, I'm Jesus Christ's daughter. I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.

With prompt, the model accurately captures the speaker’s emotional state through punctuation, adding exclamation marks during moments of yelling and emphasis.

You got called because you were being loud and screaming. No, I wasn't. That's literally what my dispatch said. I don't give a fuck what your dispatch said! They lied! Okay, well, you need to calm down. I don't! Okay, yeah, calm down, please. No, I don't! I'm Jesus Christ's daughter! I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.

Example prompts:

Transcribe this audio with beautiful punctuation and formatting.
Use expressive punctuation to reflect emotion and prosody.
Use standard punctuation and sentence breaks for readability.

5. Numbers and measurements

What it does: Controls how numbers, percentages, and measurements are formatted.

Reliability: Medium

Without prompt:

Commission has presented their communication, a hydrogen strategy for climate-neutral Europe, two weeks ago, which includes investments of between €180 billion and €400 billion.

With prompt:

Commission has presented their communication, a hydrogen strategy for climate-neutral Europe, 2 weeks ago, which includes investments of between €180,000,000,000 and €400,000,000,000.

Example prompts:

Convert spoken numbers to digits.
Use digits for numbers, percentages, and measurements.
Format financial figures with standard notation and format numbers for maximum readability.

6. Context aware clues

What it does: Helps with jargon, names, and domain expectations that are known from the audio file.

Reliability: Medium

Without prompt:

I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside. Excellent.

With prompt, adding ‘clinical history evaluation’ as a context clue corrects spelling of ‘Glicoside’ to ‘Glycoside’.

I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi— glycosi— glycoside. Excellent.

Example prompts:

Transcribe this audio. This is a doctor-patient visit, prioritize accurately transcribing medications and diseases wherever possible.
Prioritize transcribing this virtual meeting like a human transcritionist, paying close attention to names, company names, jargon, acronyms, and other entities.
Transcribe this audio. Context: a technical lecture about GPUs, CUDA, and inference.

Context alone does not tell the model how to transcribe. Providing domain context is most effective when paired with specific instructions.

For example: This is a doctor-patient visit gives the model domain context but no actionable guidance on how to improve the transcript. A more effective prompt would be: This is a doctor-patient visit, prioritize accurately transcribing medications and diseases wherever possible.

The instruction tells the model what to pay attention to when transcribing, while the context tells it what domain to expect.


7. Entity accuracy and spelling

What it does: Improves accuracy for proper nouns, brands, technical terms, and domain vocabulary.

Reliability: Medium

Without prompt:

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Entiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.

With prompt, the model corrects the misrecognition of “Anktiva,” which would otherwise be transcribed as “Entiva”.

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Anktiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.

Example prompts:

Use standard spelling and the most contextually correct spelling of all words
including names, brands, drug names, medical terms, and proper nouns.
Non-negotiable: Pharmaceutical accuracy required across all medications and drug names
Preserve acronyms and capitalization of company names and legal entities

The model works best here when you tell it the pattern of entities to identify and how you wish for it to consider those entities as it transcribes speech.

Over instructing the model to follow specific examples that occur in a file can cause hallucinations when these examples are encountered. We recommend listing the pattern vs the specific error (i.e. Pharmaceutical accuracy required across all medications and drug names vs Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)).


8. Speaker attribution

What it does: Marks speaker turns and adds identifying labels.

Reliability: Experimental (YMMV)

Without prompt:

Speaker A: Five milligrams. And you take it regularly? Good. Every evening. And no side effects with it.

With prompt:

[Speaker:NURSE] 5mg. And do you take it regularly?
[Speaker:PATIENT] Oh yeah, yeah.
[Speaker:NURSE] Good.
[Speaker:PATIENT] I take it every evening.
[Speaker:NURSE] And no side effects with it?

Without prompting, it may appear that one speaker said everything. But with prompting, the model correctly identifies this as 5 separate speaker turns, capturing utterances as short as a single word, like “good”.

Example prompts:

Transcribe the audio verbatim, include speaker change markers.
Tag speaker changes with context like name, role, or gender based on speech content.
Label speakers by role when identifiable (doctor, patient).

Speaker labels can be tagged with names, roles, genders, and more from the audio file. Simply add the desired category for the labels into your prompt.

Speaker attribution generated by the model is in addition to the speaker diarization and speaker identification feature. We recommend using one or the other.

Speaker diarization and speaker identification are stable, consistent models whereas speaker attribution via prompting is experimental and may produce inconsistent results, especially across longer files where the model processes audio in chunks. Using the word speaker anywhere in your prompt will generate labels, so avoid this word to ensure this capability is not activated.

For production use cases requiring consistent speaker labels, use the speaker diarization and speaker identification features. In the future we will be natively building the model’s capabilities here into these features.


9. Native code switching

What it does: Handles audio where speakers switch between languages.

Reliability: Medium

Example prompts:

Transcribe in the original language mix (code-switching), preserving words
in the language they are spoken.
Preserve natural code-switching between English and Spanish. Retain spoken language as-is with mixed language words.

If you expect languages beyond those supported by Universal-3 Pro, we recommend setting language_detection: true on your request.

Universal-3 Pro is natively multilingual for English, Spanish, French, German, Italian, and Portuguese. If a language outside of these is encountered, the model will attempt to transcribe and fail and/or mark as [FOREIGN LANGUAGE].

Using language_detection: true ensures files in other languages are routed to different models which can more reliably transcribe the audio.


10. Difficult audio handling

What it does: Controls how the model handles uncertain or unclear audio segments. You can choose between two opposite strategies: maximizing guesses or flagging uncertainty.

Reliability: Experimental (YMMV)

Strategy 1: Maximize guesses

Tell the model to always attempt a transcription, even when confidence is low:

Always transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

This is useful when you want the most complete transcript possible and plan to verify accuracy downstream.

Strategy 2: Flag uncertainty

Tell the model to mark segments it is unsure about instead of guessing:

Transcribe speech with your best guess when speech is heard, mark [unclear] when audio segments are unknown.
If unintelligible, write [unclear]. Mark inaudible segments.

This is useful for quality-sensitive workflows where incorrect guesses are worse than gaps.

Combining both strategies:

You can run the same audio file with both strategies to create a powerful review workflow. The “best guess” transcript gives you the most complete output, while the “flag uncertainty” transcript highlights exactly which segments need human review.

For a more robust approach to handling unclear audio, see the Handling unclear audio with [masked] section in Recommended prompts above. The [masked] strategy provides explicit flagging of uncertain segments and materially reduces hallucinations.


11. PII redaction

What it does: Tags personal identifiable information such as names, addresses, and contact details within the transcript.

Reliability: Experimental (YMMV)

Example prompts:

Tag all personal information as [private] including names, addresses, phone numbers, email addresses, and account numbers.
Mandatory: Identify and tag all personally identifiable information as [private]. This includes full names, physical addresses, contact information, social security numbers, and financial account numbers.

Be specific about which types of PII you want tagged. A vague prompt like redact PII may not give the model enough guidance. Enumerate the categories you care about.

For production PII redaction, we recommend using our dedicated PII Redaction feature, which provides stable and consistent results. PII tagging via prompting is experimental and best suited for exploration or supplementary workflows.


Best practices

What helps

PracticeImpactExampleWhy it helps
Start with transcriptionMassiveTranscribe this audio, Transcribe verbatimModel has transcribe prompts in its training data, focuses the prompt
Authoritative languageMassiveMandatory:, Non-negotiable:, Required:Model understands to pay excess attention to desired instruction
3-6 instructions maximumMassiveTranscribe verbatim. Include all disfluencies. Pay attention to rare words and entities. Preserve natural speech patterns.Prevents conflicting instructions
Desired output formatHighPharmaceutical accuracy required across all medications and drug namesModel learns the domain context and entities to pay closer attention to transcribing
Explicit disfluency instructionsHighPreserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms in the spoken language.Model sees the speech patterns and linguistic cues to pay extra attention to

What hurts

Anti-patternImpactExampleWhy it hurts
Explicit examples of errors from the filePotential HallucinationsPharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)Model is over eager to correct exact phrases in the transcript
Negative languageSevereDon't, Avoid, Never, NotModel does not process negative instructions and gets confused
Conflicting instructionsSevereInclude disfluencies. Maximum readabilityModel has to make a decision which instruction to process leading to less determinate results
Short, vague instructionsHighBe accurate, Best transcript ever, Superhero human transcriptionistModel doesn’t understand the instruction pattern to identify, pay attention to, and correct
Missing disfluency instructionsMediumTranscribe verbatim, Transcribe this audioNot necessarily a failure but the model by default will not be expressive with disfluencies unless instructed

Prompting vs. keyterms prompting

Universal-3 Pro supports two methods for improving transcription accuracy: open-ended prompting (the prompt parameter) and keyterms prompting (the keyterms_prompt parameter).

PromptingKeyterms prompting
Parameterpromptkeyterms_prompt
What it doesNatural language instructions that control transcription style, behavior, and accuracyA list of specific words or phrases to boost recognition accuracy
Best whenYou don’t know the exact terms that will appear, or you want to control transcription style (disfluencies, formatting, code-switching, etc.)You know specific terms, names, or jargon that will appear in the audio
ExamplePrioritize accurately transcribing medications and diseases["omeprazole", "metformin", "hypertension"]

The prompt and keyterms_prompt parameters are mutually exclusive at the API level. However, you can include key terms directly within your open prompt as a workaround using the Context: prefix.

We recommend using either prompt OR keyterms_prompt individually, not both together. Combining both can result in overprompting, leading to unpredictable or degraded results. If you do combine them, keep your prompt concise and limit the number of keyterms.

To combine both in a single request, append your keyterms as context within the prompt parameter:

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5headers = {"authorization": "<YOUR_API_KEY>"}
6
7base_prompt = "This is a YouTube video describing common sports injuries."
8keyterms = ["Sprained ankle", "ACL tear", "Hamstring strain", "Rotator cuff injury", "Tennis elbow"]
9
10# Append keyterms as context within the prompt
11prompt_with_context = f"{base_prompt}\n\nContext: {', '.join(keyterms)}"
12
13data = {
14 "audio_url": "https://assembly.ai/sports_injuries.mp3",
15 "speech_models": ["universal-3-pro"],
16 "language_detection": True,
17 "prompt": prompt_with_context
18}
19
20response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
21transcript_id = response.json()["id"]
22polling_endpoint = f"{base_url}/v2/transcript/{transcript_id}"
23
24while True:
25 transcript = requests.get(polling_endpoint, headers=headers).json()
26 if transcript["status"] == "completed":
27 print(transcript["text"])
28 break
29 elif transcript["status"] == "error":
30 raise RuntimeError(f"Transcription failed: {transcript['error']}")
31 else:
32 time.sleep(3)

Prompt generator

This prompt generator helps you create a starting prompt based on your selected transcription style. Paste a sample of your transcript and select your preferred style to get a customized prompt recommendation.

Click a button to open your preferred AI assistant with your transcript sample and instructions pre-loaded. The AI will generate an optimized prompt based on our prompt engineering best practices.

Prompt library

Browse community-submitted prompts, vote on the ones that work best, and share your own.

Loading prompts...
Top prompts by community votes
Loading...
Submit your own prompt
0 / 1000 characters (minimum 20)

Domain-specific sample prompts

Best for: Court proceedings, depositions, legal hearings

Mandatory: Transcribe legal proceedings with precise terminology intact.
Required: Preserve linguistic speech patterns including disfluencies, filler
words, hesitations, repetitions, stutters, false starts, and colloquialisms in
the spoken language.
Non-negotiable: Distinguish between speakers through clear role-based attribution
(Judge:, Witness:, Counsel: over generic labels).
Label participants by role when identifiable (judge, counsel, witness).

Why it works: Combines authoritative language (Mandatory, Required, Non-negotiable), clear disfluency instructions, speaker attribution guidance, and domain terminology.


Medical transcription

Best for: Clinical documentation, medical dictation, patient-provider conversations

Mandatory: Preserve all clinical terminology exactly as spoken including
drug names, dosages, and diagnostic terms.
Required: Preserve linguistic speech patterns including disfluencies, filler
words, hesitations, repetitions, stutters, false starts, and colloquialisms in
the spoken language.
Label physician and patient speech clearly when identifiable.

Why it works: Combines authoritative language (Mandatory, Required) with clear disfluency instructions, while ensuring clinical terminology accuracy and clear speaker attribution for medical documentation.


Financial/Earnings calls

Best for: Quarterly earnings calls, investor presentations, financial meetings

Mandatory: Transcribe this corporate earnings call with precise financial
terminology.
Required: Preserve linguistic speech patterns including disfluencies, filler
words, hesitations, repetitions, stutters, false starts, and colloquialisms in
the spoken language.
Non-negotiable: Financial term accuracy across all financial terminology,
acronyms, and industry-standard phrases.
Format numerical data with standard notation. Label executives and speakers
by role when identifiable (CEO, CFO, Analyst).

Why it works: Balances financial terminology precision with verbatim capture of speech patterns, listing specific financial terms for domain accuracy.


Software/Technical meetings

Best for: Engineering standups, code reviews, technical discussions

Mandatory: Transcribe this technical meeting with multiple participants.
Required: Preserve linguistic speech patterns including disfluencies, filler
words, hesitations, repetitions, stutters, false starts, and colloquialisms in
the spoken language.
Non-negotiable: Technical terminology accuracy across all software names,
frameworks, and industry acronyms.
Mark transitions between participants explicitly. Capture self-corrections
and restarts from speech.

Why it works: Preserves natural developer speech patterns while listing specific technical terms for domain accuracy.


Code-switching (Bilingual)

Best for: Multilingual conversations, Spanglish, language mixing

Mandatory: Transcribe verbatim, preserving natural code-switching between
English and Spanish.
Required: Retain spoken language as-is without translation. Preserve words
in the language they are spoken.
Non-negotiable: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.
Resolve sound-alike errors using bilingual context for maximum accuracy.

Why it works: Explicitly instructs preservation over translation, handles cross-language disfluencies, and uses pattern-based accuracy guidance for bilingual context.


Customer support call

Best for: Contact center calls, customer service interactions, agent-customer conversations

Context: a customer support call. Prioritize accurately transcribing names,
account details, and balance amounts.
Mandatory: Transcribe any overlapping speech across channels including crosstalk.
Required: Pay attention to proper nouns like names, balance amounts, and bank name
being correct.
Non-negotiable: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.

Why it works: Combines domain context with actionable instructions for entity accuracy, multichannel awareness for overlapping speech, and verbatim speech preservation for quality assurance and compliance review.


How to build your prompt

Step 1: Start with your base need

Choose your primary transcription goal:

GoalBase instruction
Verbatim/disfluenciesPreserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms in the spoken language.
Output style/formattingTranscribe this audio with beautiful punctuation and formatting.
Context aware cluesTranscribe this audio. Context: [describe the audio content and domain].
Entity accuracyUse standard spelling and contextually correct spelling of all proper nouns.
Speaker attributionMark speaker turns clearly. Tag speaker changes with context like name, role, or gender.
Audio event tagsPreserve non-speech audio in tags to indicate when the audio occurred.
Code-switchingTranscribe in the original language mix, preserving words in the language spoken.
Numbers/measurementsUse digits for numbers, percentages, and measurements.
Difficult audioIf unintelligible, write (unclear). Mark inaudible segments.
PII redactionTag all personal information as [private] including names, addresses, phone numbers, email addresses, and account numbers.

Step 2: Add authoritative language

Prefix each instruction with:

  • Non-negotiable:
  • Mandatory:
  • Required:
  • Strict requirement:

Step 3: Add instructions one by one

We recommend you layer on each instruction one by one to see the impact on the transcription output. Since conflicting instructions can cause outputs to degrade, adding each instruction one by one allows you to test and evaluate how each instruction improves/degrades your transcription output.

Step 4: Iterate and test

  1. Identify target terms - What words/phrases are being transcribed incorrectly?
  2. Find the error pattern - Vowel substitution? Sound-alike? Phonetic spelling?
  3. Choose example terms - Pick 2-3 common terms with the SAME error pattern
  4. Test and verify - Listen to the audio to confirm correctness
  5. Measure success rate - Test variations on sample files

Prompt Repair Wizard: If you need help iterating on your prompts, try the Prompt Repair Wizard on the dashboard. Paste your current prompt, describe the issues you’re seeing in the output, and it will suggest improvements based on prompting best practices.


Need help?

Prompt engineering is a new and evolving concept with SpeechLLM models. If you need help generating a prompt, our Engineering team is happy to support. Feel free to open a new live chat or send an email in the widget in the bottom right hand corner (more contact info here).