Prompting and Keyterms

Start with no promptThe default prompt outperforms most custom prompts. Omit the prompt parameter first — Universal-3 Pro automatically applies a built-in default that is already optimized for accuracy across a wide range of audio.If the default isn’t a fit, start from one of the recommended prompts, then test against a representative set of your own audio (we suggest at least 25 files — see Evaluating your prompts). Layer in one additional instruction at a time. Do not start from scratch.

How prompting works

Universal-3 Pro is a Speech-augmented Large Language Model (SpeechLLM): a multi-modal LLM with an audio encoder and LLM decoder that processes speech, audio, and text inputs in the same workflow. Think of SpeechLLM prompting as selecting modes and knobs, not open-ended instruction following. The model is trained primarily to transcribe, then fine-tuned to respond to common transcription instructions for style, speakers, and speech events. It responds best to explicit formatting rules and behavioral instructions (e.g., “include all filler words”, “use periods only for complete sentences”). Domain context like “this is a cardiology appointment” only helps when paired with specific instructions on how to transcribe.

If you know your terms, use keyterms — not the prompt
If you already know the specific names, brands, drug names, acronyms, or jargon that will appear in your audio, use keyterms prompting instead of a free-form prompt. The keyterms_prompt parameter is optimized for term boosting and produces more reliable results than describing the same terms in plain language. Reach for free-form prompts when you want to control style or behavior — not when you want to boost specific words.

What prompts can do

Capability	Description	Reliability
Verbatim transcription and disfluencies	Include filler words, false starts, repetitions, stutters	High
Native code switching	Handle multilingual audio in the same transcript	High
Output style and formatting	Control punctuation, capitalization, number formatting	High
Context aware clues	Help with jargon, names, and domain expectations	Medium
Entity accuracy and spelling	Improve accuracy for proper nouns, brands, technical terms	Medium

Keyterms prompting

Supported models

Keyterms prompting allows you to provide up to 1,000 words or phrases (maximum 6 words per phrase) using the keyterms_prompt parameter to improve transcription accuracy for those terms and related variations or contextually similar phrases.

Start with no keytermsWe strongly recommend starting with no keyterms_prompt and then adding terms as needed based on important words for your use case that you are consistently seeing the model struggle with.Including a large number of terms or common terms that are well represented in the training data could lead to overcorrections and hallucinations.

Keyterms vs. the prompt parameterkeyterms_prompt and prompt cannot be used in the same request. Use keyterms_prompt when you already know the specific terms to boost; use a free-form prompt (above) to control style or behavior.

Universal-3 Pro (Recommended)

Here is an example showing how you can use keyterms prompting to improve transcription accuracy for a name with distinctive spelling and formatting. Without keyterms prompting:

Hi, this is Kelly Byrne Donahue

With keyterms prompting:

Hi, this is Kelly Byrne-Donoghue

Python
JavaScript
Python SDK
JavaScript SDK

import requests
import time

base_url = "https://api.assemblyai.com"
headers = {"authorization": "<YOUR_API_KEY>"}

data = {
    "audio_url": "https://assemblyaiassets.com/audios/keyterms_prompting.wav",
    "language_detection": True,
    "speech_models": ["universal-3-pro", "universal-2"],
    "keyterms_prompt": ["Kelly Byrne-Donoghue"]
}

response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)

if response.status_code != 200:
    print(f"Error: {response.status_code}, Response: {response.text}")
    response.raise_for_status()

transcript_response = response.json()
transcript_id = transcript_response["id"]
polling_endpoint = f"{base_url}/v2/transcript/{transcript_id}"

while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()
    if transcript["status"] == "completed":
        print(transcript["text"])
        break
    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")
    else:
        time.sleep(3)

const baseUrl = "https://api.assemblyai.com";
const headers = {
  authorization: "<YOUR_API_KEY>",
};

const data = {
  audio_url: "https://assemblyaiassets.com/audios/keyterms_prompting.wav",
  language_detection: true,
  speech_models: ["universal-3-pro", "universal-2"],
  keyterms_prompt: ["Kelly Byrne-Donoghue"],
};

const url = `${baseUrl}/v2/transcript`;
let res = await fetch(url, {
  method: "POST",
  headers: { ...headers, "Content-Type": "application/json" },
  body: JSON.stringify(data),
});
if (!res.ok) throw new Error(`Error: ${res.status}`);
const response = await res.json();

const transcriptId = response.id;
const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

while (true) {
  res = await fetch(pollingEndpoint, { headers });
  if (!res.ok) throw new Error(`Error: ${res.status}`);
  const transcriptionResult = await res.json();

  if (transcriptionResult.status === "completed") {
    console.log(transcriptionResult.text);
    break;
  } else if (transcriptionResult.status === "error") {
    throw new Error(`Transcription failed: ${transcriptionResult.error}`);
  } else {
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
}

import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

audio_file = "https://assemblyaiassets.com/audios/keyterms_prompting.wav"

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    language_detection=True,
    keyterms_prompt=["Kelly Byrne-Donoghue"]
)

transcript = aai.Transcriber(config=config).transcribe(audio_file)

print(transcript.text)

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({
  apiKey: "<YOUR_API_KEY>",
});

const audioFile = "https://assemblyaiassets.com/audios/keyterms_prompting.wav";

const params = {
  audio: audioFile,
  speech_models: ["universal-3-pro", "universal-2"],
  language_detection: true,
  keyterms_prompt: ["Kelly Byrne-Donoghue"],
};

const transcript = await client.transcripts.transcribe(params);

console.log(transcript.text);

Keyword count limitsWhile we support up to 1000 key words and phrases, actual capacity may be lower due to internal tokenization and implementation constraints. Key points to remember:

Each word in a multi-word phrase counts towards the 1000 keyword limit
Capitalization affects capacity (uppercase tokens consume more than lowercase)
Longer words consume more capacity than shorter words

For optimal results, use shorter phrases when possible and be mindful of your total token count when approaching the keyword limit.

Recommended prompts

These three prompts are battle-tested and the strongest starting points. Use one as your base and tweak from there — don’t start from scratch.

Best all around (default)

This is also the current built-in default prompt — when you omit the prompt parameter, this is what Universal-3 Pro uses. You don’t need to set it explicitly; it’s shown here so you can build off it.

Transcribe with context and proper nouns preserved, where speech is
present in the audio. Each language as spoken. English as English.
Non-native speakers.

Verbatim with multilingual support

This prompt maximizes speech pattern capture, preserves code-switching, and tells the model to always attempt transcription even on difficult audio. The trade-off is that the model may occasionally hallucinate disfluencies or language switches that don’t exist in the audio.

Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.

Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.

Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.

Handling unclear audio with `[unclear]`

This prompt flags uncertain segments rather than forcing the model to guess. It is one of the strongest tools for avoiding hallucinations on unclear audio.

Always: Transcribe speech exactly as heard. If uncertain or audio is
unclear, mark as [unclear]. After the first output, review the transcript
again. Pay close attention to hallucinations, misspellings, or errors,
and revise them like a computer performing spell and grammar checks.
Ensure words and phrases make grammatical sense in sentences.

Result:

Hallucinations are materially reduced — the model doesn’t force incorrect guesses on uncertain audio.
Uncertain sections are explicitly flagged as [unclear], surfacing exactly where audio quality is insufficient.
Clearly audible speech is still preserved.

Capabilities reference

Each capability is a “knob” you can turn. Each section below shows one audio demo with before/after output and one recommended prompt. Layer capabilities in one at a time so you can measure the impact of each — conflicting instructions degrade output, so keep your prompt focused.

Verbatim transcription and disfluencies

Preserves natural speech patterns including filler words, false starts, repetitions, and self-corrections. Reliability: High. Without prompt:

wordWrap

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With prompt, the model captures filler words like “uh” and false starts like “we, we, we’re friends”:

wordWrap

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Preserve all disfluencies exactly as spoken including verbal hesitations,
restarts, and self-corrections.

Native code switching

Handles audio where speakers switch between languages. Reliability: High.

Transcribe in the original language mix (code-switching), preserving
words in the language they are spoken.

Universal-3 Pro is natively multilingual for English, Spanish, French, German, Italian, and Portuguese. For audio in other languages, set language_detection: true so files are routed to the right model. Without this, unsupported languages may be marked [FOREIGN LANGUAGE].

Output style and formatting

Controls punctuation, capitalization, and readability without changing words. Reliability: High. Without prompt:

wordWrap

You got called because you were being loud and screaming. No, that's literally what my dispatch said. I don't give a fuck what your dispatch said. They lied. Okay, well, you need to calm down. I don't. Okay, yeah, calm down please. No, I don't. Yes, I'm Jesus Christ's daughter. I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.

With prompt, the model uses punctuation to reflect the speaker’s emotional state:

wordWrap

You got called because you were being loud and screaming. No, I wasn't. That's literally what my dispatch said. I don't give a fuck what your dispatch said! They lied! Okay, well, you need to calm down. I don't! Okay, yeah, calm down, please. No, I don't! I'm Jesus Christ's daughter! I'm not doing this tonight with you. I'm not. I'm not. So you need to calm down.

Use expressive punctuation to reflect emotion and prosody.

Context aware clues

Helps with jargon, names, and domain expectations from the audio file. Reliability: Medium. Without prompt:

wordWrap

I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes. Glicoside. Excellent.

With prompt, adding clinical history evaluation as a context clue corrects spelling of “Glicoside” to “Glycoside”:

wordWrap

I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi— glycosi— glycoside. Excellent.

This is a doctor-patient visit. Prioritize accurately transcribing
medications and diseases wherever possible.

Context alone does not tell the model how to transcribe. Pair domain context with a specific instruction. This is a doctor-patient visit is context; prioritize accurately transcribing medications and diseases is the actionable instruction.

Entity accuracy and spelling

Improves accuracy for proper nouns, brands, technical terms, and domain vocabulary. Reliability: Medium. If you already know the exact terms you want boosted, use keyterms prompting instead of describing them in your prompt. Without prompt:

wordWrap

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Entiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.

With prompt, the model corrects the misrecognition of “Anktiva” (transcribed as “Entiva” without context):

wordWrap

Watch again closely. This is the potential game changer. The first responder NK cell killing cancer right before your eyes. If you give yourself Anktiva, even in healthy volunteers, it dries up your first responders. It dries up your protectors. And that's why I said the power is within us.

Use standard spelling and the most contextually correct spelling of all
words including names, brands, drug names, medical terms, and proper nouns.

Describe the pattern of entities you want corrected, not the specific errors — listing specific spellings often causes the model to hallucinate them. See What to avoid.

What works / what to avoid

What works

Practice	Why it helps	Example	Impact
Start with `Transcribe…`	The model has transcription prompts in its training data, so leading with this focuses it on the task.	`Transcribe this audio` or `Transcribe verbatim`	Massive
Use authoritative language	Strong directive keywords get higher compliance than soft language.	`Mandatory:`, `Non-negotiable:`, `Required:`, `Always:`	Massive
Start with fewer instructions, add one at a time	Every added instruction risks conflicting with another. The previous “3–6 instructions” guidance is an upper bound, not a target — test each addition against your own audio before adding the next.	Add a single capability instruction, evaluate, then add the next.	High
Describe the desired output format	Telling the model the pattern to watch for is more reliable than listing specifics.	`Pharmaceutical accuracy required across all medications and drug names`	High
Spell out disfluency behavior explicitly	Enumerated behavior produces more consistent output than a bare directive.	`Preserve linguistic speech patterns including disfluencies, filler words, hesitations, repetitions, stutters, false starts, and colloquialisms`	High

What to avoid

Anti-pattern	Why it hurts	Example	Impact
Listing explicit errors from your audio	Makes the model over-eager to insert those exact phrases, including in places they don’t belong. Describe the pattern, not the corrections. Use keyterms prompting if you know specific terms.	`Pharmaceutical accuracy required (omeprazole over omeprizole, metformin over metforman)`	Hallucinations
Using negative language	`Don't`, `Avoid`, `Never`, `Not` are not reliably processed by the model. Phrase instructions positively.	`Don't include filler words` → use `Output complete sentences without disfluencies`	Severe
Conflicting instructions	Forces the model to pick one; the outcome becomes non-deterministic.	`Include disfluencies. Maximum readability.`	Severe
Being short or vague	Gives the model no actionable pattern.	`Be accurate`, `Best transcript ever`, `Superhero human transcriptionist`	High

Evaluating your prompts

Prompts only work on your audio — universal best practices don’t transfer reliably across use cases. Before settling on a prompt, run it against a representative dataset. The workflow:

Build an evaluation set of at least 25 audio files that reflect the speakers, accents, audio quality, and vocabulary you expect in production. See Evaluate model accuracy for the full methodology.
Transcribe each file with no prompt to establish a baseline.
Try the Best all around and [unclear] recommended prompts and compare.
Layer in one capability instruction at a time and re-measure.

Watch out for misleading WER
Universal-3 Pro frequently outperforms human transcribers. If your word error rate (WER) shows unexpected insertions, listen to the audio at those timestamps before assuming the model is wrong — many “errors” are the model catching audio a human missed. Similarly, substitutions like “offsite” vs. “off site” or “alright” vs. “all right” inflate WER without representing real errors.Tips:

Use the [unclear] tag in your evaluation prompt so the model doesn’t guess where a human transcriber would also miss. This improves WER alignment.
Review insertions manually by listening at flagged timestamps.
Consider Semantic WER over normalized WER — it won’t penalize formatting-level differences that aren’t real errors.

Generate a starting prompt with AI

If the recommended prompts above aren’t a fit for your audio, use the generator below to produce a starting prompt. It opens your preferred AI assistant with a pre-loaded brief built from this guide — the capability knobs, the keyterms-vs-prompt routing, the positive-language rule, and the “start with fewer instructions, add one at a time” framing. The output is a starting point, not a final prompt. Test it against your evaluation set using the workflow above before settling on it.

System prompt history

The current default prompt is shown above under Best all around (default). Prior defaults are kept here for changelog transparency.

Prior system prompt (April 15, 2026 – April 21, 2026)

Always: Transcribe code-switching speech with your best guess based on
context in all possible scenarios where speech is present in the audio.
Languages: English, Spanish, German, French, Portuguese, Italian.
Language codes: en, es, de, fr, pt, it.

Prior system prompt (February 25, 2026 – April 15, 2026)

Always: Transcribe speech with your best guess based on context in all possible scenarios where speech is present in the audio.

Prior system prompt (February 20, 2026 – February 25, 2026)

Required: Preserve the original language(s) and script as spoken,
including code-switching and mixed-language phrases.

Mandatory: Preserve linguistic speech patterns including disfluencies,
filler words, hesitations, repetitions, stutters, false starts, and
colloquialisms in the spoken language.

Always: Transcribe speech with your best guess based on context in all
possible scenarios where speech is present in the audio.

Prior system prompt (before February 20, 2026)

Transcribe this audio

Need help?

Prompting Universal-3 Pro is instructional, not open-ended — use the knobs above and test against your own data. If you’d like help building or optimizing a prompt for your audio, our team can help: open a live chat or email us via the widget in the bottom-right corner (contact info).

​How prompting works

​What prompts can do

​Keyterms prompting

​Universal-3 Pro (Recommended)

​Recommended prompts

​Best all around (default)

​Verbatim with multilingual support

​Handling unclear audio with [unclear]

​Capabilities reference

​Verbatim transcription and disfluencies

​Native code switching

​Output style and formatting

​Context aware clues

​Entity accuracy and spelling

​What works / what to avoid

​What works

​What to avoid

​Evaluating your prompts

​Generate a starting prompt with AI

​System prompt history

​Need help?

How prompting works

What prompts can do

Keyterms prompting

Universal-3 Pro (Recommended)

Recommended prompts

Best all around (default)

Verbatim with multilingual support

Handling unclear audio with `[unclear]`

Capabilities reference

Verbatim transcription and disfluencies

Native code switching

Output style and formatting

Context aware clues

Entity accuracy and spelling

What works / what to avoid

What works

What to avoid

Evaluating your prompts

Generate a starting prompt with AI

System prompt history

Need help?