Voice Agent API

Prompting guide

Patterns for writing system prompts that improve instruction following, conversationality, and voice output quality.

Set your agent’s system_prompt via session.update. The patterns below are tested against real voice agent conversations and consistently improve quality.

Copy this page into Claude, ChatGPT, or your preferred LLM and use it as a reference while iterating on your prompt. Having the LLM apply these patterns to your specific use case is the fastest way to get a good system prompt.


Make instructions stick

Front-load your most important rule

Put your most critical instruction at the top and reinforce it. Long prompts dilute attention. If you bury a key rule in the middle, the model deprioritizes it.

BE SHORT. This is the most important rule. Keep every response under two sentences.
You are a customer support agent for Acme Corp...

Use negative instructions with exact phrasings

Listing the exact phrases you don’t want works better than vague positive instructions like “be casual”. The model pattern-matches against concrete strings.

Never say "certainly", "absolutely", "happy to help", or "great question".

Pair bad examples with good ones

Show what you don’t want next to what you do want. The contrast teaches the rule.

When the user describes their project, don't give a feature tour:
Bad: "You could build A, B, or C — what problem are you trying to solve?"
Good: "Yeah, like a receptionist."

Add self-check heuristics

Give the model a check it can run before responding. Abstract rules like “be brief” don’t land. A concrete heuristic does.

If your reply has a comma, ask yourself if it could just stop at the comma.
If your reply is more than 15 words, shorten it.

Match example length to desired output length

If your prompt examples are paragraphs, the model outputs paragraphs. Keep example outputs as terse as the real responses you want.

Iterate against real transcripts

Fix specific failures from real conversations. Quote the failing output and show the correction:

When the user asks "what does this do?", don't say:
"This is a powerful tool that enables you to streamline your workflow
by providing real-time insights and actionable data across multiple
dimensions of your business operations."
Say:
"It transcribes your calls and pulls out the key points."

Speculative rules (“what if the user asks about X”) add noise without improving quality. Iterate against logs, not imagination.


Sound human

Give the agent an identity

Identity statements shape tone better than behavioral lists. Tell the model who it is, not just what to do.

You're a person on a call, not a feature tour. You're not auditioning,
you're having a conversation.

Use permission language

Safety training makes models default to formal, cautious responses. Explicit permissions unlock natural behavior that “be friendly” never will.

Have opinions. You can crack jokes. You can be a little dry when someone
wastes your time. You don't need to hedge everything.

Mirror the user’s length and energy

The model’s default is to talk more than the user. Instruct it to match.

Match the user's length. When they talk in clipped phrases, you do the same.
If they give you one word, reply with a few words. If they go deep, go deep.

Define engagement modes

A single behavioral playbook produces the same response shape regardless of context. Define separate modes so the agent reads the room.

When the user is engaged and asking detailed questions, give thorough answers.
When the user gives short or distracted responses, keep it brief and check in.
When someone is clearly just messing around, you can be playful or cut it short.

Ban bot tells

List the specific phrases that make agents sound like chatbots and ban them.

Never say:
- "Great question!"
- "That's an interesting point."
- "Want me to walk you through that?"
- "What are you building today?"
- "I'd be happy to help with that."

Ground temporal and situational context

Inject session-specific information into the prompt at runtime. Users ask “what time is it?” and the model should not hallucinate.

Current date and time: {datetime_utc}
Your voice: {voice_name}
User's name: {user_name}

Define capabilities explicitly

Without clear boundaries, the model invents capabilities to please users or denies things it can actually do.

List what the agent can and cannot do

Things you CAN do:
- Look up order status (use the get_order tool)
- Answer questions about our pricing and plans
- Switch languages mid-conversation
Things you CANNOT do:
- Process refunds (transfer to a human agent)
- Access the user's account settings
- Look anything up on the internet

Pin verified facts

List every factual claim the model is allowed to make. For anything else, direct users to documentation or a human.

Facts you can state:
- Our API supports 6 languages
- Latency is under 300ms
- Plans start at $0 with pay-as-you-go
Do NOT make up statistics, pricing, or technical specs beyond what's listed above.
If you're unsure, say "I'd want to double-check that, let me point you to our docs."

Optimize for voice output

TTS engines read formatting characters literally. Formatting that works in chat sounds broken when spoken aloud.

Tell the model why formatting rules exist

Give a concrete example of the failure mode so it understands the constraint.

No markdown formatting. If you write **ivy**, the user hears
"asterisk asterisk ivy asterisk asterisk". No bullets, no bold,
no headers. Plain conversational sentences only.

Spell out how to read literals

URLs, field names, and code identifiers need explicit substitution rules.

When reading URLs:
- Say "dot" for periods, "slash" for slashes
- Drop the protocol ("assemblyai dot com", not "https colon slash slash")
- Spell out short paths ("slash docs slash getting started")
When reading code or field names:
- Say "underscore" for underscores
- Spell out abbreviations ("API" as "A-P-I")

Round numbers and times

Voice users don’t need precision. Use natural approximations.

Round times: say "around 2 in the afternoon UTC", not "2:34:17 PM".
Round large numbers: say "about 10 thousand", not "9,847".
Dates: say "April 30th", not "2026-04-30".

General prompt hygiene

Write policies, not decision trees

A policy is a general rule the model applies across situations. A decision tree is a set of fragile conditionals the model will misinterpret.

Policy (good):
"If the user shares an email address, read it back to confirm."
Decision tree (bad):
"If the user shares an email, check if it's a gmail address, then
ask if it's their work email, then confirm the spelling of..."

Trim aggressively when adding new sections

Long prompts drown their best rules. Every time you add a new instruction, look for something to remove.

Test by reading aloud

Read the agent’s outputs out loud. Visual scanning misses rhythm problems and unnatural phrasing that users notice immediately on a phone call.


Putting it together

A well-structured voice agent prompt typically follows this order:

  1. Identity and most important rule
  2. Tone and conversational style (permissions, mirroring, bot-tell bans)
  3. Capabilities and facts (what it can and cannot do, pinned facts)
  4. Tool usage instructions (when to call each tool, what to say while waiting)
  5. Voice formatting rules (no markdown, reading literals, rounding)
  6. Engagement modes (how to behave in different conversational contexts)

Your prompt works together with other session configuration. Use keyterms to boost recognition of domain-specific words, and configure turn detection thresholds to match the conversational pace your prompt encourages.