May 19, 2026

How to create an AI cold-calling agent with the Voice Agent API

Build an AI cold-calling agent that dials prospects, qualifies leads in natural conversation, and books meetings — using the AssemblyAI Voice Agent API for the conversation layer and Twilio for outbound dialing. Includes a TCPA/DNC compliance gate, tool dispatcher, sales system prompt, and forkable Python repo.

Kelsey Foster

Growth

Voice Agent API

AI voice agents

Reviewed by

Table of contents

[Visible on live site]

An AI cold-calling agent placed correctly does 500 lead-qualification calls in parallel for the cost of a single SDR. Placed poorly, it sounds like a robocall and gets hung up on in five seconds. The difference between the two isn't the LLM or the TTS — it's the speech accuracy on phone audio, the turn-taking that decides whether the agent interrupts a hesitant prospect, and the compliance layer that keeps you out of TCPA trouble.

This tutorial walks through building an AI cold-calling agent on the AssemblyAI Voice Agent API for the conversation layer, with Twilio for outbound dialing. The Voice Agent API gives you one WebSocket for STT, LLM, TTS, turn detection, and tool calling — you don't wire three providers together. You write the outbound dialer, the compliance gate, and the function dispatcher. The companion repository is linked at the end.

If you're looking for the chained STT + LLM + TTS architecture instead, our original AI cold-calling agent guide covers that path with Universal-3 Pro Streaming directly.

What an AI cold-calling agent does

An AI cold-calling agent is an outbound voice AI system that dials a prospect, delivers a pitch in natural conversation, adapts in real time based on what the prospect says, and books qualified meetings or gathers disposition data. Unlike a robocall (one-way recorded message) or a power dialer with a human rep, it conducts a two-way conversation autonomously.

The use cases where AI cold-calling agents work well today share three traits — high volume, structured pitch, and concrete success criteria (see our outbound calls walkthrough for the simpler "agent dials a single number" pattern):

Outbound SDR prospecting: open with a relevant hook, qualify BANT, book a demo
Appointment setting for field sales, financial advisors, home services
Re-engagement of lapsed leads in a CRM
Survey and research calls at scale
Event follow-up and RSVP confirmation
Renewal and upsell motions for existing customers

The common thread: one script, thousands of conversations, a measurable booking rate or disposition. That's where the Voice Agent API's combination of speech accuracy, tool calling, and flat-rate pricing pays for itself.

Architecture

  CRM / lead list (Salesforce, HubSpot, CSV)
       │
       ▼
  dialer.py
       │  compliance_gate()  ← TCPA, DNC, state laws, time windows
       ▼
  Twilio outbound dial
       │  TwiML → open Media Stream
       ▼
  bridge_server.py
       │  Twilio Media Stream ↔ Voice Agent API WebSocket
       ▼
  AssemblyAI Voice Agent API
   ┌──────────────────────────────────┐
   │  STT + Turn detection             │
   │      ↓                            │
   │  LLM with sales prompt + tools    │
   │      ↓                            │
   │  TTS                              │
   └──────────────────────────────────┘
       │
       │  tool calls
       ▼
  - book_meeting    (calendar API)
  - log_disposition (CRM update)
  - honor_dnc       (suppression list)
  - mark_callback   (scheduling)

The Voice Agent API handles the conversation. Your code handles three things outside the conversation: the dialer (who to call, when, at what concurrency), the compliance gate (TCPA, DNC, state consent), and the tool dispatcher (book a meeting, update the CRM, honor a do-not-call request).

Why use the Voice Agent API for cold-calling

Three things make the Voice Agent API a strong fit for outbound voice agents:

Speech accuracy on phone audio. Cold calls capture emails, phone numbers, company names, and job titles — "five one five, nine eight two, four zero zero zero," "J at acme dot io," "director of rev ops." Universal-3 Pro Streaming (the STT layer under the Voice Agent API) delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That's the difference between a booked meeting in your calendar and a typo you never catch.
Tool calling that maps to the booking moment. When a prospect says "yes, Tuesday at 2pm works," the agent has to fire book_meeting immediately — not in the next turn. The Voice Agent API's tool calling is structured-output reliable, which matters when one missed booking is the whole point of the call.

Flat $4.50/hour pricing. Outbound is bursty by nature. You don't want per-token surprises when the dialer fires 500 simultaneous calls. The Voice Agent API's flat hourly rate covers STT, LLM, TTS, and tool calls all-in.

Before you start

You'll need:

An AssemblyAI account with Voice Agent API access
A Twilio account with an outbound-capable phone number (and a verified caller ID if your trial requires it)
A list of leads with consent to be contacted (CSV is fine for testing — production should integrate your real CRM)
Python 3.11+

Install:

pip install fastapi uvicorn "websockets>=14" python-dotenv twilio

Step 1: Build the compliance gate first

Compliance is where AI cold-calling teams burn the most money — TCPA fines run $500–$1,500 per violating call. Build the gate before you write a line of dialer code.

# compliance.py
from datetime import datetime
from zoneinfo import ZoneInfo

DNC_LIST = set(open("suppression.txt").read().split())  # internal DNC

def compliance_gate(lead):
    # 1. Internal suppression (previous DNC requests, unsubscribes)
    if lead["phone"] in DNC_LIST:
        return False, "internal DNC"

    # 2. Federal DNC registry — integrate a real provider in production
    if on_federal_dnc(lead["phone"]):
        return False, "federal DNC"

    # 3. Time window — TCPA bans calls before 8am or after 9pm local
    local_tz = ZoneInfo(lead.get("timezone", "America/New_York"))
    local_hour = datetime.now(local_tz).hour
    if local_hour < 8 or local_hour >= 21:
        return False, f"outside TCPA window ({local_hour}:00 local)"

    # 4. State consent — California, Florida, PA require two-party consent
    if lead.get("state") in {"CA", "FL", "PA", "WA", "IL", "MD", "MT", "NH"}:
        # Agent must disclose recording at the top of the call.
        lead["needs_recording_disclosure"] = True

    return True, "ok"

Build this as a hard gate. No call goes out if any check fails.

Step 2: Define the agent's tools

Four tools the agent can call mid-conversation. In production, replace the stubs with real CRM, calendar, and DNC API calls. Each tool needs "type": "function" at the top level — the Voice Agent API validates this on session.update.

# tools.py
TOOLS = [
    {
        "type": "function",
        "name": "book_meeting",
        "description": "Book a meeting on the rep's calendar.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
                "email": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time", "email"],
        },
    },
    {
        "type": "function",
        "name": "log_disposition",
        "description": "Record the call outcome in the CRM.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "disposition": {
                    "type": "string",
                    "enum": ["booked", "not_now", "not_interested",
                             "wrong_person", "left_voicemail", "dnc"],
                },
                "notes": {"type": "string"},
            },
            "required": ["lead_id", "disposition"],
        },
    },
    {
        "type": "function",
        "name": "honor_dnc",
        "description": "Add the prospect to the do-not-call list immediately.",
        "parameters": {
            "type": "object",
            "properties": {"lead_id": {"type": "string"}, "phone": {"type": 
"string"}},
            "required": ["lead_id", "phone"],
        },
    },
    {
        "type": "function",
        "name": "mark_callback",
        "description": "Schedule a callback at the prospect's preferred time.",
        "parameters": {
            "type": "object",
            "properties": {
                "lead_id": {"type": "string"},
                "preferred_time": {"type": "string"},
            },
            "required": ["lead_id", "preferred_time"],
        },
    },
]

The honor_dnc tool is the most important one. If the prospect says anything that sounds like a do-not-call request — "take me off your list," "don't call me again," "remove me" — the agent must call this tool immediately, acknowledge, and end the call politely. No upselling, no "can I just ask one question." TCPA violations on DNC requests are the most expensive mistake a cold-calling agent can make.

Step 3: Write the system prompt

The system prompt is where the script lives. Four sections every cold-calling prompt needs:

# prompts.py
SYSTEM_PROMPT = """You are an AI sales development representative for Datafold.
You are calling {prospect_name}, {prospect_title} at {prospect_company}.

DISCLOSURE (required):
- Open every call by stating: "Hi {first_name}, this is an AI assistant calling
  on behalf of Datafold."
- This is non-negotiable and legally required in CA, FL, TX, and several other states.

OPENER (15 seconds):
- "I'm reaching out because we help data teams catch breaking changes before
  they hit production. Do you have 30 seconds for me to explain why I'm calling?"
- If yes, continue. If no, ask when's better and call mark_callback.

DISCOVERY (ask only 2 questions, max):
1. "How is your team handling data quality today — manual review, dbt tests,
   or something else?"
2. "How often does a broken model make it to production?"

PITCH (one sentence):
- "Datafold gives data teams CI for their pipelines. Customers like Patreon
  and Faire catch 90% of regressions before they ship."

CTA:
- Offer two specific times in the prospect's time zone.
- Call book_meeting with their email when they accept.

OBJECTION MAP:
- "How did you get my number?" → "You opted in on our website last month."
- "Send me an email" → "Happy to. What's the best address?" (call mark_callback)
- "Not the right person" → "Who handles data quality on your team?"
- "We already use [X]" → "Got it. Most of our customers use [X] alongside Datafold."
- "Not interested" → "No problem. Mind if I ask why?" (then call log_disposition)

DNC HANDLING (highest priority):
- If the prospect says ANYTHING like "take me off your list," "don't call me
  again," "remove me," "stop calling": call honor_dnc IMMEDIATELY, say "Of
  course, you're removed from our list. Sorry to bother you. Have a good day,"
  and end the call. Do NOT try to recover the conversation.

STYLE:
- One or two sentences per turn. Conversational, not formal.
- Listen for tone. If they sound annoyed, wrap up gracefully.
- Never claim to be human. If asked, confirm you're AI.
"""

That prompt is the entire sales playbook. The Voice Agent API will follow it turn by turn, calling tools when the conversation hits the right moments.

Step 4: Wire up the dialer

The dialer pulls leads from your list, runs each through the compliance gate, and places Twilio calls. It controls concurrency and respects time-of-day rules.

# dialer.py
import asyncio
import csv
import os
from twilio.rest import Client

twilio = Client(os.environ["TWILIO_SID"], os.environ["TWILIO_TOKEN"])

async def dial_lead(lead, callback_url):
    ok, reason = compliance_gate(lead)
    if not ok:
        log_disposition(lead["lead_id"], "skipped", notes=reason)
        return

    call = twilio.calls.create(
        to=lead["phone"],
        from_=os.environ["TWILIO_FROM"],
        url=f"{callback_url}/twilio/voice?lead_id={lead['lead_id']}",
        machine_detection="Enable",  # Hang up on voicemail
        record=True,                  # Required for compliance/QA
    )
    print(f"Dialing {lead['lead_id']}: {call.sid}")

async def run_dialer(leads_csv, max_concurrent=10):
    sem = asyncio.Semaphore(max_concurrent)
    with open(leads_csv) as f:
        leads = list(csv.DictReader(f))

    async def with_limit(lead):
        async with sem:
            await dial_lead(lead, os.environ["PUBLIC_URL"])
            await asyncio.sleep(2)  # pace
    await asyncio.gather(*(with_limit(l) for l in leads))

The machine_detection="Enable" flag tells Twilio to hang up on voicemail rather than wasting a Voice Agent API session on a robot. Important: never leave a recorded message — that's a TCPA violation in most contexts.

Step 5: Bridge Twilio Media Streams to the Voice Agent API

The bridge server is what connects Twilio's outbound call audio to the Voice Agent API WebSocket. Twilio sends G.711 μ-law at 8 kHz; the Voice Agent API accepts it natively when you set the encoding to audio/pcmu.

A few details that are easy to get wrong on this endpoint specifically:

The auth header is Authorization: Bearer YOUR_KEY — note the Bearer prefix. This is unique to the Voice Agent API; the rest of AssemblyAI accepts the raw key.
The first WebSocket message is a session.update event with all config nested under a session object. There is no session.start.
The agent's voice is a named voice from the Voice Agent API catalog (ivy, james, sophie, etc.) — not an ElevenLabs voice ID.
The telephony audio encoding is audio/pcmu (G.711 μ-law). Sample rate is implicit (8 kHz). Don't pass pcm_mulaw or a sample_rate field — the API ignores them.

You must wait for session.ready before sending any input.audio frames.

# bridge_server.py
import asyncio, json, os
import websockets
from fastapi import FastAPI, Query, Request, WebSocket
from fastapi.responses import Response

from prompts import SYSTEM_PROMPT
from tools import TOOLS, dispatch_tool

VOICE_AGENT_WS = "wss://agents.assemblyai.com/v1/ws"
ASSEMBLYAI_KEY = os.environ["ASSEMBLYAI_API_KEY"]

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice(request: Request, lead_id: str = Query(...)):
    host = request.url.hostname
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://{host}/media-stream?lead_id={lead_id}" />
  </Connect>
</Response>"""
    return Response(content=twiml, media_type="application/xml")

@app.websocket("/media-stream")
async def media_stream(twilio_ws: WebSocket, lead_id: str = Query(...)):
    await twilio_ws.accept()
    lead = LEAD_CACHE[lead_id]
    stream_sid = {"value": None}

    session_config = {
        "type": "session.update",
        "session": {
            "system_prompt": SYSTEM_PROMPT.format(**lead),
            "tools": TOOLS,
            "input": {"format": {"encoding": "audio/pcmu"}},
            "output": {
                "voice": "ivy",
                "format": {"encoding": "audio/pcmu"},
            },
        },
    }

    async with websockets.connect(
        VOICE_AGENT_WS,
        additional_headers={"Authorization": f"Bearer {ASSEMBLYAI_KEY}"},
    ) as va_ws:
        await va_ws.send(json.dumps(session_config))

        ready = asyncio.Event()
        pending_tools = []

        async def pump_twilio_to_va():
            async for raw in twilio_ws.iter_text():
                event = json.loads(raw)
                kind = event.get("event")
                if kind == "start":
                    stream_sid["value"] = event["start"]["streamSid"]
                elif kind == "media":
                    if not ready.is_set():
                        continue
                    # Twilio sends base64 mulaw; AAI accepts it directly.
                    await va_ws.send(json.dumps({
                        "type": "input.audio",
                        "audio": event["media"]["payload"],
                    }))
                elif kind == "stop":
                    return

        async def pump_va_to_twilio():
            async for raw in va_ws:
                event = json.loads(raw)
                t = event.get("type")

                if t == "session.ready":
                    ready.set()

                elif t == "reply.audio" and stream_sid["value"]:
                    await twilio_ws.send_text(json.dumps({
                        "event": "media",
                        "streamSid": stream_sid["value"],
                        "media": {"payload": event["data"]},
                    }))

                elif t == "tool.call":
                    result = dispatch_tool(event["name"], event.get("arguments",
{}))
                    pending_tools.append({"call_id": event["call_id"], "result":
result})

                elif t == "reply.done":
                    if event.get("status") == "interrupted":
                        pending_tools.clear()
                    else:
                        for tool in pending_tools:
                            value = tool["result"]
                            if not isinstance(value, str):
                                value = json.dumps(value)
                            await va_ws.send(json.dumps({
                                "type": "tool.result",
                                "call_id": tool["call_id"],
                                "result": value,
                            }))
                        pending_tools.clear()

                elif t == "transcript.user":
                    print(f"[{lead_id}] User: {event['text']}")
                elif t == "transcript.agent":
                    print(f"[{lead_id}] Agent: {event['text']}")

        await asyncio.gather(pump_twilio_to_va(), pump_va_to_twilio())

Two subtleties worth understanding:

Tool result timing. Per the tool calling docs, accumulate tool results when tool.call fires and send them inside reply.done — not immediately. The agent speaks a transition phrase ("let me check") while the tools run; sending too early causes timing issues.
Audio pass-through. Twilio's media.payload and AssemblyAI's input.audio.audio (and reply.audio.data) are all base64-encoded μ-law strings, so the bridge moves bytes through without any decode/re-encode step.

Compliance: the part most teams underweight

Three things separate a working AI cold-calling agent from a $50,000 TCPA settlement:

Scrub against the federal DNC registry before every call. Integrate a real provider — DNC.gov has a paid programmatic feed.
Honor state DNC lists. Several states maintain their own — California, Pennsylvania, Indiana, Tennessee. Your scrub vendor should cover these.
Two-party consent disclosure. In CA, FL, PA, WA, and several other states, you must disclose at the top of the call that the call is being recorded and that the caller is AI. Your system prompt's DISCLOSURE section is doing this work — never remove it.

Build all three as hard gates. If any check fails, the call doesn't go out. Log every disposition with a timestamp so you can prove compliance during an audit.

Measuring success

Three numbers tell you whether your AI cold-calling agent is working (see our broader AI voice agents guide for context on conversion metrics across use cases):

Connection rate: percentage of calls that reach a live human. Healthy: 30–50% with a local-presence dialer.
Conversation rate: percentage of connected calls that last more than 30 seconds. Healthy: 25–40%.
Book rate: percentage of conversations that end in a booked meeting. Healthy: 5–15% for warm/intent leads, 1–3% for cold lists.

Read every transcript for the first 500 calls. You'll catch prompt failures, silently wrong transcriptions on company names, and tool-call timing issues that you'd never notice listening to the audio.

The complete repository

Fork the runnable repo at github.com/kelsey-aai/cold-calling-voice-agent-api. It includes the dialer, the compliance gate, the bridge server, the tool dispatcher, the system prompt, and a sample leads.csv. Around 400 lines of Python total.

Frequently asked questions

How do I create an AI cold-calling agent with the Voice Agent API?

To create an AI cold-calling agent with the AssemblyAI Voice Agent API, build four pieces: a dialer that pulls leads from your CRM and places outbound Twilio calls, a compliance gate that scrubs against DNC registries and TCPA time windows, a bridge server that connects Twilio Media Streams to the Voice Agent API WebSocket at wss://agents.assemblyai.com/v1/ws, and a tool dispatcher with book_meeting, log_disposition, honor_dnc, and mark_callback. Define a sales-specific system prompt with disclosure, opener, discovery, pitch, CTA, objection map, and DNC handling rules. The Voice Agent API handles the conversation — your code handles dialing, compliance, and integrations.

Is AI cold-calling legal?

AI cold-calling is legal in most U.S. jurisdictions if you comply with TCPA (federal), state-level consent laws, and disclose that the caller is AI. Specifically: scrub against the federal DNC registry before every call, respect TCPA calling windows (no calls before 8am or after 9pm in the recipient's local time), get two-party consent for recording in states that require it (CA, FL, PA, WA, and others), and disclose AI identity at the top of the call. The cost of getting this wrong is steep — $500–$1,500 per violating call. Build the compliance gate as a hard barrier and consult legal counsel before scaling.

How much does it cost to run an AI cold-calling agent?

On the AssemblyAI Voice Agent API, you pay $4.50/hour of session time — STT, LLM, TTS, turn detection, and tool calls included. Twilio outbound voice adds a few cents per minute. A typical 90-second qualification call costs roughly $0.12–$0.18 all-in. At the typical 30–50% connection rate, the cost per actual conversation is closer to $0.30. Compare against a human SDR at fully-loaded $70–100/hour and the unit economics generally favor the agent for high-volume top-of-funnel motions.

What speech-to-text accuracy do I need for cold-calling?

The accuracy that matters for cold-calling is alphanumeric accuracy on phone audio — capturing emails, phone numbers, company names, and job titles correctly the first time. Universal-3 Pro Streaming, which is the STT layer under the Voice Agent API, delivers 21% fewer alphanumeric errors and 28% better accuracy on consecutive numbers than the previous generation. That accuracy is the difference between booking a meeting in the rep's calendar (alex@acme.io) and a typo your CRM never catches (alec@akme.io).

Can the Voice Agent API place outbound calls directly?

Today, you use Twilio (or another telephony provider) for the outbound dial, and bridge the resulting Media Stream into the Voice Agent API WebSocket. The Voice Agent API handles the conversation; Twilio handles the PSTN connection and the audio transport. Native outbound dialing through the Voice Agent API is on the roadmap — the bridge pattern in this tutorial is the standard path today, and the code in the companion repo handles it cleanly in about 100 lines.