Build an AI Voice Agent with DeepSeek R1, AssemblyAI, and ElevenLabs
Learn how to build a real-time AI voice agent using AssemblyAI, DeepSeek R1 via Ollama, and ElevenLabs. With AI voice agents handling an increasing share of customer interactions, this guide walks you through setting up transcription, AI-powered responses, and voice synthesis for seamless automation.



Introduction
AI voice agents are revolutionizing how businesses and developers interact with users, enabling real-time conversations with intelligent and human-like responses. According to a16z, financial services alone make up 25% of total global contact center spending and account for over $100 billion annually in business process outsourcing (BPO). With AI voice agents now surpassing human performance in certain tasks, businesses are rapidly adopting automated voice interactions to improve efficiency and customer experience.
In this tutorial, you'll learn how to build an AI voice agent that transcribes speech in real time using AssemblyAI, generates intelligent responses using DeepSeek R1 (7B model) via Ollama, and converts those responses into speech with ElevenLabs. The entire process happens in real time, allowing for smooth and interactive voice-based applications.
By the end of this tutorial, you'll have a fully functional AI voice agent that can transcribe, process, and respond to spoken queries instantly. If you prefer watching the video, check it out below:
Key Components of the AI Voice Agent
The AI voice agent consists of three major components:
- Real-time speech-to-text: AssemblyAI’s Streaming Speech-to-Text API converts spoken words into text with high accuracy.
- AI-powered responses: DeepSeek R1 (7B model) via Ollama generates context-aware responses in real time.
- Text-to-speech synthesis: ElevenLabs converts the AI-generated text response into natural-sounding audio.
In this step-by-step guide, we’ll install the necessary dependencies, configure the AI agent, and run the complete system.
Step 1: Install Required Dependencies
Before you begin, ensure you have all the required dependencies installed. The github repository contains all the assets and code required for this project.
1.1 Get API Keys
To use AssemblyAI and ElevenLabs, sign up for their API keys:
- AssemblyAI (Speech-to-Text): Sign up for a free API key
- ElevenLabs (Text-to-Speech): Sign up for an account
1.2 Install Ollama
DeepSeek R1 is accessed via Ollama. Install Ollama by following the instructions here.
1.3 Install PortAudio (Required for real-time transcription)
For Debian/Ubuntu:
apt install portaudio19-dev
For MacOS:
brew install portaudio
1.4 Install Python Libraries
Run the following command to install the required Python dependencies:
pip install "assemblyai[extras]" ollama elevenlabs
1.5 (MacOS Only) Install MPV for Audio Streaming
brew install mpv
Step 2: Download the DeepSeek R1 Model
Since this script uses DeepSeek R1 via Ollama, download the model locally by running:
ollama pull deepseek-r1:7b
Step 3: Implement the AI Voice Agent
Create a new Python file called AIVoiceAgent.py and add the following code.
3.1 Import Required Libraries
import assemblyai as aai
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import ollama
The first step is to import the necessary libraries. AssemblyAI is used for real-time speech-to-text transcription, Ollama allows us to interact with DeepSeek R1, and ElevenLabs is responsible for converting AI-generated text into speech. These tools together form the core of our AI voice agent.
3.2 Define the AI Voice Agent Class
Next, we define the AIVoiceAgent class, which initializes our API keys and sets up the conversation state.
class AIVoiceAgent:
def __init__(self):
aai.settings.api_key = "ASSEMBLYAI_API_KEY"
self.client = ElevenLabs(api_key="ELEVENLABS_API_KEY")
self.transcriber = None
self.full_transcript = [
{"role": "system", "content": "You are a language model called R1 created by DeepSeek. Answer in less than 300 characters."}
]
This class stores API keys for AssemblyAI and ElevenLabs and initializes full_transcript, which keeps track of the conversation history. The system message defines the behavior of the AI model, ensuring that responses are concise and relevant.
3.3 Set Up Real-Time Transcription with AssemblyAI
We then define the function that handles streaming real-time transcription.
def start_transcription(self):
print("\nReal-time transcription:", end="\r\n")
self.transcriber = aai.RealtimeTranscriber(
sample_rate=16000,
on_data=self.on_data,
on_error=self.on_error,
on_open=self.on_open,
on_close=self.on_close,
)
self.transcriber.connect()
microphone_stream = aai.extras.MicrophoneStream(sample_rate=16000)
self.transcriber.stream(microphone_stream)
This method starts the real-time transcription process, capturing audio from the microphone and streaming it to AssemblyAI for conversion into text. When new transcriptions are available, the on_data method is triggered.
3.4 Handling Real-Time Transcription Events
Once the transcription starts, we need to process the incoming text. The on_data function ensures that text is handled properly, either displaying partial transcripts or sending final transcripts for processing.
def on_data(self, transcript: aai.RealtimeTranscript):
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(transcript.text)
self.generate_ai_response(transcript)
else:
print(transcript.text, end="\r")
If the transcript is a partial result, it is displayed immediately. If the transcript is final (indicating a user has finished speaking), it is sent to the AI model for response generation.
3.5 Generate AI Responses with DeepSeek R1
Once transcription is complete, the system sends the conversation history to DeepSeek R1 via Ollama to generate a response.
def generate_ai_response(self, transcript):
self.stop_transcription()
self.full_transcript.append({"role": "user", "content": transcript.text})
print(f"\nUser: {transcript.text}", end="\r\n")
ollama_stream = ollama.chat(
model="deepseek-r1:7b",
messages=self.full_transcript,
stream=True,
)
print("DeepSeek R1:", end="\r\n")
response_text = "".join(chunk['message']['content'] for chunk in ollama_stream)
self.speak_response(response_text)
self.full_transcript.append({"role": "assistant", "content": response_text})
self.start_transcription()
This method pauses transcription, sends the user's message to DeepSeek R1, and streams back the AI-generated response. The response is then converted to speech using ElevenLabs.
Step 4: Running the AI Voice Agent
Once all dependencies are installed and the model is downloaded, simply run:
python AIVoiceAgent.py
Conclusion
By integrating AssemblyAI, DeepSeek R1, and ElevenLabs, this tutorial demonstrates how to build a powerful AI voice agent capable of handling real-time conversations with intelligent, human-like responses.
🔗 Learn more about AssemblyAI's real-time transcription API: Check the Docs
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.