Tutorials

Build a Real-Time AI Voice Bot Using Python, AssemblyAI, and ElevenLabs

Learn how to build a real-time AI voice bot using Python, AssemblyAI, OpenAI, and ElevenLabs for transcription, intelligent responses, and voice synthesis.

Build a Real-Time AI Voice Bot Using Python, AssemblyAI, and ElevenLabs

AI voice bots are rapidly transforming how businesses handle customer interactions, with recent studies estimating that by 2025, 95% of customer service interactions will be managed by AI agents. For developers, this presents a significant opportunity to build intelligent, scalable solutions that improve the efficiency and user experience of customer interactions. Voice-based AI technology is growing at a rate of 19.8% annually, driven by the demand for real-time, automated support across industries.

This written tutorial will guide you through the process of building an AI-powered dental assistant in Python, using AssemblyAI for speech-to-text, OpenAI for generating responses, and ElevenLabs for voice synthesis. If you prefer watching the video, check it out below:

Key Components of the AI Voice Bot

There are three major components of an AI voice bot:

  1. Streaming Transcription: AssemblyAI's Speech-to-Text API enables real-time transcription with high accuracy.
  2. Natural Language Processing (NLP): OpenAI's language models generate intelligent, context-aware responses.
  3. Voice Synthesis: ElevenLabs synthesizes text responses into natural-sounding audio, completing the conversational loop.

The below steps will demostrate how to build the AI voice bot, including code snippets and an overview of how each component interacts to form a cohesive voice bot.

Step 1: Install Required Python Libraries

To begin, run the following commands in terminal to ensure that the necessary libraries are installed:

brew install portaudio mpv
pip install "assemblyai[extras]" elevenlabs openai

These libraries will power the core functionalities of the AI voice bot: streaming transcription, response generation, and speech synthesis.

Step 2: Import Libraries & set up credentials

In this step, start by first importing the libraries needed for this project. This project requires API credentials for AssemblyAI, OpenAI & ElevenLabs. Start by signing up for AssemblyAI's API and set up billing to access real-time or streaming transcription. As a first time user, you'll also get $50 in free credits for asynchronous transcription and audio intelligence.

Create a new file in the project directory called main.py and add the following code, making sure to replace the string placeholders for the AssemblyAI, OpenAI, and ElevenLabs API keys with your personal key values.

import assemblyai as aai
from elevenlabs import generate, stream
from openai import OpenAI

class AI_Assistant:
    def __init__(self):
        aai.settings.api_key = "ASSEMBLYAI-API-KEY"
        self.openai_client = OpenAI(api_key = "OPENAI-API-KEY")
        self.elevenlabs_api_key = "ELEVENLABS-API-KEY"

        self.transcriber = None

        # Prompt
        self.full_transcript = [
            {"role":"system", "content":"You are a receptionist at a dental clinic. Be resourceful and efficient."},
        ]

Step 3: Streaming Transcription with AssemblyAI

At the core of the AI voice bot is streaming, also known as real-time, transcription. AssemblyAI’s RealtimeTranscriber handles audio streams and converts speech to text in real time. In this step, you'll set up a real-time transcriber object and callback functions that handle the different events.

In the main.py file, add the following functions to the AI_Assistant class to handle events from the real-time transcriber:

    def stop_transcription(self):
        if self.transcriber:
            self.transcriber.close()
            self.transcriber = None

    def on_open(self, session_opened: aai.RealtimeSessionOpened):
        print("Session ID:", session_opened.session_id)
        return

    def on_error(self, error: aai.RealtimeError):
        print("An error occured:", error)
        return


    def on_close(self):
        print("Closing Session")
        return

Create another function to handle transcripts. The real-time transcriber returns two types of transcripts: RealtimeFinalTranscript and RealtimePartialTranscript.

  • Partial transcripts are returned as the audio is being streamed to AssemblyAI.
  • Final transcripts are returned after a moment of silence. The app sends the final transcripts to the method generate_ai_response which will be defined in the later steps.
    def on_data(self, transcript: aai.RealtimeTranscript):
        if not transcript.text:
            return

        if isinstance(transcript, aai.RealtimeFinalTranscript):
            self.generate_ai_response(transcript)
        else:
            print(transcript.text, end="\r")

Define a new RealtimeTranscriber using the function below:

    def start_transcription(self):
        self.transcriber = aai.RealtimeTranscriber(
            sample_rate = 16000,
            on_data = self.on_data,
            on_error = self.on_error,
            on_open = self.on_open,
            on_close = self.on_close,
            end_utterance_silence_threshold = 1000
        )

        self.transcriber.connect()
        microphone_stream = aai.extras.MicrophoneStream(sample_rate =16000)
        self.transcriber.stream(microphone_stream)

This function handles real-time transcription by initiating a connection to the microphone and streaming audio to AssemblyAI’s API. The on_open function created earlier will be called when the connection has been established. The on_data function you created earlier will be called when the transcript is sent back from AssemblyAI.

Step 4: Generate Responses with OpenAI

Once the code for real-time transcription has been written, use OpenAI's API to generate context-aware responses based on the conversation. To the same AI_Assistant class, add this following method:

  def generate_ai_response(self, transcript):

        self.stop_transcription()

        self.full_transcript.append({"role":"user", "content": transcript.text})
        print(f"\nPatient: {transcript.text}", end="\r\n")

        response = self.openai_client.chat.completions.create(
            model = "gpt-3.5-turbo",
            messages = self.full_transcript
        )

        ai_response = response.choices[0].message.content

        self.generate_audio(ai_response)

        self.start_transcription()
        print(f"\nReal-time transcription: ", end="\r\n")


This method is called every time a final transcript is generated by AssemblyAI’s RealtimeTranscriber. It appends this final transcript to a list containing the full transcript of the chat so far. This list, along with the prompt, is then sent to OpenAI in order to generate a context-aware response based on the conversation.

Step 5: Voice Synthesis with ElevenLabs

Once OpenAI generates a response based on the user's input, it is then passed to ElevenLabs API for voice synthesis. Here, the text is converted into a natural-sounding audio stream that can be played back to the user. In the same main.py file, add this following method:

 def generate_audio(self, text):

        self.full_transcript.append({"role":"assistant", "content": text})
        print(f"\nAI Receptionist: {text}")

        audio_stream = generate(
            api_key = self.elevenlabs_api_key,
            text = text,
            voice = "Rachel",
            stream = True
        )

        stream(audio_stream)

This method first appends the response from OpenAI into the full_transcript list which keeps track of the full conversation. It then sends the most recent response from OpenAI to ElevenLabs for text-to-speech by making use of the generate method. Finally, use the stream method to play audio of the speech.

Step 6: Finalizing the AI Voice Bot

In this step, add the following code to the end of the main.py file:

greeting = "Thank you for calling Vancouver dental clinic. My name is Sandy, how may I assist you?"
ai_assistant = AI_Assistant()
ai_assistant.generate_audio(greeting)
ai_assistant.start_transcription()

This completes the code for the AI voice bot setup, enabling it to handle real-time conversations, transcribing and responding to users in a natural, human-like voice. To start using it, run this file in terminal with the following command:

python main.py


Once the application starts running, you will hear a prompt from the AI voice bot, to which you can respond to by speaking and this will enable the conversation to start. To watch the demo in action check out the video at 18:48. Here is the github repository with the full code to build this AI voice bot. 

By integrating AssemblyAI, OpenAI, and ElevenLabs, this tutorial demonstrates how to build a powerful AI voice bot capable of managing real-time conversations. This application is ideal for call centers, customer support, and virtual receptionists, where human-like interaction is a key part of user experience.

AssemblyAI’s streaming API is a key part of building AI Voice bots, check out the docs to learn more. 

To build more of these AI voice bots check out the following videos: