Tutorial

How to build a LiveKit app with real-time Speech-to-Text

LiveKit allows you to build real-time audio and video applications - learn how to add real-time Speech-to-Text to your LiveKit application in this tutorial.

How to build a LiveKit app with real-time Speech-to-Text

LiveKit is a powerful platform for building real-time audio and video applications. They build on top of WebRTC to abstract away the complicated details of building real-time applications, allowing developers to rapidly build and deploy applications for video conferencing, livestreaming, interactive virtual events, and more.

Beyond the core real-time capabilities, LiveKit also provides a flexible agents system, which allows developers to incorporate programmatic agents into their applications for additional functionality. For example, you can incorporate AI agents to add Speech-to-Text or LLM capabilities to build multimodal real-time AI applications.

In this guide, we'll show you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI's new Python LiveKit integration. This allows you to transcribe audio streams in real-time so that you can do backend processing, or so you can display the transcriptions in your application's UI. Here's what we'll build today:

0:00
/0:32

We'll get started with an overview of LiveKit and its constructs, but if you're already familiar with LiveKit you can jump straight to the code here. You can find the code for this tutorial in this respository. Let's get started!

LiveKit basics

LiveKit is an extremely flexible platform. It is open-source, allowing you to self-host your own infrastructure, and they offer a wide range of SDKs to build real-time applications on top of clean interfaces in an idiomatic way.

At the core of a LiveKit application is a LiveKit Server. Users connect to the server and can publish streams of data to it. These streams are commonly audio or video stream, but any arbitrary data stream can be used. Additionally, users can subscribe to streams published by other users.

Users can publish their own audio/video feeds and subscribe to other participants' feeds in order to build, for example, a video conferencing application.

The LiveKit Server acts as a Selective Forwarding Unit, which is a fancy way of saying that it accepts all of these incoming streams and sends (forwards) them to the appropriate users (i.e. selectively). In this way, the LiveKit Server is a central orchestrator which prevents the need for peer-to-peer connections between all users, which would drastically drive up bandwidth and compute requirements for applications with many users. Additionally, the LiveKit server can send lower bitrate/resolution videos for e.g. thumbnail views, further lowering bandwidth requirements. This approach is what allows LiveKit applications to seamlessly scale to large numbers of users.

The user (red) sends his audio and video streams to the LiveKit server, which forwards them to the other participants (left). This is in contrast to a peer-to-peer system (right) where the user would have to send his streams to every other participant.

Additionally, since LiveKit is unopinionated by providing a simple mechanism to exchange real-time data (the publication/subscription of streams), LiveKit is flexible enough to build a litany of real-time applications.

LiveKit constructs

With this general context in mind, we can now cast this information in terms of LiveKit verbiage. LiveKit has three fundamental constructs - participants, tracks, and rooms.

Participants are members of a LiveKit application, which means they are participating in a real-time session. Participants can be end-users connecting to your application, processes that are ingesting media into or exporting it from your application, or AI agents that can process media in real-time.

These participants publish tracks, which are the streams of information mentioned above. These streams will generally be audio and video for end-users, but could also be, for example, streams of text as we will see in this tutorial, where our AI Agent that performs Streaming Speech-to-Text will publish a stream of the transcripts.

The participants are members of rooms, which are logical groupings of participants that can publish and subscribe to each other's tracks (subject of course to your applications permissions and logic). Participants in the same room receive notifications when other participants make changes to their tracks, like adding, removing, or modifying them.

For additional information, including the fields/attributes of the relevant objects, check out LiveKit's Docs. Now that we have the overarching basics of LiveKit down, let's see what it actually takes to build a LiveKit application.

Getting started with LiveKit

In order to build a LiveKit application with real-time Speech-to-Text, you'll need three essential components:

  1. A LiveKit Server, to which the frontend will connect.
  2. A frontend application, which end-users will interact with
  3. An AI Agent that will transcribe the audio streams in real-time

Let's start by setting up the LiveKit Server.

Step 1 - Set up a LiveKit server

LiveKit is open-source, which means you can self-host your own LiveKit server. This is a great option if you want to have full control over your infrastructure, or if you want to customize the LiveKit server to your specific needs. In this tutorial, we'll use the LiveKit Cloud service, which is a hosted version of LiveKit that is managed by the LiveKit team. This will make it easy for us to get up and running quickly and is free for small applications.

Go to livekit.io and sign up for a LiveKit account. You will be met with a page that prompts you to create your first app. Name your app streaming-stt (streaming Speech-to-Text), and click "Continue". After answering a few questions about your use-case, you will be taken to the dashboard for your new app:

Your dashboard shows information about your LiveKit project, which is essentially a management layer for your LiveKit server. You can find usage information, active sessions, as well as what we're interested in - the server URL and the API keys. Go to Settings > Keys and you will see the default API key that was created when you initialized your project:

In a terminal, create a project directory and navigate into it:

mkdir livekit-stt
cd livekit-stt

Inside your project directory, create a .env file to store the credentials for your application and add the following:

LIVEKIT_URL=
LIVEKIT_API_KEY=
LIVEKIT_API_SECRET=

Back on the Keys page in your LiveKit dashboard, click the default API key for your app. This will display a popup modal where you can copy over each of the values and paste them into your .env file (you will have to click to reveal your secret key):

Your .env file should now look something like this:

LIVEKIT_URL=wss://streaming-stt-SOME_NUMBER.livekit.cloud
LIVEKIT_API_KEY=SHORT_ALPHANUMERIC_STRING
LIVEKIT_API_SECRET=REALLY_LONG_ALPHANUMERIC_STRING

Note: Your .env file contains your credentials - make sure to keep it secure and never commit it to source control.

Step 2 - Set up the LiveKit Agents Playground

Now that our server is set up, we can move on to building the frontend application. LiveKit has a range of SDKs that make it easy to build in any environment. In our case, we'll use the LiveKit Agents Playground, which is a web application that allows you to test out the LiveKit agents system. Using this playground will allow us to quickly test out the Speech-to-Text agent that we'll build in the next section. The Agents Playground is open-source, so feel free to read through the code for inspiration when you're building your own project.

Additionally, we don't even have to set up the Agents Playground ourselves - LiveKit has a hosted version that we can use. Go to agents-playground.livekit.io and you will be either automatically signed in, or met with a prompt to connect to LiveKit cloud:

Sign in if prompted, and select the streaming-stt project to connect to it:

You will be taken to the Agents Playground which is connected to a LiveKit server for your streaming-stt project. On the right, you will see the ID of the room you are connected to, as well as your own participant ID.

You can disconnect for now by clicking the button in the top right - it's time to build our Speech-to-Text agent!

Step 3 - Build a real-time Speech-to-Text agent

Before we start writing code, we need to get an AssemblyAI API key - you can get one here. Currently, the free offering includes over 400 hours of asynchronous Speech-to-Text, as well as access to Audio Intelligence models; but it does not include Streaming Speech-to-Text. You will need to add a payment method to your account to access the Streaming Speech-to-Text API. Once you have done so, you can find your API key on the front page of your dashboard:

Copy it, and paste it into your .env file:

ASSEMBLYAI_API_KEY=YOUR-KEY-HERE

Now we're ready to start coding. Back in your project directory, create a virtual environment:

# Mac/Linux
python3 -m venv venv
. venv/bin/activate

# Windows
python -m venv venv
.\venv\Scripts\activate.bat

Next, install the required package:

pip install livekit-agents livekit-plugins-assemblyai python-dotenv

This command installs the LiveKit Python SDK, the AssemblyAI plugin for LiveKit, and the python-dotenv package which you'll use to load your environment variables from your .env file.

Now it's time to build the agent, which will be based on an example from LiveKits examples repository. Create a new Python file in your project directory called stt_agent.py and add the following:

import asyncio
import logging

from dotenv import load_dotenv
from livekit import rtc
from livekit.agents import (
    AutoSubscribe,
    JobContext,
    WorkerOptions,
    cli,
    stt,
    transcription,
)
from livekit.plugins import assemblyai

load_dotenv()

logger = logging.getLogger("transcriber")

We start without imports, load our environment variables, and then instantiate a logger for our agent. Now we can move on to the writing the main agent code.

Define the entrypoint function

We start by defining an entrypoint function which executes when the agent is connected to the room.

async def entrypoint(ctx: JobContext):
    logger.info(f"starting transcriber (speech to text) example, room: {ctx.room.name}")
    stt_impl = assemblyai.STT()

The entrypoint function is an asynchronous function that accepts a JobContext. We then log a message and instantiate an assemblyai.STT() object. This object is responsible for handling the Speech-to-Text and satisfies the LiveKit Agents stt.STT interface.

Next, still within the entrypoint function, we define an inner function that tells the agent what to do when it subscribes to a new track:

    @ctx.room.on("track_subscribed")
    def on_track_subscribed(
        track: rtc.Track,
        publication: rtc.TrackPublication,
        participant: rtc.RemoteParticipant,
    ):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            asyncio.create_task(transcribe_track(participant, track))

The decorator indicates to what event this function should be bound, in this case track subscription. The function creates a new asynchronous task that transcribes audio tracks using the transcribe_track function we'll add next.

Add the following inner function to your entrypoint function:

    async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
        audio_stream = rtc.AudioStream(track)
        stt_stream = stt_impl.stream()
        stt_forwarder = transcription.STTSegmentsForwarder(
            room=ctx.room, participant=participant, track=track
        )

        # Run tasks for audio input and transcription output in parallel
        await asyncio.gather(
            _handle_audio_input(audio_stream, stt_stream),
            _handle_transcription_output(stt_stream, stt_forwarder),
        )

This function first creates an AudioStream object from the track, and then creates an AssemblyAI SpeechStream object using the .stream() method of our assemblyai.STT() object. The SpeechStream object represents the bilateral communication stream between your LiveKit agent and AssemblyAI - audio segments are forwarded to AssemblyAI, and transcripts are received. Next, the function creates a STTSegmentsForwarder object, which is responsible for forwarding the transcripts to the room so that they can be displayed on the frontend.

To transcribe the track we need to do two things in parallel - receive the audio track from the LiveKit server and send it to AssemblyAI for transcription, and then receive the response transcript from AssemblyAI and forward it back to the LiveKit server. We do this using the asyncio.gather function, which runs these two tasks in parallel. We will define these tasks next.

First, we define _handle_audio_input. Add the following inner function to entrypoint:

    async def _handle_audio_input(
        audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
    ):
        """Pushes audio frames to the speech-to-text stream."""
        async for ev in audio_stream:
            stt_stream.push_frame(ev.frame)

This function listens for audio frames from the AudioStream object and pushes them to the SpeechStream object. The AudioStream object is an asynchronous generator that yields audio frames from the subscribed track which we forward to AssemblyAI using the push_frame method of the stt_stream. Now add this inner function to entrypoint:

    async def _handle_transcription_output(
        stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
    ):
        """Receives transcription events from the speech-to-text service."""
        async for ev in stt_stream:
            if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
                print(" -> ", ev.alternatives[0].text)

            stt_forwarder.update(ev)

This function does the converse of _handle_audio_input - it listens for Speech events from the SpeechStream object and forwards them to the STTSegmentsForwarder object, which in turn forwards them to the LiveKit server. When it receives a FINAL_TRANSCRIPT event, it prints the transcript to the console. You can also add additional logic to e.g. print out INTERIM_TRANSCRIPTs - you can learn about the difference between interim (or partial) transcripts and final transcripts in this section of our blog on transcribing Twilio calls in real-time. You can see all of the speech event types here.

Finally, add the following line to the entrypoint function (at its root level) to connect to the LiveKit room and automatically subscribe to any published audio tracks:

    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

To summarize:

  1. The entrypoint function is executed when the agent connects to the LiveKit room
  2. The agent automatically subscribes to every audio track published to the room
  3. For each of these tracks, the agent creates an asynchronous task which simultaneously:
    1. Pushes audio frames to the AssemblyAI Speech-to-Text stream
    2. Receives transcription events from the AssemblyAI Speech-to-Text stream, prints them to the agent server console if they are FINAL_TRANSCRIPTs, and forwards them to the LiveKit room so that they can be sent to participants, in our case to power the "Chat" feature on the frontend.

So, your entrypoint function should now look like this:

async def entrypoint(ctx: JobContext):
    logger.info(f"Starting transcriber (speech to text) example, room: {ctx.room.name}")
    stt_impl = assemblyai.STT()

    @ctx.room.on("track_subscribed")
    def on_track_subscribed(
        track: rtc.Track,
        publication: rtc.TrackPublication,
        participant: rtc.RemoteParticipant,
    ):
        if track.kind == rtc.TrackKind.KIND_AUDIO:
            asyncio.create_task(transcribe_track(participant, track))

    async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
        """
        Handles the parallel tasks of sending audio to the STT service and 
        forwarding transcriptions back to the app.
        """
        audio_stream = rtc.AudioStream(track)
        stt_forwarder = transcription.STTSegmentsForwarder(
            room=ctx.room, participant=participant, track=track
        )

        stt_stream = stt_impl.stream()

        # Run tasks for audio input and transcription output in parallel
        await asyncio.gather(
            _handle_audio_input(audio_stream, stt_stream),
            _handle_transcription_output(stt_stream, stt_forwarder),
        )

    async def _handle_audio_input(
        audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
    ):
        """Pushes audio frames to the speech-to-text stream."""
        async for ev in audio_stream:
            stt_stream.push_frame(ev.frame)

    async def _handle_transcription_output(
        stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
    ):
        """Receives transcription events from the speech-to-text service."""
        async for ev in stt_stream:
            if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
                print(" -> ", ev.alternatives[0].text)

            stt_forwarder.update(ev)

    await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)

Define the main loop

Finally, we define the main loop of our agent, which is responsible for connecting to the LiveKit room and running the entrypoint function. Add the following code to your stt_agent.py file:

if __name__ == "__main__":
    cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))

When the script is run, we use LiveKit's cli.run_app method to run the agent, specifying the entrypoint function as the entrypoint for the agent.

Run the application

Go back to the Agents Playground in your browser, and click Connect. Remember, the Playground is connected to your LiveKit Project. Now, go into your terminal and start the agent with the below command, ensuring that the virtual environment you created earlier is active:

python stt_agent.py dev

The agent connects to your LiveKit project by using the credentials in your .env file. In the Playground, you will see the Agent connected status change from FALSE to TRUE after starting your agent.

Begin speaking, and you will see your speech transcribed in real time. After you complete a sentence, it will be punctuated and formatted, and then a new line will be started for the next sentence in the chat box on the Playground.

In your terminal where the agent is running, you will see only the final punctuated/formatted utterances printed, because this is the behavior we defined in our stt_agent.py file.

You can see this process in action here:

0:00
/0:32

That's it! You've successfully built a real-time Speech-to-Text agent for your LiveKit application. You can now use this agent to transcribe audio streams in real-time, and display the transcripts in your application's UI.

Remember, you can self-host any part of this application, including the LiveKit server, the frontend application. Check out the LiveKit docs for more information on building LiveKit applications and working with AI agents.

Final words

In this tutorial, we showed you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI's new Python LiveKit integration. We walked through the basics of LiveKit, how to set up a LiveKit server, how to build a real-time Speech-to-Text agent, and how to connect the agent to your LiveKit application.

Check out AssemblyAI's docs to learn more about other models we offer beyond Streaming Speech-to-Text. Otherwise, feel free to check out our YouTube channel or blog to learn more about building with AI and AI theory, like this video on building a Chatbot in Python with Claude 3.5 Sonnet: