LiveKit is a powerful platform for building real-time audio and video applications. They build on top of WebRTC to abstract away the complicated details of building real-time applications, allowing developers to rapidly build and deploy applications for video conferencing, livestreaming, interactive virtual events, and more.
Beyond the core real-time capabilities, LiveKit also provides a flexible agents system, which allows developers to incorporate programmatic agents into their applications for additional functionality. For example, you can incorporate AI agents to add Speech-to-Text or LLM capabilities to build multimodal real-time AI applications.
In this guide, we'll show you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI's new Python LiveKit integration. This allows you to transcribe audio streams in real-time so that you can do backend processing, or so you can display the transcriptions in your application's UI. Here's what we'll build today:
We'll get started with an overview of LiveKit and its constructs, but if you're already familiar with LiveKit you can jump straight to the code here. You can find the code for this tutorial in this respository. Let's get started!
LiveKit basics
LiveKit is an extremely flexible platform. It is open-source, allowing you to self-host your own infrastructure, and they offer a wide range of SDKs to build real-time applications on top of clean interfaces in an idiomatic way.
At the core of a LiveKit application is a LiveKit Server. Users connect to the server and can publish streams of data to it. These streams are commonly audio or video stream, but any arbitrary data stream can be used. Additionally, users can subscribe to streams published by other users.
The LiveKit Server acts as a Selective Forwarding Unit, which is a fancy way of saying that it accepts all of these incoming streams and sends (forwards) them to the appropriate users (i.e. selectively). In this way, the LiveKit Server is a central orchestrator which prevents the need for peer-to-peer connections between all users, which would drastically drive up bandwidth and compute requirements for applications with many users. Additionally, the LiveKit server can send lower bitrate/resolution videos for e.g. thumbnail views, further lowering bandwidth requirements. This approach is what allows LiveKit applications to seamlessly scale to large numbers of users.
Additionally, since LiveKit is unopinionated by providing a simple mechanism to exchange real-time data (the publication/subscription of streams), LiveKit is flexible enough to build a litany of real-time applications.
LiveKit constructs
With this general context in mind, we can now cast this information in terms of LiveKit verbiage. LiveKit has three fundamental constructs - participants, tracks, and rooms.
Participants are members of a LiveKit application, which means they are participating in a real-time session. Participants can be end-users connecting to your application, processes that are ingesting media into or exporting it from your application, or AI agents that can process media in real-time.
These participants publish tracks, which are the streams of information mentioned above. These streams will generally be audio and video for end-users, but could also be, for example, streams of text as we will see in this tutorial, where our AI Agent that performs Streaming Speech-to-Text will publish a stream of the transcripts.
The participants are members of rooms, which are logical groupings of participants that can publish and subscribe to each other's tracks (subject of course to your applications permissions and logic). Participants in the same room receive notifications when other participants make changes to their tracks, like adding, removing, or modifying them.
For additional information, including the fields/attributes of the relevant objects, check out LiveKit's Docs. Now that we have the overarching basics of LiveKit down, let's see what it actually takes to build a LiveKit application.
Getting started with LiveKit
In order to build a LiveKit application with real-time Speech-to-Text, you'll need three essential components:
- A LiveKit Server, to which the frontend will connect.
- A frontend application, which end-users will interact with
- An AI Agent that will transcribe the audio streams in real-time
Let's start by setting up the LiveKit Server.
Step 1 - Set up a LiveKit server
LiveKit is open-source, which means you can self-host your own LiveKit server. This is a great option if you want to have full control over your infrastructure, or if you want to customize the LiveKit server to your specific needs. In this tutorial, we'll use the LiveKit Cloud service, which is a hosted version of LiveKit that is managed by the LiveKit team. This will make it easy for us to get up and running quickly and is free for small applications.
Go to livekit.io and sign up for a LiveKit account. You will be met with a page that prompts you to create your first app. Name your app streaming-stt
(streaming Speech-to-Text), and click "Continue". After answering a few questions about your use-case, you will be taken to the dashboard for your new app:
Your dashboard shows information about your LiveKit project, which is essentially a management layer for your LiveKit server. You can find usage information, active sessions, as well as what we're interested in - the server URL and the API keys. Go to Settings > Keys
and you will see the default API key that was created when you initialized your project:
In a terminal, create a project directory and navigate into it:
mkdir livekit-stt
cd livekit-stt
Inside your project directory, create a .env
file to store the credentials for your application and add the following:
LIVEKIT_URL=
LIVEKIT_API_KEY=
LIVEKIT_API_SECRET=
Back on the Keys page in your LiveKit dashboard, click the default API key for your app. This will display a popup modal where you can copy over each of the values and paste them into your .env
file (you will have to click to reveal your secret key):
Your .env
file should now look something like this:
LIVEKIT_URL=wss://streaming-stt-SOME_NUMBER.livekit.cloud
LIVEKIT_API_KEY=SHORT_ALPHANUMERIC_STRING
LIVEKIT_API_SECRET=REALLY_LONG_ALPHANUMERIC_STRING
Note: Your .env
file contains your credentials - make sure to keep it secure and never commit it to source control.
Step 2 - Set up the LiveKit Agents Playground
Now that our server is set up, we can move on to building the frontend application. LiveKit has a range of SDKs that make it easy to build in any environment. In our case, we'll use the LiveKit Agents Playground, which is a web application that allows you to test out the LiveKit agents system. Using this playground will allow us to quickly test out the Speech-to-Text agent that we'll build in the next section. The Agents Playground is open-source, so feel free to read through the code for inspiration when you're building your own project.
Additionally, we don't even have to set up the Agents Playground ourselves - LiveKit has a hosted version that we can use. Go to agents-playground.livekit.io and you will be either automatically signed in, or met with a prompt to connect to LiveKit cloud:
Sign in if prompted, and select the streaming-stt
project to connect to it:
You will be taken to the Agents Playground which is connected to a LiveKit server for your streaming-stt
project. On the right, you will see the ID of the room you are connected to, as well as your own participant ID.
You can disconnect for now by clicking the button in the top right - it's time to build our Speech-to-Text agent!
Step 3 - Build a real-time Speech-to-Text agent
Before we start writing code, we need to get an AssemblyAI API key - you can get one here. Currently, the free offering includes over 400 hours of asynchronous Speech-to-Text, as well as access to Audio Intelligence models; but it does not include Streaming Speech-to-Text. You will need to add a payment method to your account to access the Streaming Speech-to-Text API. Once you have done so, you can find your API key on the front page of your dashboard:
Copy it, and paste it into your .env
file:
ASSEMBLYAI_API_KEY=YOUR-KEY-HERE
Now we're ready to start coding. Back in your project directory, create a virtual environment:
# Mac/Linux
python3 -m venv venv
. venv/bin/activate
# Windows
python -m venv venv
.\venv\Scripts\activate.bat
Next, install the required package:
pip install livekit-agents livekit-plugins-assemblyai python-dotenv
This command installs the LiveKit Python SDK, the AssemblyAI plugin for LiveKit, and the python-dotenv
package which you'll use to load your environment variables from your .env
file.
Now it's time to build the agent, which will be based on an example from LiveKits examples repository. Create a new Python file in your project directory called stt_agent.py
and add the following:
import asyncio
import logging
from dotenv import load_dotenv
from livekit import rtc
from livekit.agents import (
AutoSubscribe,
JobContext,
WorkerOptions,
cli,
stt,
transcription,
)
from livekit.plugins import assemblyai
load_dotenv()
logger = logging.getLogger("transcriber")
We start without imports, load our environment variables, and then instantiate a logger for our agent. Now we can move on to the writing the main agent code.
Define the entrypoint
function
We start by defining an entrypoint
function which executes when the agent is connected to the room.
async def entrypoint(ctx: JobContext):
logger.info(f"starting transcriber (speech to text) example, room: {ctx.room.name}")
stt_impl = assemblyai.STT()
The entrypoint
function is an asynchronous function that accepts a JobContext
. We then log a message and instantiate an assemblyai.STT()
object. This object is responsible for handling the Speech-to-Text and satisfies the LiveKit Agents stt.STT
interface.
Next, still within the entrypoint
function, we define an inner function that tells the agent what to do when it subscribes to a new track:
@ctx.room.on("track_subscribed")
def on_track_subscribed(
track: rtc.Track,
publication: rtc.TrackPublication,
participant: rtc.RemoteParticipant,
):
if track.kind == rtc.TrackKind.KIND_AUDIO:
asyncio.create_task(transcribe_track(participant, track))
The decorator indicates to what event this function should be bound, in this case track subscription. The function creates a new asynchronous task that transcribes audio tracks using the transcribe_track
function we'll add next.
Add the following inner function to your entrypoint
function:
async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
audio_stream = rtc.AudioStream(track)
stt_stream = stt_impl.stream()
stt_forwarder = transcription.STTSegmentsForwarder(
room=ctx.room, participant=participant, track=track
)
# Run tasks for audio input and transcription output in parallel
await asyncio.gather(
_handle_audio_input(audio_stream, stt_stream),
_handle_transcription_output(stt_stream, stt_forwarder),
)
This function first creates an AudioStream
object from the track, and then creates an AssemblyAI SpeechStream
object using the .stream()
method of our assemblyai.STT()
object. The SpeechStream
object represents the bilateral communication stream between your LiveKit agent and AssemblyAI - audio segments are forwarded to AssemblyAI, and transcripts are received. Next, the function creates a STTSegmentsForwarder
object, which is responsible for forwarding the transcripts to the room so that they can be displayed on the frontend.
To transcribe the track we need to do two things in parallel - receive the audio track from the LiveKit server and send it to AssemblyAI for transcription, and then receive the response transcript from AssemblyAI and forward it back to the LiveKit server. We do this using the asyncio.gather
function, which runs these two tasks in parallel. We will define these tasks next.
First, we define _handle_audio_input
. Add the following inner function to entrypoint
:
async def _handle_audio_input(
audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
):
"""Pushes audio frames to the speech-to-text stream."""
async for ev in audio_stream:
stt_stream.push_frame(ev.frame)
This function listens for audio frames from the AudioStream
object and pushes them to the SpeechStream
object. The AudioStream
object is an asynchronous generator that yields audio frames from the subscribed track which we forward to AssemblyAI using the push_frame
method of the stt_stream
. Now add this inner function to entrypoint
:
async def _handle_transcription_output(
stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
):
"""Receives transcription events from the speech-to-text service."""
async for ev in stt_stream:
if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
print(" -> ", ev.alternatives[0].text)
stt_forwarder.update(ev)
This function does the converse of _handle_audio_input
- it listens for Speech events from the SpeechStream
object and forwards them to the STTSegmentsForwarder
object, which in turn forwards them to the LiveKit server. When it receives a FINAL_TRANSCRIPT
event, it prints the transcript to the console. You can also add additional logic to e.g. print out INTERIM_TRANSCRIPT
s - you can learn about the difference between interim (or partial) transcripts and final transcripts in this section of our blog on transcribing Twilio calls in real-time. You can see all of the speech event types here.
Finally, add the following line to the entrypoint
function (at its root level) to connect to the LiveKit room and automatically subscribe to any published audio tracks:
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
To summarize:
- The
entrypoint
function is executed when the agent connects to the LiveKit room - The agent automatically subscribes to every audio track published to the room
- For each of these tracks, the agent creates an asynchronous task which simultaneously:
- Pushes audio frames to the AssemblyAI Speech-to-Text stream
- Receives transcription events from the AssemblyAI Speech-to-Text stream, prints them to the agent server console if they are
FINAL_TRANSCRIPT
s, and forwards them to the LiveKit room so that they can be sent to participants, in our case to power the "Chat" feature on the frontend.
So, your entrypoint
function should now look like this:
async def entrypoint(ctx: JobContext):
logger.info(f"Starting transcriber (speech to text) example, room: {ctx.room.name}")
stt_impl = assemblyai.STT()
@ctx.room.on("track_subscribed")
def on_track_subscribed(
track: rtc.Track,
publication: rtc.TrackPublication,
participant: rtc.RemoteParticipant,
):
if track.kind == rtc.TrackKind.KIND_AUDIO:
asyncio.create_task(transcribe_track(participant, track))
async def transcribe_track(participant: rtc.RemoteParticipant, track: rtc.Track):
"""
Handles the parallel tasks of sending audio to the STT service and
forwarding transcriptions back to the app.
"""
audio_stream = rtc.AudioStream(track)
stt_forwarder = transcription.STTSegmentsForwarder(
room=ctx.room, participant=participant, track=track
)
stt_stream = stt_impl.stream()
# Run tasks for audio input and transcription output in parallel
await asyncio.gather(
_handle_audio_input(audio_stream, stt_stream),
_handle_transcription_output(stt_stream, stt_forwarder),
)
async def _handle_audio_input(
audio_stream: rtc.AudioStream, stt_stream: stt.SpeechStream
):
"""Pushes audio frames to the speech-to-text stream."""
async for ev in audio_stream:
stt_stream.push_frame(ev.frame)
async def _handle_transcription_output(
stt_stream: stt.SpeechStream, stt_forwarder: transcription.STTSegmentsForwarder
):
"""Receives transcription events from the speech-to-text service."""
async for ev in stt_stream:
if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
print(" -> ", ev.alternatives[0].text)
stt_forwarder.update(ev)
await ctx.connect(auto_subscribe=AutoSubscribe.AUDIO_ONLY)
Define the main loop
Finally, we define the main loop of our agent, which is responsible for connecting to the LiveKit room and running the entrypoint
function. Add the following code to your stt_agent.py
file:
if __name__ == "__main__":
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint))
When the script is run, we use LiveKit's cli.run_app
method to run the agent, specifying the entrypoint
function as the entrypoint for the agent.
Run the application
Go back to the Agents Playground in your browser, and click Connect
. Remember, the Playground is connected to your LiveKit Project. Now, go into your terminal and start the agent with the below command, ensuring that the virtual environment you created earlier is active:
python stt_agent.py dev
The agent connects to your LiveKit project by using the credentials in your .env
file. In the Playground, you will see the Agent connected
status change from FALSE
to TRUE
after starting your agent.
Begin speaking, and you will see your speech transcribed in real time. After you complete a sentence, it will be punctuated and formatted, and then a new line will be started for the next sentence in the chat box on the Playground.
In your terminal where the agent is running, you will see only the final punctuated/formatted utterances printed, because this is the behavior we defined in our stt_agent.py
file.
You can see this process in action here:
That's it! You've successfully built a real-time Speech-to-Text agent for your LiveKit application. You can now use this agent to transcribe audio streams in real-time, and display the transcripts in your application's UI.
Remember, you can self-host any part of this application, including the LiveKit server, the frontend application. Check out the LiveKit docs for more information on building LiveKit applications and working with AI agents.
Final words
In this tutorial, we showed you how to add real-time Speech-to-Text to your LiveKit application using AssemblyAI's new Python LiveKit integration. We walked through the basics of LiveKit, how to set up a LiveKit server, how to build a real-time Speech-to-Text agent, and how to connect the agent to your LiveKit application.
Check out AssemblyAI's docs to learn more about other models we offer beyond Streaming Speech-to-Text. Otherwise, feel free to check out our YouTube channel or blog to learn more about building with AI and AI theory, like this video on building a Chatbot in Python with Claude 3.5 Sonnet: