October 16, 2023

How to use audio data in LlamaIndex with Python

Learn how to incorporate audio files into LlamaIndex and build an LLM-powered query engine in this step-by-step tutorial.

Patrick Loeber

Senior Developer Advocate

Tutorial

LlamaIndex

Python

Reviewed by

Ryan O'Connor

Senior Developer Educator

Table of contents

[Visible on live site]

LlamaIndex is a flexible data framework for connecting custom data sources to Large Language Models (LLMs). With LlamaIndex, you can easily store and index your data and then apply LLMs.

LLMs only work with textual data, so to process audio files with LLMs we first need to transcribe them into text.

Luckily, LlamaIndex provides an AssemblyAI integration through Llama Hub that lets you load audio data with just a few lines of code:

from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader AssemblyAIAudioTranscriptReader(file_path="./my_file.mp3") docs = reader.load_data()

Let's learn how to use this data reader step-by-step. For this, we create a small demo application with an LLM-powered query engine that lets you load audio data and ask questions about your data.

Getting Started

Create a new virtual environment:

# Mac/Linux: python3 -m venv venv . venv/bin/activate # Windows: python -m venv venv .\venv\Scripts\activate.bat

Install LlamaIndex, Llama Hub, and the AssemblyAI Python package:

pip install llama-index llama-hub assemblyai

Set your AssemblyAI API key as an environment variable named ASSEMBLYAI_API_KEY. You can get a free API key here.

# Mac/Linux: export ASSEMBLYAI_API_KEY=<YOUR_KEY> # Windows: set ASSEMBLYAI_API_KEY=<YOUR_KEY>

Use the `AssemblyAIAudioTranscriptReader`

To load and transcribe audio data into documents, import the AssemblyAIAudioTranscriptReader. It needs at least the file_path argument with an audio file specified as an URL or a local file path. You can read more about the integration in the official Llama Hub docs.

from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader audio_file = "https://storage.googleapis.com/aai-docs-samples/sports_injuries.mp3" # or a local file path: audio_file = "./sports_injuries.mp3" reader = AssemblyAIAudioTranscriptReader(file_path=audio_file) docs = reader.load_data()

After loading the data, the transcribed text is stored in the text attribute.

print(docs[0].text) # Runner's knee. Runner's knee is a condition ...

The metadata contains the full JSON response of our API with more meta information:

print(docs[0].metadata) # {'language_code': <LanguageCode.en_us: 'en_us'>, # 'punctuate': True, # 'format_text': True, # … # }

Tip: The default configuration of the document loader returns a list with only one document, that's why here we access the first document in the list with docs[0]. But you can use a different TranscriptFormat that splits text for example by sentences or paragraphs, and returns multiple documents. You can read more about the TranscriptFormat options here.

from llama_hub.assemblyai.base import TranscriptFormat reader = AssemblyAIAudioTranscripReader( file_path="./your_file.mp3", transcript_format=TranscriptFormat.SENTENCES, ) docs = reader.load_data() # Now it returns a list with multiple documents

Apply a Vector Store Index and a Query Engine

Now that you have loaded the transcribed text into LlamaIndex documents, you can easily ask questions about the spoken data. For example, you can apply a model from OpenAI with a Query Engine.

For this, you also need to set your OpenAI API key as an environment variable:

# Mac/Linux: export OPENAI_API_KEY=<YOUR_OPENAI_KEY> # Windows: set OPENAI_API_KEY=<YOUR_OPENAI_KEY>

Now, you can create a VectorStoreIndex and a query engine from the retrieved documents from the first step.

The metadata needs to be smaller than the text chunk size, and since it contains the full JSON response with extra information, it is quite large. For simplicity, we just remove it here:

from llama_index import VectorStoreIndex # Metadata needs to be smaller than chunk size # For simplicity we just get rid of it docs[0].metadata = {} index = VectorStoreIndex.from_documents(docs) query_engine = index.as_query_engine() response = query_engine.query("What is a runner's knee?") print(response) # Runner's knee is a condition characterized by ...

Conclusion

This tutorial explained how to use the AssemblyAI data reader for LlamaIndex. You learned how to transcribe audio files and load the transcribed text into LlamaIndex documents, and how to create a Query Engine to ask questions about your spoken data.

Below is the complete code:

from llama_index import VectorStoreIndex from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader audio_file = "https://storage.googleapis.com/aai-docs-samples/sports_injuries.mp3" reader = AssemblyAIAudioTranscriptReader(file_path=audio_file) docs = reader.load_data() docs[0].metadata = {} index = VectorStoreIndex.from_documents(docs) query_engine = index.as_query_engine() response = query_engine.query("What is a runner's knee?") print(response)

If you enjoyed this article, feel free to check out some others on our blog, like

Alternatively, check out our YouTube channel for learning resources on AI, like our Machine Learning from Scratch series.

How to use audio data in LlamaIndex with Python

Getting Started

Use the `AssemblyAIAudioTranscriptReader`

Apply a Vector Store Index and a Query Engine

Conclusion

Python speech recognition in 30 lines of code

Using multichannel and speaker diarization

How to use Google's Speech-to-Text API to transcribe audio in Python

Python Speech-to-Text with Punctuation, Casing, and Formatting

Best AI playgrounds in 2025

8 Ways Automatic Speech Recognition Can Increase Efficiency For Your Business

AssemblyAI Recognized as G2 High Performer, Momentum Leader for Fall 2022

How accurate is speech-to-text in 2025?

How to use audio data in LlamaIndex with Python

Getting Started

Use the AssemblyAIAudioTranscriptReader

Apply a Vector Store Index and a Query Engine

Conclusion

Related posts

Python speech recognition in 30 lines of code

Using multichannel and speaker diarization

How to use Google's Speech-to-Text API to transcribe audio in Python

Python Speech-to-Text with Punctuation, Casing, and Formatting

Best AI playgrounds in 2025

8 Ways Automatic Speech Recognition Can Increase Efficiency For Your Business

AssemblyAI Recognized as G2 High Performer, Momentum Leader for Fall 2022

How accurate is speech-to-text in 2025?

Use the `AssemblyAIAudioTranscriptReader`