Skip to main content
This page explains the different ways you can export and format your transcript data, including SRT/VTT caption files, paragraphs and sentences, and word-level timestamps. The plain-text transcript returned in the text field is raw transcript text only — it does not include timestamps or speaker labels. To build a custom text export in a format such as [Speaker] timecode -> text, enable speaker labels and construct the output from the utterances array (which contains speaker, start, end, and text for each utterance). See the Timestamped transcripts guide and the Create subtitles with speaker labels cookbook for examples.

Export SRT or VTT caption files

You can export completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos. You can also customize the maximum number of characters per caption by specifying the chars_per_caption parameter.
import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# audio_file = "./local_file.mp3"
audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True
)

transcript = aai.Transcriber(config=config).transcribe(audio_file)

if transcript.status == "error":
  raise RuntimeError(f"Transcription failed: {transcript.error}")

srt = transcript.export_subtitles_srt(
  # Optional: Customize the maximum number of characters per caption
  chars_per_caption=32
  )

with open(f"transcript_{transcript.id}.srt", "w") as srt_file:
  srt_file.write(srt)

# vtt = transcript.export_subtitles_vtt()

# with open(f"transcript_{transcript_id}.vtt", "w") as vtt_file:
#   vtt_file.write(vtt)

Export paragraphs

You can retrieve transcripts that are automatically segmented into paragraphs. The text of the transcript is broken down by paragraphs, along with additional metadata.
import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# audio_file = "./local_file.mp3"
audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True
)

transcript = aai.Transcriber(config=config).transcribe(audio_file)

if transcript.status == "error":
  raise RuntimeError(f"Transcription failed: {transcript.error}")

paragraphs = transcript.get_paragraphs()
for paragraph in paragraphs:
  print(paragraph.text)
  print()

Export sentences

You can retrieve transcripts that are automatically segmented into sentences, for a more reader-friendly experience. The text of the transcript is broken down by sentences, along with additional metadata.
import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# audio_file = "./local_file.mp3"
audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True
)

transcript = aai.Transcriber(config=config).transcribe(audio_file)

if transcript.status == "error":
  raise RuntimeError(f"Transcription failed: {transcript.error}")

sentences = transcript.get_sentences()
for sentence in sentences:
  print(sentence.text)
  print()
The response is an array of objects, each representing a sentence or a paragraph in the transcript. See the API reference for more info.

Word-level timestamps

The response also includes an array with information about each word:
import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# audio_file = "./local_file.mp3"
audio_file = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
  speech_models=["universal-3-pro", "universal-2"],
  language_detection=True
)

transcript = aai.Transcriber().transcribe(audio_file, config)

for word in transcript.words:
  print(f"Word: {word.text}, Start: {word.start}, End: {word.end}, Confidence: {word.confidence}")

API Reference

{
    words: [
      {
        text: "Smoke",
        start: 240,
        end: 640,
        confidence: 0.70473,
        speaker: null,
      },
      {
        text: "from",
        start: 680,
        end: 968,
        confidence: 0.99967,
        speaker: null,
      },
      {
        text: "hundreds",
        start: 1024,
        end: 1416,
        confidence: 0.99795,
        speaker: null,
      },
      {
        text: "of",
        start: 1448,
        end: 1592,
        confidence: 0.99926,
        speaker: null,
      },
      {
        text: "wildfires",
        start: 1616,
        end: 2248,
        confidence: 0.99838,
        speaker: null,
      },
      {
        text: "in",
        start: 2264,
        end: 2440,
        confidence: 0.99782,
        speaker: null,
      },
      {
        text: "Canada",
        start: 2480,
        end: 2968,
        confidence: 0.99977,
        speaker: null,
      },
    ],
  
}

Additional resources

Is there a way to generate SRT or VTT captions with speaker labels?

Learn how to create caption files that include speaker identification.