Extract Quotes with Timestamps Using LeMUR + Semantic Search | AssemblyAI

This Colab will demonstrate how to use AssemblyAI’s LeMUR (Leveraging Large Language Models to Understand Recognized Speech) framework to process an audio file and find the best quotes included in it through Semantic Search.

Quickstart

1 import datetime
2 import numpy as np
3 import assemblyai as aai
4 from sklearn.neighbors import NearestNeighbors
5 from sentence_transformers import SentenceTransformer
6 
7 aai.settings.api_key = "YOUR_API_KEY"
8 transcriber = aai.Transcriber()
9 
10 transcript = transcriber.transcribe("URL_OR_FILE_PATH_HERE")
11 
12 embedder = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
13 
14 embeddings = {}
15 sentences = transcript.get_sentences()
16 
17 def sliding_window(elements, distance, stride):
18     idx = 0
19     results = []
20     while idx + distance < len(elements):
21         results.append(elements[idx:idx + distance])
22         idx += (distance - stride)
23     return results
24 
25 # Sliding window to determine length of sentence groups. Tune based on desired quote length and duration.
26 sentence_groups = sliding_window(sentences, 5, 2)
27 
28 for sentence_group in sentence_groups:
29     sentence = {
30         "text": " ".join([sentence.text for sentence in sentence_group]),
31         "start": sentence_group[0].start,
32         "end": sentence_group[-1].end,
33     }
34     embeddings[(sentence["start"], sentence["end"], transcript.id, sentence["text"])] = embedder.encode(sentence["text"])
35 
36 questions = [
37     aai.LemurQuestion(
38         question="What are the 3 best quotes from this video?",
39         context="Please provide exactly 3 quotes.",
40     )
41 ]
42 
43 qa_results = transcript.lemur.question(questions, final_model=aai.LemurModel.claude3_5_sonnet).response
44 
45 # Embed the output from LeMUR for use in our comparison.
46 lemur_embedding = embedder.encode(qa_results[0].answer)
47 
48 # Vectorize our initial transcript embeddings.
49 np_embeddings = np.array(list(embeddings.values()))
50 metadata = list(embeddings.keys())
51 
52 # Find the top 3 most similar quotes to what LeMUR surfaced.
53 knn = NearestNeighbors(n_neighbors=3, metric="cosine")
54 knn.fit(np_embeddings)
55 distances, indices = knn.kneighbors([lemur_embedding])
56 
57 matches = []
58 for distance, index in zip(distances[0], indices[0]):
59     result_metadata = metadata[index]
60     matches.append(
61         {
62             "start_timestamp": result_metadata[0],
63             "end_timestamp": result_metadata[1],
64             "transcript_id": result_metadata[2],
65             "text": result_metadata[3],
66             "confidence": 1 - distance,
67         }
68     )
69 
70 for index, m in enumerate(matches):
71     print('QUOTE #{}: "{}"'.format(index + 1, m['text']))
72     print('START TIMESTAMP:', str(datetime.timedelta(seconds=m['start_timestamp']/1000)))
73     print('END TIMESTAMP:', str(datetime.timedelta(seconds=m['end_timestamp']/1000)))
74     print('CONFIDENCE:', m['confidence'])
75     print()

Getting Started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for an AssemblyAI account and get your API key from your dashboard.

You’ll also need to install a few libraries that this code depends on:

The AssemblyAI Python SDK.
Numpy, a scientific computing library.
Sciki-Learn, a library for predictive data analysis.
Sentence-Transformers, a framework for state-of-the-art sentence and text embedding.

Step-by-Step Instructions

$ pip install -U assemblyai numpy scikit-learn sentence-transformers

Then we’ll import all of these libraries and set our AssemblyAI API key.

1 import datetime
2 import numpy as np
3 import assemblyai as aai
4 from sklearn.neighbors import NearestNeighbors
5 from sentence_transformers import SentenceTransformer
6 
7 aai.settings.api_key = "API_KEY_HERE"

Next, we’ll use AssemblyAI to transcribe a file and save our transcript for later use.

1 transcriber = aai.Transcriber()
2 
3 transcript = transcriber.transcribe("URL_OR_FILE_PATH_HERE")

Now we can iterate over all of the paragraphs in our transcript and create embeddings for them to use as part of our Semantic Search later.

We’ll be relying on SentenceTransformer’s multi-qa-mpnet-base-dot-v1 model, which has been fine-tuned specifically for Semantic Search, and is their highest-performing model for this task.

We’ll also be implementing a sliding window, which allows us to group sentences together in different combinations to retain their semantic meaning and context while also enabling us to customize the length (and thus duration) of the quotes. By default, we’ll group 5 sentences together while having 2 of them overlap when the window moves. This should give us quotes around 30 seconds in length at most.

1 embedder = SentenceTransformer("multi-qa-mpnet-base-dot-v1")
2 
3 embeddings = {}
4 sentences = transcript.get_sentences()
5 
6 def sliding_window(elements, distance, stride):
7     idx = 0
8     results = []
9     while idx + distance < len(elements):
10         results.append(elements[idx:idx + distance])
11         idx += (distance - stride)
12     return results
13 
14 # Sliding window to determine length of sentence groups. Tune based on desired quote length and duration.
15 sentence_groups = sliding_window(sentences, 5, 2)
16 
17 for sentence_group in sentence_groups:
18     sentence = {
19         "text": " ".join([sentence.text for sentence in sentence_group]),
20         "start": sentence_group[0].start,
21         "end": sentence_group[-1].end,
22     }
23     embeddings[(sentence["start"], sentence["end"], transcript.id, sentence["text"])] = embedder.encode(sentence["text"])

Now we can query LeMUR to provide the type of quotes we want. In this case, let’s prompt LeMUR to find the best 3 quotes out of a video that we transcribed.

1 questions = [
2     aai.LemurQuestion(
3         question="What are the 3 best quotes from this video?",
4         context="Please provide exactly 3 quotes.",
5     )
6 ]
7 
8 qa_results = transcript.lemur.question(questions, final_model=aai.LemurModel.claude3_5_sonnet).response

Now we can take the embeddings from the transcript text, as well as the embeddings from LeMUR’s output, and use them in our k-nearest neighbors algorithm to determine their similarity. The most similar quotes to what LeMUR identified will be surfaced as our 3 best quotes, along with their timestamps and confidence scores.

We’ll be relying on cosine similarity rather than the default Euclidean distance metric since it takes into account both the magnitude and direction of our vectors.

1 # Embed the output from LeMUR for use in our comparison.
2 lemur_embedding = embedder.encode(qa_results[0].answer)
3 
4 # Vectorize our initial transcript embeddings.
5 np_embeddings = np.array(list(embeddings.values()))
6 metadata = list(embeddings.keys())
7 
8 # Find the top 3 most similar quotes to what LeMUR surfaced.
9 knn = NearestNeighbors(n_neighbors=3, metric="cosine")
10 knn.fit(np_embeddings)
11 distances, indices = knn.kneighbors([lemur_embedding])
12 
13 matches = []
14 for distance, index in zip(distances[0], indices[0]):
15     result_metadata = metadata[index]
16     matches.append(
17         {
18             "start_timestamp": result_metadata[0],
19             "end_timestamp": result_metadata[1],
20             "transcript_id": result_metadata[2],
21             "text": result_metadata[3],
22             "confidence": 1 - distance,
23         }
24     )
25 
26 for index, m in enumerate(matches):
27     print('QUOTE #{}: "{}"'.format(index + 1, m['text']))
28     print('START TIMESTAMP:', str(datetime.timedelta(seconds=m['start_timestamp']/1000)))
29     print('END TIMESTAMP:', str(datetime.timedelta(seconds=m['end_timestamp']/1000)))
30     print('CONFIDENCE:', m['confidence'])
31     print()