Generate Transcript Citations using LeMUR | AssemblyAI

This guide will walk through the process of generating transcript citations using OpenAI embeddings and the LeMUR API.

Overview

Extracting exact quotes from transcripts can be a difficult task for Large Language Models, which makes it challenging to cite sources or identify timestamps for generative text.

Embeddings are powerful representations of text that capture its semantic and contextual meaning. By leveraging embeddings, we can transform raw text data, such as transcripts, into dense numerical vectors that encode its underlying information. These embeddings enable us to perform sophisticated tasks such as similarity comparison and contextual searching.

In this guide, we demonstrate how to utilize OpenAI embeddings to retrieve transcript citations to corroborate the results from the LeMUR API. LeMUR is proficient at providing the ‘what’ & ‘why’ and now embeddings will be able to provide the ‘where’ & ‘when’.

We’ll walk through 3 use cases for this including verification of sources for specific answers, timestamping of action items, and generation of customer quotes.

Get Started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for an account and get your API key from your dashboard. You will also need an OpenAI API token.

LeMUR features are currently only available to paid users. See pricing for more details.

Instructions

Install the libraries required for the transcription and embedding creation.

$ pip install numpy sklearn openai assemblyai tiktoken

Submitting a File for Transcription

1 import assemblyai as aai
2 aai.settings.api_key = "YOUR_API_KEY"
3 transcriber = aai.Transcriber()
4 
5 def transcribe(urls):
6     return transcriber.transcribe_group(urls)

Create Transcript Embeddings

We are using the text-embedding-ada-002 model to generate our embeddings.

The pricing for this model is $0.0001 / 1k tokens which equates to roughly 0.0015 to embed one hour of audio.

1 import numpy as np
2 from sklearn.neighbors import NearestNeighbors
3 import openai
4 
5 # Set up OpenAI API key
6 openai.api_key = "YOUR_OPENAI_TOKEN"
7 
8 def embed_block(block_text):
9     # Embed the block of text using OpenAI embeddings
10     embedding = openai.Embedding.create(
11         input=block_text,
12         model='text-embedding-ada-002',
13     ).to_dict()['data'][0]['embedding']
14 
15     # Store the embedding with the timestamp in the dictionary
16     return embedding
17 
18 def find_relevant_matches(embedded_blocks, new_block_text, k=3):
19     matches = []
20     # Embed the new block of text using OpenAI embeddings
21     new_embedding = embed_block(new_block_text)
22 
23     # Prepare the embeddings for the KNN search
24     embeddings = np.array(list(embedded_blocks.values()))
25     metadata = list(embedded_blocks.keys())
26 
27     # Perform KNN search to find the most relevant matches
28     knn = NearestNeighbors(n_neighbors=k)
29     knn.fit(embeddings)
30     distances, indices = knn.kneighbors([new_embedding])
31 
32     # Print the relevant matches
33     # print(f"Relevant Matches for '{new_block_text}':")
34     for distance, index in zip(distances[0], indices[0]):
35         result_metadata = metadata[index]
36         # print(f"Timestamp: {timestamp}, Similarity: {1-distance:.4f}")
37         # print(f"Block Text: {embedded_blocks[timestamp]}")
38         # print()
39         matches.append({
40             'timestamp': result_metadata[0],
41             'transcript_id': result_metadata[1],
42             'text': result_metadata[2],
43             'confidence': 1-distance
44         })
45     return matches

1 def create_transcripts_embeddings(transcripts, granularity='paragraph'):
2     # Dictionary to store embeddings with timestamps
3     embeddings = {}
4     total_tokens_embedded = 0
5 
6     for transcript in transcripts:
7         if granularity == 'sentence':
8             sentences = transcript.get_sentences()
9             for sentence in sentences:
10                 # print(sentence.start, sentence.end)
11                 # print(sentence.text)
12                 total_tokens_embedded += num_tokens_from_string(sentence.text, 'r50k_base')
13 
14                 embeddings[(sentence.start, transcript.id, sentence.text)] = embed_block(sentence.text)
15         else:
16             paragraphs = transcript.get_paragraphs()
17             for paragraph in paragraphs:
18                 # print(paragraph.start, paragraph.end)
19                 # print(paragraph.text, '\n')
20                 total_tokens_embedded += num_tokens_from_string(paragraph.text, 'r50k_base')
21 
22                 embeddings[(paragraph.start, transcript.id, paragraph.text)] = embed_block(paragraph.text)
23 
24     print(total_tokens_embedded, 'TOKENS EMBEDDED')
25     print('COST OF EMBEDDINGS: $', (total_tokens_embedded / 1000)*0.0001)
26     print()
27     return embeddings

1 import tiktoken
2 
3 def num_tokens_from_string(string: str, encoding_name: str) -> int:
4     """Returns the number of tokens in a text string."""
5     encoding = tiktoken.get_encoding(encoding_name)
6     num_tokens = len(encoding.encode(string))
7     return num_tokens

Examples

Cite Answers to Specific Questions

Cite your sources to specific answers returned from the LeMUR Q&A API.

1 questions = [
2     aai.LemurQuestion(question="how does calcium relate to adheren junctions?", context='', answer_format="")
3 ]

1 import json, datetime
2 
3 def get_citations(lemur_output):
4     matches = find_relevant_matches(embeddings, lemur_output)
5 
6     print('CITATIONS:')
7     for index, m in enumerate(matches):
8         print('#{}'.format(index+1))
9         print('QUOTE: "{}"'.format(m['text']))
10         print('TRANSCRIPT ID:', m['transcript_id'])
11         print('START TIMESTAMP:', str(datetime.timedelta(seconds=m['timestamp']/1000)))
12         print('CONFIDENCE SCORE:', m['confidence'])
13         print()

1 transcripts = transcribe([
2     '', # TODO ADD URLS
3 ])
4 
5 embeddings = create_transcripts_embeddings(transcripts)
6 
7 qa_results = transcripts.lemur.question(questions).response
8 
9 print(f"Question: {qa_results[0].question}")
10 print(f"Answer: {qa_results[0].answer}")
11 print()
12 get_citations(qa_results[0].question + ' ' + qa_results[0].answer)

Example output:

Question: how does calcium relate to adheren junctions?
Answer: Adheren junctions are calcium dependent, meaning that if calcium is removed, the cells will fall apart as the junctions disassemble.
CITATIONS:
#1
QUOTE: "If you were to put in some kind of calcium chelator like EDTA that removed the calcium from the media or the extracellular fluid in these cells, these cells would actually fall apart and these junctions would fall apart. On the cytoplasmic side, you have a number of linker proteins. Again, they've been named in this diagram, catinin Vinculin and Alpha Actinin are involved, and again, they link up to the actin filaments."
TRANSCRIPT ID: 6yb0ijyfl0-14c4-4bc1-96f2-ff029bb7e630
START TIMESTAMP: 0:16:57.786000
CONFIDENCE SCORE: 0.5054862317894155
#2
QUOTE: "If you have two cells that are attached to one another and they're undergoing physical forces, what keeps them from coming apart is these adhering junctions. There are forces you can imagine in your intestine that would rub against the epithelia as material passes through. And these adhering junctions keep those epithelia from coming apart and exposing the connected tissue below. Gap junctions are found in most cells and their real function is actually as a pore or channel that lies between two adjacent cells. And these allow for small molecules to pass ions to pass through and they're controlled pores."
TRANSCRIPT ID: 6yb0ijyfl0-14c4-4bc1-96f2-ff029bb7e630
START TIMESTAMP: 0:08:28.270000
CONFIDENCE SCORE: 0.4346457600791962
#3
QUOTE: "And we take a look at a schematic showing the intracellular surface of one cell and the intracellular surface of another. Here are the plasma membranes of one and cell two, and then the space in between. The space is about 20. Unlike the cludence junction, there is a space, and that space contains transmembrane proteins called caherons, which match up with their homolog on the adjacent cell, and they hold the two cells together. These adherence junctions are calcium dependent."
TRANSCRIPT ID: 6yb0ijyfl0-14c4-4bc1-96f2-ff029bb7e630
START TIMESTAMP: 0:16:23.290000
CONFIDENCE SCORE: 0.4293010412511604

Provide References to Multiple Transcripts

When analyzing multiple transcripts, it can be helpful to have references to know which transcript the answer came from.

1 questions = [
2     aai.LemurQuestion(question="Identify pain points discussed across all user interviews", context='', answer_format="""
3     [
4         "pain point 1",
5         "pain point 2",
6         "pain point 3"
7     ]
8     """)
9 ]

1 import json, datetime
2 
3 def get_examples(lemur_output):
4     matches = find_relevant_matches(embeddings, lemur_output, k=5)
5 
6     print('EXAMPLES:')
7     for index, m in enumerate(matches):
8         print('#{}'.format(index+1))
9         print('QUOTE: "{}"'.format(m['text']))
10         print('TRANSCRIPT ID:', m['transcript_id'])
11         print('START TIMESTAMP:', str(datetime.timedelta(seconds=m['timestamp']/1000)))
12         print('CONFIDENCE SCORE:', m['confidence'])
13         print()
14     return matches

1 transcripts = transcribe([
2     '', # TODO ADD URLS
3 ])
4 
5 embeddings = create_transcripts_embeddings(transcripts, granularity='sentence')
6 
7 qa_results = transcripts.lemur.question(questions).response
8 
9 print(f"Question: {qa_results[0].question}")
10 print(f"Answer: {qa_results[0].answer}")
11 print()
12 
13 pain_point_array = json.loads(qa_results[0].answer.strip())
14 for pp in pain_point_array:
15     print('Pain Point:', pp)
16     get_examples(pp)

Example output:

Question: Identify pain points discussed across all user interviews
Answer: [
"Communication challenges due to the remote nature of the team",
"Lack of alignment across different tools and preferences",
"Losing key documents and data points when employees leave the company"
]
Pain Point: Communication challenges due to the remote nature of the team
EXAMPLES:
#1
QUOTE: "And so sometimes that comes with its communication challenges and people working in different time zones."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:03:03.930000
CONFIDENCE SCORE: 0.5171660164618275
#2
QUOTE: "We are remote, so virtual whiteboard."
TRANSCRIPT ID: 6ybb50tnns-785d-4361-ad15-b0ef5db075e6
START TIMESTAMP: 0:04:11.792000
CONFIDENCE SCORE: 0.44395501749808775
#3
QUOTE: "So I think that has been some of the larger challenges with our team."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:04:19.364000
CONFIDENCE SCORE: 0.43930383535763384
#4
QUOTE: "No, I think that I would sum it up in a way that I would say just essentially that I had already mentioned that communication and one single spot for us to collaborate and communicate in is already a challenge."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:25:35.388000
CONFIDENCE SCORE: 0.43763248883808603
#5
QUOTE: "How would you collaborate with your team?"
TRANSCRIPT ID: 6ybb50cj8c-877e-4314-9547-3e1450cf08f7
START TIMESTAMP: 0:20:20.196000
CONFIDENCE SCORE: 0.4267756277366459
Pain Point: Lack of alignment across different tools and preferences
EXAMPLES:
#1
QUOTE: "What creates a challenge is that there are people that are more proficient or more comfortable working within certain apps than others are."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:03:39.838000
CONFIDENCE SCORE: 0.42358740588333854
#2
QUOTE: "Again, I think it's between tools that we use all the way to just the different time zones in everyone's schedule that sometimes it causes a delay in projects getting done or slow up or bottlenecks."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:03:11.950000
CONFIDENCE SCORE: 0.4154978400007534
#3
QUOTE: "And so if we're working all from one tool, I think it's more clear because I don't think people also revert back to their Google Drive often."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:24:56.418000
CONFIDENCE SCORE: 0.3742980144030581
#4
QUOTE: "Some people are communicating with a preference of slack first, while others are email first."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:04:14.762000
CONFIDENCE SCORE: 0.3704920121143226
#5
QUOTE: "And so sometimes that comes with its communication challenges and people working in different time zones."
TRANSCRIPT ID: 6ybb50o8da-9dda-429b-b576-3b61c98a4fc3
START TIMESTAMP: 0:03:03.930000
CONFIDENCE SCORE: 0.3631327779909689

Identify Timestamps For Action Items

Quickly jump to the part of the meeting where the action item was discussed.

1 action_item_answer_format="""[{
2     "action_item":<action item>,
3     "assignee":<assignee>,
4     "quote":"<leave blank>",
5     "timestamp":"<leave blank>"
6     }]
7 """
8 
9 action_item_context = ''

1 import json, datetime
2 
3 def timestamp_action_items(action_items_array):
4 
5     for action_item in action_items_array:
6         matches = find_relevant_matches(embeddings, action_item['action_item'], k=1)
7         for index, m in enumerate(matches):
8             action_item['quote'] = m['text']
9             action_item['timestamp'] = m['timestamp']
10     return action_items_array

1 transcripts = transcribe([
2     '', # TODO add file URLs here
3 ])
4 
5 # TODO: choose granularity either sentence or paragraph
6 embeddings = create_transcripts_embeddings(transcripts, 'paragraph')
7 
8 action_item_results = transcripts.lemur.action_items(
9     context=action_item_context,
10     answer_format=action_item_answer_format).response
11 
12 # Replace preamble in LeMUR response
13 action_item_results = action_item_results.replace('Here are action items based on the transcript:', '')
14 
15 action_item_json_array = json.loads(action_item_results.strip())
16 action_item_json_array = timestamp_action_items(action_item_json_array)
17 print(json.dumps(action_item_json_array, indent=4))

Example output:

552 TOKENS EMBEDDED
COST OF EMBEDDINGS: $ 5.520000000000001e-05
[
    {
        "action_item": "Schedule a follow up call with Daniel to continue the conversation.",
        "assignee": "Rich",
        "quote": "I tell you what. I'll let you jump on that call. No sweat at all. I understand. I'll drop you a mail and we'll find a time to talk next week.",
        "timestamp": 237070
    },
    {
        "action_item": "Send an email to Daniel with availability for a call next week.",
        "assignee": "Rich",
        "quote": "I tell you what. I'll let you jump on that call. No sweat at all. I understand. I'll drop you a mail and we'll find a time to talk next week.",
        "timestamp": 237070
    },
    {
        "action_item": "Review financials and metrics for Avail to prepare for the follow up call.",
        "assignee": "Rich",
        "quote": "We are in the health and wellness space, so our space heated up extremely fast, and I found myself working 40 hours in one job and the other. So I had to make a decision. And I think there's a little bit more potential and upside with avail. Great. And so you've seen just an increased demand for your product then, over the past six weeks?",
        "timestamp": 138816
    }
]