Redact PII Entities in a Transcript with Entity Detection

This guide will walk you through using AssemblyAI’s Entity Detection model to redact specific entities from an audio transcription.

While AssemblyAI offers a PII Redaction model for automatic redaction, this method is ideal for scenarios where you need both a redacted and a non-redacted version of the transcript.

We’ll use the AssemblyAI Python SDK to demonstrate this. By the end of this guide, you’ll be able to effectively redact sensitive information from your transcriptions while preserving the original text.

Quickstart

1import assemblyai as aai
2aai.settings.api_key = "YOUR_API_KEY"
3
4transcriber = aai.Transcriber()
5
6audio_url = (
7 "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"
8)
9
10config = aai.TranscriptionConfig(entity_detection=True)
11
12transcript = transcriber.transcribe(audio_url, config)
13redacted_transcript = transcript.text
14
15# redact ALL entities
16for entity in transcript.entities:
17 redacted_transcript = redacted_transcript.replace(entity.text, f"[{entity.entity_type.upper()}]")
18
19print(redacted_transcript[:500])

Before you begin

To complete this tutorial, you need:

Step-by-Step Guide

Install the AssemblyAI SDK:

$pip install assemblyai

Import the assemblyai package and set your API key:

1import assemblyai as aai
2
3aai.settings.api_key = "YOUR-API-KEY"

Define a Transcriber and a TranscriptionConfig with entity_detection set to True, and then create a transcript.

1transcriber = aai.Transcriber()
2
3audio_url = (
4 "https://github.com/AssemblyAI-Examples/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3"
5)
6
7config = aai.TranscriptionConfig(entity_detection=True)
8
9transcript = transcriber.transcribe(audio_url, config)

To redact all detected entities, iterate through the entities in the transcript and replace their text with their entity type:

1redacted_transcript = transcript.text
2
3# redact ALL entities
4for entity in transcript.entities:
5 redacted_transcript = redacted_transcript.replace(entity.text, f"[{entity.entity_type.upper()}]")
6
7print(redacted_transcript[:500])
Output
Smoke from hundreds of wildfires in [LOCATION] is triggering air quality alerts throughout [LOCATION]. Skylines from [LOCATION] to [LOCATION] to [LOCATION] are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why. So he called [PERSON_NAME], an [OCCUPATION] in the [ORGANIZATION] at [ORGANIZATION]...

If you want to redact only certain types of entities (e.g., locations), filter them using a list of entity types:

1# filter for some entities
2redacted_transcript2 = transcript.text
3
4pii_policies = ["location"]
5
6for entity in transcript.entities:
7 if entity.entity_type in pii_policies:
8 redacted_transcript2 = redacted_transcript2.replace(entity.text, f"[{entity.entity_type.upper()}]")
9
10print(redacted_transcript2[:500])
Output
Smoke from hundreds of wildfires in [LOCATION] is triggering air quality alerts throughout [LOCATION]. Skylines from [LOCATION] to [LOCATION] to [LOCATION] are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why. So he called Peter DiCarlo, an associate professor in the department of Environmental Health and Engineering at Johns Hopkins University. Good morning. Professor. Good morning. So

Conclusion

Disclaimer: This method only creates a local redacted copy of the text. If you make a GET request for the transcript again, the text field will remain unredacted.

This tutorial demonstrated how to use the AssemblyAI Python SDK to redact sensitive information from your transcriptions using our Entity Detection model. If you have any further questions or need additional assistance, feel free to reach out to the AssemblyAI Support team at support@assemblyai.com!