PII Redaction | AssemblyAI

Supported languages

Global Englishen

Australian Englishen_au

British Englishen_uk

US Englishen_us

Spanishes

Frenchfr

Germande

Italianit

Portuguesept

Dutchnl

Hindihi

Japaneseja

Chinesezh

Finnishfi

Koreanko

Polishpl

Russianru

Turkishtr

Ukrainianuk

Vietnamesevi

Afrikaansaf

Arabicar

Belarusianbe

Bulgarianbg

Burmesemy

Catalanca

Croatianhr

Czechcs

Danishda

Estonianet

Georgianka

Greekel

Hebrewhe

Hungarianhu

Icelandicis

Indonesianid

Khmerkm

Latvianlv

Lithuanianlt

Luxembourgishlb

Malayms

Norwegianno

Persianfa

Romanianro

Slovaksk

Sloveniansl

Swahilisw

Swedishsv

Tagalogtl

Tamilta

The PII Redaction model lets you minimize sensitive information about individuals by automatically identifying and removing it from your transcript.

Personal Identifiable Information (PII) is any information that can be used to identify a person, such as a name, email address, or phone number.

When you enable the PII Redaction model, your transcript will look like this:

With hash substitution: Hi, my name is ####!
With entity_name substitution: Hi, my name is [PERSON_NAME]!

You can also Create redacted audio files to replace sensitive information with a beeping sound.

Redacted Properties

PII only redacts words in the text property. Properties from other features may still include PII, such as entities from Entity Detection or summary from Summarization.

Quickstart

Python SDK

Python

JavaScript SDK

JavaScript

C#

Ruby

PHP

Enable PII Redaction on the TranscriptionConfig using the set_redact_pii() method.

Set policies to specify the information you want to redact. For the full list of policies, see PII policies.

1 import assemblyai as aai
2 
3 aai.settings.api_key = "<YOUR_API_KEY>"
4 
5 # audio_file = "./local_file.mp3"
6 audio_file = "https://assembly.ai/wildfires.mp3"
7 
8 config = aai.TranscriptionConfig().set_redact_pii(
9     policies=[
10         aai.PIIRedactionPolicy.person_name,
11         aai.PIIRedactionPolicy.organization,
12         aai.PIIRedactionPolicy.occupation,
13     ],
14     substitution=aai.PIISubstitutionPolicy.hash,
15 )
16 
17 transcript = aai.Transcriber().transcribe(audio_file, config)
18 print(f"Transcript ID:", transcript.id)
19 
20 print(transcript.text)

Example output

1 Smoke from hundreds of wildfires in Canada is triggering air quality alerts
2 throughout the US. Skylines from Maine to Maryland to Minnesota are gray and
3 smoggy. And in some places, the air quality warnings include the warning to stay
4 inside. We wanted to better understand what's happening here and why, so we
5 called ##### #######, an ######### ######### in the ########## ## #############
6 ###### ### ########### at ##### ####### ##########. Good morning, #########.
7 Good morning. So what is it about the conditions right now that have caused this
8 round of wildfires to affect so many people so far away? Well, there's a couple
9 of things. The season has been pretty dry already, and then the fact that we're
10 getting hit in the US. Is because there's a couple of weather systems that ...

PII Redaction Using LeMUR

If you would like the option to use LeMUR for custom PII redaction, check out this guide Redact PII from Text Using LeMUR.

Create redacted audio files

In addition to redacting sensitive information from the transcription text, you can also generate a version of the original audio file with the PII “beeped” out.

Redacted Audio Endpoint

Retrieve the redacted audio file using the transcript_id for a transcript where redact_pii_audio was enabled during submission:

GET

/v2/transcript/:transcript_id/redacted-audio

1 curl https://api.assemblyai.com/v2/transcript/transcript_id/redacted-audio \
2      -H "Authorization: <apiKey>"

Note: Redacted audio creation is not supported for the EU region. For more information refer to our API Reference

Webhooks and PII Audio Redaction

If using webhooks with PII audio redaction enabled, you’ll receive two webhook calls: the first when the redacted audio file is ready and the second when the request for transcription is completed. For more information about using webhooks, refer to our webhook documentation.

Python SDK

Python

JavaScript SDK

JavaScript

C#

Ruby

PHP

To create a redacted version of the audio file, use the set_redact_pii() method on the TranscriptionConfig with redact_audio to True.

Use get_redacted_audio_url() on the transcript to get the URL to the redacted audio file.

1 config = aai.TranscriptionConfig().set_redact_pii(
2     policies=[
3         aai.PIIRedactionPolicy.person_name,
4         aai.PIIRedactionPolicy.organization,
5         aai.PIIRedactionPolicy.occupation,
6     ],
7     redact_audio=True
8 )
9 
10 transcript = aai.Transcriber().transcribe(audio_url, config)
11 
12 print(transcript.get_redacted_audio_url())

Supported Languages

You can only create redacted audio files for transcriptions in English and Spanish.

Maximum Audio File Size

You can only create redacted versions of audio files if the original file is smaller than 1 GB.

Redacted Audio for Silent Files

By default, audio redaction provides redacted audio URLs only when speech is detected. However, if your use-case specifically requires redacted audio files even for silent audio files without any dialogue, you can now opt to receive these URLs. Enable this by setting the optional parameter "return_redacted_no_speech_audio": true within redact_pii_audio_options in your POST request body.

Example request body

1 {
2     "audio_url": "YOUR_AUDIO_URL",
3     "redact_pii": true,
4     "redact_pii_audio": true,
5     "redact_pii_audio_options": {
6           "return_redacted_no_speech_audio": true
7      },
8      "redact_pii_policies": ["credit_card_number"]
9 }

Example output

1 https://s3.us-west-2.amazonaws.com/api.assembly.ai.usw2/redacted-audio/ac06721c-d1ea-41a7-95f7-a9463421e6b1.mp3?AWSAccessKeyId=...

API reference

Request

$ curl https://api.assemblyai.com/v2/transcript \
> --header "Authorization: <YOUR_API_KEY>" \
> --header "Content-Type: application/json" \
> --data '{
>   "audio_url": "YOUR_AUDIO_URL",
>   "redact_pii": true,
>   "redact_pii_policies": ["us_social_security_number", "credit_card_number"],
>   "redact_pii_sub": "hash",
>   "redact_pii_audio": true,
>   "redact_pii_audio_quality": "mp3"
> }'

Key	Type	Description
`redact_pii`	boolean	Enable PII Redaction.
`redact_pii_policies`	array	PII policies for what information to redact.
`redact_pii_sub`	string	Method used to substitute PII in the transcript. Can be `entity_name` or `hash`.
`redact_pii_audio`	boolean	Create a redacted version of the audio file.
`redact_pii_audio_quality`	string	Quality of the redacted PII audio file. Can be `mp3` or `wav`.

Response

Key	Type	Description
`text`	string	Transcript with redacted PII.

The response also includes the request parameters used to generate the transcript.

Request for Redacted Audio

In the request URL, replace transcript_id with the ID of the transcript where redact_pii_audio is set to true.

$ curl https://api.assemblyai.com/v2/transcript/transcript_id/redacted-audio \
> --header "Authorization: <YOUR_API_KEY>"

Response for Redacted Audio

Key	Type	Description
`status`	string	The status of the redacted audio.
`redacted_audio_url`	string	The URL of the redacted audio file.

PII policies

Policy name	Description	Example
`account_number`	Customer account or membership identification number	`Policy No. 10042992; Member ID: HZ-5235-001`
`banking_information`	Banking information, including account and routing numbers
`blood_type`	Blood type	O-, AB positive
`credit_card_cvv`	Credit card verification code	CVV: 080
`credit_card_expiration`	Expiration date of a credit card
`credit_card_number`	Credit card number
`date`	Specific calendar date	December 18
`date_of_birth`	Date of birth	Date of Birth: March 7, 1961
`drivers_license`	Driver’s license number	DL# 356933-540
`drug`	Medications, vitamins, or supplements	Advil, Acetaminophen, Panadol
`email_address`	Email address	support@assemblyai.com
`event`	Name of an event or holiday	Olympics, Yom Kippur
`gender_sexuality`	Terms indicating gender identity or sexual orientation, including slang terms	female, bisexual, trans
`healthcare_number`	Healthcare numbers and health plan beneficiary numbers	Policy No.: 5584-486-674-YM
`injury`	Bodily injury	I broke my arm, I have a sprained wrist
`ip_address`	Internet IP address, including IPv4 and IPv6 formats	192.168.0.1
`language`	Name of a natural language	Spanish, French
`location`	Any Location reference including mailing address, postal code, city, state, province, country, or coordinates.	Lake Victoria, 145 Windsor St., 90210
`medical_condition`	Name of a medical condition, disease, syndrome, deficit, or disorder	chronic fatigue syndrome, arrhythmia, depression
`medical_process`	Medical process, including treatments, procedures, and tests	heart surgery, CT scan
`money_amount`	Name and/or amount of currency	15 pesos, $94.50
`nationality`	Terms indicating nationality, ethnicity, or race	American, Asian, Caucasian
`number_sequence`	Numerical PII (including alphanumeric strings) that doesn’t fall under other categories
`occupation`	Job title or profession	professor, actors, engineer, CPA
`organization`	Name of an organization	CNN, McDonalds, University of Alaska, Northwest General Hospital
`passport_number`	Passport numbers, issued by any country	PA4568332, NU3C6L86S12
`password`	Account passwords, PINs, access keys, or verification answers	27%alfalfa, temp1234, My mother’s maiden name is Smith
`person_age`	Number associated with an age	27, 75
`person_name`	Name of a person	Bob, Doug Jones, Dr. Kay Martinez, MD
`phone_number`	Telephone or fax number
`political_affiliation`	Terms referring to a political party, movement, or ideology	Republican, Liberal
`religion`	Terms indicating religious affiliation	Hindu, Catholic
`url`	Internet addresses	https://www.assemblyai.com/
`us_social_security_number`	Social Security Number or equivalent
`username`	Usernames, login names, or handles	@AssemblyAI
`vehicle_id`	Vehicle identification numbers (VINs), vehicle serial numbers, and license plate numbers	5FNRL38918B111818, BIF7547

Troubleshooting

Why is the PII not redacted in my transcription?

Make sure that at least one PII policy has been specified in your request, using the redact_pii_policies parameter. If you’re still experiencing issues, please reach out to our support team for assistance.

Why is my webhook not being sent?

There could be several reasons why your webhook isn’t being sent, such as a misconfigured URL, an unreachable endpoint, or an issue with the authentication headers. Double-check your request and ensure that the webhook_url parameter is included with a valid URL that can be reached by AssemblyAI’s API. If you’re using custom authentication headers, ensure that the webhook_auth_header_name and webhook_auth_header_value parameters are included and are correct. If you’re still having issues, please contact our support team for assistance.

Why does my redacted audio file sound worse than the original?

By default, the API returns redacted audio files in MP3 format, a lossy format. Lossy formats remove audio information to reduce file size, which may cause a reduction in quality. The difference may be particularly noticeable if the submitted audio is in a lossless file format. To retain as much quality as possible, you can instead return your redacted audio files in a lossless format, by setting redact_pii_audio_quality to wav.