Our new PII Redaction Policies feature is here!
Try PII Redaction here!Introduction
Personally Identifiable Information (PII) is any data that can be used to identify an individual, any details that might provide insight around who someone is. This can include information like:
- Email addresses
- Social Security Numbers
- Credit card numbers
- Account numbers
- Phone numbers
- Birthdays
PII creates security and privacy challenges, especially when specific and stringent safeguards for it are spelled out in regulations like the European Union’s (EU’s) General Data Protection Regulation (GDPR).
The loss of PII can also result in substantial loss to both businesses and individuals. According to IBM’s 2020 Cost of a Data Breach Report, they found customer data was the most-commonly compromised type of record with 80% of breached organizations saying that customer PII was affected.
With PII now being more accessible and shareable, through multiple channels, companies are having to bolster their security practices to ensure proper handling of their customer's data.
Securing PII Through Redaction
PII redaction is one of the most effective solutions to secure data, as it provides another layer of protection to make sure customer information is hidden. This is especially important when using AI, automated speech recognition, and Speech-to-Text APIs, as there is no human review.
With PII redaction, a phone number like 412-412-4124
would become ###-###-####
in the text, and the audio would replace those words with a blank sound. We've included some additional examples below.
Conference Call Platforms
Often times, customers are calling into Conference Call Platforms and sharing their email address, credit card number, phone number, and other very sensitive PII. With PII Redaction, this sensitive data can be automatically detected and redacted, so that you're confident you're not storing or processing any PII from call recordings.
Call Tracking Platforms
Call Tracking Platforms primarily record agent-customer calls for marketing, sales, and support. In many cases, companies using these platforms require verification information from customers including account numbers, email addresses, and phone numbers while making a purchase or getting support. With PII redaction, all of this personal data will be removed from the call recording and the automated transcription.
Telemedicine
When patients visit their doctor, there is a high likelihood they share personal medical details like health insurance policy numbers, group numbers, or account numbers. With automated recording and transcription now being used for notes both in-person and over virtual calls, there is a high likelihood that patient medical details could be compromised. PII Redaction can be leveraged to protect patients' medical information by removing it from the audio (or video) recording and transcript notes.
Hiring platforms
Common hiring platforms like Applicant Tracking Systems, Video Hiring Software, and even Human Resources Information Systems allow recruitment, HR, and management teams to efficiently manage their candidate pipeline and employee onboarding. These platforms often leverage call and video recording to make the process more effective, however, this often surfaces candidate and new hire information like emails, phone numbers, and compensation amounts. To help protect this information, PII Redaction will automatically detect and remove all candidate and new hire information from the recordings and transcriptions.
Enabling PII Redaction Policies
AssemblyAI enables you to automatically detect and redact Personally Identifiable Information (PII) from the automated transcription produced by our API.
Below is a code sample that shows how easy it is to enable PII Redaction when submitting audio or video files for transcription. You can view code samples in more programming languages in our API Docs.
import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
"audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
"redact_pii": True
}
headers = {
"authorization": "YOUR-API-TOKEN",
"content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
print(response.json())
Specifying Which Types of Data to Redact
To best-fit the data redaction to your application, you can select from a set of redaction policies when PII Redaction is enabled. You can include any or some of these policy names in the redact_pii_policies
parameter when making your POST request as shown above.
For the full list of PII policies, see our API docs.
Redact PII from Audio
When you request a transcription that has PII redacted, you also have an option to request audio redaction. In that case, we will mute the parts of your audio where PII is spoken, and will make a downloadable URL available for the redacted audio file.
Important Considerations
- The muted portions of the audio will correspond to the timestamps where the PII was detected and replaced with # characters in the transcription text.
- We will store the redacted audio file for 24 hours after your transcription has completed. After this time it will expire, so you'll need to download this file and store it in your own server/S3 bucket/etc.
The below code samples shows how you can submit an audio or video file for transcription and enable PII Audio Redaction. You can view code samples in more programming languages in our API Docs.
import requests
endpoint = "https://api.assemblyai.com/v2/transcript"
json = {
"audio_url": "https://app.assemblyai.com/static/media/phone_demo_clip_1.wav",
"redact_pii": True,
"redact_pii_audio": True,
# optional; receive a webhook when redacted audio is ready
"webhook_url": "http://myserver.com/receive"
}
headers = {
"authorization": "YOUR-API-TOKEN",
"content-type": "application/json"
}
response = requests.post(endpoint, json=json, headers=headers)
Get the redacted audio URL
If a webhook_url
was provided in your API request, we will send a POST to your webhook_url
when the redacted audio is ready. The POST request to your webhook will look like this:
headers
---
content-length: 79
accept-encoding: gzip, deflate
accept: */*
user-agent: python-requests/2.21.0
content-type: application/json
params
--
status: 'redacted_audio_ready'
redacted_audio_url: 'https://link-to-redacted-audio'
If you can't receive a webhook, you can also make a GET request to the following endpoint to retrieve a URL for your redacted audio file:
https://api.assemblyai.com/v2/transcript/<your transcript id>/redacted-audio
This will return the following responses:
{
"status": "redacted_audio_ready",
"redacted_audio_url": "https://link-to-redacted-audio"
}
Sources
- https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A32016R0679
- https://www.csoonline.com/article/3215864/how-to-protect-personally-identifiable-information-pii-under-gdpr.html
- https://www.welivesecurity.com/2020/08/12/what-is-cost-data-breach/
- https://www.ibm.com/security/digital-assets/cost-data-breach-report/#/