July 18, 2024

Announcing New Language Support for PII Text Redaction and Expanding Entity Detection

We're announcing support for 47 new languages and 16 new entities to make our Audio Intelligence more powerful and globally accessible.

PII Redaction

JD Prater

Head of Product Marketing

JD Prater

Head of Product Marketing

Table of contents

[Visible on live site]

Get $50 in credits

At Assembly, we're focused on ensuring that you can extract maximum value and insights from voice data, while also keeping privacy and security at the forefront to keep you and your end users safe. Today, we're announcing updates to our PII Text Redaction and Entity Detection features to give you more power and control in protecting sensitive information.

What's New

PII Text Redaction now available in 47 additional languages
16 new entity types added to Entity Detection for a total of 44 types available

PII Redaction: Safeguarding Sensitive Information Across Languages

Our latest update brings expanded language support to our PII Text Redaction feature, now available in 47 additional languages.

This enhancement ensures that your Personally Identifiable Information (PII) — any information that can be used to identify a person — is safeguarded regardless of location or language, making robust privacy measures more accessible.

With this feature, you can:

Securely handle customer service calls containing personal information
Safely share and analyze user-generated content in media applications
Protect participant privacy in market research studies and surveys

With PII Redaction, you can identify and remove personal data such as addresses, phone numbers, and credit card details from your transcripts. You can accomplish this in two ways:

Text Redaction: Generates a transcript with PII removed. Example: "You can reach me at [phone_number]" or "You can reach me at ###.”
Audio Redaction: "Beeps out" sensitive information in your audio file.

With a few extra lines of code, you can quickly redact PII from your transcripts.

import assemblyai as aai aai.settings.api_key = "YOUR API KEY" audio_url = "https://github.com/AssemblyAI-Community/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3" config = aai.TranscriptionConfig(speaker_labels=True).set_redact_pii( policies=[ aai.PIIRedactionPolicy.person_name, aai.PIIRedactionPolicy.organization, aai.PIIRedactionPolicy.occupation, ], substitution=aai.PIISubstitutionPolicy.hash, ) transcript = aai.Transcriber().transcribe(audio_url, config) for utterance in transcript.utterances: print(f"Speaker {utterance.speaker}: {utterance.text}") print(transcript.text)

Example output

Speaker A: Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US. Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why. So we called ##### #######, an ######### ######### in the ########## ## ############# ###### ### ########### at ##### ####### ##########. Good morning, #########. Speaker B: Good morning. Speaker A: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away? Speaker B: Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that...

Plus, the PII models achieve 99%+ precision, accuracy, and recall in major languages, including English, French, German, Italian, Portuguese, Spanish, Korean, Hindi, Russian, Tagalog, and Ukrainian.

This ensures reliable protection of sensitive information. For EU-based operations, we support PII Text Redaction in 13 languages, meeting regional data residency requirements.

Expanding Entity Detection

Extracting meaningful insights from large volumes of audio data can be time-consuming and resource-intensive. We enhanced our Entity Detection with 16 new entity types for a total of 44 different entities so you can extract more value from your audio data.

You can automatically identify and categorize key information in your transcripts, providing detailed entity lists and timestamps.

Here are a few examples of what you can detect:

Names of people
Organizations
Addresses
Phone numbers
Medical data
Social security numbers

Here's how you can enable Entity Detection within your app:

import assemblyai as aai aai.settings.api_key = "YOUR API KEY" audio_url = "https://github.com/AssemblyAI-Community/audio-examples/raw/main/20230607_me_canadian_wildfires.mp3" config = aai.TranscriptionConfig(entity_detection=True) transcript = aai.Transcriber().transcribe(audio_url, config) for entity in transcript.entities: print(entity.text) print(entity.entity_type) print(f"Timestamp: {entity.start} - {entity.end}\n")

You'll receive precise entity data, such as:

Canada location Timestamp: 2548 - 3130 the US location Timestamp: 5498 - 6350 ...

By eliminating manual review, you can efficiently search and categorize audio content across multiple languages and markets. This opens up new possibilities for in-depth analysis and data-driven decision making.

With Entity Detection, you'll be able to run key use cases like:

Rapidly analyze call center interactions to improve customer service
Categorize and search media content more effectively
Extract key trends and patterns from market research data

Entity Detection delivers reliable results with 99% accuracy in major languages. It also supports EU data residency for 13 languages, helping you maintain regional compliance requirements. Quickly unlock the full potential of your audio data, gaining deeper insights into customer interactions and market trends across a broad range of contexts.

Frequently Asked Questions

I want to store and process my data in the EU. Will the expanded PII Text Redaction and Entity Detection languages be supported by EU Data Residency?

Yes, 13 languages in our "Best ASR" offering will be supported by EU Data Residency: English, Finnish, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.

What is the quality of PII Text Redaction and Entity Detection across languages?

The highest quality PII Text Redaction and Entity Detection is found in English, French, German, Italian, Portuguese, Spanish, Korean, Hindi, Dutch, Japanese, Mandarin, Russian, Tagalog, and Ukrainian. These languages have fully trained corpora with verified 99%+ precision, accuracy, and recall results.

PII Redaction and Entity Detection in other supported languages perform well, but the training corpus is actively being improved to bring it up to the same level as the other languages.

How secure is my data when using AssemblyAI's PII Redaction and Entity Detection?

AssemblyAI prioritizes data security with enterprise-grade encryption both in transit and at rest. We adhere to stringent data security practices to ensure your sensitive information is protected. Additionally, users can request the deletion of their data at any time, and these requests are handled promptly.

Start building with new security features today.