What if you received a raw transcript that looked like this?
if you picture a sound meter with a needle that bounces up and down
every time there's a sound the tone is supposed to put the needle
perfectly at this one spot on the meter with a black numbers end and
the red part of the meter begins there's like a zero at that spot
marking this is where you want to be and the tone is just supposed to
rest there rock solid but this particular day with this particular
recording we put it on and keith and i watched the meter as the needle
first dipped below the zero then climbed above the zero and then
floated sort of tentatively to the spot that it was supposed to be at
the zero and rested there
It’s legible but takes quite a bit of effort to read as your mind naturally wants to add punctuation, casing, line breaks, etc. to make sense of the long string of text.
Compare the transcript above to this:
If you picture a sound meter with a needle that bounces up and down
every time there's a sound, the tone is supposed to put the needle
perfectly at this one spot on the meter with a black numbers end, and
the red part of the meter begins there's like a zero at that spot
marking, this is where you want to be. And the tone is just supposed to
rest there rock solid. But this particular day, with this particular
recording, we put it on, and Keith and I watched the meter as the
needle first dipped below the zero, then climbed above the zero, and
then floated sort of tentatively to the spot that it was supposed to be
at the zero and rested there.
See how much easier it is to read? This is because common punctuation and casing have been automatically applied to the transcription text.
Speech-to-text automatic punctuation and casing at AssemblyAI
When you transcribe an audio or video file with the AssemblyAI Speech-to-Text API, your transcript is automatically passed through our Automatic Punctuation and Casing Model.
Instead of a long chunk of text, your transcript has appropriately placed punctuation, such as commas, periods, and question marks, and correctly capitalized proper nouns, acronyms, and more. This helps ease readability and increases the overall usefulness of your transcript, especially for customer-facing use cases.
What is automatic punctuation and casing for speech-to-text?
Punctuation refers to any commas, periods, question marks, exclamation marks, etc. that must be added to a transcription text.
Casing refers to two different categories:
- Proper Nouns
- Special Scenarios, e.g., acronyms like NASA or NY Times.
What is Inverse Text Normalization (ITN)?
Inverse Text Normalization, or ITN, is a rule-based system (based on a FST, or Finite State Transducer) that also increases the readability of a transcript.
Essentially, ITN translates the spoken form of text (which is the output of the speech-to-text model) into its written form. For example, the raw transcript might output:
february fourth twenty twenty two
(spoken form)
The ITN model converts this to:
february 4th 2022
(written form)
ITN is helpful to ensure the proper written format of text such as emails, credit card numbers, social security numbers, dates, and more.
If downstream tasks depend on these inputs, it becomes essential that all dates, numbers, emails, phone numbers, etc. are accurately formatted, or you risk an entire workflow failing to initiate correctly.
Speech-to-text automatic punctuation and casing — improvements in Universal-2
Our latest next-generation speech-to-text model—Universal-2—demonstrates even greater improvements in correctly applying text formatting rules like automatic punctuation and casing.
For example, benchmark tests revealed a 15% improvement in transcript structure and 24% improvement in proper noun recognition, leading to more natural-sounding, accurate transcripts for customer-facing products.
Using automatic punctuation with transcripts with the AssemblyAI speech-to-text API
As stated above, the AssemblyAI Speech-to-Text API will automatically punctuate and apply properly cased proper nouns to the transcription text. Numbers will also automatically be converted to their written format.
While automatic punctuation is enabled by default for optimal speech-to-text results, you have the flexibility to disable these features by setting the punctuate and format_text parameters to false in the transcription config. More details can also be found in the AssemblyAI docs.