Pre-recorded audio
Our Speech-to-Text model enables you to transcribe pre-recorded audio into written text.
On top of the transcription, you can enable other features and models, such as Speaker Diarization, by adding additional parameters to the same transcription request.
Choose model class
Choose between Best and Nano based on the cost and performance tradeoffs best suited for your application.
Quickstart
The following example transcribes an audio file from a URL.
Example output
Word-level timestamps
The response also includes an array with information about each word:
Python
TypeScript
Go
Java
C#
Ruby
Transcript status
After you’ve submitted a file for transcription, your transcript has one of the following statuses:
Handling errors
If the transcription fails, the status of the transcript is error
, and the transcript includes an error
property explaining what went wrong.
Python
TypeScript
Go
Java
C#
Ruby
A transcription may fail for various reasons:
- Unsupported file format
- Missing audio in file
- Unreachable audio URL
If a transcription fails due to a server error, we recommend that you resubmit the file for transcription to allow another server to process the audio.
Select the speech model with Best and Nano
We use a combination of models to produce your results. You can select the class of models to use in order to make cost-performance tradeoffs best suited for your application. You can visit our pricing page for more information on our model tiers.
Python
TypeScript
Go
Java
C#
Ruby
Python
TypeScript
Go
Java
C#
Ruby
You can change the model by setting the speech_model
in the transcription config:
For a list of the supported languages for each model, see Supported languages.
Select the region
The default region is US, with base URL api.assemblyai.com
. For EU data residency requirements, you can use our base URL for EU at api.eu.assemblyai.com
.
The base URL for EU is currently only available for Async transcription.
Python
TypeScript
Go
Java
C#
Ruby
To use the EU endpoint, set the base_url
in the client settings like this:
Automatic punctuation and casing
By default, the API automatically punctuates the transcription text and formats proper nouns, as well as converts numbers to their numerical form.
Python
TypeScript
Go
Java
C#
Ruby
To disable punctuation and text formatting, set punctuate
and format_text
to False
in the transcription config.
Automatic language detection
Identify the dominant language spoken in an audio file and use it during the transcription. Enable it to detect any of the supported languages.
To reliably identify the dominant language, the file must contain at least 50 seconds of spoken audio.
Python
TypeScript
Go
Java
C#
Ruby
To enable it, set language_detection
to True
in the transcription config.
Select model class based on detected language
By performing automatic language detection on a small chunk of audio first, you can then select between the Best or Nano model depending on the detected language. To learn more, see Separating automatic language detection from transcription.
Confidence score
If language detection is enabled, the API returns a confidence score for the detected language. The score ranges from 0.0 (low confidence) to 1.0 (high confidence).
Python
TypeScript
Go
Java
Ruby
C#
Set a language confidence threshold
You can set the confidence threshold that must be reached if language detection is enabled. An error will be returned if the language confidence is below this threshold. Valid values are in the range [0,1] inclusive.
Python
TypeScript
Go
Java
Ruby
Fallback to a default language
For a workflow that resubmits a transcription request using a default language if the threshold is not reached, see this cookbook.
Set language manually
If you already know the dominant language, you can use the language_code
key to specify the language of the speech in your audio file.
Python
TypeScript
Go
Java
C#
Ruby
To see all supported languages and their codes, see Supported languages.
Custom spelling
Custom Spelling lets you customize how words are spelled or formatted in the transcript.
Python
TypeScript
Go
Java
C#
Ruby
To use Custom Spelling, pass a dictionary to set_custom_spelling()
on the transcription config. Each key-value pair specifies a mapping from a word or phrase to a new spelling or format. The key specifies the new spelling or format, and the corresponding value is the word or phrase you want to replace.
The key is case-sensitive, but the value isn’t. Additionally, the value can contain multiple words.
Custom vocabulary
To improve the transcription accuracy, you can boost certain words or phrases that appear frequently in your audio file.
To boost words or phrases, include the word_boost
parameter in the transcription config.
You can also control how much weight to apply to each keyword or phrase. Include boost_param
in the transcription config with a value of low
, default
, or high
.
Python
TypeScript
Go
Java
C#
Ruby
Follow formatting guidelines for custom vocabulary to ensure the best results:
- Remove all punctuation except apostrophes.
- Make sure each word is in its spoken form. For example,
iphone seven
instead ofiphone 7
. - Remove spaces between letters in acronyms.
Additionally, the model still accepts words with unique characters such as é, but converts them to their ASCII equivalent.
You can boost a maximum of 1,000 unique keywords and phrases, where each of them can contain up to 6 words.
Multichannel transcription
If you have a multichannel audio file with multiple speakers, you can transcribe each of them separately.
Python
TypeScript
Go
Java
C#
Ruby
To enable it, set multichannel
to True
in your transcription config.
Multichannel audio increases the transcription time by approximately 25%.
The response includes an audio_channels
property with the number of different channels, and an additional utterances
property, containing a list of turn-by-turn utterances.
Each utterance contains channel information, starting at 1.
Additionally, each word in the words
array contains the channel identifier.
Dual-channel transcription
Use Multichannel instead.
Export SRT or VTT caption files
You can export completed transcripts in SRT or VTT format, which can be used for subtitles and closed captions in videos.
You can also customize the maximum number of characters per caption by specifying the chars_per_caption
parameter.
Export paragraphs and sentences
You can retrieve transcripts that are automatically segmented into paragraphs or sentences, for a more reader-friendly experience.
The text of the transcript is broken down by either paragraphs or sentences, along with additional metadata.
The response is an array of objects, each representing a sentence or a paragraph in the transcript. See the API reference for more info.
Filler words
The following filler words are removed by default:
- “um"
- "uh"
- "hmm"
- "mhm"
- "uh-huh"
- "ah"
- "huh"
- "hm"
- "m”
If you want to keep filler words in the transcript, you can set the disfluencies
to true
in the transcription config.
Python
TypeScript
Go
Java
C#
Ruby
Profanity filtering
You can automatically filter out profanity from the transcripts by setting filter_profanity
to true
in your transcription config.
Any profanity in the returned text
will be replaced with asterisks.
Python
TypeScript
Go
Java
C#
Ruby
Profanity filter isn’t perfect. Certain words may still be missed or improperly filtered.
Set the start and end of the transcript
If you only want to transcribe a portion of your file, you can set the audio_start_from
and the audio_end_at
parameters in your transcription config.
Python
TypeScript
Go
Java
C#
Ruby
Speech threshold
To only transcribe files that contain at least a specified percentage of spoken audio, you can set the speech_threshold
parameter. You can pass any value between 0 and 1.
If the percentage of speech in the audio file is below the provided threshold, the value of text
is None
and the response contains an error
message:
Python
TypeScript
Go
Java
C#
Ruby
Word search
You can search through a completed transcript for a specific set of keywords, which is useful for quickly finding relevant information.
The parameter can be a list of words, numbers, or phrases up to five words.
Python
TypeScript
Go
Java
C#
Ruby
Delete transcripts
You can remove the data from the transcript and mark it as deleted.
Python
TypeScript
Go
Java
C#
Ruby
Account-level TTL value
Starting on 11-26-2024, the platform will assign an account-level Time to Live (TTL) for customers who have executed a Business Associate Agreement (BAA) with AssemblyAI. For those customers, all transcripts generated via the async transcription endpoint will be deleted after the TTL period.
As of the feature launch date:
- The TTL is set to 3 days (subject to change).
- Customers can still manually delete transcripts before the TTL period by using the deletion endpoint. However, they cannot keep transcripts on the platform after the TTL period has expired.
BAAs are limited to customers who process PHI, subject to HIPAA. If you are processing PHI and require a BAA, please reach out to sales@assemblyai.com.
API reference
You can find the API reference here.
Troubleshooting
How can I make certain words more likely to be transcribed?
You can include words, phrases, or both in the word_boost
parameter. Any term included has its likelihood of being transcribed boosted.
Can I customize how words are spelled by the model?
Yes. The Custom Spelling feature gives you the ability to specify how words are spelled or formatted in the transcript text. For example, Custom Spelling could be used to change the spelling of all instances of the word “Ariana” to “Arianna”. It could also be used to change the formatting of “CS 50” to “CS50”.
Why am I receiving a "400 Bad Request" error when making an API request?
A “400 Bad Request” error typically indicates that there’s a problem with the formatting or content of the API request. Double-check the syntax of your request and ensure that all required parameters are included as described in the API reference. If the issue persists, contact our support team for assistance.