For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Getting started
    • Transcribe a pre-recorded audio file
    • Model selection
    • View model benchmarks
    • Evaluate model accuracy
    • Cloud endpoints & data residency
    • Manage concurrent requests
    • Webhooks
  • Models
    • Medical Mode
  • Features
    • Boost specific terms
    • Label speakers
    • Transcribe multiple audio channels
    • Transcribe audio with mixed languages
    • Correct spelling of terms
    • Include filler words
    • Search for words in transcript
    • Set the start and end of the transcript
  • Guides
      • Build a meeting notetaker
      • Build a medical scribe
      • Build a contact center application
        • Translate an AssemblyAI Subtitle Transcript
        • Translate AssemblyAI Transcripts Into Other Languages Using Commercial Models
        • Transform Chinese transcripts into Simplified or Traditional Text
LogoLogo
PlaygroundChangelogSign In
On this page
  • Quickstart
  • Step-by-step instructions
  • Conclusion
GuidesTutorialsTranslation

Transform Chinese transcripts into Simplified or Traditional Text

Was this page helpful?
Previous

Do More With Our SDKs

Next
Built with

When transcribing Chinese audio, our models produce output that mixes both Simplified and Traditional Chinese characters. This happens because our models are typically trained on diverse datasets containing a mix of both writing systems.

This guide demonstrates a practical workaround for this using OpenCC, an open-source Chinese conversion tool. We’ll show you how to implement a post-processing step that can normalize your transcription output to either consistent Simplified Chinese or Traditional Chinese, depending on your needs.

While this guide uses Python, OpenCC is available across multiple programming languages.

Quickstart

1import assemblyai as aai
2import opencc
3
4aai.settings.api_key = "<YOUR-API-KEY>"
5
6audio_file = "https://assembly.ai/chinese-interview.mp4"
7
8config = aai.TranscriptionConfig(language_code="zh", speech_models=["universal-3-pro", "universal-2"])
9
10transcript = aai.Transcriber(config=config).transcribe(audio_file)
11
12if transcript.status == "error":
13 raise RuntimeError(f"Transcription failed: {transcript.error}")
14
15# t2s.json converts traditional characters to simplified
16# use s2t.json to convert from simplified to traditional
17converter = opencc.OpenCC('t2s.json')
18
19simplified_transcript = converter.convert(transcript.text)
20
21print(simplified_transcript)

Step-by-step instructions

First, install the required packages:

  1. AssemblyAI SDK
  2. OpenCC
$pip install -U assemblyai opencc

Import the necessary libraries and configure your API credentials:

1import assemblyai as aai
2import opencc
3
4aai.settings.api_key = "YOUR_API_KEY"

Specify your audio source and create a configuration for Chinese language transcription. Then submit your transcription request.

1audio_file = "https://assembly.ai/chinese-interview.mp4"
2
3config = aai.TranscriptionConfig(language_code="zh", speech_models=["universal-3-pro", "universal-2"])
4
5transcript = aai.Transcriber(config=config).transcribe(audio_file)

Implement error handling to catch any transcription failures:

1if transcript.status == "error":
2 raise RuntimeError(f"Transcription failed: {transcript.error}")

Apply script conversion using OpenCC with the appropriate configuration:

1# Script conversion options:
2# - 't2s.json': Traditional to Simplified
3# - 's2t.json': Simplified to Traditional
4
5# Create converter object with desired direction
6converter = opencc.OpenCC('t2s.json') # For Traditional to Simplified
7
8# Convert the transcript text
9simplified_transcript = converter.convert(transcript.text)

Output or save your converted transcript:

1print(simplified_transcript)
2
3# Optionally save to file
4with open("converted_transcript.txt", "w", encoding="utf-8") as f:
5 f.write(converted_transcript)

Conclusion

This guide demonstrates how to solve the common challenge of mixed Chinese script systems in transcription outputs. By combining AssemblyAI’s powerful speech recognition capabilities with OpenCC’s script conversion tools, you can create a reliable pipeline for producing consistently formatted Chinese text from audio sources.