Voices
Pick any voice ID from the tables below and set it on session.output.voice in a session.update before session.ready. session.output is immutable once the session is established, so the voice canโt be changed mid-conversation.
Language support
The voice agentโs input (speech recognition) and output (speech synthesis) cover different sets of languages:
- Input (understood): ๐บ๐ธ English, ๐ซ๐ท French, ๐ฉ๐ช German, ๐ฎ๐น Italian, ๐ต๐น Portuguese, and ๐ช๐ธ Spanish.
- Output (spoken): those six, plus ๐ฎ๐ณ Hindi, ๐ฏ๐ต Japanese, ๐ฐ๐ท Korean, ๐จ๐ณ Mandarin, and ๐ท๐บ Russian.
The agent can speak a language it canโt transcribe from user audio. This is useful for translation-style flows where the user speaks one of the recognized languages and the agent replies in another.
Choose a voice by language
Every voice supports every output language. The difference between the two tables is the voiceโs primary accent:
- For an English accent (American or British) carried into other languages, pick from Voices.
- For a native accent in a specific non-English language, pick the matching language-specific voice.
Voices
These voices have an American or British English accent. They speak ๐บ๐ธ English, ๐ซ๐ท French, ๐ฉ๐ช German, ๐ฎ๐น Italian, ๐ต๐น Portuguese, ๐ช๐ธ Spanish, ๐ฎ๐ณ Hindi, ๐จ๐ณ Mandarin, ๐ท๐บ Russian, ๐ฐ๐ท Korean, and ๐ฏ๐ต Japanese. Their English accent carries over into the other languages.
Language-specific voices
These voices have a native accent in a specific non-English language. They also speak ๐บ๐ธ English, ๐ซ๐ท French, ๐ฉ๐ช German, ๐ฎ๐น Italian, ๐ต๐น Portuguese, ๐ช๐ธ Spanish, ๐ฎ๐ณ Hindi, ๐จ๐ณ Mandarin, ๐ท๐บ Russian, ๐ฐ๐ท Korean, and ๐ฏ๐ต Japanese, and they code-switch naturally between their primary language and English.