Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis


Detai Xin1,2, Xu Tan1, Kai Shen1,3, Zeqian Ju1,4, Dongchao Yang5, Yuancheng Wang6,

Shinnosuke Takamichi2, Hiroshi Saruwatari2, Shujie Liu1, Jinyu Li1, Sheng Zhao1

1Microsoft, 2The University of Tokyo,

3Zhejiang University, 4University of Science and Technology of China,

5The Chinese University of Hong Kong, 6The Chinese University of Hong Kong, Shenzhen

Abstract. We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from 5.6% (without reranking) and 1.7% (with reranking) to 2.8% and 1.0%, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%.

This page is for research demonstration purposes only.

Zero-Shot TTS Samples

We show the audio samples of GT, RALL-E, ELLA-V, and VALL-E selected from LibriSpeech test-clean set. We cannot show the samples of VALL-T since it is tested on LibriTTS. The WER% and transcriptions recognized by the ASR model used in the paper are also provided, where mispronunciation/repetition/hallucination is denoted in UPPER case, word omission is denoted by *.

Prompt text: I thought few men would follow me for I was young

Prompt audio:

Synthesis text: But young sharp tongue, now that we have caught you, we will put you into a trap that you cannot get out of.

ASR result but young sharp tongue now that we have caught you we will put you into a trap that you cannot get out of but young sharp tongue now that we have caught you we will put you into a trap that you cannot get out of *** young ***** SHARPTON now that we have caught you we will put you into a trap **** THEY cannot get out of *** ***** sharp tongue now that we have caught you we will put you into a trap that you cannot get out of
WER% 0 0 21.7 8.7

Prompt text: using every possible contraction a quarter of an hour, not less.

Prompt audio:

Synthesis text: When we were out in the darkness of the quadrangle, we again looked up at the windows.

ASR result when we were out in the darkness of the quadrangle we again looked up at the windows when we were out in the darkness of the quadrangle we again looked up at the windows **** ** were out in the darkness of A QUIET DRANGLE WEEK AND looked up at the windows when ** WE'RE out in the darkness of the quadrangle we again looked up at the windows
WER% 0 0 41.2 11.8

Prompt text: of this long separation will wear away.

Prompt audio:

Synthesis text: She, a Tory and clergyman's daughter, was always in a minority of one in our house of violent dissent and radicalism.

ASR result she a tory and clergyman's daughter was always in a minority of one in our house of violent dissent and radicalism she a tory and clergyman's daughter was always in a minority of one in our house of violent dissent and radicalism *** * **** and clergyman's daughter was always in a minority of one in our house of violent ******* DISSENTING radicalism she a tory and clergyman's daughter was always in a minority ** one in our house of VIOLET dissent and radicalism
WER% 0 0 23.8 9.5

Prompt text: Far from its sire, your majesty haven't given

Prompt audio:

Synthesis text: Quick, quick then, among the high-reed grass said Montalais. Stoop, Athenais, you are so tall.

ASR result quick quick then among the high reed grass said montalais stoop ATHENAE you are so tall quick quick then among the high reed grass said montalais stoop athenais you are so tall quick WICK THAN among the high READ grass said montalais TO PATHANAY you are so tall QUICK quick quick then among the **** HYRID grass said MONTEL EYES BE AN AST TOO stoop ATHANIZE you are so tall
WER% 6.25 0 31.25 62.5

Prompt text: drew back from John Yago as he approached the empty chair next

Prompt audio:

Synthesis text: Soft heart, he said gently to her. Then to Thorkel. Well, let him go, Thorkel.

ASR result soft heart he said gently to her then to TORKELL well let him go TORKELL soft heart he said gently to her then to thorkel well let him go thorkel soft heart he said JEALOUSY to her THAN to FORKEL well let him go FORKEL JANKS AND LOW HARK he said gently to her THEN TO THORKEL OH then to thorkel well let him go thorkel
WER% 13.3 0 26.7 53.3

Samples of hard setences

We show samples of the hard sentences mentioned in Section 4.4 of the paper. These sentences have no GT utterance. We describe the errors (mispronunciation, omission, repetition, hallucination) existing in the samples.

Synthesis text: b

Errors B is repeated twice.

Synthesis text: 22222222 hello 22222222

Errors "2" is synthesized 9 times before "hello" and 7 times after "hello", but in the input text "2" is repeated 8 times at both sides of "hello".

Synthesis text: c five eight zero three three nine a zero bf eight FALSE zero zero zero bba3add2 - c229 - 4cdb -

Errors Omission and hallucination

Synthesis text: You can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information.

Errors "four" is omitted once in the cell phone number

Synthesis text: Thanks J RGR Are you using the LDDM driver for this system or the in the build XDDM driver ? 12

Errors "LDDM" is mispronunced, and hallucination at the end.