An RNN-based prosodie information synthesizer for mandarin text-to-speech

Sin-Horng Chen*, Shaw-Hwa Hwang, Yih-Ru Wang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

136 Scopus citations


A new RNN-based prosodie information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodie information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent currentword phonologic states within the prosodie structure of text to be synthesized. The second hidden layer and output layer operate on a syllable-synchronized clock and use outputs from the preceding layers, along with additional syllable-level inputs fed directly to the second hidden layer, to generate desired prosodie parameters. The RNN was trained on a large set of actual utterances accompanied by associated texts, and can automatically learn many human-prosody phonologic rules, including the wellknown Sandhi Tone 3 FO-change rule. Experimental results show that all synthesized prosodie parameter sequences matched quite well with their original counterparts, and a pitch-synchronousoverlap-add-based (PSOLA-based) Mandarin TTS system was also used for testing of our approach. While subjective tests are difficult to perform and remain to be done in the future, we have carried out informal listening tests by a significant number of native Chinese speakers and the results confirmed that all synthesized speech sounded quite natural.

Original languageEnglish
Article number668817
Pages (from-to)226-239
Number of pages14
JournalIEEE Transactions on Speech and Audio Processing
Issue number3
StatePublished - May 1998


  • Mandarin
  • Pitch contour
  • Prosodie information synthesizer
  • Recurrent neural network
  • Text-to-speech

Fingerprint Dive into the research topics of 'An RNN-based prosodie information synthesizer for mandarin text-to-speech'. Together they form a unique fingerprint.

Cite this