THESIS
2020
xii, 57 pages : illustrations ; 30 cm
Abstract
Recent studies in multi-lingual multi-speaker text-to-speech (TTS) systems proposed models
that can synthesize high-quality speeches. However, some models are trained with proprietary
corpora consisting of hours of speeches recorded by performing artists and require
additional fine-tuning to enroll new voices. To reduce the cost of training corpora and support
online enrollment of new voices, we investigate a novel multi-lingual multi-speaker neural
TTS synthesis approach for generating high-quality native or accented speech for native/foreign
seen/unseen speakers in English, Mandarin and Cantonese. The unique features of the proposed
model make it possible to synthesize accented/fluent speeches for a speaker in a language that
is not his/her mother tongue. Our proposed model ex...[
Read more ]
Recent studies in multi-lingual multi-speaker text-to-speech (TTS) systems proposed models
that can synthesize high-quality speeches. However, some models are trained with proprietary
corpora consisting of hours of speeches recorded by performing artists and require
additional fine-tuning to enroll new voices. To reduce the cost of training corpora and support
online enrollment of new voices, we investigate a novel multi-lingual multi-speaker neural
TTS synthesis approach for generating high-quality native or accented speech for native/foreign
seen/unseen speakers in English, Mandarin and Cantonese. The unique features of the proposed
model make it possible to synthesize accented/fluent speeches for a speaker in a language that
is not his/her mother tongue. Our proposed model extends the single speaker Tacotron-based
TTS model by transfer learning technique which conditions the model on pretrained speaker
embeddings, x-vectors, using a speaker verification system. We also replace the input character
embedding with a concatenation of phoneme embedding and tone/stress embedding to produce
more natural speech. The additional tone/stress embedding works as an extension of language
embedding which provides extra controls on accents over the languages. By manipulating the
tone/stress input, our model can synthesize native or accented speech for foreign speakers. The
WaveNet vocoder in the TTS model is trained with Cantonese speech and yet it can synthesize English and Mandarin speech very well. It demonstrates that conditioning the WaveNet
on mel-spectrograms is good enough for it to perform well in multi-lingual speech synthesis.
The mean opinion score (MOS) results show that the synthesized native speech of both unseen
foreign and native speakers are intelligent and natural. The speaker similarity of such speech is
also good. The lower scores of foreign accented speech suggests that it is distinguishable from
native speech. The foreign accents we introduced can confuse the meaning of the synthesized
speech perceived by human raters.
Post a Comment