THESIS
2022
1 online resource (x, 58 pages) : illustrations (some color)
Abstract
Recent deep learning text-to-speech (TTS) systems synthesize natural speech. Applying
speaker adaptation can make a TTS system speak like the adapting speaker, but the speaking
style of the synthesized utterance still follows closely to the speaker’s of the training
utterances. In some applications, it is desirable to synthesize speech in a speaking manner
depending on the scenario. A straightforward solution is to record speech data from
a speaker under different role-playing scenarios. However, excluding professional voice
talents, most people are not experienced in speaking in different expressive styles. Likewise,
without being exposed to a multilingual environment from an early age, most people
cannot speak a second language with its native accent. In this thesis, we propose
a nove...[
Read more ]
Recent deep learning text-to-speech (TTS) systems synthesize natural speech. Applying
speaker adaptation can make a TTS system speak like the adapting speaker, but the speaking
style of the synthesized utterance still follows closely to the speaker’s of the training
utterances. In some applications, it is desirable to synthesize speech in a speaking manner
depending on the scenario. A straightforward solution is to record speech data from
a speaker under different role-playing scenarios. However, excluding professional voice
talents, most people are not experienced in speaking in different expressive styles. Likewise,
without being exposed to a multilingual environment from an early age, most people
cannot speak a second language with its native accent. In this thesis, we propose
a novel data augmentation method to create a stylish TTS model for a speaker. Specifically,
augmented data are created by ‘forcing’ a speaker to imitate stylish speeches of
other speakers. Our proposed method consists of two steps. Firstly, all the data are used
to train a basic multi-style multi-speaker TTS model. Secondly, augmented utterances
are created on-the-fly from the latest TTS model during its training and are used to further train the TTS model. We select two applications to demonstrate the effectiveness of
our proposed method: (1) synthesizing speech in three scenarios — newscasting, public
speaking, and storytelling — for a speaker who provides only neutral speech data; (2)
synthesizing “beautified” speech of a language spoken by a non-native speaker by reducing
his/her accent in the aspects of better pronunciation and more native prosody. Our
experiment shows that for scenario-based TTS, the scenario speeches synthesized by our
proposed method are overwhelmingly preferred over those from a speaker-adapted TTS
model. For accent-beautified TTS, our model reduces the foreign accent of the non-native
speeches while retaining a higher voice similarity than a state-of-the-art accent conversion
model.
Post a Comment