Speech imitation by neural speech synthesis with on-the-fly data augmentation

HKUST Electronic Theses

Speech imitation by neural speech synthesis with on-the-fly data augmentation

by Chung Man Hon

THESIS 2022

M.Phil. Computer Science and Engineering

1 online resource (x, 58 pages) : illustrations (some color)

Abstract

Recent deep learning text-to-speech (TTS) systems synthesize natural speech. Applying speaker adaptation can make a TTS system speak like the adapting speaker, but the speaking style of the synthesized utterance still follows closely to the speaker’s of the training utterances. In some applications, it is desirable to synthesize speech in a speaking manner depending on the scenario. A straightforward solution is to record speech data from a speaker under different role-playing scenarios. However, excluding professional voice talents, most people are not experienced in speaking in different expressive styles. Likewise, without being exposed to a multilingual environment from an early age, most people cannot speak a second language with its native accent. In this thesis, we propose a novel data augmentation method to create a stylish TTS model for a speaker. Specifically, augmented data are created by ‘forcing’ a speaker to imitate stylish speeches of other speakers. Our proposed method consists of two steps. Firstly, all the data are used to train a basic multi-style multi-speaker TTS model. Secondly, augmented utterances are created on-the-fly from the latest TTS model during its training and are used to further train the TTS model. We select two applications to demonstrate the effectiveness of our proposed method: (1) synthesizing speech in three scenarios — newscasting, public speaking, and storytelling — for a speaker who provides only neutral speech data; (2) synthesizing “beautified” speech of a language spoken by a non-native speaker by reducing his/her accent in the aspects of better pronunciation and more native prosody. Our experiment shows that for scenario-based TTS, the scenario speeches synthesized by our proposed method are overwhelmingly preferred over those from a speaker-adapted TTS model. For accent-beautified TTS, our model reduces the foreign accent of the non-native speeches while retaining a higher voice similarity than a state-of-the-art accent conversion model.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science and Engineering Supervisors Mak, Brian Authors Chung, Man Hon Language English Call number Thesis CSE 2022 Chung DOI 10.14711/thesis-991013080612503412

Full record

Speech imitation by neural speech synthesis with on-the-fly data augmentation

by Chung Man Hon

Post a Comment Cancel reply