THESIS
2020
Abstract
Voice conversion (VC) is the task of converting a source speaker’s speech such that the
output speech sounds like it is uttered by a different target speaker. Earlier approaches
focus on finding a direct mapping function between a pair of source and target speakers,
which requires pairs of utterances with the same content to be available in the training
set. However, collecting pairs of utterances is often costly and time-consuming. Thus,
training VC models with unconstrained speech data is more desirable; this is sometimes
known as non-parallel VC. Recently, various deep learning methods like autoencoder,
variational autoencoder and generative adversarial network are proposed for non-parallel
VC. However, most of them cannot be easily trained and perform well at the same time....[
Read more ]
Voice conversion (VC) is the task of converting a source speaker’s speech such that the
output speech sounds like it is uttered by a different target speaker. Earlier approaches
focus on finding a direct mapping function between a pair of source and target speakers,
which requires pairs of utterances with the same content to be available in the training
set. However, collecting pairs of utterances is often costly and time-consuming. Thus,
training VC models with unconstrained speech data is more desirable; this is sometimes
known as non-parallel VC. Recently, various deep learning methods like autoencoder,
variational autoencoder and generative adversarial network are proposed for non-parallel
VC. However, most of them cannot be easily trained and perform well at the same time.
In this thesis, we present a simple but novel framework to train a non-parallel many-to-many VC model based on the encoder-decoder framework that can convert (seen or
unseen) speech between any speaker pairs in a non-parallel speech corpus. We propose
to transfer knowledge from the state-of-the-art multi-speaker text-to-speech (TTS) model,
Mellotron, to the VC model by adopting Mellotron’s decoder as the VC decoder. The
model is trained on LibriTTS dataset with simple loss terms. Subjective evaluation shows
that our proposed model is able to generate naturally sounding speech and out-perform
the state-of-the-art non-parallel VC model, AUTO-VC.
Post a Comment