THESIS
2021
1 online resource (xx, 142 pages) : illustrations (some color)
Abstract
Human beings can distinguish voices from different speakers when listening with only one ear.
Fundamental frequency (F0) and spectral envelope have been identified as the two most
important monaural cues utilized by listeners. In recent years, speech separation/recognition
models based on deep learning (DL) have been developed to simulate human ability to
distinguish voices. However, the effects of F0 and spectral envelope on the performance of DLbased
speech separation models have received little attention.
Two experiments were conducted to evaluate how the DL-based speech separation/recognition
models are affected by F0 and spectral envelope differences between concurrent speeches.
Results indicated that, similar to human, the performance of the models is also significantly
affected...[
Read more ]
Human beings can distinguish voices from different speakers when listening with only one ear.
Fundamental frequency (F0) and spectral envelope have been identified as the two most
important monaural cues utilized by listeners. In recent years, speech separation/recognition
models based on deep learning (DL) have been developed to simulate human ability to
distinguish voices. However, the effects of F0 and spectral envelope on the performance of DLbased
speech separation models have received little attention.
Two experiments were conducted to evaluate how the DL-based speech separation/recognition
models are affected by F0 and spectral envelope differences between concurrent speeches.
Results indicated that, similar to human, the performance of the models is also significantly
affected by F0/envelope. To our best knowledge, this is the first comparison of such kind.
The effects of F0 modulation and spectral envelope were further studied with DL speech
separation and recognition models on concurrent vowels/syllables. Results of Experiments 3
and 4 indicated that as the spectral envelope differences between the concurrent vowels/syllables increased, the model performance in separating the concurrent vowels/syllables also increased linearly. This is different from the logarithmic trend of human listeners. Another interesting finding is that vowel/syllable pairs with different F0 modulation patterns were easier to be separated but harder to be recognized.
Two data augmentation methods, based on manipulation of F0 and spectral envelope, were
developed and studied for reducing the overfitting problem in DL-based speech separation
models. Results of Experiments 5.1 to 5.3 indicated that, without data augmentation, overfitting
problem in speech separation occurred in a typical end-to-end speech separation model when
the number of speakers whose speeches to be involved in the training set was less than 40. With
either one of the data augmentation methods, overfitting problem was significantly reduced.
The benefits of data augmentation were further validated to be generalizable regardless of
whether the mixed speeches have similar or different F0 and spectral envelopes.
Post a Comment