Roles of fundamental frequency and spectral envelope in deep learning-based speech separation and recognitio

HKUST Electronic Theses

Roles of fundamental frequency and spectral envelope in deep learning-based speech separation and recognitio

Hui Jun

THESIS 2021

Ph.D. Industrial Engineering and Decision Analytics

1 online resource (xx, 142 pages) : illustrations (some color)

Abstract

Human beings can distinguish voices from different speakers when listening with only one ear. Fundamental frequency (F0) and spectral envelope have been identified as the two most important monaural cues utilized by listeners. In recent years, speech separation/recognition models based on deep learning (DL) have been developed to simulate human ability to distinguish voices. However, the effects of F0 and spectral envelope on the performance of DLbased speech separation models have received little attention.

Two experiments were conducted to evaluate how the DL-based speech separation/recognition models are affected by F0 and spectral envelope differences between concurrent speeches. Results indicated that, similar to human, the performance of the models is also significantly affected by F0/envelope. To our best knowledge, this is the first comparison of such kind.

The effects of F0 modulation and spectral envelope were further studied with DL speech separation and recognition models on concurrent vowels/syllables. Results of Experiments 3 and 4 indicated that as the spectral envelope differences between the concurrent vowels/syllables increased, the model performance in separating the concurrent vowels/syllables also increased linearly. This is different from the logarithmic trend of human listeners. Another interesting finding is that vowel/syllable pairs with different F0 modulation patterns were easier to be separated but harder to be recognized.

Two data augmentation methods, based on manipulation of F0 and spectral envelope, were developed and studied for reducing the overfitting problem in DL-based speech separation models. Results of Experiments 5.1 to 5.3 indicated that, without data augmentation, overfitting problem in speech separation occurred in a typical end-to-end speech separation model when the number of speakers whose speeches to be involved in the training set was less than 40. With either one of the data augmentation methods, overfitting problem was significantly reduced. The benefits of data augmentation were further validated to be generalizable regardless of whether the mixed speeches have similar or different F0 and spectral envelopes.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Industrial Engineering and Decision Analytics Supervisors So, Richard Authors Hui, Jun Subjects Automatic speech recognition Data processing Speech perception Machine learning Intonation (Phonetics) Frequency spectra Language English Call number Thesis IELM 2021 Hui DOI 10.14711/thesis-991012936366603412

Full record

Roles of fundamental frequency and spectral envelope in deep learning-based speech separation and recognitio

Hui Jun

Post a Comment Cancel reply