Speech recognition is a powerful and widely used technology nowadays. However, its performance is not robust enough due to variations in speech introduced by the operating environment, noises (their type and energy) and inter-speaker differences....[ Read more ]
Speech recognition is a powerful and widely used technology nowadays. However, its performance is not robust enough due to variations in speech introduced by the operating environment, noises (their type and energy) and inter-speaker differences.
Speaker adaptation is an important technology to fine-tune either features or speech models for the mis-match due to inter-speaker variation. In the last decade, eigenvoice (EV) speaker adaptation has been developed. It makes use of the prior knowledge of training speakers to provide a fast adaptation algorithm (in other words, only a small amount of adaptation data is needed). Inspired by the kernel eigenface idea in face recognition, kernel eigenvoice (KEV) is proposed. KEV is a non-linear generalization to EV. This incorporates Kernel Principal Component Analysis (KPCA), a non-linear version of Principal Component Analysis (PCA), to capture the higher order correlations in order to further explore the speaker space and enhance recognition performance. The major difficulty is that through KEV adaptation, the adapted speaker model is estimated in the kernel feature space which may not have an exact pre-image in the input speaker-supervector space, yet observation likelihoods are computed in the acoustic observation space for both adaptation and recognition. Composite kernel is proposed to solve the problem.
Experimental investigation on TIDIGITS corpus, an English continuous digits recognition task, using 4 seconds of adaptation data shows that KEV adaptation gives a 21% relative improvement (RI) over the speaker-independent (SI) model, a 25% RI over MLLR adaptation, a 32% RI over MAP adaptation and a 32% RI over EV adaptation. When the speaker-adapted models from KEV are interpolated with the SI model, the RI increase to 32% over SI model, 35% over MLLR adaptation, 41%over MAP adaptation and 32% over similarly interpolated EV adaptation.