THESIS
2013
xv, 125 pages : illustrations ; 30 cm
Abstract
The performance of modern speech recognition systems depends heavily on the
availability of sufficient training data. Although the recognition accuracy of a speech
recognition system trained with a large amount of training data can reach above 90%,
the recognition accuracy is much lower if the training data is inadequate. Acquiring
large amounts of manually transcribed speech data is the major cost in deploying a
speech recognition system for any new language. Therefore, there is a high demand
for techniques that help us to build practical speech recognition systems in languages
that have limited training data, with optimal tradeoff between computational cost
and recognition accuracy.
Previously, GMM-HMM based acoustic models with diagonal or full covariance
matrices were he...[
Read more ]
The performance of modern speech recognition systems depends heavily on the
availability of sufficient training data. Although the recognition accuracy of a speech
recognition system trained with a large amount of training data can reach above 90%,
the recognition accuracy is much lower if the training data is inadequate. Acquiring
large amounts of manually transcribed speech data is the major cost in deploying a
speech recognition system for any new language. Therefore, there is a high demand
for techniques that help us to build practical speech recognition systems in languages
that have limited training data, with optimal tradeoff between computational cost
and recognition accuracy.
Previously, GMM-HMM based acoustic models with diagonal or full covariance
matrices were heuristically chosen according to training data size. Full covariance
models are seldom used when training data is limited since they tend to over fit.
On the other hand, diagonal covariance models simply assume feature independence which is an over simplification. In this dissertation, we propose regularized and
sparse models to deal with the problems that conventional diagonal and full covariance
models face: incorrect model assumption and over-fitting when training data
is insufficient.
Three widely used regularization methods, namely ridge, lasso and elastic net
regularization, are investigated in this thesis. Lasso and elastic net regularizations
lead to sparse models, meaning that many entries inside the precision matrices are
shrunk to zero. We also propose weighted lasso regularization to train acoustic
models with sparse banded precision matrices. The proposed sparse banded models
resulting from weighted lasso regularization subsume traditional acoustic modeling
methods with diagonal or full covariance matrices as special cases. Regularization
terms are added to the traditional objective functions to penalize complex models
so that the resulting models will not suffer from serious over-fitting. We derive
the training procedure under an HMM training framework by maximizing the new
objective functions. Other implementation issues are also discussed. Both maximum
likelihood training and discriminative training are investigated.
Experimental results on three limited size corpus, namely Wall Street Journal,
Cantonese and Mandarin data sets, show that our proposed models can significantly
outperform conventional diagonal or full covariance models in terms of recognition
accuracy. In addition, based on our experimental results, lasso regularization is
recommended over other regularization schemes. We also found that sparse banded
models need less computational cost.
Post a Comment