THESIS
2004
x, 84 leaves : ill. ; 30 cm
Abstract
While the hidden Markov models (HMM) are the most commonly used models in the field of continuous speech recognition, their weaknesses, such as poor modeling of duration and correlation between speech frames, can limit the recognition performance. One alternative model proposed to address these shortcomings is the polynomial segment model (PSM). The PSM represents the mean trajectory of an acoustic speech segment as a polynomial function across time. Because the segmental property in which all the frames within a segment are jointly evaluated, segmental model can better describe the correlations between frames. Previous results showed that the PSMs out-performed the HMMs in phone classification tasks....[
Read more ]
While the hidden Markov models (HMM) are the most commonly used models in the field of continuous speech recognition, their weaknesses, such as poor modeling of duration and correlation between speech frames, can limit the recognition performance. One alternative model proposed to address these shortcomings is the polynomial segment model (PSM). The PSM represents the mean trajectory of an acoustic speech segment as a polynomial function across time. Because the segmental property in which all the frames within a segment are jointly evaluated, segmental model can better describe the correlations between frames. Previous results showed that the PSMs out-performed the HMMs in phone classification tasks.
While the segmental models are better in capturing inter-frame correlations, they are inferior to the HMM in two aspects. First, the joint estimation of observation likelihoods within a segment implies that any change in segment boundary requires the re-computation of the likelihoods for all the frames within a segment. Thus, the computational requirements of the PSMs in both training and recognition are significantly higher that exact solutions for PSM are not feasible. Second, the typical segment models use a uniform mapping to align observations within a segment to the model which may fail to capture the warping effects within a segment.
In this thesis we propose an innovative and efficient likelihood evaluation approach such that likelihood of speech segments with the same starting time can be computed incrementally. Specifically, the likelihood of a new segment can be computed using the likelihood of an existing segment with shorter duration plus the observation probability of the newly added frames. Based on this incremental approach, expectation-maximization (EM) based training and dynamic programming-based recognition algorithms are proposed. The computational costs of these algorithms can now be reduced down to the level similar to those of HMM.
With the proposed efficient training and searching algorithms for PSM, the concept of sub-segments PSM is proposed as a PSM improvement. It captures the warping effect within a segment and greatly increases the flexibility of PSM in modeling acoustic variations. The improved PSMs, together with the recognition and training algorithms, perform better than the HMMs in both recognition and classification tasks on the TIMIT corpus.
Post a Comment