THESIS
2002
xiii, 96 leaves : ill. ; 30 cm
Abstract
Most speech recognition systems use acoustic features (ACF) as representations of the acoustic signals for phone modeling. Recently, articulatory features (AF) have been proposed as an alternative representation. Articulatory features characterize the important properties such as lip rounding, different tongue positions, etc. during speech production and, thus, are more phonologically meaningful. Previous research on AF systems has shown that their recognition performances are comparable to those of ACF systems and the combination of the two systems outperforms either one of the systems alone....[
Read more ]
Most speech recognition systems use acoustic features (ACF) as representations of the acoustic signals for phone modeling. Recently, articulatory features (AF) have been proposed as an alternative representation. Articulatory features characterize the important properties such as lip rounding, different tongue positions, etc. during speech production and, thus, are more phonologically meaningful. Previous research on AF systems has shown that their recognition performances are comparable to those of ACF systems and the combination of the two systems outperforms either one of the systems alone.
Conventional AF systems use a single model to represent one phone when operating alone or when combined with ACF models. In our approach multiple ways to improve the combination of AF models and ACF models were investigated . First, we increased the resolution of the AF models and allowed asynchronous combination between the states of the two models during recognition. With the asynchronous combination, the alignments of the AF model states and ACF model states were made more flexible. This resulted in an absolute improvement of 3.37% in phone recognition accuracy. Second, parameters of the ACF models and AF models were jointly estimated. The joint estimation allowed the HMM models to take into account the contributions from AF models and vice versa. The gain from the joint estimation was about 1%. It also simplified the combination by reducing the gap between synchronous and asynchronous states alignments. Third, besides phone recognition, the AF model was used to perform confidence measures of the recognition outputs. Our work on confidence measures again demonstrated that AF models and ACF models make different errors and their combination was superior to either one by itself.
Besides being used for phone recognition and confidence measure, AF models can also be applied to other speech related tasks. The knowledge learned from combining AF and ACF models during recognition and estimation can also be generalized to the combination of other models.
Post a Comment