THESIS
2011
xi, 74 p. : ill. ; 30 cm
Abstract
In this thesis, we design a common questionnaire of stress-inducing and non-stress-inducing questions in English, Mandarin and Cantonese and collect a first ever multilingual corpus of natural stress emotion. Most emotional speech database are simulated emotions [4], which collect data from professional actors. The majority of other natural emotion databases contain universal emotions (i.e. happiness, anger, sadness). Among the less natural stress emotion databases, the stress emotion is collected under extreme conditions, which are not suitable since we need daily life emotions to build everyday applications. We interviewed 37 Mandarin, 36 English and 36 Cantonese native speakers, with stressed emotions and comparatively neutral emotions for each, of both genders, giving a total of aro...[
Read more ]
In this thesis, we design a common questionnaire of stress-inducing and non-stress-inducing questions in English, Mandarin and Cantonese and collect a first ever multilingual corpus of natural stress emotion. Most emotional speech database are simulated emotions [4], which collect data from professional actors. The majority of other natural emotion databases contain universal emotions (i.e. happiness, anger, sadness). Among the less natural stress emotion databases, the stress emotion is collected under extreme conditions, which are not suitable since we need daily life emotions to build everyday applications. We interviewed 37 Mandarin, 36 English and 36 Cantonese native speakers, with stressed emotions and comparatively neutral emotions for each, of both genders, giving a total of around 4 hours’ speech for each database.
We present a systematic study for emotional stress in university students with only acoustic features and compare the result with linguistic features. It is desirable not to use linguistic features because they are expensive and must be obtained from accurate transcriptions by either the ASR system or humans. We found that with the same amount of data, using acoustic features rather than linguistic features gives a higher performance. Previous studies on stressed speech recognition used TEO based features [1] [5], whereas emotion recognition system uses standard low-level descriptors to detect universal emotions. We extracted 560 acoustic features, including low-level descriptors and Teager Energy Operator ("TEO") based features, and integrated them together for our system. We found that using them together is better than using either group of features alone. Our acoustic feature-based classifier recognizes stress in the subjects’ speech with 83.96% accuracy, on average, within the same gender and language group, largely outperforming human perception tests, which showed only 39.27%, without the knowledge of the linguistic content.
We study the cross gender differences for the stress emotion in speech and show that the classification accuracy by using the model of one gender to detect the stress emotion in the other gender decreases, even within the same language, with an average of 27.13% among the three languages. This implies that the acoustic features used to model emotional stress are different for different genders.
We investigate the cross lingual difference in terms of acoustic features for stress emotion recognition in speech. Our system approximately maintains the same performance across the three languages, with an average classification accuracy of 76.91% for male subjects and 80.01% for female speakers. Feature ranking experiments show that the most important stress features are TEO and MFCCs, rather than pitch. This explains the relative language-independence of our model, even though Mandarin is a tonal language.
Post a Comment