THESIS
2023
1 online resource (ix, 87 pages) : illustrations (chiefly color)
Abstract
Difficulties in eliciting substantial spoken data from speaker populations of interest and producing
the accompanying transcripts result in low-resource scenarios in which the development
of robust automatic speech recognition (ASR) systems may be hindered. With the aid of
a large volume of unlabeled audio data, self-supervised speech representation learning may
address this limitation by learning a model-based feature extractor via a proxy task in advance,
thus offering pre-trained representations transferable to the ASR task for fine-tuning.
This dissertation reviews current self-supervised speech representation learning methodologies
and investigates the application of wav2vec 2.0 ASR on a developing corpus named CU-MARVEL
in order to provide automatic transcripts for streamlining it...[
Read more ]
Difficulties in eliciting substantial spoken data from speaker populations of interest and producing
the accompanying transcripts result in low-resource scenarios in which the development
of robust automatic speech recognition (ASR) systems may be hindered. With the aid of
a large volume of unlabeled audio data, self-supervised speech representation learning may
address this limitation by learning a model-based feature extractor via a proxy task in advance,
thus offering pre-trained representations transferable to the ASR task for fine-tuning.
This dissertation reviews current self-supervised speech representation learning methodologies
and investigates the application of wav2vec 2.0 ASR on a developing corpus named CU-MARVEL
in order to provide automatic transcripts for streamlining its human transcription
work. The said corpus involves spontaneous responses from Cantonese-speaking older adults
in Hong Kong—a unique setting concerning a language and a population that are both low-resource.
We contribute a Cantonese wav2vec 2.0 model that is pre-trained on audio data
obtained from the web and segmented using end-to-end neural diarization methods. We
evaluate the usefulness of further pre-training on in-domain data and semi-supervised learning
by pseudo-labeling for ASR under the pre-training-and-fine-tuning paradigm. Given the
availability of cross-lingual wav2vec 2.0 models, we also compare the downstream performance
of the monolingual pre-trained model to that resulted from the cross-lingual 300M
XLS-R model and justify if a monolingual pre-trained model is necessary. We benchmark our
results against those obtained from parallel experiments on the English LibriSpeech corpus.
Our best performing model for CU-MARVEL is the 300M XLS-R further pre-trained in two
stages: first adapting to the target language and then confining to the target domain. On
participants’ speech it reduces the character error rate (CER) of the vanilla XLS-R baseline by
23.1% relatively. This dissertation concludes with suggesting directions for future research.
Post a Comment