THESIS
2022
1 online resource (xi, 92 pages) : illustrations (some color)
Abstract
Speaker verification (SV) is the process of verifying whether an utterance belongs to the
claimed speaker, based on some reference utterances.
Learning effective and discriminative speaker embeddings is a central theme in the
speaker verification task. In this thesis, we focus on the speaker embedding learning
issues in text-independent SV tasks, and present three methods to learn better speaker
embeddings.
The first one is the self-attentive speaker embedding learning method. Usually, speaker
embeddings are extracted from a speaker-classification neural network that averages the
hidden vectors over all the spoken frames of a speaker; the hidden vectors produced
from all the frames are assumed to be equally important. We relax this assumption and
compute the speaker embedding as a weigh...[
Read more ]
Speaker verification (SV) is the process of verifying whether an utterance belongs to the
claimed speaker, based on some reference utterances.
Learning effective and discriminative speaker embeddings is a central theme in the
speaker verification task. In this thesis, we focus on the speaker embedding learning
issues in text-independent SV tasks, and present three methods to learn better speaker
embeddings.
The first one is the self-attentive speaker embedding learning method. Usually, speaker
embeddings are extracted from a speaker-classification neural network that averages the
hidden vectors over all the spoken frames of a speaker; the hidden vectors produced
from all the frames are assumed to be equally important. We relax this assumption and
compute the speaker embedding as a weighted average of a speaker’s frame-level hidden
vectors, and their weights are automatically determined by a self-attention mechanism.
The effect of multiple attention heads is also investigated to capture different aspects of
a speaker’s input speech.
The second method generalizes the multi-head attention in the Bayesian attention
framework, where the standard deterministic multi-head attention can be viewed as a
special case. In the Bayesian attention framework, parameters of each attention head
share a common distribution, and the update of these parameters is related, instead of
being independent. The Bayesian attention framework can help alleviate the attention
redundancy problem. It also provides a theoretical understanding of the benefits of applying
multi-head attention. Based on the Bayesian attention framework, we propose a
Bayesian self-attentive speaker embedding learning algorithm.
The third method introduces channel attention to the embedding learning framework,
and analyzes the channel attention from the perspective of frequency analysis. Frequency-domain
pooling methods are then proposed to enhance the channel attention and produce
better speaker embeddings.
Systematic evaluation of the proposed embedding learning methods is performed on
different evaluation sets. Significant and consistent improvements over state-of-the-art
systems are achieved on all the evaluation datasets.
Post a Comment