Deep speaker representation learning in speaker verification

HKUST Electronic Theses

Deep speaker representation learning in speaker verification

by Yingke Zhu

THESIS 2022

Ph.D. Computer Science and Engineering

1 online resource (xi, 92 pages) : illustrations (some color)

Abstract

Speaker verification (SV) is the process of verifying whether an utterance belongs to the claimed speaker, based on some reference utterances.

Learning effective and discriminative speaker embeddings is a central theme in the speaker verification task. In this thesis, we focus on the speaker embedding learning issues in text-independent SV tasks, and present three methods to learn better speaker embeddings.

The first one is the self-attentive speaker embedding learning method. Usually, speaker embeddings are extracted from a speaker-classification neural network that averages the hidden vectors over all the spoken frames of a speaker; the hidden vectors produced from all the frames are assumed to be equally important. We relax this assumption and compute the speaker embedding as a weighted average of a speaker’s frame-level hidden vectors, and their weights are automatically determined by a self-attention mechanism. The effect of multiple attention heads is also investigated to capture different aspects of a speaker’s input speech.

The second method generalizes the multi-head attention in the Bayesian attention framework, where the standard deterministic multi-head attention can be viewed as a special case. In the Bayesian attention framework, parameters of each attention head share a common distribution, and the update of these parameters is related, instead of being independent. The Bayesian attention framework can help alleviate the attention redundancy problem. It also provides a theoretical understanding of the benefits of applying multi-head attention. Based on the Bayesian attention framework, we propose a Bayesian self-attentive speaker embedding learning algorithm.

The third method introduces channel attention to the embedding learning framework, and analyzes the channel attention from the perspective of frequency analysis. Frequency-domain pooling methods are then proposed to enhance the channel attention and produce better speaker embeddings.

Systematic evaluation of the proposed embedding learning methods is performed on different evaluation sets. Significant and consistent improvements over state-of-the-art systems are achieved on all the evaluation datasets.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Mak, Brian Authors Zhu, Yingke Subjects Automatic speech recognition Data processing Speech processing systems Language English Call number Thesis CSE 2022 Zhu DOI 10.14711/thesis-991013141958803412

Full record

Deep speaker representation learning in speaker verification

by Yingke Zhu

Post a Comment Cancel reply