THESIS
2018
xi, 56 pages : illustrations ; 30 cm
Abstract
Over the past few years, there has been a resurgence of interest in using recurrent neural
network-hidden Markov model (RNN-HMM) for automatic speech recognition (ASR).
Some modern recurrent network models, such as long short-term memory (LSTM) and
simple recurrent unit (SRU), have demonstrated promising results on this task. Recently,
several scientific perspectives in the fields of neuroethology and speech production suggest
that human speech signals may be represented in discrete point patterns involving
acoustic events in the speech signal. Based on this hypothesis, it may pose some challenges
for RNN-HMM acoustic modeling: firstly, it arbitrarily discretizes the continuous
input into the interval features at a fixed frame rate, which may introduce discretization
errors; se...[
Read more ]
Over the past few years, there has been a resurgence of interest in using recurrent neural
network-hidden Markov model (RNN-HMM) for automatic speech recognition (ASR).
Some modern recurrent network models, such as long short-term memory (LSTM) and
simple recurrent unit (SRU), have demonstrated promising results on this task. Recently,
several scientific perspectives in the fields of neuroethology and speech production suggest
that human speech signals may be represented in discrete point patterns involving
acoustic events in the speech signal. Based on this hypothesis, it may pose some challenges
for RNN-HMM acoustic modeling: firstly, it arbitrarily discretizes the continuous
input into the interval features at a fixed frame rate, which may introduce discretization
errors; secondly, the occurrences of such acoustic events are unknown. Furthermore, the
training targets of RNN-HMM are obtained from other (inferior) models, giving rise to
misalignments.
On the other hand, the temporal point process is a powerful mathematical tool to describe
the latent mechanisms governing the occurrences of observed random events. It is
a random process whose realization consists of a sequence of isolated events with their
time-stamps. Due to their generality, point processes have been widely used for modeling
phenomena such as earthquakes, human activities, financial data, context-aware recommendations,
etc. Major research in this area focuses on exploring the observed event data
to model the underlying dynamics of the system, while our work attempts to deal with
the situation where acoustic events are not available/observed even during training.
In this paper, we propose a recurrent Poisson process (RPP) which can be seen as a
collection of Poisson processes at a series of time intervals in which the intensity evolves
according to the RNN hidden states that encode the history of the acoustic signal. It aims
at allocating the latent acoustic events in continuous time. Such events are efficiently
drawn from the RPP using a sampling-free solution in an analytic form. The speech signal
containing latent acoustic events is reconstructed/sampled dynamically from the discretized
acoustic features using linear interpolation, in which the weight parameters are
estimated from the onset of these events. The above processes are further integrated into
an SRU, forming our final model, called recurrent Poisson process unit (RPPU). Experimental
evaluations on ASR tasks including ChiME-2, WSJ0 and WSJ0&1 demonstrate the
effectiveness and benefits of the RPPU. For example, it achieves a relative WER reduction
of 10.7% over state-of-the-art models on WSJ0.
Post a Comment