THESIS
2018
xiii, 66 pages : illustrations ; 30 cm
Abstract
Visual speech recognition (a.k.a lip-reading) is the task of recognizing speech solely
from the visual movement of the mouth. In this work, we propose multiple feasible
and practical strategies, and demonstrate significant improvements to the established
competitive baselines in both low-resource and resource-rich scenarios.
On one hand, one main challenge in practical automatic lip-reading is to deal with
the diverse facial viewpoints in the available video data. With the recent proposal of
the spatial transformer, the spatial invariance to input data in the convolutional neural
network has been enhanced and it has demonstrated different levels of success in a broad
spectrum of areas including face recognition, facial alignment and gesture recognition with
promising results by...[
Read more ]
Visual speech recognition (a.k.a lip-reading) is the task of recognizing speech solely
from the visual movement of the mouth. In this work, we propose multiple feasible
and practical strategies, and demonstrate significant improvements to the established
competitive baselines in both low-resource and resource-rich scenarios.
On one hand, one main challenge in practical automatic lip-reading is to deal with
the diverse facial viewpoints in the available video data. With the recent proposal of
the spatial transformer, the spatial invariance to input data in the convolutional neural
network has been enhanced and it has demonstrated different levels of success in a broad
spectrum of areas including face recognition, facial alignment and gesture recognition with
promising results by virtue of the increased model robustness to viewpoint variations
in the data. We study the effectiveness of the learned spatial transformation to our
model through quantitative and qualitative analysis with visualizations and attain an
absolute accuracy gain of 0.92% to our data-augmented baseline on the resource-rich
Lip Reading in the Wild (LRW) continuous word recognition task with incorporation of
spatial transformer.
On the other, we explore the effectiveness of convolutional neural network (CNN)
and long short-term memory (LSTM) recurrent neural network in lip-reading under a
low-resource scenario that has not yet been explored before. We propose an end-to-end
deep learning model fusing conventional CNN and bidirectional LSTM (BLSTM)
together with maxout activation units (maxout-CNN-BLSTM) and dropout, which is
capable of attaining a word accuracy of 87.6% on the low-resource Ouluvs2 corpus, offering
an absolute improvement of 3.1% to the previous state-of-the-art auto-encoder-BLSTM
model at that time. To emphasize, our lip-reading system does not require any separate
feature extraction stage nor pre-training phase with external data resources.
Post a Comment