Deeply learned representations for human action recognition

HKUST Electronic Theses

Deeply learned representations for human action recognition

by Lin Sun

THESIS 2018

Ph.D. Electronic and Computer Engineering

xx, 130, that is, xxi, 130 pages : illustrations ; 30 cm

Abstract

Unlike in image recognition, human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Previous research has mainly focused on using hand-designed local features, such as SIFT, HOG and SURF, to solve the video-based recognition problem. However, these approaches have complex implementation and are difficult to extend to the real-world data. Inspired by the success of deeply learned features for image classification, recent attempts have been made to learn deep features for video analysis. However, unlike image analysis, few deep learning models have been proposed to solve the problems in video analysis, and only limited success for videos has been reported. In particular, most such models either deal with simple datasets or rely on low-level local spatial-temporal features for the final precision.

Transferring the success of two-dimensional (2D) Convolutional Neural Networks (CNNs) to videos by implementing 3D CNNs is a direct approach for action recognition. However, partially due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. Therefore, we investigate a new deep architecture which can handle 3D signals more effectively. We propose a factorized spatio-temporal convolutional network (F_STCN) structure which factorizes the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). In order to enhance the spatio-temporal representations for videos without losing the advantage of speed, we propose to add another modality, the difference between neighboring RGB frames, into the spatio-temporal modeling.

CNN-based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. On the other hand, Recurrent Neural Networks (RNNs), are able to learn temporal motion dynamics though iteratively feeding the previous hidden features. In this thesis, we present RNNs as an alternative approach to CNNs. We establish that a feedback based approach, such as RNNs has several fundamental advantages over feedforward approach besides the comparable performance.

We further apply RNNs, particularly the long short-term memory (LSTM) to human action recognition problems. In our experiments, we find that compared with CNNs, RNNs can better model the temporal relations in videos. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long.

To address this invalid assumption, we propose the Lattice-LSTM (L²STM), which extends the LSTM by learning independent hidden state transitions of memory cells at different spatial locations. This method effectively enhances the LSTM's ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures, which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams separately.

To benefit from these heterogenous data/features, existing RNN/CNN models mostly adopted two-stream style: individual models are deployed separately to learn from individual data sources, and the results are either fused or post-processed to achieve the final objective. While relatively effective, this direct use of CNNs/RNNs for learning from heterogenous data/features does not fully exploit the reciprocal information contained in the multiple sources, neither does it exploit the reciprocity in a recurrent manner. Therefore, we proposed a novel recurrent architecture, the Coupled Recurrent Network (CRN), to deal with multiple sources efficiently and effectively. In particular, we study human action recognition by coupling the information from two streams: RGB and optical flow inputs. The same architecture can be applied to other computer vision tasks, such as human pose estimation, where the combination of heat map prediction and part affinity field estimation can work together to boost the final performance.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Electronic and Computer Engineering Supervisors Shi, Bertram E. Authors Sun, Lin Subjects Human activity recognition Data processing Pattern recognition systems Computer vision Language English Call number Thesis ECED 2018 Sun DOI 10.14711/thesis-991012637468003412

Full record

Deeply learned representations for human action recognition

by Lin Sun

Post a Comment Cancel reply