Unlike in image recognition, human actions in video sequences are three-dimensional
(3D) spatio-temporal signals characterizing both the visual appearance
and motion dynamics of the involved humans and objects. Previous research
has mainly focused on using hand-designed local features, such as SIFT, HOG and
SURF, to solve the video-based recognition problem. However, these approaches
have complex implementation and are difficult to extend to the real-world data.
Inspired by the success of deeply learned features for image classification, recent
attempts have been made to learn deep features for video analysis. However,
unlike image analysis, few deep learning models have been proposed to solve the
problems in video analysis, and only limited success for videos has been reported...[
Read more ]
Unlike in image recognition, human actions in video sequences are three-dimensional
(3D) spatio-temporal signals characterizing both the visual appearance
and motion dynamics of the involved humans and objects. Previous research
has mainly focused on using hand-designed local features, such as SIFT, HOG and
SURF, to solve the video-based recognition problem. However, these approaches
have complex implementation and are difficult to extend to the real-world data.
Inspired by the success of deeply learned features for image classification, recent
attempts have been made to learn deep features for video analysis. However,
unlike image analysis, few deep learning models have been proposed to solve the
problems in video analysis, and only limited success for videos has been reported.
In particular, most such models either deal with simple datasets or rely on
low-level local spatial-temporal features for the final precision.
Transferring the success of two-dimensional (2D) Convolutional Neural Networks
(CNNs) to videos by implementing 3D CNNs is a direct approach for action
recognition. However, partially due to the high complexity of training 3D convolution
kernels and the need for large quantities of training videos, only limited
success has been reported. Therefore, we investigate a new deep architecture
which can handle 3D signals more effectively. We propose a factorized spatio-temporal convolutional network (F
STCN) structure which factorizes the original
3D convolution kernel learning as a sequential process of learning 2D spatial kernels
in the lower layers (called spatial convolutional layers), followed by learning
1D temporal kernels in the upper layers (called temporal convolutional layers).
In order to enhance the spatio-temporal representations for videos without losing
the advantage of speed, we propose to add another modality, the difference
between neighboring RGB frames, into the spatio-temporal modeling.
CNN-based methods are effective in learning spatial appearances, but are
limited in modeling long-term motion dynamics. On the other hand, Recurrent
Neural Networks (RNNs), are able to learn temporal motion dynamics though iteratively
feeding the previous hidden features. In this thesis, we present RNNs as
an alternative approach to CNNs. We establish that a feedback based approach,
such as RNNs has several fundamental advantages over feedforward approach
besides the comparable performance.
We further apply RNNs, particularly the long short-term memory (LSTM) to
human action recognition problems. In our experiments, we find that compared
with CNNs, RNNs can better model the temporal relations in videos. However,
naively applying RNNs to video sequences in a convolutional manner implicitly
assumes that motions in videos are stationary across different spatial locations.
This assumption is valid for short-term motions but invalid when the duration of
the motion is long.
To address this invalid assumption, we propose the Lattice-LSTM (L
2STM),
which extends the LSTM by learning independent hidden state transitions of
memory cells at different spatial locations. This method effectively enhances the
LSTM's ability to model dynamics across time and addresses the non-stationary
issue of long-term motion dynamics without significantly increasing the model
complexity. Additionally, we introduce a novel multi-modal training procedure
for training our network. Unlike traditional two-stream architectures, which use
RGB and optical flow information as input, our two-stream model leverages both
modalities to jointly train both input gates and both forget gates in the network
rather than treating the two streams separately.
To benefit from these heterogenous data/features, existing RNN/CNN models
mostly adopted two-stream style: individual models are deployed separately
to learn from individual data sources, and the results are either fused or post-processed
to achieve the final objective. While relatively effective, this direct
use of CNNs/RNNs for learning from heterogenous data/features does not fully
exploit the reciprocal information contained in the multiple sources, neither does
it exploit the reciprocity in a recurrent manner. Therefore, we proposed a novel
recurrent architecture, the Coupled Recurrent Network (CRN), to deal with multiple
sources efficiently and effectively. In particular, we study human action
recognition by coupling the information from two streams: RGB and optical flow
inputs. The same architecture can be applied to other computer vision tasks, such
as human pose estimation, where the combination of heat map prediction and
part affinity field estimation can work together to boost the final performance.
Post a Comment