THESIS
2018
xxi, 130 pages : color illustrations ; 30 cm
Abstract
Learning video visual representation for detection has emerged as one of
the fundamental problems toward general video understanding. It requires rich
knowledge for spatial and/or temporal localization; however, manually collecting
fully-supervised annotations is extremely expensive and not scalable. This thesis
has made progress toward effectively employing weakly-labeled data in learning
video representations for detection tasks. Specifically, we focus on video object
detection with human action descriptions, and temporal action detection with
video-level action categories.
For the weakly-supervised video object detection task, we propose a temporal
dynamic graph long short-term memory network. This enables global temporal
reasoning by constructing a dynamic graph that is ba...[
Read more ]
Learning video visual representation for detection has emerged as one of
the fundamental problems toward general video understanding. It requires rich
knowledge for spatial and/or temporal localization; however, manually collecting
fully-supervised annotations is extremely expensive and not scalable. This thesis
has made progress toward effectively employing weakly-labeled data in learning
video representations for detection tasks. Specifically, we focus on video object
detection with human action descriptions, and temporal action detection with
video-level action categories.
For the weakly-supervised video object detection task, we propose a temporal
dynamic graph long short-term memory network. This enables global temporal
reasoning by constructing a dynamic graph that is based on temporal correlations
of object proposals and spans the entire video. It significantly alleviates the missing
label issue for each individual frame by transferring knowledge across correlated
objects proposals in the whole video. Extensive evaluations on a large-scale
daily-life action dataset demonstrates the superiority of our proposed method.
For the weakly-supervised temporal action detection task, we propose three
different approaches. (1) We propose an end-to-end framework to simultaneously
update feature representation for classification and generate temporal proposals
with a gated recurrent unit for detection. (2) We propose a novel structure-and-relation network, which includes a local structure module to leverage the
context information for improving localization, and a global relation module to
process all instances simultaneously through exploiting their interactions. These
modules are integrated from a probabilistic perspective and can be learned in
an end-to-end fashion. (3) We propose a marginalized dropout attention (MDA)
mechanism for video feature aggregation in order to learn a more accurate action
probability for each frame from classification networks. The MDA module is
performed as a structural regularization, alleviating an existing problem of only
attending on the most salient frames. Our proposed methods outperform the
previous weakly-supervised approaches on several challenging video benchmarks.
Post a Comment