THESIS
2022
1 online resource (xii, 102 pages) : illustrations (chiefly color)
Abstract
Analyzing human actions in videos and augmenting human action videos with visual
effects are common tasks in video understanding and editing. However, they are
challenging in three aspects. First, analyzing human actions in videos automatically and
augmenting videos with visual effects require programming or professional tools, and
are thus often tedious and not friendly to novice users. Second, the intrinsic perspective
shortening problem in videos makes both observation and computation of human action
attributes affected by the viewpoint. Third, the data attributes of human actions in question
are often application-specific, and thus are either pre-defined or require programming
to generalize to new instances, thus limiting the capability of supporting customized analysis,
especially...[
Read more ]
Analyzing human actions in videos and augmenting human action videos with visual
effects are common tasks in video understanding and editing. However, they are
challenging in three aspects. First, analyzing human actions in videos automatically and
augmenting videos with visual effects require programming or professional tools, and
are thus often tedious and not friendly to novice users. Second, the intrinsic perspective
shortening problem in videos makes both observation and computation of human action
attributes affected by the viewpoint. Third, the data attributes of human actions in question
are often application-specific, and thus are either pre-defined or require programming
to generalize to new instances, thus limiting the capability of supporting customized analysis,
especially for novices. This thesis aims to address the above limitations in both the
analysis and augmentation of human action videos.
We first present a tool PoseTween that allows users to easily add visual effects (animated
virtual objects) to augment human action videos. We propose to model the visual
effects as tween animations of virtual objects driven by the subject’s movements in videos.
By utilizing the subject’s movements, PoseTween achieves natural interactions between the augmented virtual objects and the subjects in videos, while largely simplifying the editing
process. We then study the problem of finding the temporal alignments of human
action videos, which is useful for the automatic transfer of visual effects from a template
video to a target video based on action proximity to reduce user intervention. To address
the perspective shortening problem, we propose a deep learning-based method that
normalizes human poses in videos and extracts features from the normalized poses for
matching. The temporal alignment by matching two human action videos with the normalized
human pose features is thus invariant to variations in videos, such as camera
viewpoint and subject anthropometry. In the third part of the thesis we study the analysis
and visualization of differences in local human poses. We design a tool, PoseCoach,
for video-based running coaching by comparing the running poses between an amateur
runner and a professional runner. Our tool allows the interactive annotation of human
pose biomechanical attributes, such that novice users (e.g., amateur runners) can perform
customizable analysis from human action videos without explicit programming. Existing
visualization methods that show the differences in local human poses with side-by-side
or overlaid placements are subject to viewpoint variation and require users’ perception to
interpret the differences. We thus also propose a visualization method to intuitively show
the pose differences by 3D animations of a body model.
We conduct extensive quantitative evaluations and user studies to evaluate the effectiveness
of our proposed methods. The results show that PoseTween and PoseCoach are
friendly to novice users in both the analysis of actions in videos and the augmentation of
human action videos with animated virtual objects. The normalized pose features show
promising accuracies in various tasks that require measuring pose similarity, such as video
temporal alignment and action recognition.
Post a Comment