THESIS
2021
1 online resource (x, 45 pages) : color illustrations
Abstract
Video action recognition has been an important task of computer vision where the goal is to classify and label the action of a video. While the advent of machine learning has greatly improved the performance of video action recognition, it has been showing limited performance compared to simpler image recognition.
Through identifying inherent labeling noises in the existing video action datasets, which misdirects a machine learning model from effectively classifying the action from a video, we contribute HAA500, a manually annotated video action dataset with 10,000 video clips of 500 classes. HAA500 consists of fine-grained 500 atomic action classes where the video clips of consistent actions are collected for each class. Our HAA500 enables deep learning models to improve their predicti...[
Read more ]
Video action recognition has been an important task of computer vision where the goal is to classify and label the action of a video. While the advent of machine learning has greatly improved the performance of video action recognition, it has been showing limited performance compared to simpler image recognition.
Through identifying inherent labeling noises in the existing video action datasets, which misdirects a machine learning model from effectively classifying the action from a video, we contribute HAA500, a manually annotated video action dataset with 10,000 video clips of 500 classes. HAA500 consists of fine-grained 500 atomic action classes where the video clips of consistent actions are collected for each class. Our HAA500 enables deep learning models to improve their predictions while avoiding unwanted bias and focusing on the human figure and pose.
We further study the importance of human pose and the use of skeleton data in action recognition. We introduce a novel temporal alignment method using 3D skeleton data extracted from a video. Compared to existing methods using RGB video frames, a sequence of 3D skeleton data consists of a compact representation of the pertinent human action that is highly robust to unwanted bias, making it suitable for few-shot learning tasks that have limited data for the novel classes. We introduce skeleton embedding generated from a three-stream embedding network using multi-order representations of a 3D skeleton sequence, with a generative model to reconstruct the skeleton coordinates from the embedding. We evaluate our model on a few-shot action recognition task and show that the model outperforms the state-of-the-art method on multiple benchmarks.
Post a Comment