THESIS
2015
xii, 86 pages : illustrations ; 30 cm
Abstract
Visual tracking is a fundamental task in computer vision. Numerous subsequent applications such as video semantic analysis call for reliable tracking results. In this thesis, we first delve into a tracker to analyze the most influential components for good performance. In particular, we propose a novel framework for tracker evaluation which decomposes a tracker into five parts: motion model, feature extractor, observation model, model updater, and ensemble post-processor. We find that the feature extractor plays the most important role in a tracker. Although the observation model is the focus of many studies, it often brings no significant improvement. Moreover, the motion model and model updater contain many details that could affect the result. The ensemble post-processor can also imp...[
Read more ]
Visual tracking is a fundamental task in computer vision. Numerous subsequent applications such as video semantic analysis call for reliable tracking results. In this thesis, we first delve into a tracker to analyze the most influential components for good performance. In particular, we propose a novel framework for tracker evaluation which decomposes a tracker into five parts: motion model, feature extractor, observation model, model updater, and ensemble post-processor. We find that the feature extractor plays the most important role in a tracker. Although the observation model is the focus of many studies, it often brings no significant improvement. Moreover, the motion model and model updater contain many details that could affect the result. The ensemble post-processor can also improve the result substantially when the constituent trackers have high diversity. Though we cannot cast all the existing trackers into the evaluation framework, we cover most of the popular ones. We believe the analyses are universal enough to enlighten some future research directions.
Based on these analyses, we decide to apply advanced machine learning techniques for visual tracking. Particularly, we focus on the feature learning and ensemble post-processing parts. In feature learning, we try three different approaches which are based on online non-negative robust dictionary learning, stacked denoising autoencoder (SDAE) and convolutional neural network (CNN). In the first method, we present an online robust non-negative dictionary learning algorithm for updating the object templates, and devise a novel online projected gradient descent method to solve the dictionary learning problem. Our algorithm blends the past information and the current tracking result in a principled way. It can automatically detect and reject the occlusion and cluttered background, yielding robust object templates. For the next two methods, we build our methods on the idea of transfer learning. We propose to transfer generic image and object features which are learned offline to online tracking. Our first attempt is to train an SDAE using one million auxiliary natural images, and then transfer the encoder part of the SDAE as a feature extractor to online tracking. An additional classifier layer is appended to the top of the encoder in online tracking, then both the feature extractor and the classifier can be fine-tuned to adapt to appearance changes of the moving object. Our second approach is based on CNN. To tackle the disadvantages of the previous approach, we utilize CNN which is a more powerful deep learning model as demonstrated by many researchers in recent years. We further improve the vanilla CNN with objectness pretraining and structured output. Experiments on a recent benchmark [81] show the superiority of the proposed methods.
For the ensemble post-processor, inspired by some recent studies on crowdsourcing, we propose a novel factorial hidden Markov model (FHMM) for aggregating structured output time series data. For efficient online inference of the FHMM, we devise a conditional particle filter algorithm by exploiting the structure of the joint posterior distribution of the hidden variables. The proposed method could significantly improve the results compared with those of individual trackers.
Post a Comment