THESIS
2018
xvi, 79 pages : illustrations ; 30 cm
Abstract
Context plays a critical role in perceptual inference as it provides useful guidance to solve
numerous tasks both in spatial and temporal domains (Divvala et al., 2009; Galleguillos et al.,
2008; Mottaghi et al., 2014). In this dissertation, we study several fundamental computer vision
problems, i.e., object detection, image generation and high-level image understanding, by
exploiting different spatial-temporal context to boost their performance.
Driven by the recent development of deep neural nets, we propose deep contextual modeling
in spatial and temporal domains. Context here refers to one of the following application
scenarios, e.g., (1) temporal coherence and consistence for object detection from video frames;
(2) spatial constraint for conditional image synthesis, i.e., g...[
Read more ]
Context plays a critical role in perceptual inference as it provides useful guidance to solve
numerous tasks both in spatial and temporal domains (Divvala et al., 2009; Galleguillos et al.,
2008; Mottaghi et al., 2014). In this dissertation, we study several fundamental computer vision
problems, i.e., object detection, image generation and high-level image understanding, by
exploiting different spatial-temporal context to boost their performance.
Driven by the recent development of deep neural nets, we propose deep contextual modeling
in spatial and temporal domains. Context here refers to one of the following application
scenarios, e.g., (1) temporal coherence and consistence for object detection from video frames;
(2) spatial constraint for conditional image synthesis, i.e., generating image from sketch; (3)
domain-specific knowledge such as fcial attributes for natural face image generation.
We first study the problem of exploiting temporal context for object detection from video,
where applying single frame-based object detector directly in video sequence tends to produce
high temporal variation on frame-level output. With the recent advent in sequential modeling,
we exploit long-range visual context for temporal coherence and consistence by proposing a
novel association LSTM framework, which solves the regression and association tasks in video
simultaneously. Next we investigate image generation guided by hand sketch in spatial domain.
We design a joint image representation for learning joint distribution and correspondence of
sketch-image pair. A contextual GAN framework is proposed to pose image generation as a
constrained image completion problem, where sketch serves as weak spatial context. Therefore
the output images do not necessarily follow the ugly sketch while still realistic. Finally we explore domain-specific context, i.e., face attribute and attribute-guided face generation: we
condition the CycleGAN and propose conditional CycleGAN, which is designed to allow easy
control of the appearance of the generated face via the facial attribute or identity context. We
demonstrate three applications for identity-guided face generation.
For future research directions, we will study deep nets for jointly learning spatial and temporal
context and explore the possibility of solving all applications using one single model.
Post a Comment