THESIS
2022
1 online resource (xviii, 109 pages) : illustrations (chiefly color)
Abstract
How to recognize 3D properties from 2D RGB images is a fundamental problem in computer vision,
which enables enormous applications such as human-computer interaction, traffic surveillance,
autonomous perception, and augmented reality. This problem is challenging due to the
loss of depth information in the image formation process and a large variation of depth, object
geometry, and scene illumination. In early studies, Marr pointed out that the choice of representation
and algorithm is an important level in understanding visual information processing
systems. Specifically, he proposed a representational framework to map images to 3D model
representations with a low-level primal sketch and an intermediate 2.5D sketch. Recent deep
learning approaches instantiate this framework by learning...[
Read more ]
How to recognize 3D properties from 2D RGB images is a fundamental problem in computer vision,
which enables enormous applications such as human-computer interaction, traffic surveillance,
autonomous perception, and augmented reality. This problem is challenging due to the
loss of depth information in the image formation process and a large variation of depth, object
geometry, and scene illumination. In early studies, Marr pointed out that the choice of representation
and algorithm is an important level in understanding visual information processing
systems. Specifically, he proposed a representational framework to map images to 3D model
representations with a low-level primal sketch and an intermediate 2.5D sketch. Recent deep
learning approaches instantiate this framework by learning representations from data in an end-to-end manner. How to design appropriate representations and learn them effectively are key
to these instantiations. This thesis studies such instantiations for four important and typical
3D perception tasks. After introducing the background and the niche of this thesis, these studies
are presented in an order of increasing number of 3D attributes, camera views, and system
capabilities.
We first study the problem of recognizing the 3D orientation of a class of rigid objects from
a single RGB image. In contrast to prior arts that directly regress the angular values with a
deep neural network, we propose to learn part-based 2D-3D correspondences as intermediate
representations to achieve improved performance and robustness to partially occluded objects.
We show that the prior knowledge of a projective invariant can be utilized in the training process
to further improve the representation learning with extra unlabeled images.
Secondly, we study inferring the posture for a class of articulated objects from single-view
images. We discovered a dataset bias problem that negatively influences the generalization
power of the model that lifts the 2D representations to the 3D targets. We address this problem by proposing the first method to incorporate synthetic data into the training phase, where the
choice of a hierarchical pose representation motivates the design of genetic operators to enrich
a population of tree-structured data. Experiments validate that this approach improves model
generalization of 2D-to-3D networks to unseen inputs.
We then extend our study of 3D perception to two views for stereo 3D object detection,
where the goal is to precisely estimate the instance location, orientation, and size. Previous
studies learn a point-based representation and suffer from the depth estimation error and lack
of semantic cues. We instead propose to learn a high-resolution voxel-based representation
for accurate and robust estimation of the object’s rigid pose. We design a part-based confidence
map representation and demonstrate its superiority compared to a center-based design
for the partially occluded and truncated instances. This approach achieves high precision and
model-agnostic pose refinement. It also leads to a new multi-resolution design that makes high-resolution
modeling of important object regions computationally tractable.
Finally, we push the capability of our perception system to go beyond the pose estimation
task and achieve fine-grained shape inference, narrowing the gap between a computer vision
system and the binocular human vision system. We design the first model for joint stereo 3D
object detection and implicit shape estimation that can not only recover scene properties up to
3D bounding boxes but also detailed surface geometry inside the boxes. This model instantiates
the visible surface in Marr’s framework with point-based representations and solves an unseen
surface hallucination problem to achieve a complete and resolution-agnostic shape description
for the detected objects. This extended instance-level model also shows the orientation refinement
studies for other non-rigid object classes such as pedestrians.
Post a Comment