Deep discriminative representational frameworks for recovering categorical 3D object attributes from visual data

HKUST Electronic Theses

Deep discriminative representational frameworks for recovering categorical 3D object attributes from visual data

by Shichao Li

THESIS 2022

Ph.D. Computer Science and Engineering

1 online resource (xviii, 109 pages) : illustrations (chiefly color)

Abstract

How to recognize 3D properties from 2D RGB images is a fundamental problem in computer vision, which enables enormous applications such as human-computer interaction, traffic surveillance, autonomous perception, and augmented reality. This problem is challenging due to the loss of depth information in the image formation process and a large variation of depth, object geometry, and scene illumination. In early studies, Marr pointed out that the choice of representation and algorithm is an important level in understanding visual information processing systems. Specifically, he proposed a representational framework to map images to 3D model representations with a low-level primal sketch and an intermediate 2.5D sketch. Recent deep learning approaches instantiate this framework by learning representations from data in an end-to-end manner. How to design appropriate representations and learn them effectively are key to these instantiations. This thesis studies such instantiations for four important and typical 3D perception tasks. After introducing the background and the niche of this thesis, these studies are presented in an order of increasing number of 3D attributes, camera views, and system capabilities.

We first study the problem of recognizing the 3D orientation of a class of rigid objects from a single RGB image. In contrast to prior arts that directly regress the angular values with a deep neural network, we propose to learn part-based 2D-3D correspondences as intermediate representations to achieve improved performance and robustness to partially occluded objects. We show that the prior knowledge of a projective invariant can be utilized in the training process to further improve the representation learning with extra unlabeled images.

Secondly, we study inferring the posture for a class of articulated objects from single-view images. We discovered a dataset bias problem that negatively influences the generalization power of the model that lifts the 2D representations to the 3D targets. We address this problem by proposing the first method to incorporate synthetic data into the training phase, where the choice of a hierarchical pose representation motivates the design of genetic operators to enrich a population of tree-structured data. Experiments validate that this approach improves model generalization of 2D-to-3D networks to unseen inputs.

We then extend our study of 3D perception to two views for stereo 3D object detection, where the goal is to precisely estimate the instance location, orientation, and size. Previous studies learn a point-based representation and suffer from the depth estimation error and lack of semantic cues. We instead propose to learn a high-resolution voxel-based representation for accurate and robust estimation of the object’s rigid pose. We design a part-based confidence map representation and demonstrate its superiority compared to a center-based design for the partially occluded and truncated instances. This approach achieves high precision and model-agnostic pose refinement. It also leads to a new multi-resolution design that makes high-resolution modeling of important object regions computationally tractable.

Finally, we push the capability of our perception system to go beyond the pose estimation task and achieve fine-grained shape inference, narrowing the gap between a computer vision system and the binocular human vision system. We design the first model for joint stereo 3D object detection and implicit shape estimation that can not only recover scene properties up to 3D bounding boxes but also detailed surface geometry inside the boxes. This model instantiates the visible surface in Marr’s framework with point-based representations and solves an unseen surface hallucination problem to achieve a complete and resolution-agnostic shape description for the detected objects. This extended instance-level model also shows the orientation refinement studies for other non-rigid object classes such as pedestrians.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Cheng, Kwang-Ting Authors Li, Shichao Subjects Image processing Data processing Image reconstruction Pattern recognition systems Computer vision Language English Call number Thesis CSE 2022 LiS DOI 10.14711/thesis-991013114651803412

Full record

Deep discriminative representational frameworks for recovering categorical 3D object attributes from visual data

by Shichao Li

Post a Comment Cancel reply