THESIS
2020
xv, 161 pages : illustrations ; 30 cm
Abstract
3D scene perception, with a variety of applications in robotics and augmented reality,
is in large demands for both academia and industry, yet still in its early stage and suffers
from robustness and scalability. This Ph.D. thesis focuses on the fundamental problems
in 3D scene perception, i.e., scalable geometric modeling and semantic understanding of
real world environments.
Firstly, we identify that globally consistent pose estimation of cameras are critical for
robust 3D modeling. While exhaustive search of all previous observations are infeasible
for large scale datasets, Loop Closure Detection (LCD) has been proved to be extremely
useful to achieve global consistency of visual observations. Examining state-of-the-art
methods that gained a lot of popularity for their effic...[
Read more ]
3D scene perception, with a variety of applications in robotics and augmented reality,
is in large demands for both academia and industry, yet still in its early stage and suffers
from robustness and scalability. This Ph.D. thesis focuses on the fundamental problems
in 3D scene perception, i.e., scalable geometric modeling and semantic understanding of
real world environments.
Firstly, we identify that globally consistent pose estimation of cameras are critical for
robust 3D modeling. While exhaustive search of all previous observations are infeasible
for large scale datasets, Loop Closure Detection (LCD) has been proved to be extremely
useful to achieve global consistency of visual observations. Examining state-of-the-art
methods that gained a lot of popularity for their efficiency yet suffer from low recall due to
the inherent drawback that high dimensional binary feature descriptors lack well-defined
centroids, we propose a real-time LCD approach called MILD (Multi-Index Hashing for
Loop closure Detection), in which image similarity is measured by feature matching directly
to achieve high recall without introducing extra computational complexity with the
aid of the Multi-Index Hashing (MIH) technique. A robust globally consistent pose estimation approach GCSLAM is further introduced to minimize the registration error of
visual observations collected from MILD, achieving state-of-the-art accuracy while maintaining
high efficiency based on our proposed FastGO technique for Fast Globally Consistent
Point Cloud Registration. Based on the globally camera pose estimation, we present
FlashFusion with real-time dense 3D reconstruction on portable devices for AR/VR applications.
Moreover, we demonstrate that semantic understanding of the environment serves
as the key component for scalable representation of 3D environments by simplifying independent
point clouds into meaningful objects. Unlike images that are represented by
densely organized pixels in 2D space, 3D scene normally employs unordered point clouds
for representation, making it a tough problem to use convolutional neural networks for
3D scene understanding. For efficient and robust semantic understanding of 3D environments,
we propose a chunk based spatially-sparse convolution scheme based on the insight
that points are continuous as 2D surfaces in 3D space that is 4x faster than previous
state-of-the-arts. A novel occupancy signal is introduced for robust instance level semantic
understanding of 3D environments that achieves state-of-the-art accuracy on public
datasets. Finally, we present a building-scale dense 3D reconstruction system with room-level
loop closure detector relying on the proposed semantic understanding approaches
of the reconstructed 3D model.
Post a Comment