THESIS
2024
1 online resource (xix, 132 pages) : illustrations (chiefly color)
Abstract
The majority of contemporary perception models leverage Transformer-based architectures, such as DETR for object detection and Mask2Former for image segmentation. Central to these frameworks is the concept of extracting objects from image features through the formulation of queries, underscoring the significance of query design.
In this dissertation, we explore integrating locality priors into the global attention mechanism via innovative query designs in DN-DETR and DINO. These designs include: 1. conceptualizing queries as anchor boxes; 2. predicting relative object locations across each decoder layer; 3. an auxiliary denoising task that refines queries to be close to object bounding boxes; and 4. strategically initializing queries coupled with a selection process. These advancements...[
Read more ]
The majority of contemporary perception models leverage Transformer-based architectures, such as DETR for object detection and Mask2Former for image segmentation. Central to these frameworks is the concept of extracting objects from image features through the formulation of queries, underscoring the significance of query design.
In this dissertation, we explore integrating locality priors into the global attention mechanism via innovative query designs in DN-DETR and DINO. These designs include: 1. conceptualizing queries as anchor boxes; 2. predicting relative object locations across each decoder layer; 3. an auxiliary denoising task that refines queries to be close to object bounding boxes; and 4. strategically initializing queries coupled with a selection process. These advancements have yielded substantial improvements in both performance and training efficiency. As a result, our DINO is the strongest detection head adopted by many top-performing detection models.
In the domain of open-world perception, defining objects presents a fundamental challenge. In computer vision, visual prompts are often used to identify objects in open-world settings, serving a similar function to queries in closed-set perception. To address this, we introduced Semantic-SAM, a novel model that integrates visual prompts into the positional component of queries. Semantic-SAM, trained on the extensive SA-1B visual prompt dataset, achieves performance comparable to that of SAM. However, directly using visual prompts as queries restricts their format and precludes multi-round interactions requiring memory prompts. To overcome this, we developed SEEM, which incorporates visual prompts through a cross-attention mechanism with queries. SEEM demonstrated the best results in interactive segmentation upon its introduction.
As language models progress, the importance of language prompts in computer vision becomes more recognized. We introduced OpenSEED, a method that uses contrastive learning to align language prompts with queries, achieving top performance in zero-shot segmentation. Employing a similar contrastive approach, LLaVA-Grounding excelled in referring expression comprehension (REC) and referring expression segmentation (RES), outperforming other multi-modal LLMs of the same model size. Additionally, SEEM fuses queries with both language and visual prompts via cross-attention. Our proposed techniques, including contrastive learning for matching queries with prompts and their fusion via cross-attention, are now widely recognized in open-world perception strategies.
Merely locating objects is insufficient for contemporary applications like autonomous driving, where knowing a vehicle’s speed and intentions is also crucial. With the rise of Large Language Models (LLMs), understanding objects can become an open question-answering challenge, requiring the perception model to answer questions about any object. To facilitate a deeper understanding of objects, we propose the LLaVA-Grounding model, which integrates the perception model with multi-modal prompts and LLMs, allowing it to interpret user prompts and understand objects.
In summary, this dissertation advances open-world perception by introducing effective query designs that enhance object localization through the integration of local priors. It also presents innovative strategies for matching and integrating prompt information with queries, significantly enriching perception research. Additionally, it presents Multimodal LLM on top of the perception model to understand objects more deeply.
Post a Comment