THESIS
2023
1 online resource (xiv, 101 pages) : color illustrations
Abstract
Building systems capable of seamlessly learning from multiple modalities, such as vision
and language, has been a longstanding aspiration in Artificial Intelligence (AI). As
humans, we acquire new knowledge and skills through various sensory inputs, including
visual signals and textual information. Models that emulate this behavior can potentially
learn more effectively as information from different modalities often complements and
supplements each other. More importantly, their capabilities can be greatly expanded
to perform tasks that unimodal models cannot achieve. In this thesis, we investigate and
present novel approaches for constructing robust and versatile vision-language (VL) models.
Particularly, we focus on how to efficiently teach language models (LMs) to comprehend
visual d...[
Read more ]
Building systems capable of seamlessly learning from multiple modalities, such as vision
and language, has been a longstanding aspiration in Artificial Intelligence (AI). As
humans, we acquire new knowledge and skills through various sensory inputs, including
visual signals and textual information. Models that emulate this behavior can potentially
learn more effectively as information from different modalities often complements and
supplements each other. More importantly, their capabilities can be greatly expanded
to perform tasks that unimodal models cannot achieve. In this thesis, we investigate and
present novel approaches for constructing robust and versatile vision-language (VL) models.
Particularly, we focus on how to efficiently teach language models (LMs) to comprehend
visual data, as this is more resource efficient than starting with vision models.
Despite the salient progress made by modern deep learning approaches, most previous
works in VL learning focus on task-specific finetuning, which cannot generalize well.
While preliminary studies have explored VL pre-training in order to build generalized
backbones, several essential problems exist. First, pre-training VL models from scratch
is extremely computationally costly due to the added visual modality. Second, the data used for pre-training are mainly image-text pairs, where the text components are short,
succinct descriptions of the images. The short texts lead to insufficient language abilities
and unsatisfactory downstream performance, especially for generative tasks. However,
simply adding long-form text-only data does not help much due to the discrepancy between
the unimodal and multimodal training losses. Furthermore, even with robust VL
backbones, the methods to improve their generalization (i.e., zero-shot performance on
unseen datasets) and versatility, both critical factors in their applicability, remain largely
uninvestigated.
To address these challenges, in this thesis, we focus on two research problems: 1) efficient
construction of robust VL models with strong language abilities; and 2) improving
generalization of pre-trained VL models. Particularly, we propose three innovative approaches.
Firstly, we present a task-specific vision guidance method that uses visual information
to tame pre-trained LMs, enabling them to generate texts from VL inputs. This
method adapts text-only LMs to the VL domain without compromising their original language
abilities. Next, we take a step further by introducing task-agnostic vision-language
knowledge distillation (VLKD). VLKD bridges powerful pre-trained vision models and
pre-trained LMs to utilize the capabilities from both sides. Specifically, we adopt self-supervised
learning with only a handful of image-text data to align them together, which
is considerably more data and time efficient than pre-training from scratch. Last but not
least, we introduce InstructBLIP, a simple yet novel VL instruction tuning framework to
enhance VL models to accurately follow instructions. This approach dramatically improves
the model’s generalization on unseen datasets and tasks, achieving state-of-the-art
performance on a wide range of tasks, such as image captioning, visual reasoning, and
both image and video-based question answering. Qualitative studies further showcase its
robustness and versatility in multi-turn dialogues.
Post a Comment