THESIS
2023
1 online resource (xiii, 92 pages) : illustrations (chiefly color)
Abstract
Multimodal conversational AI combines visual perception and linguistic understanding
to enable more contextually rich conversations. By integrating vision and language,
these systems can comprehend user utterances, interpret open-domain images, and generate
appropriate responses related to objects and their relationships within the visual
context. However, challenges hinder its effectiveness and applicability. This thesis addresses
two primary challenges in building a robust vision-language understanding of
objects.
The first challenge involves understanding objects in multimodal contexts, requiring
visual reasoning and grounding to comprehend object semantics. This skill is useful for
multimodal conversational AI systems to appropriately interpret the user’s queries that
refer to or in...[
Read more ]
Multimodal conversational AI combines visual perception and linguistic understanding
to enable more contextually rich conversations. By integrating vision and language,
these systems can comprehend user utterances, interpret open-domain images, and generate
appropriate responses related to objects and their relationships within the visual
context. However, challenges hinder its effectiveness and applicability. This thesis addresses
two primary challenges in building a robust vision-language understanding of
objects.
The first challenge involves understanding objects in multimodal contexts, requiring
visual reasoning and grounding to comprehend object semantics. This skill is useful for
multimodal conversational AI systems to appropriately interpret the user’s queries that
refer to or involve objects. The second challenge pertains to assessing the faithfulness
of vision-language understanding of objects, which repeatedly arises in the form of object
hallucination, where the generated answers of the vision-language models are nonsensical
or unfaithful to the provided multimodal inputs.
To enhance the multimodal alignment needed for the vision-language understanding of objects, we introduce three methods to solve multimodal object identification in multimodal
dialogue, i.e., dialogue-contextualized object detection, object-dialogue alignment,
and scene-dialogue alignment. Unlike previous works that only can identify specific and
unambiguous objects, our solutions enable the multimodal identification of all objects that
plausibly fit the constraints conveyed in the dialogue context. Thus, our methods align
better with real-world scenarios, where multiple objects could fit a given context and induce
ambiguity in the conversation.
Furthermore, to overcome the absence of object hallucination assessment in multimodal
conversational AI, we propose the Negative Object Presence Evaluation (NOPE)
to quantitatively measure object hallucination in the context of visual question answering
(VQA). The construction of NOPE enables future research to assess object hallucination, as
well as paves the way for future research to investigate and mitigate object hallucination
that hinders faithful vision-language understanding of objects.
In summary, the contributions of this thesis advance the field of multimodal conversational
AI by improving its vision-language object understanding through more realistic
and reliable multimodal object identification in multimodal dialogue. Additionally, we
address a critical gap in the faithfulness of vision-language object understanding by introducing
NOPE, a VQA-based assessment for object hallucination that the field previously
lacked.
Post a Comment