Data-centric studies for language generation models

HKUST Electronic Theses

Data-centric studies for language generation models

by Yingxiu Zhao

THESIS 2024

Ph.D. Emerging Interdisciplinary Areas

1 online resource (xv, 143 pages) : illustrations (some color)

Abstract

Recent progress in natural language processing (NLP) has significantly advanced the capabilities of language models, attracting heightened attention from academia and industry researchers. Trained on extensive text datasets, these models excel in various linguistic tasks, such as translation, summarization, question-answering, and dialogue generation. Underpinning these developments is the essential role of data, the lifeblood of NLP, particularly in today's context where large language models necessitate vast datasets to learn effectively and generate precise outputs. This thesis focuses on the data-centric methodologies for optimizing language model performance across various NLP applications. It introduces innovative methods for improving the way models ingest and process data, thereby making notable advances in their practical deployment in real-world scenarios.

The research unfolds through a deep dive into the data-driven facets of NLP, encompassing both data quantity and quality. By adopting a top-down approach, the research traverses the full spectrum of the data lifecycle, addressing aspects of data utilization, enhancement, and construction. In terms of data utilization, the study begins by adapting models with limited data and then taps into the potential of unlabeled data to enhance model performance through continual learning. Transitioning to data enhancement, the study improves the quality of synthetically generated data for learned tasks to consolidate model knowledge for continual learning. It then designs a method to control the complexity of instruction data and investigates its influences on the performance of large language models. Targeting data construction, the study first develops a large-scale pretraining corpus that is causal-complete and tailored for document-grounded dialogue tasks. Moreover, the work creates instruction datasets for diverse tools by harnessing the abilities of large language models, thereby equipping them with the capacity for tool utilization.

In summary, this thesis contributes to the field of data-driven NLP research and systematically covers the comprehensive cycle of data handling. The innovative methods introduced in this thesis are designed to substantially advance the capabilities of language models and improve their practical implementation in various real-world scenarios.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Emerging Interdisciplinary Areas Supervisors Zhang, Nevin Lianwen Sun, Jian Authors Zhao, Yingxiu Language English Call number Thesis EMIA 2024 Zhao DOI 10.14711/thesis-991013319956303412

Full record

Data-centric studies for language generation models

by Yingxiu Zhao

Post a Comment Cancel reply