THESIS
2024
1 online resource (xv, 143 pages) : illustrations (some color)
Abstract
Recent progress in natural language processing (NLP) has significantly advanced the capabilities of language models, attracting heightened attention from academia and industry researchers. Trained on extensive text datasets, these models excel in various linguistic tasks, such as translation, summarization, question-answering, and dialogue generation. Underpinning these developments is the essential role of data, the lifeblood of NLP, particularly in today's context where large language models necessitate vast datasets to learn effectively and generate precise outputs. This thesis focuses on the data-centric methodologies for optimizing language model performance across various NLP applications. It introduces innovative methods for improving the way models ingest and process data, there...[
Read more ]
Recent progress in natural language processing (NLP) has significantly advanced the capabilities of language models, attracting heightened attention from academia and industry researchers. Trained on extensive text datasets, these models excel in various linguistic tasks, such as translation, summarization, question-answering, and dialogue generation. Underpinning these developments is the essential role of data, the lifeblood of NLP, particularly in today's context where large language models necessitate vast datasets to learn effectively and generate precise outputs. This thesis focuses on the data-centric methodologies for optimizing language model performance across various NLP applications. It introduces innovative methods for improving the way models ingest and process data, thereby making notable advances in their practical deployment in real-world scenarios.
The research unfolds through a deep dive into the data-driven facets of NLP, encompassing both data quantity and quality. By adopting a top-down approach, the research traverses the full spectrum of the data lifecycle, addressing aspects of data utilization, enhancement, and construction. In terms of data utilization, the study begins by adapting models with limited data and then taps into the potential of unlabeled data to enhance model performance through continual learning. Transitioning to data enhancement, the study improves the quality of synthetically generated data for learned tasks to consolidate model knowledge for continual learning. It then designs a method to control the complexity of instruction data and investigates its influences on the performance of large language models. Targeting data construction, the study first develops a large-scale pretraining corpus that is causal-complete and tailored for document-grounded dialogue tasks. Moreover, the work creates instruction datasets for diverse tools by harnessing the abilities of large language models, thereby equipping them with the capacity for tool utilization.
In summary, this thesis contributes to the field of data-driven NLP research and systematically covers the comprehensive cycle of data handling. The innovative methods introduced in this thesis are designed to substantially advance the capabilities of language models and improve their practical implementation in various real-world scenarios.
Post a Comment