Effective transfer learning for low-resource natural language understanding

HKUST Electronic Theses

Effective transfer learning for low-resource natural language understanding

by Zihan Liu

THESIS 2022

Ph.D. Electronic and Computer Engineering

1 online resource (xvi, 148 pages) : illustrations (some color)

Abstract

Natural language understanding (NLU) is the task of semantic decoding of human languages by machines. NLU allows users to interact with machines using natural sentences, and is the fundamental component for any natural language processing (NLP) system. Despite the significant achievements on NLU tasks made by machine learning approaches, in particular deep learning, they still rely heavily on large amounts of training data to ensure good performance and fail to generalize well to languages and domains with little training data. Obtaining or collecting massive data samples is relatively easy for high-resource languages (e.g., English, Chinese) with significant amounts of textual data on the Internet. However, many other languages have only a small online footprint (e.g., less than 0.1% of data resources on the Internet are in Tamil or Urdu). This makes collecting datasets for these low-resource languages much more difficult. Similarly, datasets for low-resource domains (e.g., rare diseases), which have very few data resources and domain experts, are also much more challenging to collect than for high-resource domains (e.g., news). To enable machines to better comprehend natural sentences in low-resource languages and domains, it is necessary to overcome the data scarcity challenge, when very few or even zero training samples are available.

Cross-lingual and cross-domain transfer learning methods have been proposed to learn task knowledge from large training samples of high-resource languages and domains and transfer it to low-resource languages and domains. However, previous methods failed to effectively tackle the two main challenges in developing cross-lingual and cross-domain systems, namely, 1) that it is difficult to learn good representations from low-resource target languages (domains); and 2) that it is difficult to transfer the task knowledge from high-resource source languages (domains) to low-resource target languages (domains) due to the discrepancies between languages (domains). How to meet these challenges in a deep learning framework calls for new investigations.

In this thesis, we focus on addressing the aforementioned challenges in a deep learning framework. First, we propose to further refine the representations of task-related keywords across languages. We find that the representations for low-resource languages can be easily and greatly improved by focusing on just the keywords. Second, we present an Order-Reduced Transformer for the cross-lingual adaptation, and find that modeling partial word orders instead of the whole sequence can improve the robustness of the model against word order differences between languages and task knowledge transfer to low-resource languages. Third, we propose to leverage different levels of domain-related corpora and additional masking of data in the pre-training for the cross-domain adaptation, and discover that more challenging pre-training can better address the domain discrepancy issue in the task knowledge transfer. Finally, we introduce a coarse-to-fine framework, Coach, and a cross-lingual and cross-domain parsing framework, X2Parser. Coach decomposes the representation learning process into a coarse-grained and a fine-grained feature learning, and X2Parser simplifies the hierarchical task structures into flattened ones. We observe that simplifying task structures makes the representation learning more effective for low-resource languages and domains.

In all, we tackle the data scarcity issue in NLU by improving the low-resource representation learning and enhancing model robustness on topologically distant languages and domains in the task knowledge transfer. Experiments show that our models can effectively adapt to low-resource target languages and domains, and significantly outperform previous state-of-the-art models.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Electronic and Computer Engineering Supervisors Fung, Pascale Authors Liu, Zihan Subjects Transfer learning (Machine learning) Natural language processing (Computer science) Language English Call number Thesis ECE 2022 LiuZ DOI 10.14711/thesis-991013106358603412

Full record

Effective transfer learning for low-resource natural language understanding

by Zihan Liu

Post a Comment Cancel reply