THESIS
2022
1 online resource (xvi, 148 pages) : illustrations (some color)
Abstract
Natural language understanding (NLU) is the task of semantic decoding of human
languages by machines. NLU allows users to interact with machines using natural sentences,
and is the fundamental component for any natural language processing (NLP)
system. Despite the significant achievements on NLU tasks made by machine learning
approaches, in particular deep learning, they still rely heavily on large amounts of training
data to ensure good performance and fail to generalize well to languages and domains
with little training data. Obtaining or collecting massive data samples is relatively easy for
high-resource languages (e.g., English, Chinese) with significant amounts of textual data
on the Internet. However, many other languages have only a small online footprint (e.g.,
less than 0.1% o...[
Read more ]
Natural language understanding (NLU) is the task of semantic decoding of human
languages by machines. NLU allows users to interact with machines using natural sentences,
and is the fundamental component for any natural language processing (NLP)
system. Despite the significant achievements on NLU tasks made by machine learning
approaches, in particular deep learning, they still rely heavily on large amounts of training
data to ensure good performance and fail to generalize well to languages and domains
with little training data. Obtaining or collecting massive data samples is relatively easy for
high-resource languages (e.g., English, Chinese) with significant amounts of textual data
on the Internet. However, many other languages have only a small online footprint (e.g.,
less than 0.1% of data resources on the Internet are in Tamil or Urdu). This makes collecting
datasets for these low-resource languages much more difficult. Similarly, datasets for
low-resource domains (e.g., rare diseases), which have very few data resources and domain
experts, are also much more challenging to collect than for high-resource domains
(e.g., news). To enable machines to better comprehend natural sentences in low-resource
languages and domains, it is necessary to overcome the data scarcity challenge, when very few or even zero training samples are available.
Cross-lingual and cross-domain transfer learning methods have been proposed to learn
task knowledge from large training samples of high-resource languages and domains and
transfer it to low-resource languages and domains. However, previous methods failed to
effectively tackle the two main challenges in developing cross-lingual and cross-domain
systems, namely, 1) that it is difficult to learn good representations from low-resource target
languages (domains); and 2) that it is difficult to transfer the task knowledge from
high-resource source languages (domains) to low-resource target languages (domains)
due to the discrepancies between languages (domains). How to meet these challenges
in a deep learning framework calls for new investigations.
In this thesis, we focus on addressing the aforementioned challenges in a deep learning
framework. First, we propose to further refine the representations of task-related keywords
across languages. We find that the representations for low-resource languages can
be easily and greatly improved by focusing on just the keywords. Second, we present an
Order-Reduced Transformer for the cross-lingual adaptation, and find that modeling partial
word orders instead of the whole sequence can improve the robustness of the model
against word order differences between languages and task knowledge transfer to low-resource
languages. Third, we propose to leverage different levels of domain-related corpora
and additional masking of data in the pre-training for the cross-domain adaptation,
and discover that more challenging pre-training can better address the domain discrepancy
issue in the task knowledge transfer. Finally, we introduce a coarse-to-fine framework,
Coach, and a cross-lingual and cross-domain parsing framework, X2Parser. Coach
decomposes the representation learning process into a coarse-grained and a fine-grained
feature learning, and X2Parser simplifies the hierarchical task structures into flattened
ones. We observe that simplifying task structures makes the representation learning more
effective for low-resource languages and domains.
In all, we tackle the data scarcity issue in NLU by improving the low-resource representation
learning and enhancing model robustness on topologically distant languages
and domains in the task knowledge transfer. Experiments show that our models can effectively
adapt to low-resource target languages and domains, and significantly outperform
previous state-of-the-art models.
Post a Comment