THESIS
2021
1 online resource (x, 59 pages) : illustrations (some color)
Abstract
Pre-trained language models play an essential role in modern natural language processing. Unsupervised
learning from the massive corpus provides good language representation for various
natural language understanding tasks. However, not every language can be benefited by the pre-trained
language models. Most researches on pre-trained language models concern widely-used
languages such as English, Chinese or other Indo-European languages. Also, such schemes often
require heavy computational resources alongside a large amount of data which is usually
unfeasible for less-studied languages.
To address this research niche, we aim to construct a language model that understands the
linguistic phenomenon of the target language, where the language is with low resource-setting.
In this regard, thi...[
Read more ]
Pre-trained language models play an essential role in modern natural language processing. Unsupervised
learning from the massive corpus provides good language representation for various
natural language understanding tasks. However, not every language can be benefited by the pre-trained
language models. Most researches on pre-trained language models concern widely-used
languages such as English, Chinese or other Indo-European languages. Also, such schemes often
require heavy computational resources alongside a large amount of data which is usually
unfeasible for less-studied languages.
To address this research niche, we aim to construct a language model that understands the
linguistic phenomenon of the target language, where the language is with low resource-setting.
In this regard, this thesis focuses on language modeling dedicated to Korean language, especially
concerning language representation and pre-training methods. Though Korean is one of
the widely used languages in East Asia, it is less studied in natural language processing. Therefore,
developing novel approaches for Korean language modeling is necessary. Our Korean-specified
language representation is expected to help build more powerful language models for
Korean understanding, even with fewer resources.
Based on the widely used transformer architecture and bidirectional language representation,
we propose chunk-wise reconstruction for the chunk-level understanding of the Korean
language. In addition, we inject morphological features such as Part-of-Speech (PoS) to the
language understanding by leveraging such information during the pre-training. The proposed
methods show adequate performance on scrambled sentence recognition and text classification.
Our experiment results prove that the proposed methods improve the model performance of
the investigated Korean language understanding tasks. Chunk-wise reconstruction shows that
the data augmentation matters as well. Performance comparison with larger language models
shows the effectiveness of our approach.
Post a Comment