Structured sparsity for pre-training distributed word representations with subword information

HKUST Electronic Theses

Structured sparsity for pre-training distributed word representations with subword information

by Leonard Elias Lausen

THESIS 2019

M.Phil. Computer Science and Engineering

xii, 107 pages : illustrations ; 30 cm

Abstract

Word representations obtained from large textual corpora have gained popularity in natural language processing as they can help to improve performance on supervised tasks for which only comparatively little labeled training data can be obtained (Turian, Ratinov, and Bengio 2010). Recently a series of scalable methods beginning with Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning from very large unlabeled corpora, obtaining better representations and representations for more words. The long-tail nature of human language – implying that most words are infrequent (Zipf 1949; Mandelbrot 1954) – however prevents these methods from representing infrequent words well (Lowe 2001; Luong, Socher, and Christopher D. Manning 2013).

Considering that words are typically formed of meaningful parts, taking their structure into account was proposed as remedy (Harris 1954; Luong, Socher, and Christopher D. Manning 2013). Recently Bojanowski et al. (2017) proposed fastText, a scalable model incorporating such information. fastText allocates separate parameters for words and their parts, with part-specific parameters being shared among all words containing the respective part.

However, parameters specific to rare words and rare word-parts are nevertheless estimated from little data and can suffer from unreliability and overfitting, impacting resulting word representations negatively. This thesis thus introduces a group lasso regularization (Yuan and Y. Lin 2006) to enable the selection of the words and word-parts jointly during training. Deselected parameters are pushed to 0, preventing negative impact on the resulting representation. For optimization a scalable proximal asynchronous stochastic gradient descent (ProxASGD) optimizer is introduced.

The proposed method is evaluated on a variety of tasks and our results show that the regularization enables better representations for rare words and morphologically complex languages such as German. Providing separate regularization hyperparameters for words and word-parts, trading-off between inclusion of semantic and syntactic information is made possible.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Computer Science and Engineering Authors Lausen, Leonard Elias Subjects Natural language processing (Computer science) Semantic memory Text processing (Computer science) Language English Call number Thesis CSED 2019 Lausen DOI 10.14711/thesis-991012730263503412

Full record

Structured sparsity for pre-training distributed word representations with subword information

by Leonard Elias Lausen

Post a Comment Cancel reply