THESIS
2019
xii, 107 pages : illustrations ; 30 cm
Abstract
Word representations obtained from large textual corpora have gained popularity in
natural language processing as they can help to improve performance on supervised
tasks for which only comparatively little labeled training data can be obtained (Turian,
Ratinov, and Bengio 2010). Recently a series of scalable methods beginning with
Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning from very
large unlabeled corpora, obtaining better representations and representations for
more words. The long-tail nature of human language – implying that most words
are infrequent (Zipf 1949; Mandelbrot 1954) – however prevents these methods from
representing infrequent words well (Lowe 2001; Luong, Socher, and Christopher D.
Manning 2013).
Considering that words are typicall...[
Read more ]
Word representations obtained from large textual corpora have gained popularity in
natural language processing as they can help to improve performance on supervised
tasks for which only comparatively little labeled training data can be obtained (Turian,
Ratinov, and Bengio 2010). Recently a series of scalable methods beginning with
Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning from very
large unlabeled corpora, obtaining better representations and representations for
more words. The long-tail nature of human language – implying that most words
are infrequent (Zipf 1949; Mandelbrot 1954) – however prevents these methods from
representing infrequent words well (Lowe 2001; Luong, Socher, and Christopher D.
Manning 2013).
Considering that words are typically formed of meaningful parts, taking their
structure into account was proposed as remedy (Harris 1954; Luong, Socher, and
Christopher D. Manning 2013). Recently Bojanowski et al. (2017) proposed fastText,
a scalable model incorporating such information. fastText allocates separate parameters
for words and their parts, with part-specific parameters being shared among all
words containing the respective part.
However, parameters specific to rare words and rare word-parts are nevertheless
estimated from little data and can suffer from unreliability and overfitting, impacting
resulting word representations negatively. This thesis thus introduces a group lasso
regularization (Yuan and Y. Lin 2006) to enable the selection of the words and word-parts
jointly during training. Deselected parameters are pushed to 0, preventing negative impact on the resulting representation. For optimization a scalable proximal
asynchronous stochastic gradient descent (ProxASGD) optimizer is introduced.
The proposed method is evaluated on a variety of tasks and our results show that
the regularization enables better representations for rare words and morphologically
complex languages such as German. Providing separate regularization hyperparameters
for words and word-parts, trading-off between inclusion of semantic and
syntactic information is made possible.
Post a Comment