THESIS
2018
xix, 142 pages : illustrations ; 30 cm
Abstract
We have managed to consistently improve the translation quality for challenging low
resource languages by injecting semantic based objective functions into the training pipeline
at an early (training) rather than late (tuning) stage as in previous attempts. The set of
approaches suggested in this thesis are motivated by the fact that including semantics in a late
stage tuning of machine translation models has already been shown to increase translation
quality.
Any shortage of parallel data constitutes a serious obstacle for conventional machine
translation training techniques, because of their heavy dependency on memorization from
big data. With low resource languages, for which parallel corpora are scarce, it becomes
imperative to make learning from small data much more effici...[
Read more ]
We have managed to consistently improve the translation quality for challenging low
resource languages by injecting semantic based objective functions into the training pipeline
at an early (training) rather than late (tuning) stage as in previous attempts. The set of
approaches suggested in this thesis are motivated by the fact that including semantics in a late
stage tuning of machine translation models has already been shown to increase translation
quality.
Any shortage of parallel data constitutes a serious obstacle for conventional machine
translation training techniques, because of their heavy dependency on memorization from
big data. With low resource languages, for which parallel corpora are scarce, it becomes
imperative to make learning from small data much more efficient by adding additional constraints
to create stronger inductive biases—especially linguistically well-motivated constraints,
such as the shallow semantic parses of the training sentences. However, while
automatic semantic parsing is readily available to produce shallow semantic parses for a
high resource output language (typically English), the problem is that there are no semantic
parsers for low resource languages such as Oromo, Uyghur and Uzbek.
We propose the first ever methods that inject a crosslingual semantic based objective
function into training translation models for translation tasks like Chinese–English where
we have semantic parsers for both languages. We report promising results showing that this
way of training the machine translation model in general helps bias the learning towards semantically
more correct bilingual constituents. Semantic statistical machine translation for
low resource languages has been a difficult challenge since semantic parses are not usually
available for low resource input languages but only for high resource output languages such
as English. We extend our bilingual approaches to a low resource setup via our new training
approaches which only require the output language semantic parse.
We then thoroughly analyze the reasons behind the promising results that we achieved
for multiple challenging low resource translation tasks such as Hausa, Uzbek, Uyghur,
Swahili, Oromo and Amharic always translating into English. Our methods heavily rely
on the degree of goodness of the semantic parser. We have noted that commonly used semantic
parses completely fail to parse any sentence containing any form of the verb to be.
Ignoring sentences containing the verb to be means that we are throwing away a good part
of valuable data. Finally, we propose a novel way that attempts to semantically parse sentences
that contains the verb to be and re-run all previous models on this new parsed data;
we also contrast to a newer semantic parse which handles the verb to be . All our results
show that building efficient MT systems for low resource languages could be more feasible
than generally assumed.
Post a Comment