THESIS
2009
xii, 84 p. : ill. ; 30 cm
Abstract
In this thesis, we show that reordering Statistical Machine Translation (SMT) output to match its semantic roles with those of the input improves the translation quality....[
Read more ]
In this thesis, we show that reordering Statistical Machine Translation (SMT) output to match its semantic roles with those of the input improves the translation quality.
Current SMT systems attempt to tackle adequacy primarily by memorizing in a bi-lexicon, all word (or phrase) translation pairs that co-occur frequently in a training corpus, using various statistics with the hope of improving the accuracy of lexical translation choices. The word order in the translation output is modeled as a statistical dependency problem, relying heavily on monolingual n-gram language models of the output language in an attempt to compensate for weak bilingual models of word (or phrase) alignment and permutation. Since no semantic features are considered throughout the process of training and translating, it is not surprising that serious semantic role confusion errors appear in the SMT output. To tackle this problem, one approach is to integrate semantic information into SMT. Semantic Role Labeling (SRL) is one of the most cost-effective ways to annotate semantic information in a sentence. As a result, we propose using SRL to reorder SMT output in order to improve translation quality.
Firstly, we study in detail a state-of-the-art Chinese shallow semantic parser, C-ASSERT, which consists of a Chinese word segmenter and a Chinese shallow syntactic parser. A set of controlled experiments is carried out by using different Chinese word segmenters and Chinese shallow syntactic parsers. It is found that the best performance is obtained when the Chinese word segmenter and the Chinese shallow syntactic parser are both the maximum entropy model built by our research center at HKUST.
Then, to provide solid groundwork to support our claim that using SRL to reorder SMT output improves translation quality, a series of SMT systems have been set up with language models of different n-gram sizes, phrase tables of different maximum phrase length, decoders of different maximum number of translation hypotheses per input phrase and different distortion limits. The best system obtained is further optimized with Minimum Error Rate Training (MERT) on the development data.
Three objective scoring functions (F-ident, F-class, F-combined) are then devised to quantify the three various definitions of SRL matching between the Chinese source input and the SMT output correspondingly. In addition an algorithm is built in order to reorder the SMT output using SRL. Finally three reordering experiments using this algorithm with the three corresponding objective scoring functions are conducted. The results show that the algorithm successfully returns a better translation with fewer semantic role confusion errors. The results also indicate that using F-combined to the algorithm gives the best improvement in translation quality.
Post a Comment