THESIS
1999
xi, 82 leaves : ill. ; 30 cm
Abstract
We describe a method for automatically learning a parser from labeled, bracketed corpora that results in a fast, robust, lightweight parser that is suitable for realtime natural language systems and similar applications. Unlike ordinary parsers, all grammatical knowledge is captured in the learned decision trees, so no explicit phrase-structure grammar is needed. Another characteristic of the architecture is robustness, since the input need not fit pre-specified productions. The runtime architecture is very slim and references two learned decision trees that allow the parser to operate in a "strictly deterministic" manner in Marcus' (1977) sense. The basis is a shift-reduce parser (Aho, Sethi, & Ullman 1986) consisting of a stack, an input stream and a decision control mechanism. The co...[
Read more ]
We describe a method for automatically learning a parser from labeled, bracketed corpora that results in a fast, robust, lightweight parser that is suitable for realtime natural language systems and similar applications. Unlike ordinary parsers, all grammatical knowledge is captured in the learned decision trees, so no explicit phrase-structure grammar is needed. Another characteristic of the architecture is robustness, since the input need not fit pre-specified productions. The runtime architecture is very slim and references two learned decision trees that allow the parser to operate in a "strictly deterministic" manner in Marcus' (1977) sense. The basis is a shift-reduce parser (Aho, Sethi, & Ullman 1986) consisting of a stack, an input stream and a decision control mechanism. The core part of our work is to learn the decision control mechanism, for which we employ a novel Shift/Reduce decision algorithm and a novel Constituent Labeling decision algorithm. The features used for both the Shift/Reduce and Constituent Labeling decision tasks are restricted to the constituent labels in the stack and the part-of-speech tags of the words in the input. Even without using specific lexical features, we have achieved respectable labeled bracket accuracies of about 81% precision and 82% recall on the Penn Treebank corpus. Processing speed on a Sparc Ultra I machine is more than 500 words per CPU second. The high processing speed makes our parser suitable for applications like online language understanding and machine translation applications. Without any optimization, the decision trees consume only 6M of memory, making it possible to run on platforms with limited memory. Since the only resource needed to train our parser is a labeled and bracketed corpus, we believe the learning method is readily applicable to other languages. Preliminary experiments on a Chinese corpus (which contains about 3000 sentences from Chinese primary school text) have yielded results comparable to that for English.
Post a Comment