Mining parallel documents using low bandwidth and high precision clir from the heterogeneous Web

HKUST Electronic Theses

Mining parallel documents using low bandwidth and high precision clir from the heterogeneous Web

by Shi Yue

THESIS 2011

M.Phil. Electronic and Computer Engineering

x, 57 p. : ill. (some col.) ; 30 cm

Abstract

In this thesis, we propose a content-based method of mining bilingual parallel documents from websites that are not necessarily related to each other....[ Read more ]

In this thesis, we propose a content-based method of mining bilingual parallel documents from websites that are not necessarily related to each other.

Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. There are two existing approaches for automatically mining parallel documents from the web, structure based approach and content based approach. Structure based methods work only for parallel websites and most of the content based methods either require large scale computational facilities, network bandwidth or not applicable to heterogeneous web.

We suggest that parallel documents can be mined with high precision from websites that are not necessarily parallel to each other. We propose a novel content based method using cross lingual information retrieval (CLIR) with query feedback and verification and supplemented with structural information, to mine parallel resources from the entire web using search engine APls. The method goes beyond URL matching and hyper-links tracking to find parallel documents from non-parallel websites.

We introduce a Search Query Relevance Score (SQRS) to measure the translation quality and select keywords to generate queries for further mining of target documents. Our approach neither requires crawling all web documents in the target language, nor does it require machine translation of the full text. It therefore requires less bandwidth and computing resources. We obtained a very high mining precision (88%) on the parallel documents by the pure content based approach and improve the quantity by mining parallel websites using the structure based methods. After extracting parallel sentences from the mined documents and using them to train an SMT system, we found that the SMT performance, with a higher BLEU score, is comparable to that obtained with high quality manually translated parallel sentences illustrating the excellent quality of the mined parallel materiel.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree M.Phil. Department Electronic and Computer Engineering Authors Shi, Yue Subjects Internet searching Data mining Information retrieval Language English Call number Thesis ECED 2011 Shi DOI 10.14711/thesis-b1155574

Full record

Mining parallel documents using low bandwidth and high precision clir from the heterogeneous Web

by Shi Yue

Post a Comment Cancel reply