Mining parallel documents using low bandwidth and high precision clir from the heterogeneous Web
by Shi Yue
M.Phil. Electronic and Computer Engineering
x, 57 p. : ill. (some col.) ; 30 cm
In this thesis, we propose a content-based method of mining bilingual parallel documents from websites that are not necessarily related to each other....[ Read more ]
In this thesis, we propose a content-based method of mining bilingual parallel documents from websites that are not necessarily related to each other.
Parallel corpora are a key resource as training data for statistical machine translation, and for building or extending bilingual lexicons and terminologies. There are two existing approaches for automatically mining parallel documents from the web, structure based approach and content based approach. Structure based methods work only for parallel websites and most of the content based methods either require large scale computational facilities, network bandwidth or not applicable to heterogeneous web.
We suggest that parallel documents can be mined with high precision from websites that are not necessarily parallel to each other. We propose a novel content based method using cross lingual information retrieval (CLIR) with query feedback and verification and supplemented with structural information, to mine parallel resources from the entire web using search engine APls. The method goes beyond URL matching and hyper-links tracking to find parallel documents from non-parallel websites.
We introduce a Search Query Relevance Score (SQRS) to measure the translation quality and select keywords to generate queries for further mining of target documents. Our approach neither requires crawling all web documents in the target language, nor does it require machine translation of the full text. It therefore requires less bandwidth and computing resources. We obtained a very high mining precision (88%) on the parallel documents by the pure content based approach and improve the quantity by mining parallel websites using the structure based methods. After extracting parallel sentences from the mined documents and using them to train an SMT system, we found that the SMT performance, with a higher BLEU score, is comparable to that obtained with high quality manually translated parallel sentences illustrating the excellent quality of the mined parallel materiel.