1
Automatic bitext alignment for Southeast Asian languages | |
Author | Lwin Moe |
Call Number | AIT Thesis no.CS-09-07 |
Subject(s) | Thai language |
Note | A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, School of Engineering and Technology |
Publisher | Asian Institute of Technology |
Series Statement | Thesis ; no. CS-09-07 |
Abstract | Bitext alignment is the task of aligning words, phrases or sentences in one language with the equivalent translation in another. Aligned bitexts help lay the groundwork for statistical machine translation, are useful for language teaching, provide data for cross-language information retrieval, and have a variety of other applications. This thesis investigates the problem of bitext alignment for English and Southeast Asian languages. Although bitext alignment in general has been well studied, most algorithms, implementations, and even performance metrics depend on the assumption that both texts have been regularly divided into words and sentences. Bitext alignment of Southeast Asian languages has not benefited from previous work because they are not normally divided this way. There is no completely reliable machine method for dividing such texts into words and sentences. We will use Thai as our example and test language because experimental data are readily available. However, our goal is to develop insights into the best methods of automatically aligning "low resource" Southeast Asian languages like Burmese, Khmer, and Lao. This thesis will explore dictionary-based alignment methods to improve basic length-based method. We will begin by introducing existing European and Asian bitext corpora, and then discuss current approaches to bitext alignment problems. First, we discuss the basic length-based approach that we use as our baseline method. We then look at the use of lexical features and semantic analysis; for example, using dictionary-based similarity and WordNet relatedness measures, to enhance the baseline methods. Finally, we test different approaches to adapting a Southeast Asian language, Thai, to work with these methods. Before aligning with dictionary-based methods, we pre-segment the Thai input using vari¬ous techniques and prepare the English and Thai input using stemming, stopword removal or normalization of derived forms in English. This thesis will make the following contributions: 1.It will establish the baseline performance of the naIve basic method. 2.It will introduce metrics for evaluating the performance of bitext alignment, taking both sentence boundary detection and alignment of individual Thai segments into account. 3.It will test and measure different approaches to Southeast Asian word segmentation in the input text preparation before determining similarity between Thai sentence segments and English sentences. 4.It will compare the effectiveness of English-to-English comparison (that is, translate the Thai segments to English first) versus Thai-to- Thai comparison (that is, translate the English sentences to Thai first). 5.It will test and measure the effects of using different types of dictionaries for translation and alignment. 6.It will test and measure the effects of stopword removal, stemming, simplification of derived forms on dictionary-based realignment. 7.It will test WordNet relatedness analysis to realign the output of length-based method. 8.It will provide data that will be useful for ongoing research into such problems as detection and correction of misordered or missing alignment pairs. 9.It will make Southeast Asian language-specific recommendations on performance measurement, segmentation algorithm, segmentation dictionary, translation type, translation dictionary and different approaches to improve the segment and sentence similarity test. |
Year | 2009 |
Corresponding Series Added Entry | Asian Institute of Technology. Thesis ; no. CS-09-07 |
Type | Thesis |
School | School of Engineering and Technology (SET) |
Department | Department of Information and Communications Technologies (DICT) |
Academic Program/FoS | Computer Science (CS) |
Chairperson(s) | Janecek, Paul |
Examination Committee(s) | Dailey, Matthew;Cooper, Doug |
Scholarship Donor(s) | Asian Institute of Technology Fellowship |
Degree | Thesis (M.Sc.) - Asian Institute of Technology, 2009 |