IMPROVING CHINESE-ENGLISH MACHINE TRANSLATION THROUGH ...

[Pages:154]IMPROVING CHINESE-ENGLISH MACHINE TRANSLATION THROUGH BETTER SOURCE-SIDE LINGUISTIC PROCESSING

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Pi-Chuan Chang August 2009

c Copyright by Pi-Chuan Chang 2009 All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Christopher D. Manning) Principal Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Daniel Jurafsky)

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

(Andrew Y. Ng)

Approved for the University Committee on Graduate Studies.

iii

Abstract

Machine Translation (MT) is a task with multiple components, each of which can be very challenging. This thesis focuses on a difficult language pair ? Chinese to English ? and works on several language-specific aspects that make translation more difficult.

The first challenge this thesis focuses on is the differences in the writing systems. In Chinese there are no explicit boundaries between words, and even the definition of a "word" is unclear. We build a general purpose Chinese word segmenter with linguistically inspired features that performs very well on the SIGHAN 2005 bakeoff data. Then we study how Chinese word segmenter performance is related to MT performance, and provide a way to tune the "word" unit in Chinese so that it can better match up with the English word granularity, and therefore improve MT performance.

The second challenge we address is different word order between Chinese and English. We first perform error analysis on three state-of-the-art MT systems to see what the most prominent problems are, especially how different word orders cause translation errors. According to our findings, we propose two solutions to improve Chinese-to-English MT systems.

First, word reordering, especially over longer distances, caused many errors. Even though Chinese and English are both Subject-Verb-Object (SVO) languages, they usually use different word orders in noun phrases, prepositional phrases, etc. Many of these different word orders can be long distance reorderings and cause difficulty for MT systems. There have been many previous studies on this. In this thesis, we introduce a richer set of Chinese grammatical relations that describes more semantically abstract relations between words. We are able to integrate these Chinese grammatical relations into the most used, state-of-the-art phrase-based MT system and to improve its performance.

iv

Second, we study the behavior of the most common Chinese word " " (DE), which does not have a direct mapping to English. DE serves different functions in Chinese, and therefore can be ambiguous when translating to English. It might also cause longer distance reordering when translating to English. We propose a classifier to disambiguate DEs in Chinese text. Using this classifier, we improve the English translation quality because we can make the Chinese word orders much more similar to English, and we also disambiguate when a DE should be translated to different constructions (e.g., relative clause, prepositional phrase, etc.).

v

Acknowledgments

First, I would like to thank my advisor, Chris Manning, for being a great advisor in every way. On research, Chris has always provided very constructive comments during our weekly meetings. Also, Chris always offers good advice on writing and presenting my research work. He helps me organize the content of my papers and slides, and even fixes grammatical errors. If there are still any errors in this thesis, the original draft probably had 100 times more! I have enjoyed meeting with Chris, for he is very knowledgeable in various research topics, whether they are computer science or linguistics related.

I would like to thank Dan Jurafsky for his insightful ideas and suggestions during our collaboration on the DE classification and the grammatical relations work. In particular I want to thank him for his useful feedback to my research at MT meetings. I also would like to thank Andrew Ng, Peng Xu, and Yinyu Ye for being on my thesis committee and for their refreshing interest in my topic.

My dissertation topic is related to machine translation. On this part I want to give thanks to William Morgan, who sparked the initial interest in MT within the Stanford NLP group. I want to thank Kristina Toutanova for working with me on an MT project at Microsoft Research, which made me decide to work on MT for my dissertation. And also I would like to thank Michel Galley and Dan Cer in the MT group (and also my officemates) for useful discussions and collaborations on my research projects and the MT framework here at Stanford. I found that a good code-base and great people to work with are especially important when working on MT research. There is only so much that I can do by myself. I wouldn't be able to finish my dissertation without the help of the whole MT group.

Finally, I want to give thanks to the whole Stanford NLP group. It was always fun to talk to people at the weekly NLP lunch about what they are working on and what is going

vi

on in life. During my years at Stanford, every member of the Stanford NLP group has been friendly and willing to help each other whenever anyone has questions. I greatly appreciate it.

Outside research and outside the department, I want to thank Hsing-Chen Tsai for being a good friend and supporting me through many difficult times during these years. And I want to thank Andrew Carroll for his support and love, and for sharing my happiness and sadness all the time. I would also like to thank my family in Taiwan ? my dad, my mom and my sister, for supporting my decision to study abroad, and always praying for me.

vii

Contents

Abstract

iv

Acknowledgments

vi

1 Introduction

1

1.1 Key issues in Chinese to English translation . . . . . . . . . . . . . . . . . 2

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Background: Phrase-based MT Systems . . . . . . . . . . . . . . . . . . . 7

1.3.1 Phrase extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Basic feature functions in MERT . . . . . . . . . . . . . . . . . . . 11

2 Chinese Word Segmentation and MT

12

2.1 Chinese Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Lexicon-based Segmenter . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Feature-based Chinese Word Segmenter . . . . . . . . . . . . . . . . . . . 15

2.2.1 Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . 15

2.2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.4 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Word Segmentation for Machine Translation . . . . . . . . . . . . . . . . . 22

2.3.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Understanding Chinese Word Segmentation for Phrase-based MT . 26

2.3.3 Consistency Analysis of Different Segmenters . . . . . . . . . . . . 31

2.3.4 Optimal Average Token Length for MT . . . . . . . . . . . . . . . 34

viii

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download