CHAPTER 4 PREPROCESSING FOR ENGLISH SENTENCE

CHAPTER 4

PREPROCESSING FOR ENGLISH SENTENCE

Current phrase based Statistical Machine Translation system does not use any linguistic information and it only operates on surface word form. It is shown that adding linguistic information helps to improve the translation process. Adding linguistic information can be done through preprocessing steps. On the other hand, machine translation system for language pair with disparate morphological structure needs best pre-processing or modeling before translation. This chapter explains about how preprocessing is applied on the raw source language sentence to make it more appropriate for translation.

4.1 MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE

Grammar of a language is divided into syntax and morphology. Syntax is how words are combined to form a sentence and morphology deals with the formation of words. Morphology is also defined as the study of how meaningful units can be combined to form words. One of the reasons to process a morphology and syntax together in language processing is that a single word in a language is equivalent to combination of words in another. The term "morpho-syntax" is a hybrid word that comes from morphology and syntax. It plays a major role in processing different types of languages and it is also a related term to machine translation because the fundamental unit of machine translation is words and phrases. Retrieving the syntactic information is a primary step in pre-processing English language sentences. The tool which is used for retrieving syntactic structure from a given sentence is called parsing and which is used to retrieve morphological features from a word is called as morphological analyzer. Syntactic information includes dependency relation, syntactic structure and POS tag morphological information consists of lemma and morphological features.

Klein and Manning (2003) [150] from Stanford University proposed a statistical technique for retrieving the syntactical structure of English sentences. Based on this technique a "Stanford Parser tool" was developed. This parser provides dependency relationship as well as phrase structure trees for a given sentence. Stanford parser

90

package is a Java implementation of probabilistic natural language parsers, such as highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The parser was also developed for other languages such as Chinese, Italian, Bulgarian, and Portuguese. This parser uses the knowledge gained from hand-parsed sentences to produce the most likely analysis of new sentences. In this pre-processing Stanford parser is used to retrieve the morpho-syntactic information of English sentences.

4.1.1 POS and Lemma Information

Part-of-Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate parts-of-speech like noun, verb, adjective, etc. This process takes an untagged sentence as input then assigns a POS tag to words and produces tagged sentences as output. The most widely used part of speech tagset for English is PennTree bank tagset which is given in the Appendix-A. In this thesis, English sentences are tagged using this tagset. POS and lemma of word forms are shown in Table 4.1. The example shown bellow represents the POS tagging for English sentences.

English Sentence

The boy is going to the school.

Part-of-Speech Tagging

The/DT boy/NN is/VBZ going/VBG to/TO the/DT school/NN ./.

Table 4.1 POS and Lemma of Words

Word

Playing Playing Walked Pens Training Training Trains Trains

POS

NN VBG VBD NNS VBG NN NNS VBZ

Lemma

Playing Play Walk Pen Train Training Train Train

91

Morphological analyzer or lemmatizer is used to find the lemma of a word. Lemmas have special importance in highly inflected languages. Lemma is a dictionary word or a root word. For example the word "play" is available in dictionary but other word forms like playing, played, and plays aren't available. So the word "play" is called a lemma or dictionary word for the above mentioned word forms.

4.1.2 Syntactic Information

Syntactic information of a language is used in NLP tasks like Machine translation, Question Answering, Information Extraction and Language Generation. Syntactic information can be extracted from parsing. Parsing extracts the information such as parts-of-speech tags, phrases and relationships between the words in the sentences. In addition, from the parse tree of a sentence, noun phrases, verb phrases, and prepositional phrases are also identified. Figure 4.1 shows an example of English syntactic tree. The parser output is a tree structure with a sentence label as the root. The example shown bellow indicates the syntactic information of English sentences.

Figure 4.1 Example of English Syntactic Tree

92

English Sentence

The boy is going to the school.

Parts of speech for each word

(NN = Noun, VBZ = Verb, DT = Determiner, VBG = Verbal Gerund) S NP DT the NN boy VP VBZ is VP VBG going PP TO to NP DT the NN school

Parsing information

(ROOT (S (NP (DT The) (NN boy)) (VP (VBZ is) (VP (VBG going) (PP (TO to) (NP (DT the) (NN school))))) (. .)))

Phrases

Noun Phrases (NP): "the boy", "the school" Verb Phrases (VP): "is", "going" Sentences (S): "the boy is going to the school"

4.1.3 Dependency Information

Dependency information represents a relation between individual words. A typed dependency parser additionally labels dependencies with grammatical relations, such as subject, direct object, indirect object etc. It is used in several NLP applications and such applications benefit particularly from having access to dependencies between words typed with grammatical relations. Since these relations also provide information about predicate-argument structure which is not readily available from phrase structure parse trees. The Stanford typed dependency representation was designed to provide a simple description of the grammatical relationships in a sentence. It can be easily understood and used even by people without linguistic knowledge. It is also used to extract textual relations. An example of the typed dependency relation for an English sentence is given below.

93

English Sentence

The boy is going to the school.

Subject Verb Object

The boy is Subject

going to the school

Verb

Object

Typed dependencies

det(boy-2, The-1) nsubj(going-4, boy-2) aux(going-4, is-3) root(ROOT-0, going-4) prep(going-4, to-5) det(school-7, the-6) pobj(to-5, school-7)

4.2 DETAILS OF PREPROCESSING ENGLISH SENTENCES

Recently, SMT systems are introduced with linguistic information in order to address the problem of word order and morphological variance between the language pairs. This preprocessing of source language is done constantly on the training and testing corpora. More source side pre-processing steps brings the source language sentence closer to that of the target language sentence.

This section explains the preprocessing methods for English sentence to improve the quality of English to Tamil Statistical Machine Translation system. The preprocessing module for English language sentence includes three stages, which are reordering, factorization and compounding. Figure 4.2 shows the preprocessing stages of English language sentence. The first step in preprocessing English sentence is to retrieve the linguistic features such as lemma, POS tag, and syntactic relations using Stanford parser. These linguistic features along with the sentence will be subjected to reordering and factorization stages. Reordering applies the reordering rules to the

94

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download