CHAPTER 4 PREPROCESSING FOR ENGLISH SENTENCE
CHAPTER 4
PREPROCESSING FOR ENGLISH SENTENCE
Current phrase based Statistical Machine Translation system does not use any linguistic information and it only operates on surface word form. It is shown that adding linguistic information helps to improve the translation process. Adding linguistic information can be done through preprocessing steps. On the other hand, machine translation system for language pair with disparate morphological structure needs best pre-processing or modeling before translation. This chapter explains about how preprocessing is applied on the raw source language sentence to make it more appropriate for translation.
4.1 MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE
Grammar of a language is divided into syntax and morphology. Syntax is how words are combined to form a sentence and morphology deals with the formation of words. Morphology is also defined as the study of how meaningful units can be combined to form words. One of the reasons to process a morphology and syntax together in language processing is that a single word in a language is equivalent to combination of words in another. The term "morpho-syntax" is a hybrid word that comes from morphology and syntax. It plays a major role in processing different types of languages and it is also a related term to machine translation because the fundamental unit of machine translation is words and phrases. Retrieving the syntactic information is a primary step in pre-processing English language sentences. The tool which is used for retrieving syntactic structure from a given sentence is called parsing and which is used to retrieve morphological features from a word is called as morphological analyzer. Syntactic information includes dependency relation, syntactic structure and POS tag morphological information consists of lemma and morphological features.
Klein and Manning (2003) [150] from Stanford University proposed a statistical technique for retrieving the syntactical structure of English sentences. Based on this technique a "Stanford Parser tool" was developed. This parser provides dependency relationship as well as phrase structure trees for a given sentence. Stanford parser
90
package is a Java implementation of probabilistic natural language parsers, such as highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The parser was also developed for other languages such as Chinese, Italian, Bulgarian, and Portuguese. This parser uses the knowledge gained from hand-parsed sentences to produce the most likely analysis of new sentences. In this pre-processing Stanford parser is used to retrieve the morpho-syntactic information of English sentences.
4.1.1 POS and Lemma Information
Part-of-Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate parts-of-speech like noun, verb, adjective, etc. This process takes an untagged sentence as input then assigns a POS tag to words and produces tagged sentences as output. The most widely used part of speech tagset for English is PennTree bank tagset which is given in the Appendix-A. In this thesis, English sentences are tagged using this tagset. POS and lemma of word forms are shown in Table 4.1. The example shown bellow represents the POS tagging for English sentences.
English Sentence
The boy is going to the school.
Part-of-Speech Tagging
The/DT boy/NN is/VBZ going/VBG to/TO the/DT school/NN ./.
Table 4.1 POS and Lemma of Words
Word
Playing Playing Walked Pens Training Training Trains Trains
POS
NN VBG VBD NNS VBG NN NNS VBZ
Lemma
Playing Play Walk Pen Train Training Train Train
91
Morphological analyzer or lemmatizer is used to find the lemma of a word. Lemmas have special importance in highly inflected languages. Lemma is a dictionary word or a root word. For example the word "play" is available in dictionary but other word forms like playing, played, and plays aren't available. So the word "play" is called a lemma or dictionary word for the above mentioned word forms.
4.1.2 Syntactic Information
Syntactic information of a language is used in NLP tasks like Machine translation, Question Answering, Information Extraction and Language Generation. Syntactic information can be extracted from parsing. Parsing extracts the information such as parts-of-speech tags, phrases and relationships between the words in the sentences. In addition, from the parse tree of a sentence, noun phrases, verb phrases, and prepositional phrases are also identified. Figure 4.1 shows an example of English syntactic tree. The parser output is a tree structure with a sentence label as the root. The example shown bellow indicates the syntactic information of English sentences.
Figure 4.1 Example of English Syntactic Tree
92
English Sentence
The boy is going to the school.
Parts of speech for each word
(NN = Noun, VBZ = Verb, DT = Determiner, VBG = Verbal Gerund) S NP DT the NN boy VP VBZ is VP VBG going PP TO to NP DT the NN school
Parsing information
(ROOT (S (NP (DT The) (NN boy)) (VP (VBZ is) (VP (VBG going) (PP (TO to) (NP (DT the) (NN school))))) (. .)))
Phrases
Noun Phrases (NP): "the boy", "the school" Verb Phrases (VP): "is", "going" Sentences (S): "the boy is going to the school"
4.1.3 Dependency Information
Dependency information represents a relation between individual words. A typed dependency parser additionally labels dependencies with grammatical relations, such as subject, direct object, indirect object etc. It is used in several NLP applications and such applications benefit particularly from having access to dependencies between words typed with grammatical relations. Since these relations also provide information about predicate-argument structure which is not readily available from phrase structure parse trees. The Stanford typed dependency representation was designed to provide a simple description of the grammatical relationships in a sentence. It can be easily understood and used even by people without linguistic knowledge. It is also used to extract textual relations. An example of the typed dependency relation for an English sentence is given below.
93
English Sentence
The boy is going to the school.
Subject Verb Object
The boy is Subject
going to the school
Verb
Object
Typed dependencies
det(boy-2, The-1) nsubj(going-4, boy-2) aux(going-4, is-3) root(ROOT-0, going-4) prep(going-4, to-5) det(school-7, the-6) pobj(to-5, school-7)
4.2 DETAILS OF PREPROCESSING ENGLISH SENTENCES
Recently, SMT systems are introduced with linguistic information in order to address the problem of word order and morphological variance between the language pairs. This preprocessing of source language is done constantly on the training and testing corpora. More source side pre-processing steps brings the source language sentence closer to that of the target language sentence.
This section explains the preprocessing methods for English sentence to improve the quality of English to Tamil Statistical Machine Translation system. The preprocessing module for English language sentence includes three stages, which are reordering, factorization and compounding. Figure 4.2 shows the preprocessing stages of English language sentence. The first step in preprocessing English sentence is to retrieve the linguistic features such as lemma, POS tag, and syntactic relations using Stanford parser. These linguistic features along with the sentence will be subjected to reordering and factorization stages. Reordering applies the reordering rules to the
94
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- a bidirectional lstm approach with word embeddings for
- use a short simple sentence to emphasize an important point or
- make sentence in english words
- on the papers the imporance of t stress indicating the most
- sentiment classification using language models and sentence
- chapter 4 preprocessing for english sentence
- it is the most important word in any sentence its form tense
- why is vocabulary development so important
- word usuage in scientific writing ucla
- 1 adda247 no 1 app for banking ssc preparation website
Related searches
- english sentence unscrambler
- is my english sentence grammatically correct
- english sentence dictionary
- spanish to english sentence translator
- english sentence structure pdf
- english sentence structure worksheets
- english sentence structure book pdf
- english sentence construction pdf
- english sentence grammar
- english sentence grammar check online
- english sentence examples
- proper english sentence structure check