Vafa spell-checker for detecting spelling, grammatical ...

See discussions, stats, and author profiles for this publication at:

Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language

Article in Digital Scholarship in the Humanities ? December 2014

DOI: 10.1093/llc/fqu043

CITATIONS

6

4 authors: Heshaam Faili University of Tehran 94 PUBLICATIONS 430 CITATIONS

SEE PROFILE

Mortaza Montazery University of Isfahan 3 PUBLICATIONS 30 CITATIONS

SEE PROFILE

READS

894

Nava Ehsan University of Tehran 8 PUBLICATIONS 71 CITATIONS

SEE PROFILE

Mohammad Taher Pilevar University of Cambridge 65 PUBLICATIONS 1,504 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects: Grammar Linking View project Fake News Detection View project

All content following this page was uploaded by Nava Ehsan on 08 September 2015.

The user has requested enhancement of the downloaded file.

Digital Scholarship in the Humanities Advance Access published December 2, 2014

Vafa spell-checker for detecting

spelling, grammatical, and

real-word errors of Persian language

................................................................................................ ............................................................

Heshaam Faili School of Electrical and Computer Engineering, College of Engineering, University of Tehran and School of Computer Science, Institute for Research in Fundamental Sciences (IPM)

Nava Ehsan, Mortaza Montazery and Mohammad Taher Pilehvar

School of Electrical and Computer Engineering, College of

Engineering, University of Tehran

................................................................................................ .......................................

Abstract

With advancements in industry and information technology, large volumes of

electronic documents such as newspapers, emails, weblogs, and theses are pro-

duced daily. Producing electronic documents has considerable benefits such as

easy organizing and data management. Therefore, existence of automatic systems

such as spell and grammar-checker/correctors can help to improve their quality.

In this article, the development of an automatic spelling, grammatical and real-

word error checker for Persian (Farsi) language, named Vafa Spell-Checker, is

explained. Different kinds of errors in a text can be categorized into spelling,

grammatical, and real-word errors. Vafa Spell-Checker is a hybrid system in

which both rule-based and statistical approaches are used to detect/correct

whole types of errors. The detection and correction phases of spelling and real-

word errors are fully statistical, while for the grammar-checker, a rule-based

approach is proposed. Vafa Spell-Checker attempts to process these kinds of

Correspondence: Heshaam Faili,

error types in an integrated system for Persian language. The results on the real-world collected test set indicate that continuing the work on grammar-

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran. E-mail:

checker requires statistical approaches. Evaluation results with respect to F0.5 measure for spell-checker, grammar-checker, and real-word error checker are about 0.908, 0.452, and 0.187, respectively. Moreover, several free-usable language resources for Persian that are generated during this project are demonstrated in this article. These resources could be used in the further research in

hfaili@ut.ac.ir

Persian language.

................................................................................................ .................................................................................

Digital Scholarship in the Humanities # The Author 2014. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: journals.permissions@ doi:10.1093/llc/fqu043

1 of 23

H. Faili et al.

1 Introduction

Proofreading tools such as spell- and grammarcheckers are one of the most widely used tools within natural language applications. The editorial assistance tools are useful for second language learners not only in writing but also in learning a language by providing valuable feedback (Leacock et al., 2010). Spell- and grammar-checkers for English have been included in most common word processors for some years now. In this article we will describe the development of an automatic writing assistance tool including spelling, grammar, and real-word error checkers for Persian (Farsi) language, named Vafa Spell-Checker. The result of the project is a freely available error checker program which can be used as an add-in for Microsoft Word processor. Also several useful language resources that prepared during development are freely available. These resources will be described in detail later in this article.

Kukich has categorized the errors of a text into five groups: (1) isolated, (2) non-isolated or syntactic errors, (3) real-word errors, (4) discourse structure, and (5) pragmatic errors (Kukich 1992). The last two types cannot be considered as spelling or grammatical error. Thus, the whole errors considered in this article are categorized into three groups: spelling, grammatical, and real-word errors. Spelling errors are words that spell-checker could not find in its lexicon. Grammatical errors are those sentences or expressions that violate the predefined rules of the language. Real-word errors are the misspelled words that have been converted wrongly to another word. Detection of these errors requires semantic analysis of the context. Thus, the conventional spell-checkers that just check the existence of words in lexicon cannot detect them. Some real-word errors that could cause syntactical errors would be recognized by grammar-checker, the others would be recognized in the third phase of this application. This tool attempts to detect and correct these three types of errors for Persian texts and can be used for proofreading and standardizing Persian texts. Vafa Spell-Checker attempts to standardize Persian texts according to rules of APLL1 and checks the mentioned types of errors together.

Shortly, the contributions of this work can be summarized as follows:

(1) As mentioned, this work develops a freely available integrated error checker/corrector system for Persian that deals with the whole types of errors in a text.

(2) Experimental results on spell-checker/ corrector part of the system show its superiority for the detection/correction accuracy compared to Microsoft Word Spell-Checker/ Corrector system.

(3) Promising accuracy is achieved for Persian real-word error checker/corrector part of the system.

(4) Several resources for Persian are developed and freely published during this project. These resources are, Persian lexicon containing the frequency and the most frequent partof-speech (POS) tags, preprocessing rules, heuristic based confusion tables for different kinds of typing errors, grammar checking rules, mutual and n-gram information, confusion set for real-word error checking, and real-world test set for evaluating different parts of the system.

This article outlines the development aspects of Vafa Spell-Checker. First, an overview of the tool is given in Section 2. A survey on Literature is reported in Section 3. Spelling, grammatical, and real-word error checking are described in separate sections. Each section contains linguistic features, principles, and implementation issues for Persian. Finally the evaluation of each part is reported at the end of each section.

As mentioned, during this project, several other resources for Persian are developed and maintained that can be freely downloaded and expected to be useful for the research community.2

2 Vafa Spell-Checker System

Vafa Spell-Checker uses a lexicon containing more than 1,200,000 word types of Persian. It uses also about twenty rules for text preprocessing. The system combines the probabilities of typing

2 of 23 Digital Scholarship in the Humanities, 2014

Vafa spell-checker for detecting errors

mistakes, which is calculated by DamerauLevenshtein algorithm (Damerau 1964) and the frequency of word occurrence in the language to achieve the acceptable accuracy for error correction. The grammatical error patterns are defined for system to detect a number of Persian grammatical errors and suggest corrections where it is possible. Mutual information between any pairs of Persian words and n-gram probabilities are calculated and stored to be used in the process of detecting and correcting the real-word errors. The general architecture of Vafa Spell-Checker is demonstrated in Figure 1. First, preprocessing rules are applied on a text in order to turn the tokens to standard style

and to simplify the tokenization process. Then, the existence of the tokens in the lexicon is checked. If there is any nonword error, correction phase of spell-checker starts followed by ranking the suggestions. If there is spacing error, it would be corrected in space handling phase, otherwise it would be corrected in isolated word error correction phase. After checking the correctness of the words and assigning the appropriate POS tag to each word, the process of grammar checking starts. The grammar checking rules, which will be described in Section 4.2.2, are applied to each sentence. If any ungrammatical pattern is matched, it would be flagged by the system as grammatical error and an additional description

Fig. 1 An Overview of the Vafa spell-checker system

Digital Scholarship in the Humanities, 2014 3 of 23

H. Faili et al.

about the error is attached. In case of having suggestion for the pattern, the suggestion is reported to user. Finally the real-word error checking phase attempts to correct context-sensitive spelling mistakes that were not detected in previous phases.

As mentioned before, in this project several resources are developed for Persian language. The generated resources are (1) Persian lexicon containing frequency and most frequent POS tags, (2) preprocessing rules, (3) heuristic-based confusion tables for different kinds of typing errors, (4) grammar checking rules, (5) mutual and n-gram information, and (6) real-world test set for evaluating different parts of the system.

The Persian lexicon and POS tags and also mutual and n-gram information can be used in several natural language processing (NLP) and information retrieval tasks including proofreading tools, machine translations (MTs), WordNet construction, search engines, classification, etc. Frequency of the words can be used in statistical NLP applications and information retrieval applications such as classification and summarization. Preprocessing rules are a valuable resource for cleaning and standardizing texts for further uses of texts. Confusion tables could be used for other research for Persian spell checking and error analysis. Grammar checking rules could be used for other research on Persian grammar checking. The test set could be a valuable resource to compare different spell- and grammarcheckers and approaches with the same data set of real-world errors.

3 Survey on Literature

Related work of spell checking, grammar checking, and real-word error checking are reported individually in the following sections.

3.1 Spell checking

The main tasks of a spell-checker module are tokenization, error detection, error correction and ranking the suggestions. Tokenization is a languagespecific task that splits a text into meaningful elements called tokens.

Most methods use dictionary directly although there are methods that work without using dictionary (Comeau and Wilbur, 2004). Methods that use dictionary directly can differ in the way of storing their dictionaries. From this point, the whole methods can either use minimal redundancy or fulllisting approaches (Jurafsky et al., 2010). There are some other ways for saving the words list like using dictionary as bitmap (Mitton 2010) or Ternary search tree (Barari and QasemiZadeh, 2005). The patterns of the errors can be categorized into four groups, (1) multi-word token and split errors, (2) typographical errors, (3) cognitive errors, and (4) phonetic errors (Bhagat 2007). Multi-word token errors are those errors, which happen due to missing space between two distinct words like `ofthe' and split errors refer to having extra space between the letters of a word like `sp ent'. Typographical errors deal with regular forms of mistyping like pressing a key on the keyboard twice or hitting the adjacent key mistakenly. Cognitive errors refer to those errors that happen because of misconception or lack of knowledge of the user like typing `recieve' instead of `receive'. Phonetic errors are those errors that happen due to pronunciation similarities between the letters like typing the word `naturally' as `nacherly'. There are many algorithms for correcting the errors such as Soundex, SPEEDCOP described in (Mitton 2010) and Metaphone (Philips 2000), which just deal with phonetic errors and do not rank the list of suggestions. There are also some other works that use n-gram models and neural networks (Hodge and Austin, 2003) for error correction. Some methods on statistical spelling corrections are presented in (Brill and Moor, 2000; Kolak and Resnik, 2002). Another method in spelling correction is finding minimum edit distance. Analysis of typographical errors in (Damerau 1964) states that about 80?95% of the errors in English texts are single errors that are caused by wrong insertion, deletion, substitution of one single letter, or transposition of two adjacent letters. The DamerauLevenshtein distance refers to minimum number of insertions, deletions, substitutions, or transpositions need to convert a word to the other word. In this model (Damerau 1964) after detecting the erroneous word, all the words that could be

4 of 23 Digital Scholarship in the Humanities, 2014

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download