Proceedings Template - WORD



Comparing Naïve Bayesian and k-NN algorithms for automatic email classification

Louis Eisenberg

Stanford University M.S. student

PO Box 18199

Stanford, CA 94309

650-269-9444

louis@

ABSTRACT

The problem of automatic email classification has numerous possible solutions; a wide variety of natural language processing algorithms are potentially appropriate for this text classification task. Naïve Bayes implementations are popular because they are relatively easy to understand and implement, they offer reasonable computational efficiency, and they can achieve decent accuracy even with a small amount of training data. This paper seeks to compare the performance of an existing Naïve Bayesian system, POPFile [1], to a hand-tuned k-nearest neighbors system. Previous research has generally shown that k-NN should outperform Naïve Bayes in text classification. My results fail to support that trend, as POPFile significantly outperforms the k-NN system. The likely explanation is that POPFile is a system specifically tuned to the email classification task that has been refined by numerous people over a period of years, whereas my k-NN system is a crude attempt at the problem that fails to exploit the full potential of the general k-NN algorithm.

INTRODUCTION

Using machine learning to classify email messages is an increasingly relevant problem as the rate at which Internet users receive emails continues to grow. Though classification of desired messages by content is still quite rare, many users are the beneficiaries of machine learning algorithms that attempt to distinguish spam from non-spam (e.g. SpamAssassin [2]). In contrast to the relative simplicity of spam filtering – a binary decision – filing messages into many folders can be fairly challenging. The most prominent non-commercial email classifier, POPFile, is an open-source project that wraps a user-friendly interface around the training and classification of a Naïve Bayesian system. My personal experience with POPFile suggests that it can achieve respectable results but it leaves considerable room for improvement. In light of the conventional wisdom in NLP research that k-NN classifiers (and many other types of algorithms) should be able to outperform a Naïve Bayes system in text classification, I adapted TiMBL [3], a freely available k-NN package, to the email filing problem and sought to surpass the accuracy obtained by POPFile.

DATA

I created the experimental dataset from my own inbox, considering the more than 2000 non-spam messages that I received in the first quarter of 2004 as candidates. Within that group, I selected approximately 1600 messages that I felt confident classifying into one of the twelve “buckets” that I arbitrarily enumerated (see Table 1). I then split each bucket and allocated half of the messages to the training set and half to the test set.

As input to POPFile, I kept the messages in Eudora mailbox format. For TiMBL, I had to convert each message to a feature vector, as described in section 3.

|Code |Size** |Description |

|ae |86 |academic events, talks, seminars, etc. |

|bslf |63 |buy, sell, lost, found |

|c |145 |courses, course announcements, etc. |

|hf |43 |humorous forwards |

|na |37 |newsletters, articles |

|p |415 |personal |

|pa |53 |politics, advocacy |

|se |134 |social events, parties |

|s |426 |sports, intramurals, team-related |

|ua |13 |University administrative |

|w |164 |websites, accounts, e-commerce, support |

|wb |36 |work, business |

* - training and test combined

Table 1. Classification buckets

POPFILE

POPFile implements a Naïve Bayesian algorithm. Naïve Bayesian classification depends on two crucial assumptions (both of which are results of the single Naïve Bayes assumption of conditional independence among features as described in Manning and Schutze [4]): 1. each document can be represented as a bag of words, i.e. the order and syntax of words is completely ignored; 2. in a given document, the presence or absence of a given word is independent of the presence or absence of any other word. Naïve Bayes is thus incapable of appropriately capturing any conditional dependencies between words, guaranteeing a certain level of imprecision; however, in many cases this flaw is relatively minor and does not prevent the classifier from performing well.

To train and test POPFile, I installed the software on a Windows system and then used a combination of Java and Perl to perform the necessary operations. To train the classifier I fed the mbx files (separated by category) directly to the provided utility script insert.pl. For testing, I split each test set mbx file into its individual messages, then used a simple Perl script fed the messages one at a time to the provided script pipe.pl, which reads in a message and outputs the same message with POPFile’s classification decision prepended to the Subject header and/or added in a new header called X-Test-Classification. After classifying all of the messages, I ran another Java program, popfilescore, to tabulate the results and generate a confusion matrix.

k-NN

To implement my k-NN system I used the Tilburg Memory-Based Learner, a.k.a. TiMBL. I installed and ran the software on various Unix-based systems. TiMBL is an optimized version of the basic k-NN algorithm, which attempts to classify new instances by seeking “votes” from the k existing instances that are closest/most similar to the new instance. The TiMBL reference guide [5] explains:

Memory-Based Learning (MBL) is based on the idea that intelligent behavior can be obtained by analogical reasoning, rather than by the application of abstract mental rules as in rule induction and rule-based processing. In particular, MBL is founded in the hypothesis that the extrapolation of behavior from stored representations of earlier experience to new situations, based on the similarity of the old and the new situation, is of key importance.

Preparing the messages to serve as input to the k-NN algorithm was considerably more difficult than in the Naïve Bayes case. A major challenge in using this algorithm is deciding how to represent a text document as a vector of features. I chose to consider five separate sections of each email: the attachments, the from, to and subject headers, and the body. For attachments each feature was a different file type, e.g. jpg or doc. For the other four sections, each feature was an email address, hyperlink URL, or stemmed and lowercased word or number. I discarded all other headers. I also ignored any words of length less than 3 letters or greater than 20 letters and any words that appeared on POPFile’s brief stopwords list. All together this resulted in each document in the data set being represented as a vector of 15,981 features. For attachments, subject, and body, I used tf.idf weighting according to the equation:

weight(i,j) = (1+log(tfi,j))log(N/dfi) iff tfi,j ≥ 1,

where i is the term index and j is the document index. For the to and from fields, each feature was a binary value indicating the presence or absence of a word or email address.

The Java program mbx2featurevectors parses the training or test set and generates a file containing all of the feature vectors, represented in TiMBL’s Sparse format.

TiMBL processes the training and test data in response to a single command. It has a number of command-line options with which I experimented in an attempt to extract better accuracy. Among them:

• k, the number of neighbors to consider when classifying a test point: the literature suggests that anywhere between one and a handful of neighbors may be optimal for this type of task

• w, the feature weighting scheme: the classifier attempts to learn which features have more relative importance in determining the classification of an instance; this can be absent (all features get equal weight) or based on information gain or other slight variations such as gain ratio and shared variance

• m, the distance metric: how to calculate the nearness of two points based on their features; options that I tried included overlap (basic equals or not equals for each feature), modified value difference metric (MVDM), and Jeffrey divergence

• d, the class vote weighting scheme for neighbors; this can be simple majority (all have equal weight) or various alternatives, such as Inverse Linear and Inverse Distance, that assign higher weight to those neighbors that are closer to the instance

For distance metrics, MVDM and Jeffrey divergence are similar and, on this task with its numeric feature vectors, both clearly preferable to basic overlap, which draws no distinction between two values that are almost but not quite equivalent and two values that are very far apart.

The other options have no clearly superior setting a priori, so I relied on the advice of the TiMBL reference guide and the results of my various trial runs.

RESULTS/CONCLUSIONS

The confusion matrices for POPFile and for the most successful TiMBL run are reproduced in Tables 2 and 3. Figure 4 compares the accuracy scores of the two algorithms on each category. Table 5 lists accuracy scores for various combinations of TiMBL options. The number of TiMBL runs possible was limited considerably by the length of time that each run takes – up to several hours even on a fast machine, depending greatly on the exact options specified.

|  |ae |bs |c |hf |

|MVDM |gain ratio |9 |inv. dist. |51.0% |

|overlap |none |1 |majority |54.9% |

|overlap |inf. gain |15 |inv. dist. |53.7% |

|MVDM |shared var |3 |inv. linear |61.1% |

|Jeffrey |shared var |5 |inv. linear |60.2% |

|overlap |shared var |9 |inv. linear |58.9% |

|MVDM |gain ratio |21 |inv. dist. |49.4% |

|MVDM |inf. gain |7 |inv. linear |57.4% |

|MVDM |shared var |1 |inv. dist. |61.0% |

|MVDM |shared var |5 |majority |54.6% |

Table 4. Sample of TiMBL trials

OTHER RESEARCH

A vast amount of research already exists on this and similar topics. Some people, e.g. Rennie et al [6], have investigated ways to overcome the faulty Naïve Bayesian assumption of conditional independence. Kiritchenko and Matwin [7] found that support vector machines are superior to Naïve Bayesian systems when much of the training data is unlabeled. Other researchers have attempted to use semantic information to improve accuracy [8].

In addition to the two models discussed in this paper, there exist many other options for text classification: support vector machines, maximum entropy and logistic models, decision trees and neural networks, for example.

REFERENCES

1] POPFile:

2] SpamAssassin:

3] TiMBL:

4] Manning, Christopher and Hinrich Schutze. Foundations of Statistical Natural Language Processing. 2000.

5] TiMBL reference guide:

6] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan and David R. Karger. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning. 2003.

7] Svetlana Kiritchenko and Stan Matwin. Email classification with co-training. Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative Research. 2001.

8] Nicolas Turenne. Learning Semantic Classes for Improving Email Classification. Biométrie et Intelligence. 2003.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download