Quality and Innovation



Text Analysis Fundamentals

Objective

To extract and examine metadata from collections of documents that contain text. If you want to develop intelligent agents that classify or make predictions based on text, this is an important prerequisite! Most machine learning algorithms require quantitative input, so text analysis helps you reduce and transform your textual content into numbers that describe the content. This example uses the tm and wordcloud packages in R, plus data and a little bit of the code from Machine Learning for Hackers by Drew Conway (@drewconway) and John Myles White (@johnmyleswhite). You should follow both of these guys on Twitter. They are very cool and they love data analysis. And their code is downloadable from .

Background

Text analysis is a powerful mechanism for knowledge discovery and extraction, and provides functions for identifying associations and establishing categorization schemes. Typically, data mining means working with structured data where you know what types of information are contained within various locations in your data repository (such as your database tables). Text mining, in contrast, involves working with unstructured data and can be much more complicated. It can be costly and time consuming to convert a dataset consisting of unstructured text into a more structured form suitable for mining with other techniques. There are three data structures typically associated with text analysis and text mining: the basic container for the unstructured text, and matrices that contain information about the frequency of appearance of words and terms. They are:

• Corpus - A list containing a collection of documents. Although the tm package in R allows you to build a corpus from several different file types, we will only be working with text files here. Although I've only used collections containing up to 10k documents, it is reported that R performs well with collections up to 50k documents, and sometimes even up to 100k documents depending upon how much memory you have available.

• Document-Term Matrix (dtm) - Organized with the documents as rows and the terms as columns, this matrix contains counts of the appearance of each term within each document of a corpus.

• Term-Document Matrix (tdm) - Organized with the terms as rows and the documents as columns, this matrix contains counts of the appearance of each term within each document of a corpus.

In this chapter, we step through the process of building each of these objects in R, and using them to analyze the content of the emails and compare the composition of legitimate spam emails to the composition of real non-spam emails ("ham").

Data Format

For this exercise we will use the "spam and ham" dataset of emails from Chapter 3 of Machine Learning for Hackers. (I've zipped up this data set and posted it to Blackboard under the Lab Exercises menu option. It's called spam-and-ham.zip. I unzipped this data into the directory C:/ISAT 344/MLH/03-Classification/data on my computer. Be sure you note what directory you unzip the data into, because you'll need to use that path.)

spam.path length(findFreqTerms(spam.dtm,100))

[1] 120

That's a lot of terms, so let's check and see how many have been mentioned even more frequently - at least 300 times:

> length(findFreqTerms(spam.dtm,300))

[1] 29

Since 29 is a manageable number, we can look at the terms themselves:

> findFreqTerms(spam.dtm,300)

[1] "arial" "body" "border" "borderd" "business"

[6] "center" "div" "email" "facedarial" "faceverdanafont"

[11] "font" "free" "height" "heightd" "helvetica"

[16] "html" "input" "list" "money" "option"

[21] "people" "please" "receive" "sansserif" "size"

[26] "sized" "table" "width" "widthd"

We can also check to see what terms appear with greatest frequency BOTH in the ham and the spam, meaning that the presence of these terms cannot be used to distinguish one type of email from the other:

> intersect(findFreqTerms(spam.dtm,100),findFreqTerms(ham.dtm,100))

[1] "access" "address" "business" "call" "company" "computer"

[7] "contenttype" "day" "email" "free" "government" "help"

[13] "here" "home" "html" "information" "internet" "life"

[19] "link" "list" "mail" "mailing" "message" "million"

[25] "money" "name" "people" "please" "report" "send"

[31] "sent" "service" "size" "software" "time" "web"

[37] "you"

For association analysis, we're going to work with the matrices that we stripped out many of the sparse terms from, let's take a look first at all the terms that appear in each matrix, and the number of times each of those common terms appears in the first document of our collection:

> inspect(new.ham.tdm.2[,1])

A term-document matrix (6 terms, 1 documents)

Non-/sparse entries: 3/3

Sparsity : 50%

Maximal term length: 7

Weighting : term frequency (tf)

Docs

Terms 00001.7c53336b37003a9286aba55d2945844c

date 1

email 0

list 4

mailing 1

url 0

wrote 0

> inspect(new.spam.tdm.2[,1])

A term-document matrix (22 terms, 1 documents)

Non-/sparse entries: 12/10

Sparsity : 45%

Maximal term length: 11

Weighting : term frequency (tf)

Docs

Terms 00001.7848dde101aa985090474a91ec93fcf0

address 0

body 1

click 1

company 0

contenttype 0

email 2

font 3

form 0

free 3

head 0

html 1

information 0

list 1

message 0

meta 2

please 1

receive 0

removed 1

table 4

time 0

wish 1

you 0

Since the word "list" appears 4 times in the first document, let's see what that term is most commonly associated with throughout the corpus:

> findAssocs(new.ham.tdm.2,"list",0)

list mailing wrote email

1.00 0.68 0.26 0.24

Similarly, let's see what's most commonly associated with the word "free" in the spam:

> findAssocs(new.spam.tdm.2,"free",0)

free information receive time address list you email

1.00 0.32 0.31 0.27 0.17 0.17 0.15 0.14

please click font head html wish

0.14 0.12 0.09 0.09 0.04 0.03

Looks like the spammers are most frequently offering free information.

We can also examine word clouds, which can be very fun. But be advised: many data analysts and statisticians hate word clouds because the same information (and usually better information) can be found in frequency tables and matrices. That said, word clouds are still fun to make and can be very revealing nonetheless. So let's make a few.

If you have not done it already, install and then load the wordcloud package:

library(wordcloud)

In this example, our goal is to create word clouds that compare and contrast the content of the ham corpus with respect to the spam corpus. As a result, we need to squish all the documents in all.spam and all.ham into a data frame with two columns, one for each email type:

allspam ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download