Classes.ischool.syr.edu



Exploring Text with Corpus Statistics

Suppose that you have collected a number of news articles on a technology or product through searching the Web or through a search product like Factiva. Factiva has the capability to do a number of really interesting comparisons or lists of companies, but it doesn’t reveal what is actually being said in the articles.

One way to explore the contents of text documents is through what is sometimes called corpus statistics, where corpus refers to a collection of documents or text, and statistics refers to frequencies of words and phrases occurring in the text, and occasionally other statistical measures. It is quite typical to use frequencies of words and bigrams, which are pairs of words.

In articles about technology, you might also be interested in what companies, or perhaps people, are in the articles. For this, we can also use named entity extraction, which identities names and their types.

For this task, we have provided a user interface to the Stanford Named Entity Recognizer. We take text and find the words using the Recognizer and then filter them for punctuation and special characters, and we remove what are called “stop words”, words with little content such as “the” and “about”. From the remaining words, we display frequencies of the single words and also pairs of words. Then we also count the frequencies of the three types of names provided by the Named Entity Recognizer: people, organizations, and locations. And finally, we let you search for other (longer) phrases and find the frequencies of occurrence.

Installing and Running the Explore Text program

In order to run the ExploreText program, you will need to have Java 1.6 or later installed on your system.

Download the file ExploreText.zip, noting that it is about 90 Mbytes. Unzip this file and extract the ExploreText directory.

Now on a Mac, open a terminal window and navigate directories until you get to the ExploreText directory and go further inside its directory called “dist”. Similarly, on a PC, open a command prompt window (in the run window, type cmd) and navigate to ExploreText –> dist.

Now that you are in the dist directory, you want to execute the program by typing at the prompt:

% java –jar TextViewer.jar

This should launch the viewer window. If, instead, it says that it doesn’t recognize “java”, you should give the directory path to where the java command is. This will depend on the directory structure where java is installed on your PC. For example, in the lab, java is installed on the C drive, under Program Files (x86)\Java\jre7, and you need to give this path to the java command in the bin subdirectory:

% "C:\Program Files (x86)\Java\jre7\bin\java" -jar TextViewer.jar

The double quotes are required because the path has spaces in it.

If you should get an error message that Java has run out of memory, try running the command with the extra memory option. Try something like:

% java -jar TextViewer.jar -Xms1024m

Text Data

To use this program, you should copy and paste text from articles into a text file. I used a TextPad document, or you can use something like NotePad++. The text can be collected into a single file, or if you prefer, you can have a directory of files of text. The text file(s) should have file extension .txt.

ExploreText viewer interface

Below is a picture of the viewer interface when you start.

[pic]

You first click on the button “Select File or Directory of Files to Process”. It will give you a file chooser box where you can navigate through your file system to find either a single text file or a directory of all text files to process.

This button will take a little while to run the Stanford Named Entity Recognizer and find all the lists of words. When it is done, the Status label will read something like:

“Status: File(s) loaded with 14017 words”, with a count of the number of words that it found.

Now you can view words, bigrams or the 3 types of names, in any order. If you just click one of the button “View Words”, View Bigrams”, or “View Names”, it will show all the words, bigrams and names of that type that it found, with a frequency for each one. But if you put a limit number in the textbox over the viewing area, then it will only show that many of the top frequency items.

Before clicking the “View Names” button, you should click one of the name types, either People, Organizations (which will include companies) or Locations.

When the items are put into the view pane on the right, it shows the word(s) and the frequency that it occurs among all the words. When it first displays words and phrases, it leaves the scroller at the bottom of the list, so you should scroll up to see the higher frequency items.

Finally, if there is a particular phrase that you are looking for, you can put the words for that phrase is the search text box at the bottom left. You should put the words in lower case and with one space in-between. The frequency of that phrase is shown in the Search Result Label, also on the bottom left.

Results

We hope that this will be useful in finding particular keywords or phrases that are indicative of your task, that is, identifying where a technology is in the Gartner Hype Cycle. One difficulty is that it is not known exactly what words and phrases will indicate the various parts of the cycle.

It is our hope that this program will help you explore the words and phrases in the set of articles that you have chosen in order to do your analysis. For example, you may want to report particular phrases that occur with their frequency that you think indicate a particular part of the cycle.

Please let me know if this was helpful to you, or if you have improvements to suggest.

Nancy McCracken

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download