Home | Charles Darwin University



INTRODUCTION TO TEXT ANALYSIS USING QUANTEDA IN Rby Simon MossIntroductionOften, researchers want to uncover interesting patterns in texts, such as blogs, letters, and tweets. One package in R, called quanteda, is especially helpful in these circumstances (for more information, visit ). This package can apply a range of procedures to collate and to analyse texts. For example, you canidentify the most frequent words or combinations of words in texts—such as the most frequent hashtagsconstruct word cloudsascertain whether the frequency of specific words differs between two or more textscalculate the diversity of words or phrases that an author or speaker used—sometimes a measure of language ability or complexitydetermine whether one word tends to follow or precede another wordascertain the degree to which texts utilise a specific cluster of words, such as positive wordsclassify documents, based on specific featuresThis document outlines some of the basics of quanteda. In particular, this document will first present information on how to utilise R. Next, this document will impart knowledge on relevant concepts, such as vectors, dataframes, corpuses, tokens, and document feature matrices—all distinct formats that can be used to store text. Finally, this document will outline how researchers can utilise this package to display and analysis patterns in the texts. More advanced techniques, however, are discussed in other documents. Install and use RDownload and install RIf you have not used R before, you can download and install this software at no cost. To achieve this goal, proceed to the “Download R” option that is relevant to your computer—such as the Linus, Mac, or Windows versionclick the option that corresponds to the latest version, such as R 3.6.2.pkg. follow the instructions to install and execute R on your computer—as you would install and execute any other program.Download and install R StudioIf you are unfamiliar with the software, R can be hard to navigate. To help you use R, most researchers utilize an interface called R studio as well. To download and install R studio proceed to Download R studio under the heading “Installers for Supported Platforms”, click the RStudio option that corresponds to your computer, such as Windows or Macfollow the instructions to install and to execute R on your computer—as you would install and execute any other programthe app might appear in your start menu, applications folder, or other locations depending on your computerFamiliarise yourself with RYou do not need to have become a specialist in R to use quanteda. Nevertheless, you might choose to become familiar with the basics—partly because expertise in R is becoming an increasingly valued skill in modern society. To achieve this goal, you could read the document called “How to use R”, available on the CDU webpage about “Choosing your research methodology and methods”. Regardless, the remainder of this document will help you learn the basics of R as well. Vectors, dataframes, and matrices To analyse text, and indeed to use R in general, you should understand the difference between vectors, dataframes, and matrices. In essencea vector is a single set of characters—such as a column or row of numbersa dataframe is a set of vectors of equivalent length—like a typical data filea matrix is like a dataframe, except all the values are the same type, such as all numbers or all lettersVectors: cTo learn about vectors, enter the code that appears the left column of the following table. These examples illustrate how you can develop vectors and extract specific items, called elements, from vectors. To enter this code, you could enter one command at a time in the Console. Or, if you want to enter code more efficiently, in R studio, choose the File menu and then “New File” as well as “R script”in the file that opens, paste the code that appears in the left column of the following tableto execute this code, highlight all the instructions and press the “Run” button—a button that appears at the top of this file Code to enterExplanation or clarificationnumericalVector <- c(1, 5, 6, 3)print(numericalVector)The c generates a vector—in this instance, a set of numbers.This vector is like a container of characters The vector, in this example, is labelled numericalVectorWhen printed, R produces the display [1] 1 5 6 3The [1] indicates the first row or columncharacterVector<-c('apple', 'banana', 'mandarin', 'melon')print(characterVector)This example is the same, except the items or elements are surrounded by quotation marksConsequently, R realizes these elements are characters, including letters, rather than only numbers print(characterVector[c(1, 3)])This example shows how you can extract or print only a subset of elements or itemsIn this example, R displays only the first and third element of this vector: apple and mandarinnumericalVector2 <- numericalVector * 2print(numericalVector2)You can also perform mathematical operations on vectors, but only if the elements are numericalIn this example, R will display a vector that is twice the original vector: 2, 20, 12, 6characterVector2 <- paste(c('red', 'yellow', 'orange', 'green'), characterVector)print(characterVector2)You can also aggregate multiple vectors to generate a longer vectorIn this example, two vectors are aggregated, called concatenation: 'red', 'yellow', 'orange', 'green' and 'apple', 'banana', 'mandarin', 'melon'However, to be precise, the first element of each vector are combined. The second element of each vector are also combined and so forthThus, R will display the output "red apple" "yellow banana" "orange mandarin" "green melon" Data frames: data.frameSimilarly, to learn about dataframes—a format that resembles a typical data file—enter the code that appears the left column of the following table. These examples illustrate how you can generate dataframes and extract specific records.Code to enterExplanation or clarificationfruitData <- data.frame(name = characterVector, count = numericalVector)print(fruitData)The function data.frame generates a dataframe, like a data file, called “fruitData”In particular, the first column of this data file are the four fruits, such as apple and bananaThe second column of this data file are the 4 numbers, such as 1 and 5These two columns are labelled name and count respectivelyThese two columns can be combined only because they comprise the same number of elements each: 4The print command will generate the following display name count1 apple 12 banana 53 mandarin 64 melon 3You can use a variety of other methods to construct dataframes—such as upload and convert csv filesThat is, you do not have to derive these data frames from vectorsprint(nrow(fruitData))nrow(fruitData) computes the number of rows in this data file: 4fruitData_sub1 <- subset(fruitData, count < 3)print(fruitData_sub1)using the function subset, you can extract, and then print, a subset of rowsThis code prints all rows in which the count, a variable you defined earlier, is less than 3MatricesMatrices are like dataframes, except all the elements are the same type, such as numbers. Consequently, R can perform certain functions on matrices that cannot be performed on dataframes. To learn about matrices, enter the code that appears the left column of the following table. Code to enterExplanation or clarificationsampleMatrix <- matrix(c(1, 3, 6, 8, 3, 5, 2, 7), nrow = 2)print(sampleMatrix)The code c(1, 3, 6, 8, 3, 5, 2, 7) specifies all the elements or items in this matrixThe code “nrow = 2” indicates the elements should be divided into two rowsThus, print(sampleMatrix) will generate the following display [,1] [,2] [,3] [,4][1,] 1 6 3 2[2,] 3 8 5 7colnames(sampleMatrix) <- c("first", "second", "third", "fourth")print(sampleMatrix)The code colnames(sampleMatrix) adds column labels to the matrixThe code c("first", "second", "third", "fourth") represents these labelsThus, print(sampleMatrix) will generate the following display first second third fourth[1,] 1 6 3 2[2,] 3 8 5 7You can also label the rows using rownames instead of colnamesInstalling and loading the relevant packages The package quanteda is most effective when combined with some other packages. The following table specifies the code you should use to install and load these packages. Code to enterExplanation or clarificationinstall.packages("quanteda")install.packages("readtext")install.packages("devtools")devtools::install_github("quanteda/quanteda.corpora")install.packages("quanteda.textmodels")install.packages("spacyr")install.packages("newsmap")Installs the relevant packagesrequire(quanteda)require(readtext)require(quanteda.corpora)require(quanteda.textmodels)require(spacyr)require(newsmap)require(ggplot2)loads these packagesinstead of “require”, you can use the command “library” insteadthese two commands are identical, but respond differently to errorsIf you close the program, you will need to load these packages again; otherwise, your code might not work. Therefore, if you need to terminate and then initiate R again, you should copy, paste, and run all the lines beginning with require again to analyse text.Importing data To analyse blogs, letters, tweets, and other documents, you need to first import this text into a format that R understands. To illustrate, researchers might construct a spreadsheet that comprises rows of text as the following display reveals. In particulareach cell in the first column is the text of one inauguration speech from a US presidentthe other columns specify information about each speech, such as the year and president Next, the researchers will tend to save this file into another format—called a csv file. That is, they merely need to choose “Save as” form the “File” menu and then choose the file format called something like csv. Finally, an R function, called “read.csv”, can then be utilised to import this csv file into R. The rest of this section clarifies these procedures. Import one file of texts: read.csvTo practice this procedure, the readtext package—a package you installed earlier—includes a series of sample csv files. To import one of these files, enter the code that appears in the left column of the following table. If you do not enter the code, the information might be hard to follow. You should complete this document in order; otherwise, you might experience some errors as you proceed. Code to enterExplanation or clarificationpath_data <- system.file("extdata/", package = "readtext")path_dataActually, these sample files are hard to locate. Rather than search your computer, you can enter this code into R. Specificallythis code will uncover the directory in which the package "readtext" is storedthe code also labels this directory "path_data"if you now enter "path_data" into R, the output will specify the directory in which these files are stored, such as "/Library/Frameworks/R.framework/Versions/3.6/ /library/readtext/extdata/"dat_inaug <- readtext(paste0(path_data, "/csv/inaugCorpus.csv"), text_field = "texts")This code can be used whenever one column stores the text and the other columns store information about each text, such as the year. In the directory is a subdirectory called csvWithin this subdirectory is a csv file called inaugCorpus.csv—resembling the previous spreadsheetThe code “readtext” converts this csv file into a dataframe—a format that R can utiliseThis imported file is labelled “dat_inaug” because the speeches were presented during the inauguration of these presidentsThe code text_field = "texts" indicates the text is in the column called texts. Without this code, R cannot ascertain which column is the text and which column stores other information about the documentsNote, if using Windows instead or Mac, you may need to replace the / with \Import more than one file of texts:Sometimes, rather than one spreadsheet or csv file, you might want to import a series of text files. To illustrateyou might locate a series of speeches on the webyou could then copy and paste each speech into a separate Microsoft Word filehowever, you could save each file in a txt rather than doc format To practice this procedure, the readtext package also included a series of text files in a subdirectory called txt/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationpath_data <- system.file("extdata/", package = "readtext")This code was discussed in the previous tabledat_udhr <- readtext(paste0(path_data, "/txt/UDHR/*"))This code imports all the files that appear in the txt/UDHR subdirectoryThese files are assigned the label dat_udhrdat_udhr is a dataframe, like a spreadsheetView(dat_udhr)This code will display the dataAs this display shows, dat_udhr comprises two columns. The first column is the title of each text file. The second column is the contents of each text fileImporting pdf files or Word filesInstead of importing a series of text files, you might instead want to import a series of pdf files. To illustrateyou might locate a series of speeches on the webyou might be able to download these speeches as a series of pdf files—each file corresponding to one speech.To practice this procedure, the readtext package also included a series of pdf files in a subdirectory called pdf/UDHR. To import these files simultaneously, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationpath_data <- system.file("extdata/", package = "readtext")This code was discussed in a previous tabledat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"))This code imports all the pdf files that appear in the pdf/UDHR subdirectoryThese files are assigned the label dat_udhrdat_udhr is a dataframe, like a spreadsheetAs an aside, note that UDHR is an abbreviation of the Universal Declaration of Human Rights, because all the speeches relate to this topic View(dat_udhr)This code will display the data—and the data will again include two columns: the title of each document and the textYou can use similar code to import Microsoft Word documents. In particular, you would merely replace *.pdf with *.doc or *.docs.Deriving information from the titles of each documentSometimes, the title of each document imparts vital information that can be included in the dataframe. To illustrate, in this examplethe documents are labelled UDHR_chinese, UDHR_danish, and so forththe first part, UDHR, indicates the type of documents—documents that relate to the Universal Declaration of Human Rightsthe second part, such as Chinese, represents the languageYou can use this information to generate news columns in the dataframe or spreadsheet—columns that represent the type and language of each document. To achieve this goal, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationdat_udhr <- readtext(paste0(path_data, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"), sep = "_")the code docvarsfrom = "filenames" instructs R to derive the names of additional variables from the names of each filethe code docvarnames = c("document", "language") instructs R to label these variables document and language respectivelythe code sep = "_" instructs R the two variables are separated by the _. That is, the letters before this symbol are assigned to the first variable. The letters after this symbol are assigned to the second variableView(dat_udhr)This code will display the data—and the data will again include two additional columns, labelled document and languageIn short, text that is stored in csv, pdf, text, or Word files can be readily imported into R. You might want to read other documents, such as the document in Learnline or the web about web scraping, to learn about how to distil these files from the web. For example, one document will show you how you can convert tweets into csv files. Types of text objects In the previous examples, the text was stored in a dataframe, something like a spreadsheet. In particular, one column comprised the text of each document. The other columns specified other variables associated with the documents, such as the language, similar to the following spreadsheet. Nevertheless, to be analysed in R, the text then needs to be converted to one of three formats. These three formats are calleda corpusa tokena document feature matrix or dfmCorpusesA corpus is almost identical to the previous dataset. That is, like this dataset, a corpus comprises one column or variable that stores the text of each document. The other columns or variables represent specific features of each document, such as the year or author. Howeverthe package quanteda cannot analyse the text until the data file is converted to a corpushence, corpus is like a data frame, but slightly adapted to a format that R can analyseTokensIn R, you might need to translate this corpus into a format that demands less memory. One of these formats is called a token. You can imagine this format as like a container that stores all the text, but without the other variables. To illustrate, the following display illustrates the information that might be stored in a token. Note the token preserves the order in which these words appear. Fellow Citizens of the Senate and the House of Representatives Fellow citizens. I am again called upon by the voice of my country. When it was first perceived in early times that no middle course…Occasionally, rather than a sequence of words, tokens can include other units, such as a sequence of sentences. For example, if you asked R to identify the second element, the answer could be “Citizens” or “I am again called upon by the voice of my country” depending on whether the tokens were designated as words or sentences. Document feature matrixTokens, in essence, comprise all the text, in order, but no other information. Document feature matrices, in contrast, are like containers that store information only on the frequency of each word in the text. The following display illustrates the information that might be stored in a document feature matrix. This format is sometimes referred to as dfm or even colloquially as a bag of words. Fellow x 6Citizens x 4 of x 23… How to construct and modify corpuses Constructing a corpusSo, how can you construct a corpus—a format that R can analyse effectively. One of the simplest approaches is to use the following code. This code can be used to convert a dataframe, in this instance called dat_inaug, to a corpus, in this instance called corp_inaug. corp_inaug <- corpus(dat_inaug)Other methods can be used to develop a corpus, such as character vectors. These methods, however, are not discussed in this document. Document variables in a corpusUsually, the most informative part of the corpus is the column of text. However, sometimes, you want to explore the document variables as well—such as the year or author of each text. To learn how to access these document variables, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationcorp <- data_corpus_inauguralThis code merely assigns a shorter label, corp, to a corpus that is already included in the quanteda packageThus, after this code is entered, corp is a corpus that comprises inauguration speeches of US presidents.Like every corpus, corp includes the text of each document in one column and other information about each document in other columnsdocvars(corp)This code will display the corpus—but only the document variables instead of the text—as illustrated by the following extractEach row corresponds to a separate document in this corpus. Year President Name Party1 1789 Washington George none2 1793 Washington George none3 1797 Adams John Federalist4 1801 Jefferson Thomas Democratic-Republicandocvars(corp, field = "Year")This code displays only one of the document variables: YearYou could store this variable in a vector—using code like newVector = docvars(corp, field = "Year")corp$YearGenerates the same outcome as the previous codedocvars(corp, field = "Century") <- floor(docvars(corp, field = "Year") / 100) + 1This code is utilized to create a new document variableIn this instance, the new variable is called CenturyThis new variable equals Year divided by 100 + 1The code floor rounds this value down to the nearest integer docvars(corp)If you now display these document variables, you will notice an additional variable: Centuryndoc(corp)Specifies the number of documents in the corpus called corpIn this instance, the number of documents is 58Extracting a subset of a corpusSometimes, you want to examine merely a subset of a corpus, such as all the documents that were published after 1980. To learn how to achieve this objective, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationcorp_recent <- corpus_subset(corp, Year >= 1990)The function corpus_subset distils only a subset of the texts or documentsIn this instance, the code distils all texts in corp in which the year is 1990 or laterThis subset of texts is labelled corp_recentcorp_dem <- corpus_subset(corp, President %in% c('Obama', 'Clinton', 'Carter'))In this instance, the code extracts all texts in which the President is either Obama, Clinton, or Carter%in% means “contained within”—that is, texts in which the President is contained within a vector that comprises Obama, Clinton, and CarterChange the unit of text in a corpus from documents to sentencesThus far, each unit or row in the corpus represents one document, such as one speech. You may, however, occasionally want to modify this configuration so that each unit or row corresponds to one sentence instead. To learn about this possibility, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationcorp <- corpus(data_char_ukimmig2010)This code merely converts a vector that is already included in the quanteda package to a corpus called corpAs this code shows, you can convert vectors into a corpusndoc(corp)This code will reveal the corpus comprises 9 rows or units—each corresponding to one document corp_sent <- corpus_reshape(corp, to = 'sentences')The function corpus_reshape can be used to change the unit from documents to sentences or from sentences to documentsIn this instance, the argument to = 'sentences' converts the unit from documents to sentencesndoc(corp)This code will reveal the corpus comprises 207 rows or units, each corresponding to one sentenceChange the unit of text in a corpus to anything you preferIn the previous section, you learned about code that enables you to represent each sentence, rather than each document, in a separate row of the data file. As this section illustrates, each row can also correspond to other segments of the text. In particularin your text, you can include a symbol, such as ##, to divide the various segments of textyou can then enter code that allocates one row to each of these segments, as the left column of the following table illustratesCode to enterExplanation or clarificationcorp_tagged <- corpus(c("##INTRO This is the introduction. ##DOC1 This is the first document. Second sentence in Doc 1. ##DOC3 Third document starts here. End of third document.","## INTRO Document ##NUMBER Two starts before ##NUMBER Three."))This code is merely designed to create a corpus in which sections are separated with the symbol ##In particular, the corpus is derived from a vector. As the quotation marks show, the vector actually comprises two elements. But both elements comprise text The corpus is labelled corp_taggedcorp_sect <- corpus_segment(corp_tagged, pattern = "##*")This code divides the corpus of text into the six segments that are separated by the symbol ##Hence, if you now enter corp_sect, R will indicate this corpus entails 6 distinct textscbind(texts(corp_sect), docvars(corp_sect))The function cbind merges files togetherIn this instance, cbind merges the vector that comprises all 6 texts with the vector that comprises the patterns—that is, the information that begins with ## How to construct and modify tokens Many of the analyses that are conducted to analyse texts can be applied to corpuses. However, some analyses need to be applied to tokens. This section discusses how you can construct and modify tokens. For example, the left column in the following table delineates how you can construct tokens. Code to enterExplanation or clarificationoptions(width = 110)This code simply limits the display of results to 110 columnstoks <- tokens(data_char_ukimmig2010)The function tokens merely converts text—in this instance, text in a vector or corpus called data_char_ukimmig2010—to a token called toksNote that data_char_ukimmig2010 is included with the package and, therefore, should be stored on your computerIf you now enter print(toks), R will display each word, including punctuation and other symbols, in quotation marks separately. This display shows that toks merely comprises all the words and symbols, individually, in ordertoks_nopunct <- tokens(data_char_ukimmig2010, remove_punct = TRUE)If you add the argument remove_punct = TRUE, R will remove the punctuation from the tokensYou can also remove numbers, with the argument remove_numbers = TRUEYou can also remove symbols, with the argument remove_symbols = TRUELocating keywordsOne of the benefits of tokens is you can search keywords and then identify the words that immediately precede and follow each keyword. The left column in the following table demonstrates how you can identify keywords.Code to enterExplanation or clarificationkw_immig <- kwic(toks, pattern = 'immig*')The function kwic is designed to identify keywords and the surrounding wordsNote that kwic means key words in contextSpecifically, in this example, the keyword is any word that begins with immig, such as immigration or immigrantshead(kw_immig, 10)Displays 10 instances of words that begin with immig—as well as several words before and after this termOther numbers could have been used instead The function head is used whenever you want to display only the first few rows of a container, such as a datafilekw_immig <- kwic(toks, pattern = c('immig*', 'deport*'))head(kw_immig, 10)Same as above, except displays words that surround two keywords—words that begin with immig and words that begin with deportkw_immig <- kwic(toks, pattern = c('immig*', 'deport*'), window=8)The code window indicates the number of words that precede and follow each keyword in the displayIn this instance, the display will present 8 words before and 8 words after each keywordkw_asylum <- kwic(toks, pattern = phrase('asylum seeker*'))If the keyword comprises more than one word, you need to insert the word “phrase” before the keywordtoks_comp <- tokens_compound(toks, pattern = phrase(c('asylum seeker*', 'british citizen*')))This code generates a data file of tokens in which particular phrases, such as asylum seeker, are represented as wordsThe importance of this code will be clarified later.In essence, the function token_compound instructs R to conceptualize the phrases asylum seeker and british citizen as words Retaining only a subset of tokensAfter you construct the tokens—that is, after you distil the words from a set of texts—you might want to retain only a subset of these words. For exampleyou might want to delete functional words—words such as it, and, the, to, and during—that affect the grammar, but not the meaning, of sentencesor you might want to retain only the subset of words that relate to your interest, such as the names of mammalsThe left column in the following table presents the code you can use to delete functional words or retain a particular subset of words. In both instances, you need to utilise the function tokens_select.Code to enterExplanation or clarificationtoks_nostop <- tokens_select(toks, pattern = stopwords('en'), selection = 'remove')This code instructs R to select only stop wordsStop words are functional terms that affect the grammar, but not the meaning, of sentencesThis code then instructs R to remove or delete these selected stop wordsIf you enter print(toks_nostop), you will notice the remaining words are primarily nouns, verbs, adjectives, and adverbstoks_nostop2 <- tokens_remove(toks, pattern = stopwords('en'))This code is equivalent to the previous code but simpler. That is, the function tokens_remove immediately deletes the selected stop wordstoks_nostop_pad <- tokens_remove(toks, pattern = stopwords('en'), padding = TRUE)If you include the code padding=TRUE, the length of your text will not change after the stop words are removedThat is, R will retain empty spacesThis code is important if you want to compare two or more texts of equal length—and is thus vital when you conduct position analyses and similar techniques toks_immig <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE)print(toks_immig)This code retains only a subset of words—words that begin with immig or migraAll other words are deletedtoks_immig_window <- tokens_select(toks, pattern = c('immig*', 'migra*'), padding = TRUE, window = 5)print(toks_immig_window)This code is identical to the previous code, besides the argument window=5This argument not only retains words that begin with immig or migra but also the five words before and after these retained termsComparing tokens to a dictionary or set of wordsSometimes, you might want to assess the number of times various sets of words appear in a text, such as words that are synonymous with refugee. To achieve this goal, you first need todefine these sets of words, called a dictionary—using a function called dictionaryinstruct R to search these words in a text—using a function called tokens_lookup. The left column in the following table presents the code you can use to construct a dictionary or sets of words and then to search these sets of words in some text, represented as tokens.Code to enterExplanation or clarificationtoks <- tokens(data_char_ukimmig2010)Note that data_char_ukimmig2010 is a set of 9 texts about immigrationdict <- dictionary(list(refugee = c('refugee*', 'asylum*'), worker = c('worker*', 'employee*')))print(dict)The function dictionary is designed to construct sets of words, called a dictionaryIn this instance, the dictionary comprises two sets of wordsThe first set, called refugee, includes words that begin with refugee or asylumThe second set, called worker, includes words that begin with worker or employeeThese sets of words are collectively assigned the label dictThis dictionary is hierarchical, comprising two categories or sets and a variety of words within these categories or sets; not all dictionaries are hierarchical, howeverdict_toks <- tokens_lookup(toks, dictionary = dict)print(dict_toks)This code distils each of the dictionary words that can be located in the text called toksPrint(dict_toks) will then display the results—and, in this instance, indicate in which document these dictionary words appeardfm(dict_toks)This code will specify the number of times the dictionary words appear in the various documentsThe reason is that dfm refers to the document feature matrix—a format, discussed later, in which only the frequency of each word id recordedNote that you can also use exiting dictionaries of words that were constructed by other people—such as a dictionary of all cities. You would use code that resembles newSet <- dictionary(file = "../../dictionary/newsmap.yml") to import these dictionaries. Generating n-gramsMany analyses of texts examine individual words. For example, one analysis might determine which words are used most frequently. But insteadsome analyses of texts examine sets of two, three, or more wordsfor instance, one analysis might ascertain which pairs of words are used most frequently. Sets of words are called n-grams. For example, pairs of words are called 2-grams. The left column in the following table presents the code you can use to convert a token of individual words to n-gramsCode to enterExplanation or clarificationtoks_ngram <- tokens_ngrams(toks, n = 2:4)The function token_ngrams converts the individual words to n_gramsIn this instance, the individual words are stored in a container called toks n = 2:4 instructs R to complete all n-grams of 2, 3 or 4 wordsFor example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would includeThe catcat satsat onthe matThe cat satcat sat onsat on theon the matThe cat sat oncat sat on thesat on the mattoks_skip <- tokens_ngrams(toks, n = 2, skip = 1:2)This code generates n-grams between words that are not consecutive—but after skipping between 1 and 2 wordsFor example, if toks contained the words “The cat sat on the mat” in this order, the n-grams would includeThe satThe oncat oncat thesat thesat matsat thesat maton matIn the previous examples, every possible combination of n-grams were generated. Sometimes, however, you might want to restrict the n-grams to sets of words that fulfil particular criteria. For example, you might want to construct only n-grams that include the word “not”. The left column in the following table presents the code you can use to restrict your n-gramsCode to enterExplanation or clarificationtoks_neg_bigram <- tokens_compound(toks, pattern = phrase('not *'))This code is designed to convert all phrases of two words that begin with the word not into compound wordsFor example, the phrase “not happy” will be converted to one compound word—and thus treated as one wordConsequently, the container called toks_neg_bigram is the same as toks, but the pairs of words that begin with not are counted as one wordtoks_neg_bigram_select <- tokens_select(toks_neg_bigram, pattern = phrase('not_*'))This code then selects only the subset of toks_neg_bigram that comprise the compound words beginning with not How to construct and modify document feature matrices The previous section demonstrated how you can construct and modify tokens—a container of text that presents the words in order but excludes other information, such as the year in which the documents were published. Some analyses, however, are more effective when applied to another format called document feature matrices. Document feature matrices are containers of text that specify only the frequency, but not the order, of each word in the text. Construct and refine a document feature matrixYou can apply several methods to construct document feature matrices. One method is to convert a token format to a document feature matrix. To construct, and then to explore, these matrices, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationtoks_inaug <- tokens(data_corpus_inaugural, remove_punct = TRUE)This code translates a corpus—called data_corpus_inaugural—to tokens, after removing punctuationThese tokens are then stored in a container called toks_inaugdfmat_inaug <- dfm(toks_inaug)This code then converts these tokens, stored in toks_inaug, to a document feature matrixprint(dfmat_inaug)If you print this document feature matrix, the output looks complexFirst, the output indicates the matrix still differentiates all the documents in the original text. That is, the matrix comprises 58 documentsSecond, for each document, the output specifies the number of times each word appeared. For example, although misaligned, the output indicates that “fellow-citizens” was mentioned once and “of” 71 times in the first documentacrossDocuments<-colSums(dfmat_inaug)This code is designed to combine the documents—generating the frequency of each word across the entire set of textsSpecifically, in the document feature matrix, each document is represented as a separate row in the tableEach column corresponds to a separate wordSo, if you calculate the sum of each column, you can determine the frequency of each word across the documentsIndeed, if you now enter “acrossDocuments”, R will present the frequency of each wordtopfeatures(dfmat_inaug, 10)This code will generate the 10 most frequent wordsThis output would have been more interesting if stop words—that is functional words like “the” and “of”—had been deleted first the of and to in a our 10082 7103 5310 4526 2785 2246 2181 dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")print(dfmat_inaug_prop)This code merely converts the frequencies to proportionsFor example, if 1% of the words are “hello”, hello will be assigned a .01That is, the function dfm_weight is designed to transform the frequenciesThe argument scheme = “prop” indicates the transformation should be to convert frequencies to proportionsSelect and remove subsections of a document feature matrixYou can also select and remove information about specific words from the document feature matrix. To illustrate, you can remove stop words—functional words, such as “it” or “the”—as the following table shows.Code to enterExplanation or clarificationdfmat_inaug_nostop <- dfm_select(dfmat_inaug, pattern = stopwords('en'), selection = 'remove')The function dfm_select can be used to select and remove particular subsets of words, such as functional wordsIn this example, stop words—that is, functional words—are selected and then removeddfmat_inaug_nostop <- dfm_remove(dfmat_inaug, pattern = stopwords('en'))This code is equivalent to the previous codeThat is, the function dfm_remove both selects and removes particular subsets of words, such as stopwords dfmat_inaug_long <- dfm_select(dfmat_inaug, min_nchar = 5)This code selects words that comprise 5 or more lettersObviously, the number 5 can be changed to other integers as well dfmat_inaug_freq <- dfm_trim(dfmat_inaug, min_termfreq = 10)The function dfm_trim can be used to remove frequencies that are more or less than a specific numberIn this example, all words that appear fewer than 10 times are trimmed or removeddfmat_inaug_docfreq <- dfm_trim(dfmat_inaug, max_docfreq = 0.1, docfreq_type = "prop")This code is similar to the previous code, besides two differencesFirst, this code explores proportions rather than frequencies, as indicated by the argument “docfreq_type = "prop"Second, this code trims or removes words that exceed some proportion—in this instance, 0.1Therefore, all the very common words are removed. Calculating the frequency of specific words in document feature matricesYou can also use ascertain the frequencies of particular words—such as positive words—in a document feature matrix. That is, similar to procedures that can be applied with tokens, you need toconstruct a dictionary, comprising particular sets of wordsascertain the frequency of these words in a document feature matrix, as shown in the following table.Code to enterExplanation or clarificationdict_lg <- dictionary(list(budget = c('budget*', 'forecast*'), presented = c('present*', 'report*')))This code constructs a dictionary of words, called dict_lgYou can also construct dictionaries from existing dictionaries, with code like dict_lg<-dictionary(file = ' ')toks_irish <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE)dfmat_irish <- dfm(toks_irish)This code merely generates a document feature matrix from a set of texts on your computer—about the Irish budgetdfmat_irish_lg <- dfm_lookup(dfmat_irish, dictionary = dict_lg, levels = 1:2)print(dfmat_irish_lg)The function dfm_lookup determines the frequency of words in the dictionary dict_lg that appear in the document feature matrix dfmat_irishOften, as in this example, the dictionary is hierarchicalFor example, at the highest level are broad categories, such as budget and presented.At a lower level are more specific words, such as budget and forecastThe levels argument is utilized if you want to explore the frequencies of words and categories at more than one levelHow to construct and modify document feature matrices A feature co-occurrence matrix is similar to a document feature matrix. However, the feature co-occurrence matrix determines the number of times two words appear in the same section—such as the same document. Enter the code in the left column of the following table to construct a feature co-occurrence matrix. As this example showsfirst construct a document feature matrixthen use the function fcm to convert this document feature matrix into a feature co-occurrence matrixCode to enterExplanation or clarificationcorp_news <- download('data_corpus_guardian')This code somehow downloads a corpus from the Guardian and labels this corpus corp_newsUsually, in the brackets you would need to specify an entire urldfmat_news <- dfm(corp_news, remove = stopwords('en'), remove_punct = TRUE)dfmat_news <- dfm_remove(dfmat_news, pattern = c('*-time', 'updated-*', 'gmt', 'bst'))dfmat_news <- dfm_trim(dfmat_news, min_termfreq = 100)The first line of code converts the corpus corp_news to a document feature matrix, using the function dfm.Furthermore, these lines of code reduce the size of this corpus—removing stop words, punctuation, words that end in time, words that begin with updated, as well as words that are used less frequently than 100fcmat_news <- fcm(dfmat_news)This code then converts the document feature matrix into a feature co-occurence matrix. dim(fcmat_news)This code merely calculates the number of rows and columns in this matrix, called dimensionsIn this example, the numbers 4210 and 4210 appearTherefore, this matrix presents 4210 x 4210 cellsThe number in each cell represents the number of times two corresponding words, such as “wealthy” and “refugee” appear in the same documenthead(fcmat_news)This code generates the following output—a subset of the matrixFor example, as this output shows, the words London and climate both appear in 755 documents london climate change | want london 5405 755 1793 108 2375 climate 0 10318 10284 74775 1438 change 0 0 3885 112500 2544 How to construct statistical analyses Thus far, this document has merely demonstrated how to import text and convert this text to objects or containers that can be used in subsequent analyses. This section presents some basic statistical analyses that can be conducted to explore these containers of text. Conduct a frequency analysisTo start their analysis, many researchers first calculate the frequency of specific words and then display these findings. For example, they might want to identify the most common hashtags in tweets To conduct these analyses and displays, enter the code that appears in the left column of the following table. Code to enterExplanation or clarificationcorp_tweets <- download(url = '')This code merely downloads a corpus from a specific url—and then labels this corpus corp_tweetsThis corpus comprises a series of tweets and information about these tweetstoks_tweets <- tokens(corp_tweets, remove_punct = TRUE) This code then converts this corpus into tokens after removing punctuationThese tokens are stored in a container or object called toks_tweetsdfmat_tweets <- dfm(toks_tweets, select = "#*")This code converts these tokens into a document feature matrix—but includes only the words that begin with #Hence, this document feature matrix summarises the frequency of each hashtagtstat_freq <- textstat_frequency(dfmat_tweets, n = 5, groups = "lang")The function textstat_frequency calculates the frequency of a word both in the entire text as well as in each document with this textThe output will also specify the language; this variable, called lang, was included in the original corpus and is maintained in the document feature matrix Because of the argument n=5, only the top five most frequent words in each language appearFor example, if you entered View(tstat_freq), you would receive the display, usually in the left top quadrant of the screendfmat_tweets %>% textstat_frequency(n = 15) %>% ggplot(aes(x = reorder(feature, frequency), y = frequency)) + geom_point() + coord_flip() + labs(x = NULL, y = "Frequency") + theme_minimal()This code will generate the following graphMost of this code can be utilized without changesYou can change the 15 to another number, depending on how many of the most frequent words you want to displaydfmat_tweets is the name you used to label the document feature matrix in which the text is stored set.seed(132)textplot_wordcloud(dfmat_tweets, max_words = 100)This code generates a word cloud, as shown belowYou can change the 132 to any numberYou can change the 100 depending on the number of words you want to display in the cloudComparing word cloudsA helpful analysis is to compare groups—such as languages or regions—in a single word cloud. For example, the words corresponding to one group might appear in dark blue, and the words corresponding to another group might appear in light blue, as the following display shows. To generate this display, you will need toutilise an existing grouping variable or generate a grouping variable, such as a variable that specifies whether the text is in English or notintegrate this variable with the document feature matrix create the wordcloudThese procedures are clarified in the following table. Specifically, this table presents the code you need to use.Code to enterExplanation or clarificationcorp_tweets$dummy_english <- factor(ifelse(corp_tweets$lang == "English", "English", "Not English"))Background to this codeIn the corpus called corp_tweet, one of the variables or columns is called langIn this variable, the options include English, German, French, and so forthDetails about this codeThis code generates a variable in corp_tweets called dummy_english According to this code, whenever the language is equivalent to English, this variable will be assigned the value EnglishWhenever the language is not equivalent to English, this variable will be assigned the value Not Englishdfmat_corp_language <- dfm(corp_tweets, select = "#*", groups = "dummy_english")This code distils a data feature matrix from the corpus called corp_tweetsFurthermore, this code includes on the words that begin with #--and thus retains hashtags onlyFinally, all the tweets are divided into groups: English and Non Englishset.seed(132)textplot_wordcloud(dfmat_corp_language, comparison = TRUE, max_words = 200)This code simply constructs the wordcloudThe argument comparison= TRUE compares the two groupsAssessing the variety of words that people useSometimes, researchers want to explore the variety of words that people use, called lexical diversity. That is, lexical diversity refers to the number of distinct words that individuals use. If people use many distinct words in a single document, they are assumed to demonstrate greater language or thinking ability. Code to enterExplanation or clarificationtstat_lexdiv <- textstat_lexdiv(dfmat_inaug)This code is designed to calculate the lexical diversity of each document in the document feature matrixIn this instance, the document feature matrix is called dfmat_inaug and was constructed beforeTo illustrate, if you entered the code tstat_lexdiv to explore the contents of this container, R will produce the following output1 1789-Washington 0.78067482 1793-Washington 0.93548393 1797-Adams 0.65420564 1801-Jefferson 0.72939735 1805-Jefferson 0.67260146 1809-Madison 0.8326996Each row corresponds to one document within this textEach number, such as .780, is called the lexical diversity and ranges from 0 to 1. The number equals the number of distinct words over the number of wordsTherefore, if the number is low, the writer or speaker is using the same words repeatedlyplot(tstat_lexdiv$TTR, type = 'l', xaxt = 'n', xlab = NULL, ylab = "TTR")grid()axis(1, at = seq_len(nrow(tstat_lexdiv)), labels = dfmat_inaug$President)This code will generate the following plot—in which the Y axis represents the lexical diversity and the X axis represents each documentThe argument tstat_lexdiv$TTR specifies the plot is designed to display the variable TTR—the measure of lexical diversity—in the object or container called tstat_lexdiv The x axis is not labelled, as indicated by the code NULLThe y axis is labelled TTRType = “I” refers to line graphs. Assessing the similarity of documents or featuresSometimes, you might want to assess the extent to which documents are similar to one another. For exampleif two documents are very similar, you might conclude that one author derived most of their insights from another authorif two documents, supposedly written by the same person, are very different, you might conclude that a ghost writer actually constructed one of these documentsyou might show that a set of document can be divided into two sets, demonstrating two main perspectivesR can be utilised to assess the degree to which sets of documents are similar to one another. Enter the code in the left column of the following table to learn about this procedure. Code to enterExplanation or clarificationtstat_dist <- as.dist(textstat_dist(dfmat_inaug))The function textstat_dist is designed to ascertain the level of similarity between all documents in the document feature matrix called dfmat_inaugThe function as.dist records these distances in a format that can be subjected to another analyses, such as cluster analysisThese results are stored in a container called tstat_dist If you simply enter tstat_dist into R now, you will receive a series of matrices that resemble the following 89-Wash 93-Wash 97-Adams 93-Wash 76.13803 97-Adams 141.40721 206.69543 According to this matrix, the distance between the 1789 Washington speech and the 1793 Washington speech is 76.14The distance between the 1789 Washington speech and the 1797 Adams speech is 141.40A higher number indicates greater differences in the words of these speechesHence, the 1789 Washington speech and the 1793 Washington speech is more similar than is the 1789 Washington speech and the 1797 Adams speechclust <- hclust(tstat_dist)plot(clust, xlab = "Distance", ylab = NULL)This code then subjects these distances to a hierarchical cluster analysisYou will not understand this display, unless you are familiar with hierarchical cluster analysis. Even if you are familiar with hierarchical cluster analysis, the display is hard to interpret.Ascertaining whether keywords differ between two groups of textsSometimes, you want to examine whether the frequency of specific keywords, such as refugee, differ between two sets of texts, such as speeches from conservative leaders and speeches from progressive leaders. You can use and adapt the following code to achieve this goal. Code to enterExplanation or clarificationtstat_key <- textstat_keyness(dfmat_news, target = "text136751 " )The code “target = text136751)” generates two groups: this document versus all the other documentsThe code then labels these two groups yearThe function textstat_keyness instructs R to compare the two groups of documents on the frequency of each wordtextplot_keyness(tstat_key)This code generates a plot The plot is hard to decipher but, in essence, the longest bars represent words that are more common in one set of documents compared to the other set of documentsOther benefitsThis document summarised the first half of a web tutorial about this package called quanteda, available at . If you want more information, you might read Sections 5, 6, and 7 of this tutorial. These other sections will impart knowledge about how you can generate machine learning models that can classify future textshow you can derive measures that characterize features of texts—such as the extent to which a text is conservative or progressive, andmany other functions ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download