Curtis Miller's Personal Website | Curtis Miller's ...



Identifying Gender of AuthorsAn application of Markov chains to textual analysisCurtis Miller1012190845185Spring 2015Identifying Gender of AuthorsAn application of Markov chains to textual analysisIntroductionAuthors frequently use pen names in place of their own for their work. Reasons range from preserving anonymity to marketing purposes. Female authors often used pen names in order to hide their gender. This was very common in the 18th century but still occurs today. Joanne Rowling, for example, used the name J. K. Rowling and Robert Galbraith to hide her gender [4].In this paper, I show that Markov chains can be used to identify author gender. I explain the method for doing so and provide an example using 120 texts provided by the Gutenberg project. I then finish with a discussion and suggestions of further applications.Method and Mathematical Background Markov conceived of Markov chains originally with the textual analysis application in mind [3]. Even though human languages are obviously not Markov chains, Dmitri Khmelev et. al. showed they can be used to predict authorship with surprisingly good accuracy [1;2]. The method in this paper is based on Khmelev's method, described in [1] and [2].Khmelev provided a formalization of his procedure in [1], and I describe it below. Begin with an alphabet set A, which usually contains lower-case letters and a single whitespace character, the space character. Aκ is the set of all words of length κ>0, and A*=κ>0Aκ; in other words, A* is the set of strings based on an alphabet A. f∈A* is one such string, and f is its length. There are n sets Ci, and fi,j∈A*, with 1≤j≤mi, is one of the mi strings in Ci, so fi,j∈Ci.Every string fi,j is thought to be generated by a Markov chain with transition matrix Πi. We do not know Πi, but we can estimate it with a transition matrix Pi. Suppose k,l∈A are letters in alphabet A. Denote by Qkli,j the number of letter transitions k?l in fi,j. In addition, Qkli=j=1miQkli,j (this is the frequency for a letter transition for “author” i) and Qki=l∈AQkli (this is the frequency a letter k is used). Then the entry of Pi corresponding to a transition k?l (which I may refer to as Pik,l or Pkli) will be Pik,l=QkliQki. This is the empirical transition matrix.Suppose we knew that a string x∈A* was generated by some Πθ, where 1≤θ≤n but is unknown. Let νkl denote a transition in x for the letters k?l. For every i with 1≤i≤n, we would use Pias the estimate for Πi. We would then represent the probability of seeing the string x if θ=i as k,l∈APkliνkl.Notice that problems would arise where Pkli=0; if just Pkli=0, the whole probability is zero, and if both Pkli=0 and νkl=0, we have a number that is undefined (00). Rather than let these zero transitions spoil our estimators, we will omit them instead and consider the number:k,l:Pik,l>0Pkliνkl If we let θ be the i that maximizes this number, θ would be a maximum likelihood estimator for θ.Rather than use the probability directly, though, we will use the natural logarithm of this number. Let:Λx,i=-k,l:Pik,l>0νkllog?Pkli(log is the natural logarithm). Then we could write the maximum likelihood estimator θ as:θ=argminiΛx,iKhmelev applied this notion in [1] and [2] to guessing who the author of a text is when that author is unknown, and θ would correspond to the "best guess" of the author of a text x. His framework need not be restricted to that application, though. The sets Ci mentioned above could consist of texts with any common theme, not just common authorship. In this paper, I use gender as the distinguishing factor, and while this technique is not as effective when applied to gender, it is better than flipping a coin (at least when applied to my sample).Sample and ApplicationI downloaded 120 texts provided by the Gutenberg project and divided the texts into three groups. One group consists of texts from female authors who did not use a pen name. Another group consists of texts from male authors who did not use a pen name. Finally, the test group consists of texts whose authors (who are either male or female) either used a pen name or are anonymous and still unknown. One can infer the texts and authors used in this study from the provided R script.Prior to analyzing the texts, some preprocessing takes place. All punctuation is removed and all white space is replaced with only a space character. Khmelev found in [1] that the method is more accurate if words that contain a capital letter are removed from the texts completely, so I do so. I also add single space characters at the beginning and end of the string. The resulting string is then used in the analysis.When applying the above method in practice to a particular text x, the "male author" and the "female author" are assigned a rank according to their respective Λx,i. A smaller lambda corresponds to a lower rank, and the "author" with the smallest rank is the one deemed to have written the text. I created a matrix R with rows corresponding to texts and columns corresponding to "authors" and an entry Rij being the rank of "author" j regarding possible authorship of text i. In this application, R has only two columns, male and female. In addition to making the results easy to find, the matrix R allows for other useful insights that expand the possible applications of the Markov chain method.ResultsI list the results below in a table containing the title of the text, the author (with pen name in parentheses), the gender predicted by the Markov chain method, and the author’s true gender. For the authors of known gender, the method does a decent job, correctly guessing gender about 70% of the time. I do not believe this to be as effective as the method when applied to authorship identification; Khmelev was able to achieve about the same accuracy but with a wider array of categories (think authors) in [3], which should allow for more error. With that said, the method does a better job than basing a guess off a coin flip.TextAuthorPredicted GenderTrue GenderPride and PrejudiceJane AustenFemaleFemaleLittle WomenLouisa May AlcottFemaleFemaleThe Adventures of Huckleberry FinnMark Twain (Samuel Langhorne Clemens)FemaleMaleMicromegasVoltaire (Fran?ois-Marie Arouet)MaleMaleHeart of DarknessJoseph Conrad (Józef Teodor Konrad Korzeniowski)MaleMaleWuthering HeightsEllis Bell (Emily Bront?)FemaleFemaleAgnes GreyActon Bell (Anne Bront?)MaleFemale1984George Orwell (Eric Blaire)MaleMaleMiddlemarchGeorge Eliot (Mary Ann Evans)MaleFemaleJane EyreCurrer Bell (Charlotte Bront?)FemaleFemaleThe Romance of Lust, or Early ExperiencesAnonymousFemaleUnknown (believed to be male)Forbidden Fruit; Luscious and exciting story and More forbidden fruit or Master Percy's progress in and beyond the domestic circleAnonymousFemaleUnknownLaura MiddletonAnonymousMaleUnknownBeauty and the BeastAnonymousFemaleUnknownAs for the anonymous texts, the method guessed that all authors were female, with the exception of Laura Middleton. No one knows the gender of these authors for certain. However, The Romance of Lust is believed to have been written by a male author (there are two individuals who are believed to be the possible authors, William Simpson Potter and Edward Sellon [5]).Discussion and ConclusionDetermination of author gender could be used in practical applications. Web sites, for example, try to determine as much as they can about an individual so they can tailor their services to meet an individual's expected preferences better. Often a service can predict an individual's gender best with their name, but if an individual is anonymous, it may be possible to determine the individual's gender from the text they generate. The Markov chain method described here would be one way to do so.The framework for this method is general enough to be used in numerous applications. Attribution of authorship is only one. Here I applied it to author gender attribution, but there are possibilities beyond thatAnother obvious potential application is determining the genre of a text. This method could be used to determine if a text fits into a particular literary genre, such as science fiction, fantasy, mystery, etc. It could also be used to determine if a text is a work of fiction (like a novel), an essay, a report, and so on. This would be helpful for services that process numerous documents and would like to classify them if no classification has been given otherwise.Khmelev said in [1] that the matrix of ranks R is useful not only for determining authorship of a text but also determining which authors are similar. He noticed that authors correlate in their rankings. We could deem authors that correlate in rank to be "similar" in some way. This idea could be used by social media websites, where users create lots of text, to suggest other interests or tailor advertising to users based on what "similar" users prefer.One should keep in mind, though, that while the Markov method is very useful for determining authorship of a text or even possibly for saying that authors are “similar,” it does not say why that is beyond an author transitioning from one letter to another more frequently than some other author (and this is true even when texts are not segregated based on authorship). This leaves open a large realm of possible reasons for the method to work. For example, certain letter transitions may be more likely in one genre of writing than another, and if women tend to write more in one genre and men tend to write more in another, a man who writes in a female-dominated genre may be predicted by the method to be female. Then the method described above is not actually predicting genre but rather whether an author writes in a genre dominated by a particular sex. This may partly explain the patterns we see in gender identification.This potential problem holds for determining particular authors as well. The Time Machine might be attributed to H.G. Wells rather than Leo Tolstoy because the former is an authority figure in the science fiction genre, unlike the latter. If we unearthed some unknown work of science fiction by Tolstoy, would the method attribute the text to Tolstoy or to H.G. Wells? This is an important issue that one must bear in mind when seeking to expand the Markov chain method.With that said, the Markov chain method is a surprisingly good method for determining traits about a text. I used identification of author gender as one expansion of the method, but this is hardly the only direction the method could take. There could be numerous uses for the method beyond what Markov or others have envisioned, with worthwhile real-world applications. They are worth investigating. References[1] Khmelev, D.V. Disputed Authorship Resolution through Using Relative Empirical Entropy for Markov Chains of Letters in Human Language Texts. 2000. Journal of Quantitative Linguistics, v. 7, no. 3, pp. 201-207.[2] Khmelev, D.V. & Tweedie, F.J. Using Markov Chains for Identification of Writers. 2001. Literary and Linguistic Computing, v. 16, no. 3, pp. 299-307.[3] Markov, A.A. On some applications of statistical method. 1916. Izvetstia Akademii Nauk, v. 10, no. 4, p. 239.[4] Wikipedia. J. K. Rowling. Retrieved April 18th, 2015 from .[5] Wikipedia. The Romance of Lust. Retrieved April 18th, 2015 from . ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download