Using Synonyms for Author Recognition

[Pages:11]Using Synonyms for Author Recognition

Abstract. An approach for identifying authors using synonym sets is presented. Drawing on modern psycholinguistic research, we justify the basis of our theory. Having formally defined the operations needed to algorithmically determine authorship, we present the results of applying our method to a corpus of classic literature. We argue that this technique of author recognition is both accurate as an author identification tool, as well as applicable to other domains in computer science such as speaker recognition.

1 Introduction and Motivation

Current research in the area of Stylometry focuses on identifying idiosyncrasies in written literature to identify the author. We present a novel approach for identifying authors in written text that is also applicable to identifying speakers from automatically transcribed discourse.

In this paper, we argue that an author's choice of lexicons used by an author or speaker when many synonyms are available is idiosyncratic to the point of providing identification. Modern psycholinguistic evidence indicates that this is a well-founded approach. Though previous methods of author attribution have focused on linguistic elements peculiar to written text, the use of synonyms does not require such elements as punctuation to be effective and so may be applied to automatically transcribed speech as well.

Such a flexible approach is ideal for situations such as a Smart Home environment where requests made in natural language could originate from sources as varied as a personal computer to a mounted microphone. Traditional methods might require the use of both a vocal analyzer and a text-specific analyzer to identify the source. When using synonyms, no such dependence on disparate technologies is incurred.

The ability to discern the source of a document or statement also has value in the area of knowledge acquisition. One goal of knowledge acquisition is to gain the most accurate knowledge possible. However, all sources of information are not equally reliable and so such a system could be designed not trust them equally. The first step toward this type of learning is the ability to correctly identify the source of a piece of knowledge (i.e. a document). The presented method of author recognition aims to be adaptable to all of these applications.

2 Related Work

Author attribution is a well-studied area of artificial intelligence. Formal methods for determining authorship have even older roots. The field of Stylometry has thrived well before the turn of the twentieth century having several documented methods of analyzing texts to settle disputed authorship. Linguistic idiosyncrasies that have been identified as characteristic of an author include everything from counting keywords to analyzing punctuation.

2.1 Related Efforts in Stylometry

The classification of documents by content (as opposed to by author) has been one area in which similar techniques have been employed. Keywords have shown to be very effective in this area by many studies. This idea has been taken further by those such as Paek [10], who used keywords in image descriptions to classify the content of the images.

In the field of Stylometry, several algorithmic approaches have been applied to author attribution. For instance, Brinegar [3], Glover, and Hirst [5] used distributions of word length statistics to determine authorship while others including Morton [9] used sentence length for identification. The number of distinct words in a set of documents was studied by Holmes [7] and Thisted. [14]

2.2 Psycholinguistic Foundations

In the development of this method, consideration was given to not only the empirical correctness of its results, but also to the theoretical foundations on which it is built. Current psycholinguistic research suggests that synonyms are a part of language that is affected by one's environment.

Developed by the Cognitive Science Laboratory at Princeton University, the WordNet lexical database is itself rooted in modern psycholinguistic theory. [13] It is organized based largely upon how humans are believed to understand words. It is no coincidence that synonym sets are at the heart of WordNet.

In Holland's 1991 study, when subjects were requested to think of as many synonyms as possible for a set of words, a small but significant priming effect was found. [6] That is to say, subjects were more likely to produce as synonyms words that they had previously seen. This suggests that the synonyms produced by an individual are a product of experience, something that is very unique indeed.

Let us digress a moment and further pursue the concept of priming. Associative priming, the process by which one concept leads to another (e.g. a dog might cause one to think about a bone), is a result of spreading activation. [1] The way in which activation potentials spread through a neural network is dependent upon what connections have been made and how often they have been asserted. According to Hebbian brain theory, connections that fire together are more likely to fire together in the future. Thus, from the perspective of cognitive psychology, it makes sense that though

many synonyms may be activated by a concept, the word at the forefront of one's mind will be based on experience. The uniqueness of this experience and the corresponding uniqueness of synonym choice may be exploited to determine authorship.

3 Theory

3.1 Definitions

As a simple formalization of this theory, let us begin by defining a set of authors , which have been encountered by our system before any identification is processed. We then define i as the lexicon corresponding to author i so that |i| is the vocabulary richness. Next, consider two functions which may be applied to a word w in : occ(w) and syn(w) where occ(w) is the number of times word w was encountered and syn(w) is the number of synonyms for x.

Consider a threshold , which is the minimum number of synonyms that a word must have before to be considered an idiosyncratic identifier of an author. Note that it is the use of a sufficiently large value of that provides a reasonable running time for our algorithm. Now we define the filtered lexicon i as author i's lexicon with words having sufficiently large set of synonyms such that i i where each word ij has syn(ij) .

Having considered the task of learning each author's style, we can examine the case of identifying the work of an unknown author u where u . The heart of the algorithm is in the intersection of the filtered lexicon of the unknown author with that of all known authors:

i = i u.

(1)

Here, there are some special considerations to be made as to exactly how the intersection of these two lexicons is to be calculated. Since we have also associated a number of occurrences occ(w) with each element of , we need to determine how to evaluate this function for an intersection. For our purposes, let the number of occurrences for a word used by both authors i and j be

occ( i j ) = min( occ(i), occ(j) ).

(2)

Finally, we calculate the match factor for each author:

|i |

?(i,u) = occ(ij ) ? syn(ij )

(3)

j=1

The hypothetical author i of the text corresponds to the maximum value of the

match factor where n = || such that

?(i,u) = max( ?(0,u), ?(1,u), ... , ?(i,u), ... ?(n,u) )

(4)

3.2 Tractability Though upon first glance, calculating i u for every author in may spark com-

plexity concerns, we have already addressed the problem of tractability by means of . Recall that is a subset of and, as we will see, can be a very small subset indeed, if the threshold is set high enough. Even though it seems we are cutting many possible matches, our theory holds that for reasonably small , we are cutting less important words, since the author was constrained by the lexicon of the language when choosing a word with syn(w) < , due to a lack of synonyms.

3.3 Parallelization By means of the independence of the various stages of the training and, later, the discernment process, this algorithm is an excellent candidate for parallelization (see Fig. 1). Each author and, in fact, each document that is added to the training set can be evaluated separately, leaving the merging of the sets to a central dispatcher. Since the string operations necessary to calculate the statistics are far more processor-intensive than calculating the union of said sets, a very high rate of speedup is possible.

However, true to Amdahl's law, the stage of identifying the work of an unknown author is dependent on having all training data prepared. Still, each intersection i u and the subsequent calculations of ?(i,u) are independent of one another, giving yet another opportunity for parallelization. In the end, it is reasonable to distribute the calculations associated with each individual author to a separate parallel process.

Fig. 1. An illustration of the major parts of the algorithm that may be run in parallel

4 Implementation

4.1 WordNet

As our method requires the ability to determine the number of synonyms for a word, we chose to use WordNet1 to accomplish this task. In development since 1985, WordNet is now the one of the foremost lexical resources in computational linguistics. With over 118,000 word forms, it encompasses a substantial portion of the English language. [13]

In WordNet, each word is linked to one or more senses. These senses, in turn, can reside in synonym sets. Thus, we can find if a word shares at least one of its senses in common with another word making it, by definition, a synonym. Furthermore, this gives us the number of synonyms for a word by summing the number of unique members of a word's synonym sets.

Though unimportant from a theoretical standpoint, it should be noted that the actual version of WordNet used in this research was not the traditional `C' library, but rather a normalized database format for PostgreSQL.2 This allowed the number of synonyms for a word to be determined in the execution of a single SQL statement.

4.2 Corpus

The texts used to train the system consisted of works of classic literature by Charles Dickens and William Shakespeare. As these texts are in the public domain, they are freely available for review by the curious reader.3

Table 1. Texts included in our experimental corpus. The corpus was comprised of 286,898 words total

Set Dickens Train

Total Words 65,157

Dickens Test

80,832

Shakespeare Train 73,863

Shakespeare Test 67,046

Included Texts Battle of Life, Chimes, To be Read at Dusk Cricket on the Hearth, Three Ghost Stories, A Christmas Carol Comedy of Errors, Hamlet, Romeo and Juliet Julius Caesar, Henry V, Macbeth

1 WordNet 2.1 is available for download from the Princeton Cognitive Science Laboratory at .

2 WordNet SQL Builder 1.2.0 for WordNet 2.1, the application used to generate a PostgreSQL database of WordNet, is available for download at .

3 The texts listed here are available from Project Gutenberg at .

The selected works of Dickens includes the so-called "Christmas Stories," which are A Christmas Carol, The Chimes, The Cricket on the Hearth, and the Battle of Life, all written within the same decade. The other of Dickens' texts are short-stories, "To be Read at Dusk" and "Three Ghost Stories." Shakespeare's writings that were analyzed are some of his more famous comedies and tragedies such as The Comedy of Errors, Hamlet, Romeo and Juliet, Julius Caesar, Henry V, and Macbeth. These works of Shakespeare span over a 16-year period of his writing career.

In total, the corpus contained 286,898 words. The works of Shakespeare contributed 140,909 of the words while Dickens' text constituted 145,989 words of the total. Each author used a relatively large vocabulary, though still a very small portion of the English language as a whole. Shakespeare's lexicon had over 13,000 words while Dickens' had in excess of 12,000.

The corpus was divided into four sections. Most obviously, the texts were grouped by author. Secondly, each author's texts were divided in half, resulting in groups of about 72,000?9% words. These sets were tasked as "train" and "test" where the "train" was used in acquiring , the characteristic lexicon of an author and a match factor ?(i,u) was then calculated for each train-test pair Train Test.

5 Results

5.1 Author Identification Results

For our test data, the results showed that there is a well-defined difference in the match factor of a correct pair and an incorrect pair (see Fig. 2 and Fig. 3). This trend is consistent through all tested values of the threshold . In the cases of both Dickens and Shakespeare, the correct test set was matched with its corresponding test set.

Moving beyond the simple fact that the system managed to produce the correct answers, let us take note of the margin by which the correct answer was ranked above the others. The strongest case in the set was that of the Shakespeare training set versus the two test sets. With the difference between the matching and non-matching set being roughly 17%, there is little question that this test run was a success.

In the case of the Dickens training set run against the test sets, the answer is less confident with a 10% difference between the matching and non-matching values. However, upon further consideration, it makes sense that the authors should have a good deal of their vocabulary in common. Though the unique qualities of an author's style may be subtle, we still have the ability to detect these subtleties and exploit them to determine authorship.

Match Factor

250,000

Percent Difference

200,000 150,000 100,000

50,000

Shakespeare Train Shakespeare Test

Shakespeare Train Dickens Test

Dickens Train Shakespeare Test

Dickens Train Dickens Test

0

15

20

25

30

Threshold Value

Fig. 2. The match factor ?(i,u) correctly correlates each test set with the training set written by the same author. Thus, given a set known to be by a certain author, the system can discern one author from another

Percent Difference

25% 20% 15% 10%

5% 0%

15

Percent Difference in Match Factors

Shakespeare Train vs Two Test Sets

Dickens Train vs Two Test Sets

20

25

30

Threshold Value

Fig. 3. By applying sufficiently large values of the synonym threshold , the number of synonym matches, and therefore the computational overhead, is greatly reduced

Number of Synonym Matches

2,000 1,800 1,600 1,400 1,200 1,000

800 600 400 200

0 15

Number of Synonym Matches

20

25

30

Threshold Value

Shakespeare Train - Shakespeare Test

Shakespeare Train - Dickens Test

Dickens Train Shakespeare Test

Dickens Train Dickens Test

Fig. 4. By applying sufficiently large values of the synonym threshold , the number of synonym matches, and therefore the computational overhead, is greatly reduced. Due to our particular interest in words with large numbers of synonyms, the cutting of words with small synonym sets does not negatively effect the algorithm's ability to distinguish authors from one another

Reductions in Synonym Matches

Reduction in Synonym Matches

100%

95%

90%

85%

80%

75%

15

20

25

30

Threshold Value

Shakespeare Train Shakespeare Test

Shakespeare Train Dickens Test

Dickens Train Shakespeare Test

Dickens Train Dickens Test

Fig. 5. Even moderate values of yield in excess of 90% reduction in the synonym matches between authors

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download