Significance Testing of Word Frequencies in Corpora

Significance Testing of Word Frequencies in Corpora

Author: Jefrey Lijffijt Affiliation: Aalto University Current affiliation: University of Bristol Mail: University of Bristol, Department of Engineering Mathematics, MVB Woodland Road, Bristol, BS8 1UB, United Kingdom. E-mail: jefrey.lijffijt@bristol.ac.uk

Author: Terttu Nevalainen Affiliation: University of Helsinki

Author: Tanja S?ily Affiliation: University of Helsinki

Author: Panagiotis Papapetrou Primary affiliation for this manuscript: Aalto University Current affiliation: Stockholm University

Author: Kai Puolam?ki Primary affiliation for this manuscript: Aalto University Current affiliation: Finnish Institute of Occupational Health

Author: Heikki Mannila Affiliation: Aalto University

Abstract Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (2005), the use of the 2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

1. Introduction Comparison of word frequencies is among the core methods in corpus linguistics and is frequently employed as a tool for different tasks, including generating hypotheses and identifying a basis for further analysis. In this study, we focus on the assessment of the statistical significance of differences in word frequencies between corpora. Our goal is to answer questions such as `Is word X more frequent in male conversation than in female conversation?' or `Has word X become more frequent over time?'.

Statistical significance testing is based on computing a p-value, which indicates the probability of observing a test statistic that is equal to or greater than the test statistic of the observed data, based on the assumption that the data follow the null hypothesis. If a p-value is small (i.e. below a given threshold ), then we reject the null hypothesis. In the case of comparing the frequencies of a given word in two corpora the test statistic is the difference between these frequencies and, put simply, the null hypothesis is that the frequencies are equal.

However, to employ a test, the data have to be represented in a certain format, and by choosing a representation we make additional assumptions. For example, to employ the 2 test, we represent the data in a 2x2 table, as illustrated in Table 1. We refer to this representation as the bag-of-words model. This representation does not include any information on the distribution of the word X in the corpora. When using this representation and the 2 test, we implicitly assume that all words in a corpus are statistically independent samples. The reliance on this assumption when computing the statistical significance of differences in word frequencies has been challenged previously; see, for example, Evert (2005) and Kilgarriff (2005).

Table 1 The 2x2 table that is used when employing the 2 test

Corpus S Corpus T

Word X

A

B

Not word X C

D

Hypothesis testing as a research framework in corpus linguistics has been debated but remains, in our view, a valuable tool for linguists. A general account on how to employ hypothesis testing or keyword analysis for comparing corpora can be found in Rayson (2008). We observe that the discussion regarding the usefulness of hypothesis testing in the field of linguistics has often been conflated with discussions pertaining to the assumptions made when employing a certain representation and statistical test. Kilgarriff (2005) asserts that the `null hypothesis will never be true' for word frequencies. As a response, Gries (2005) argues that the problems posed by Kilgarriff can be alleviated by looking at (measures of) effect sizes and confidence intervals, and by using methods from exploratory data analysis. Our main point is different from that of Gries (2005). While we endorse Kilgarriff's conclusion that the assumption that all words are statistically independent is inappropriate, the lack of validity of one assumption does not imply that there are no comparable representations and tests based on credible assumptions.

As pointed out in Kilgarriff (2001) and Paquot & Bestgen (2009), it is possible to represent the data differently and employ other tests, such as the t-test, or the Wilcoxon rank-sum test, such that we assume independence at the level of texts rather than individual words. An alternative approach to the 2x2 table presented above is to count the number of occurrences of a word per text, and then compare a list of

(normalized) counts from one corpus against a list of counts from another corpus. An

illustration of this representation is given in Table 2. This approach has the advantage

that we can account for the distribution of the word within the corpus.

Table 2 The frequency lists that are used when employing the t-test. The lists do not have to be of equal

length, as the corpora may contain an unequal number of texts.

Corpus S

Text S1

Text S2

...

Text SN

Normalized frequency S1

S2

...

S|S|

of word X

Corpus T

Text T1

Text T2

...

Text TM

Normalized frequency T1

T2

...

T|T|

of word X

We emphasize that the utility of hypothesis testing critically depends on the

credibility of the assumptions that underlie the statistics. We share Kilgarriff's (2005) concern that application of the 2 test leads to finding spurious results, and we agree

with Kilgarriff (2001) and Paquot and Bestgen (2009) that there are more appropriate

alternatives, which, however, have not been implemented in current corpus linguistic

tools. We re-examine the alternatives and provide new insights by analysing the

differences between six statistical tests in a controlled resampling setting, as well as in a

practical setting.

The question which method is most appropriate for assessing the significance of

word frequencies or other statistics is not new. Dunning (1993) and Rayson and Garside

(2000) suggest that a log-likelihood ratio test is preferable to a 2 test because the latter

test is inaccurate when the expected values are small (< 5). Rayson et al. (2004) propose using the 2 test with a modified version of Cochrane's rule. Kilgarriff (2001) concludes

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download