CHAPTER 12 MULTIPLE COMPARISONS AMONG TREATMENT …

CHAPTER 12 MULTIPLE COMPARISONS AMONG

TREATMENT MEANS

OBJECTIVES

To extend the analysis of variance by examining ways of making comparisons within a

set of means.

CONTENTS

12.1 ERROR RATES 12.2 MULTIPLE COMPARISONS IN A SIMPLE EXPERIMENT ON MORPHINE TOLERANCE 12.3 A PRIORI COMPARISONS 12.4 POST HOC COMPARISONS 12.5 TUKEY'S TEST 12.6 THE RYAN PROCEDURE (REGEQ) 12.7 THE SCHEFF? TEST 12.8 DUNNETT'S TEST FOR COMPARING ALL TREATMENTS WITH A CONTROL 12.9 COMPARIOSN OF DUNNETT'S TEST AND THE BONFERRONI t 12.10 COMPARISON OF THE ALTERNATIVE PROCEDURES 12.11 WHICH TEST? 12.12 COMPUTER SOLUTIONS 12.13 TREND ANALYSIS

A significant F in an analysis of variance is simply an indication that not all the

population means are equal. It does not tell us which means are different from which

other means. As a result, the overall analysis of variance often raises more questions than

it answers. We now face the problem of examining differences among individual means,

or sets of means, for the purpose of isolating significant differences or testing specific

hypotheses. We want to be able to make statements of the form 1 2 3 , and 4 5 ,

but the first three means are different from the last two, and all of them are different

from 6 .

Many different techniques for making comparisons among means are available; here we will consider the most common and useful ones. A thorough discussion of this topic can be found in Miller (1981), and in Hochberg and Tamhane (1987), and Toothaker (1991). The papers by Games (1978a, 1978b) are also helpful, as is the paper by Games and Howell (1976) on the treatment of unequal sample sizes.

12.1. ERROR RATES

The major issue in any discussion of multiple-comparison procedures is the question of the probability of Type I errors. Most differences among alternative techniques result from different approaches to the question of how to control these errors.1 The problem is in part technical; but it is really much more a subjective question of how you want to define the error rate and how large you are willing to let the maximum possible error rate be.

We will distinguish two basic ways of specifying error rates, or the probability of Type I errors.2 In doing so, we shall use the terminology that has become more or less standard since an extremely important unpublished paper by Tukey in 1953. (See also Ryan, 1959; O'Neil and Wetherill, 1971.)

1 Some authors choose among tests on the basis of power and are concerned with the probability of finding any or all significant differences among pairs of means (any-pairs power and all-pairs power). In this chapter, however, we will focus on the probability of Type I errors and the way in which different test procedures deal with these error rates. 2 There is a third error rate called the error rate per experiment (PE), which is the expected number of Type I errors in a set of comparisons. The error rate per experiment is not a probability, and we typically do not

2

Error rate per comparison (PC)

We have used the error rate per comparison (PC) in the past and it requires little elaboration. It is the probability of making a Type I error on any given comparison. If, for example, we make a comparison by running a t test between two groups and we reject the null hypothesis because our t exceeds t.05 , then we are working at a per comparison error rate of .05.

Familywise error rate (FW)

When we have completed running a set of comparisons among our group means, we will arrive at a set (often called a family) of conclusions. For example, the family might consist of the statements

1 2 3 4

1 3 4 2

The probability that this family of conclusions will contain at least one Type I error is called the familywise error rate (FW).3 Many of the procedures we will examine are specifically directed at controlling the FW error rate, and even those procedures that are not intended to control FW are still evaluated with respect to what the level of FW is likely to be.

attempt to control it directly. We can easily calculate it, however, as PE = c, where c is the number of comparisons and is the per comparison error rate.

3

In an experiment in which only one comparison is made, both error rates will be the same. As the number of comparisons increases, however, the two rates diverge. If we let represent the error rate for any one comparison and c represent the number of comparisons, then

Error rate per comparison PC : Familywise error rate FW : 1 1c

if comparisons are independent

If the comparisons are not independent, the per comparison error rate remains unchanged,

but the familywise rate is affected. In most situations, however, 1 1c still

represents a reasonable approximation to FW. It is worth noting that the limits on FW are PC < FW < c and in most reasonable cases FW is in the general vicinity of c This fact becomes important when we consider the Bonferroni tests.

The null hypothesis and error rates

We have been speaking as if the null hypothesis in question were what is usually called the complete null hypothesis ( 1 2 3 k ). In fact, this is the null hypothesis tested by the overall analysis of variance. In many experiments, however, nobody is seriously interested in the complete null hypothesis; rather, people are concerned about a few more restricted null hypotheses, such as ( 1 2 3 , 4 5 , 6 7 ), with

3 This error rate is frequently referred to, especially in older sources, as the experimentwise error rate. However, Tukey's term familywise has become more common. In more complex analyses of variance, the experiment often may be thought of as comprising several different families of comparisons.

4

differences among the various subsets. If this is the case, the problem becomes more complex, and it is not always possible to specify FW without knowing the pattern of population means. We will need to take this into account in designating the error rates for the different tests we shall discuss.

A priori versus post hoc comparisons

It is often helpful to distinguish between a priori comparisons, which are chosen before the data are collected, and post hoc comparisons, which are planned after the experimenter has collected the data, looked at the means, and noted which of the latter are far apart and which are close together. To take a simple example, consider a situation in which you have five means. In this case, there are 10 possible comparisons involving pairs of means (e.g., X1 versus X 2 , X1 versus X3 , and so on). Assume that the complete null hypothesis is true but that by chance two of the means are far enough apart to lead us erroneously to reject H0 : i j . In other words, the data contain one Type I error. If you have to plan your single comparison in advance, you have a probability of .10 of hitting on the 1 comparison out of 10 that will involve a Type I error. If you look at the data first, however, you are certain to make a Type I error, assuming that you are not so dim that you test anything other than the largest difference. In this case, you are implicitly making all 10 comparisons in your head, even though you perform the arithmetic for only the largest one. In fact, for some post hoc tests, we will adjust the error rate as if you literally made all 10 comparisons.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download