Replication prep CIs - American Psychological Association

Support for the President

CONFIDENCE INTERVALS AND THE NEW STATISTICS Report effect sizes and confidence intervals, then interpret these in the research context. That's what researchers need to do, according to the sixth edition of the APA Publication Manual.

By Geoff Cumming

Suppose a newspaper reports that Support for the President is 57% in a poll with an error margin of 2%. It might also report the information graphically, as in Figure 1.

60%

55%

50%

Figure 1. Support for the President is 57% in a poll with an error margin of 2%.

Most readers understand that 57% is our best estimate of true support for the President, in the population from which the pollster sampled, and that the 2% margin of error (MOE) tells us that the estimate is, most likely, not more than 2% from the true value. In other words, values inside the interval [55, 59], as marked by the error bars in Figure 1, are the most plausible for the true value. Suppose a footnote tells us that the interval is a 95% confidence interval (CI), which is an interval calculated from the data that`s likely to include the true population value. The MOE is simply the length of one arm of the CI. My aim is to describe an estimation approach to research, based on CIs. Estimation is an excellent approach to designing research and analyzing data because it`s highly informative, and usually gives the best answers to our research questions.

The 57% is a point estimate of the true level of support, and the CI is an interval estimate, which tells us the precision of the point estimate. The pollster addressed the question: How large is the President`s support? and the point and interval estimates give the most informative answer possible, given the pollster`s data.

The sixth edition of the APA Publication Manual (APA, 2010) states: Wherever possible, base discussion and interpretation of results on point and interval estimates (p. 34). In other words, psychologists should use estimation to present and interpret their research results. It`s an extremely important statement, which if widely adopted would lead to substantial improvement in how psychological research is carried out, reported, and understood.

The first step is to note the question asked: How large is the President`s support? I refer to that as estimation language, because it asks for an amount, or a size. In psychology, research questions expressed in this way might include To what extent did the new therapy lead to improvement?, or How strong is the relationship between parental income and the word knowledge of 5-year-olds?

1

To any such question the most informative answer will be a point estimate--our best bet for the true value in the population, given the data--and a CI to tell us the precision of that point estimate. To the word knowledge question, for example, the answer may be that the correlation, based on our sample data, is .32, 95% CI [.07, .53]. (That`s the format specified by the Publication Manual for reporting a CI.) In other words, our point estimate is a correlation of .32, and the error bars depicting the CI, our interval estimate, would extend from a correlation of .07 to a correlation of .53.

Effect Sizes

I`ll say more about CIs in a moment, but first I need to say a little about effect sizes. An effect size (ES) is simply the amount of something we might be interested in. It can be as familiar as a mean, a difference between means, a percentage, a frequency count, or a correlation. We calculate the sample ES (e.g., correlation = .32) from our sample data, and use that as the point estimate of what we`d really like to know: the population ES, which is the correlation in the population. The sample ES of .32 is our best estimate of the population ES, and the CI tells us the population ES is most likely within [.07, .53].

The term effect size` is used very broadly. The average blood pressure in a group of seniors, the percentage of students still on the diet after three months, and the regression slope of heart rate against running speed are all perfectly good ESs. The Publication Manual states that:

It is almost always necessary to include some measure of effect size.... Effect sizes may be expressed the original units (e.g., the mean number of questions answered correctly; kg/month for a regression slope) and are often most easily understood when reported in original units. It can often be valuable to report an effect size ... also in some standardized or units-free unit (e.g., as a Cohen`s d value). (APA, 2010, p. 34)

Cohen`s d is a number of standard deviation (SD) units, and is a standardized ES that can be very useful for comparing or combining different measures of the same concept. For example, different researchers might measure anxiety using different scales. One researcher might find an average increase in anxiety, in some particular situation, of 5 points on a scale that has a SD of 10 in some relevant population. The increase is just half of one SD, and so Cohen`s d is 0.5. Another researcher, using a different scale, might find an increase of 24 points in a similar situation, on a scale that has SD = 40 in a comparable population. Cohen`s d would be 24/40 = 0.60. The results are similar, despite the ESs in original units (5 points, and 24 points) being very different. Care is needed, especially in choosing a suitable SD unit to use as a standardizer, but Cohen`s d and other standardized ESs can be extremely useful. Now, back to CIs.

Confidence Intervals

Suppose you wish to estimate the average level of depression of unemployed adults in a county that suffered a natural disaster a year ago. You administer the Beck Depression Inventory (BDI-II; Beck, Steer, & Brown, 1996) to a random sample of 40 such adults, and calculate the mean score to be 21.3, 95% CI [15.7, 26.9], as shown in Figure 2. Figure 2 also marks ranges of BDI-II scores with labels that help interpretation.

Beck Depression Inventory Score

0

10

20

30

40

50

60

Minimal

Mild Moderate

Severe

Figure 2. Mean and 95% CI for depression scores in a group of 40 unemployed adults.

2

We can interpret that CI as indicating the range of plausible values for the mean BDI-II score in the population of unemployed adults in the county. Values outside the interval are relatively implausible. In other words, the CI tells us where the true population value is most likely to be; it does not, for example, simply reflect the range of values in the sample.

To illustrate a second approach to understanding a CI, imagine carrying out the experiment many times, each time with exactly the same procedure, but with a new random sample of 40 adults. Figure 3 shows the results of a computer simulation of 25 replications of the experiment. Our initial result, as in Figure 2, appears at the bottom, then above are 24 sample means each with its 95% CI. The simulation assumed that the BDI-II scores in the population are normally distributed, and have mean 24 and a SD of 16. The population mean is marked by the vertical line at 24. When the simulation runs, the CIs fall down the screen as new simulated experiments are added at the top. As they fall, they dance from side to side, and vary somewhat in length. Figure 3 is a snapshot of what I call the dance of the CIs. The 95% in 95% CI` is the level of confidence, and is defined to be the percentage of CIs that, in an indefinitely long sequence of replications, will include the population mean. Figure 3 shows 23 CIs that include the vertical line that marks the population mean, and two that don`t-- whose means are marked with large black dots. I`ve mentioned only 95% CIs, and these should be our choice unless there are strong reasons to use some other level of confidence in a particular situation. A 99% CI, for example, would in the long run include the population mean for 99% of samples. To achieve this higher capture rate, they need to be longer than 95% CIs, in fact about 30% longer. On the other hand, 90% CIs are about 15% shorter than 95% CIs.

Beck Depression Inventory score

Pop.

0

10

20 mean 30

40

50

60

Minimal

Mild Moderate

Severe

Figure 3. Mean and 95% CI for the initial experiment, as in Figure 2, shown at the bottom, and for 24 simulated replications of the experiment. The assumed population mean is marked by the vertical line and the label Pop. mean.` Two

CIs do not include the population mean; their means are marked with large black dots.

Our first interpretation of a CI was as a range of plausible values for the population mean we`re estimating. Our second interpretation is based on Figure 3 and the dance of the CIs. Our CI, which we calculated from our sample data, is a randomly chosen interval from the CIs given by an infinite number of replications, 95% of which include the population mean. Most likely our interval includes the population mean we are estimating, but there is no guarantee. Our CI may be like those marked with the heavy black dot in Figure 3, which miss the population mean. We`ll never know, because we don`t know the population mean. Note that a researcher runs only a single experiment, not a large

3

number of replications, and also doesn`t know the population mean and SD, which the simulation needs to assume.

A randomized control trial. Suppose you ran a randomized control trial (RCT) of a particular type of psychotherapy for the depression you found in the troubled county. You randomly allocated adults who presented to your depression clinic, and agreed to participate, to either the Treatment group, or a waiting list Control group. Figure 4 shows the results of your pretest, posttest, and 2 month follow up scores on the BDI-II. The figure shows an encouraging pattern, but we need to identify the ES, or ESs, that most closely correspond to your research questions. You first wish to ask, using good estimation language: How great an improvement did the Treatment group show?, then you`ll compare that with the Control group. The ES we need is the (Posttest ? Pretest) difference. Figure 5 shows that difference, with its 95% CI, for each group. We could use that figure as the basis for interpretation, or could go a further step and find the difference between those two differences, and calculate the CI on that. More complex analyses are possible, and we need to consider also the follow up data, but Figures 4 and 5 show important aspects of what the RCT can tell us.

Beck score

35 30 25 20 15 10

5 0

Pretest

Control Treatment

Level of clinical concern

Posttest

Follow Up

Figure 4. Means and 95% CIs for depression scores of Treatment and Control groups, at three testing times. A reference value is marked by the horizontal grey line.

4

Beck score difference

Posttest - Pretest difference 2

0

-2

Small -4

-6

Medium

-8

-10

Large

-12 Treatment

Control

Figure 5. Means and 95% CIs for the (Posttest ? Pretest) changes in scores, for the data depicted in Figure 4. Horizontal lines mark three reference values.

Interpreting ESs and CIs

After reporting ESs and CIs, the crucial step is to interpret them, as the Publication Manual's statement requires. This calls for knowledgeable judgment in the research context. Give reasons for your interpretation, and readers can disagree if they wish. Interpretation may include judgments about the theoretical, clinical, or practical importance of the ESs. What are the implications of finding the sizes of effects estimated in the study? Is each small or large, trivial or important? How do they compare with theoretical expectations, and with the ESs found in previous research?

Figures can include reference values, if that helps. Figures 2 and 3, for example, mark ranges of BDIII scores with the labels assigned in the test manual. Figure 4 marks a single reference value, perhaps chosen by the researcher as appropriate in the particular situation. Figure 5 marks three reference values, which may be the choice of the researcher, or may reflect established clinical custom. The CIs and reference values in Figures 5, for example, may justify interpreting the average change in the Control group`s scores as around zero, at most small, and that for the Treatment group as mediumto-large. The wording reflects both the ES estimate and the extent of the CI.

My first example CI was the correlation .32, 95% CI [.07, .53]. Did that CI seem to you very wide? In fact, that`s the CI we`d get if our sample size was 60. Unfortunately, CIs given by typical experiments are often disappointingly wide, which may help explain why psychologists have been reluctant to report them. Appreciating that our experiments often give wide CIs should prompt us to try harder to design better experiments. It can also prompt us to reduce uncertainty by combining evidence over studies, which we may be able to achieve by using meta-analysis.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download