AP Statistics - College Board

AP? Statistics

The Satterthwaite Formula for Degrees of Freedom in the Two-Sample t-Test

Michael Allwood Brunswick School Greenwich, Connecticut

connect to college successTM

? 2008 The College Board. All rights reserved. College Board, AP Central, APCD, Advanced Placement Program, AP, AP Vertical Teams, CollegeEd, Pre-AP, SAT, and the acorn logo are registered trademarks of the College Board. Admitted Class Evaluation Service, Connect to college success, MyRoad, SAT Professional Development, SAT Readiness Program, Setting the Cornerstones, SpringBoard, and The Official SAT Teacher's Guide are trademarks owned by the College Board. PSAT/NMSQT is a registered trademark of the College Board and National Merit Scholarship Corporation. All other products and services may be trademarks of their respective owners. Permission to use copyrighted College Board materials may be requested online at: inquiry/cbpermit.html.

I. Introduction What's the most complicated formula we encounter in AP Statistics? To me it's undoubtedly the formula for degrees of freedom in the two-sample t-test (the version of the test where we do not assume equal population variances):

df

=

1

n1

-

1

s12 n1

+

s22 n2

2

s12 n1

2

+

1 n2 -1

s22 n2

2

.

Admittedly, we don't have to tell our students this formula. We can tell them to use the number of degrees of freedom given by the calculator (which is in fact the result of this formula), or we can tell them to resort to the "conservative" method of using the smaller of n1 -1 and n2 -1 .

Nonetheless, I've been intrigued over the years by this array of symbols and have been eager to know where it comes from.

The formula was developed by the statistician Franklin E. Satterthwaite and a derivation of the result is given in Satterthwaite's article in Psychometrika (vol. 6, no. 5, October 1941). My aim here is to translate Satterthwaite's work into terms that are easily understood by AP Statistics teachers. The mathematics involved might seem a little daunting at first, but apart perhaps from one or two steps in section V, no stage in the argument is beyond the concepts in AP Statistics. (Section V concerns two standard results connected with the chi-square distributions. These results can easily be accepted and their proofs omitted on the first reading.) It is also worth noting that section IV, concerning the test statistic in the one-sample t-test, is only included by way of an introduction to the work on Satterthwaite's formula. So this section, too, can be omitted by the reader who wants the quickest route to Satterthwaite's result.

II. A Definition of the Chi-Square Distributions Let Z1, Z2,K, Zn be independent random variables, each with distribution N (0,1) .

The 2 (chi-square) distribution with n degrees of freedom can be defined by

2 n

=

Z12

+

Z

2 2

+K +

Zn2.

(1)

III. A Definition of the t-Distributions Let's suppose that X has distribution N (?, ) and that X1,K, X n is a random sample of values of X. As usual, we denote the mean and the standard deviation of the sample by X and s , respectively. In 1908, W. S. Gosset, a statistician working for Guinness in Dublin, Ireland, set about determining the distribution of

X -?, sn

- 2 -

and it is this distribution that we refer to as the "t-distribution." Actually, we should refer to the "t-distributions" (plural), since the distribution of that statistic varies according to the value of n.

However, we define the t-distributions in the following way: Suppose that Z is a random variable whose distribution is N (0,1) , that V is a random variable whose distribution is 2 with n degrees

of freedom, and that Z and V are independent. Then the t-distribution with n degrees of freedom is given by

tn =

Z .

Vn

(2)

Our task in the next section is to confirm that Gosset's t-statistic, t = ( X - ?) (s n) , does, in fact, have a t-distribution.

IV. A Demonstration That ( X - ?) (s n ) Has Distribution tn-1 First,

X - ? = ( X - ?) (

sn

s2 2

n) = ( X - ?) ( n) . (n -1)s2 2 n -1

Now we know that the distribution of X -?

is N (0,1) , n

so according to the definition (2) of the t-distribution, we now need to show that (n -1)s2 2 is

2 distributed with n -1 degrees of freedom and that ( X - ?) ( n) and (n -1)s2 2 are

independent. This second fact is equivalent to the independence of X and s when sampling from a normal distribution, and its proof is too complex for us to attempt here.1 To show that

(n -1)s2

2

is

2 n -1

,

we

start

by

observing

that

(n -1)s2 = n -1 ( Xi - X )2 = ( Xi - X )2 .

2

2

n -1

2

We first replace the sample mean X with the population mean ? and turn our attention to

(Xi - ?)2 =

2

Xi -

?

2

.

- 3 -

Since each Xi is independently N (?, ) , each ( Xi - ?) is independently N (0,1) . So

(( Xi - ?) )2 is the sum of the squares of n independent N (0,1) random variables, and

therefore, according to the definition (1) of the 2 distributions, it is 2 distributed with n degrees of freedom.

Now,

( Xi - ?)2 = ( Xi - X ) + ( X - ?)2 = ( Xi - X )2 + 2( Xi - X )( X - ?) + ( X - ?)2 = ( Xi - X )2 + 2( X - ?) ( Xi - X ) + n( X - ?)2.

But

(Xi

- X) = Xi

- nX

= Xi

-n Xi n

= 0,

so

( Xi - ?)2 = ( Xi - X )2 + n( X - ?)2.

(3)

Therefore, dividing by 2 ,

(Xi - ?)2 =

2

(Xi

-

2

X

)2

+

X

-

? n

2

.

(4)

The fact that we have just established, (4), gives us the key to our argument: ( X - ?) ( n) is

N (0,1) , and so ( X - ?) ( n)2 is 12 . Also, we established that

(Xi - ?)2

2

is

2 n

.

Now we mentioned above that ( X - ?) ( n) and (n -1)s2 2 (i.e., ( Xi - X )2 2 ) are independent when sampling from a normal distribution. So according to (4), ( Xi - X )2 2

has

that

distribution

that must be

independently added

to

12

to give

2 n

.

Looking at

the

definition of the 2 distributions (1), we see that this distribution must be the sum of the squares

of

n -1

independent normally distributed random variables, that

is,

2 n-1

.

So we have shown that

(Xi - X )2 2

= (n -1)s2 2

is

2 n -1

.

Thus we have completed our demonstration that X - ? is t distributed with n -1 degrees of sn

freedom.

V. The Mean and Variance of the Chi-Square Distribution with n Degrees of Freedom

- 4 -

In section II we defined the chi-square distribution with n degrees of freedom by

2 n

=

Z12

+

Z

2 2

+K+ Zn2 ,

where

Z1, Z2 ,K, Zn

are

independent random

variables, each

with

distribution N (0,1) .

Taking the expected value and the variance of both sides, we see that

E

(

2 n

)

=

E(Z12

)

+

K

+

E

(

Z

2 n

)

,

and

Var(

2 n

)

=

Var(Z12

)

+

K

+

Var(Z

2 n

).

But all the instances of Zi have identical distributions, so

E

(

2 n

)

=

nE

(Z

2

)

,

and

Var(n2 ) = nVar(Z 2 ),

where Z is the random variable with distribution N (0,1) .

Now,

E(Z 2 ) = E (Z - 0)2 = E (Z - ?Z )2 = Var(Z ) = 1,

telling us that

E

(

2 n

)

=

n

1

=

n.

So we are left now with the task of finding Var(Z 2 ) .

Now,

Var(Z 2 ) = E (Z 2 - ?Z2 )2 = E (Z 2 -1)2 = E(Z 4 - 2Z 2 +1) = E(Z 4 ) - 2E(Z 2 ) +1 = E(Z 4 ) - 2 1+1,

so

Var(Z 2 ) = E(Z 4 ) -1.

(5)

To find E(Z 4 ) , we'll use the fact that for any continuous random variable X with probability density function f, and any exponent k,

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download