STAT 101, Module 3: Numerical Summaries for



STAT 101, Module 7:

The Root-N Law, the Central Limit Theorem,

Standard Errors, and Confidence Intervals

(Book: chapter 7)

Independent and Uncorrelated Random Variables

• Definitions: Two random variables X and Y are called …

o … independent if the events Ai = (X=xi) and Bj = (Y=yj) are independent for all possible values xi of X and yj of Y.

o … uncorrelated if C(X,Y) = 0.

We say “uncorrelated” even though we use the covariance in the definition. Maybe that’s because we can’t say “uncovarianced”, or maybe because if σ(X) and σ(Y) are both >0, then C(X,Y) = 0 ( c(X,Y) = 0.

• Theorem: If X and Y are independent, they are uncorrelated.

The theorem is important because it is often easier to recognize that two random variables are independent than uncorrelated, even though independence is a more stringent condition. For example, one recognizes coin flips immediately as independent.

o “Proof”: C(X,Y) = E((X – µX) (Y – µY)) = E(X – µX) E(Y – µY) = 0· 0 = 0

The second equality is some grinding algebra, but nothing deep.

o The converse is not true! Here is a counter example of a pair of random variables that are uncorrelated but not independent:

P(X=1 and Y=2) = P(X=2 and Y=1) =

P(X=2 and Y=3) = P(X=3 and Y=2) = ¼

This can be realized by a game where two dice are thrown repeatedly till they show a 1-2 pair or a 2-3 pair in any order. The outcome will be one of (1,2), (2,1), (2,3), (3,2), with equal probability.

[pic]

To see that the two random variables are not independent, check the marginal (plain) probabilities:

P(X=1) = P(Y=1) = P(X=3) = P(Y=3) = ¼

P(X=2) = P(Y=2) = ½

=> P(X=1 and Y=2) = ¼ ≠ P(X=1)·P(Y=2) = ¼ · ½ = ⅛

Intuitively, the two random variables cannot be independent because if X=1 we know Y=2, for example.

To see that the two random variables are uncorrelated, calculate their covariance. Note, however, that E(X) = E(Y) = 2, hence each summand in C(X,Y) has a factor 0 (each pair has an outcome 2). Therefore C(X,Y) = 0.

o The following is an example of two independent variables:

P(X=1) = P(X=2) = P(X=3) = ⅓

P(Y=1) = P(Y=2) = P(Y=3) = ⅓

P(X=x and Y=y) = P(X=x) · P(Y=y)

Note that for independent variables we only need to specify the marginal probabilities, and the joint probabilities are obtained by multiplication. In this example the important thing is not that the probabilities of 1,2,3 are equal, but that they can be multiplied to obtain the joint probability of all pairs of values.

[pic]

This can be realized by a game where two dice are thrown till they both show a value of 3 or less.

o Both of the above examples are constructed by “conditioning”. This is often useful to go from a known situation to a slightly different one: simply single out the cases that you like and condition on them. In this case the instructor didn’t want to deal with 6 · 6 = 36 outcomes, which is why he scaled things down to outcomes 3 or less.

The Root-N Law and the Standard Error

• We are finally able to determine the rate at which relative frequencies and means grow more precise. It will be a disappointing result, because the precision gets better only very slooooowwwwwwly…

We consider a possibly long series of random variables with identical possible outcomes and identical probabilities for these outcomes (“identically distributed”), and we also assume the variables are uncorrelated:

X1 , X2 , X3 , X4 , X5 , …, XN

As examples, keep in mind flipping a coin, rolling a die, but also daily stock market returns (which are surprisingly uncorrelated day to day), or the monthly credit card bills of a randomly sampled series of households, measurements of blood glucose in a given patient (the measurements are slightly different even from the same blood sample due to measurement error), survival times of cancer patients treated with a new therapy,…

Note that X1 stands for the values of the first case across datasets,

X2 for the values of the second case across datasets,… It therefore makes sense to talk about the probability distribution of the variable X1, X2,…

The assumption of identical distribution has the consequence that not only are all the probabilities P(X1=x) = P(X2=x) = P(X3=x) =… the same for all possible values x, but so are the expected values and variances and SDs:

E(X1) = E(X2) = E(X3) = … = E(XN) = µ

V(X1) = V(X2) = V(X3) = … = V(XN) = σ2

We think of the whole series as repeatable: Over and over, we could

o flip another N coins,

o roll another N dice,

o look at another series of N daily stock returns,

o another sample of N households and their monthly credit card bills,

o another set of N blood glucose measurements from the same blood sample,

o another clinical trial with N treated patients and their survival times,…

We are now interested in the mean value of these outcomes:

[pic][pic] = (X1 + X2 + X3 + X4 + X5 + …+ XN ) / N

Because of the assumed repeatability, [pic] is a random variable in its own right: every repetition would produce a slightly different mean. Its expected values is obviously E([pic]) = μ, but what is its SD?

• Theorem: If X1 , X2 , …, XN are uncorrelated and identically distributed with same variance σ2, then

V([pic]) = σ2 / N

o Proof: V([pic]) = V(X1 + X2 + …+ XN ) / N2

= C(X1 + X2 + …+ XN , X1 + X2 + …+ XN )/ N2

= ( V(X1) + V(X2) + …+ V(XN) +

… + C(Xi, Xj) + … ) / N 2

= ( N σ2 ) / N 2

= σ2 / N

The steps of the proof are as follows: 1) pull out the factor 1/N as 1/N 2 ; 2) expand the variance of the sum into N variances and N(N–1) covariances; 3) use the fact that all covariances disappear; 4) use the fact that all variances are the same, σ2.

(For those who enjoy math: This is really a giant application of a version of the theorem of Pythagoras. It is like taking a giant N-dimensional triangle, or N-angle, really, and doing something like this: hypotenuse2 = (side 1)2 + (side 2)2 + (side 3)2 + … + (sideN)2, where all sides are of equal length, so that hypotenuse2 = N · (any side)2. The quantity we are examining, though, is hypotenuse2 /N 2 = (any side)2 / N, which is σ2/N.)

o What is disappointing about this result? It becomes clear once we reformulate it in terms of standard deviations, which are the real measure of dispersion:

σ([pic]) = σ / N ½

• Definition: σ([pic]) is called the standard error of the mean .

The standard error is a standard deviation but only in a special case: when describing the variability of an estimate such as a mean across datasets.

• Interpretation: The standard error of the mean is a measure of dispersion of the mean from dataset to dataset , assuming one could obtain datasets like the observed one over and over and over…

This mental exercise should give you something to think. In any given data analysis, you are looking at one single dataset. You are calculating one number from a column, its mean (mean household income, say, and this could be something like $53,128.358). How come we are going to think that this number is “variable”? It’s one number, right? There are many households in the sample, but there is only one mean. And how are we going to pretend we knew something about the “variability” of this one number?

Well, the mental exercise starts from the realization of repeatability of the data collection. We could collect other datasets just like the one we have, at least in principle, and each time the mean would be slightly different. The miracle of the root-N law for the standard error of the mean happens by making an assumption that the cases/rows/records were uncorrelated, which is usually the case when the cases are obtained by independent sampling or can otherwise be thought of as arising independently of each other. This is where the math proof gives insight: the “Pythagorean miracle” happens only because we assume that the individual observations are uncorrelated (“orthogonal”) to each other.

• Examples:

o Assume we are looking at household surveys of various sample sizes. To make things concrete, assume the observations are the household incomes, which may average around $50,000 with a SD of $30,000. Then:

N = 1: σ([pic]) = σ / 1 (= dispersion of the raw observations)

N = 100: σ([pic]) = σ / 10

N = 10000: σ([pic]) = σ / 100

N = 1000000: σ([pic]) = σ / 1000

Thus the uncertainty in the mean household incomes drops to ±$3,000 (N=100), to ±$300 (N=10,000), to ±$30 (N=1,000,000).

We have a diminishing returns effect! Gaining 10-fold precision requires 100-fold increases in sample size.

o Your employer conducted a survey of households on a shoestring budget, and the sample size was just N = 200. The manager is naturally dissatisfied with the precision of the estimates of product take-rates, average household income, average household spending, household preferences,… So he/she presses upper management for more money. He/she happily reports back to the group that conducted the survey, saying “I got sufficient funds to double the sample size, so we can slash the errors by a factor of two.”

What should your response be?

“Apologies, but we’ll be able to reduce the errors only by about 30%, not 50%.”

Why is this the correct response?

The sample size grows from 200 to 400. The standard error decreases from σ/2001/2 to σ/4001/2. The ratio is

( σ/4001/2 ) / (σ/2001/2 ) =

(200/400)1/2 = 1/21/2 = 0.7071068 ≈ 70%

Thus the reduction is not even quite 30%.

To slash the standard error by half, one needs to quadruple the sample size!!!

Standard Error Estimate of the Mean

• The root-N law and the standard error are theoretical so far because they rest on an unknown population quantity σ. While it is nice to have insights into how precision depends on the sample size N, it would be even nicer if the standard error could be estimated. This is indeed done and part of standard statistical practice:

• Although we don’t know σ, we can estimate it! The obvious estimate is the empirical standard deviation s of the observations:

s = [pic]

which in the limit N → ∞ goes to

σ = ( (x1–μ)2 · P(X=x1) + (x2– μ)2 · P(X=x2) + … )1/2

In words:

o σ is the “true” or population SD “calculated” from infinitely many observations Xi.

o s is the estimated or sample SD calculated from the N observations X1, X2, …, XN of a single dataset.

Therefore, the natural estimate of the standard error is:

stderr([pic]) = s / N ½

With this estimation step, we have achieved something remarkable:

Based on one single dataset (the one we have in hand),

we estimate how much the mean of a variable varies across datasets!

Isn’t this stranger than strange? How is this possible? It is possible due to the math that goes into the root-N law. This math draws on the assumption that the cases/rows of the dataset are sampled independently. Such independence makes the N values of a variable uncorrelated if we could repeat data collections. Having zero-covariances between all N observations wipes out most terms in V([pic]), and the root-N miracle happens, leaving us with a population standard deviation that can be estimated from any single dataset…

The full ramifications will become clear as we develop the notion of a confidence interval constructed from standard errors.

When polls around election time report a margin of error, it is the standard error of a proportion of voters. Recall that a proportion is just a mean of 0s and 1s, where 1=‘in favor of the incumbent’, 0=‘in favor of the challenger’.

As for terminology: the technically correct term “standard error estimate of the mean” is usually replaced with the shorter “standard error of the mean” or even shorter “standard error”. This is technically not correct because the standard error is a theoretical population quantity, but the precise term is too much of a mouth full to bother.

• Standard Errors in JMP: Take any dataset with quantitative variables and apply Distribution to them. For example, go to the dataset PennStudents.JMP  and run the variables Height and Weight through Distribution. We focus on the bottom list labeled ‘Moments’:

| | |

|HEIGHT: | |

|Mean |67.754103 |

|Std Dev |3.9749694 |

|Std Err Mean |0.2012804 |

|upper 95% Mean |68.149836 |

|lower 95% Mean |67.358369 |

|N |390 |

|WEIGHT: | |

|Mean |150.07821 |

|Std Dev |30.051343 |

|Std Err Mean |1.5217089 |

|upper 95% Mean |153.07001 |

|lower 95% Mean |147.0864 |

|N |390 |

From Module 6 we know how to interpret the ‘Mean’ and the ‘Std Dev’ in conjunction with the bellcurve.

What is new is that we can make sense of the next three numbers, labeled ‘Std Err Mean’, ‘upper 95% Mean’ and ‘lower 95% Mean’:

o The ‘Std Err Mean’ is of course the standard error estimate of the mean. We can confirm that it is obtained from the standard deviation (of the observations) by dividing with the root of N:

3.9749694 / 3901/2 = 0.2012804

So: The mean, which is 67.75 for this dataset, would be different for other datasets, but it would vary around the population mean of Height (which we don’t know) with a standard deviation of about 0.2.

o The next two numbers, ‘upper 95% Mean’ and ‘lower 95% Mean’ are roughly the mean ± two standard errors. So why aren’t these two number not exactly

67.754193 ± 2· 0.2012804 = (67.35154, 68.15666) ?

The reason is that the empirical rule as we formulated it with a nice factor 2 is not exact. JMP and all software packages calculate exacter numbers to achieve 95% coverage, but you see that JMP’s numbers are reasonably close to the rough-and-ready ± 2 stderr rule. When available, use JMP’s numbers; when not use the empirical rule.

Wait a minute! How could JMP assume that the distribution of the means across datasets is approximately normally distributed? This is what seems to be going on when labeling these bounds as upper and lower bounds of a 95% coverage interval.

Something is missing: the Central Limit Theorem.

The Central Limit Theorem

• Theorem: If X1 , X2 , …, XN are mutually independent and identically distributed with the same population mean μ and the same population variance σ2. Then, as N → ∞, the variation of the sample means

[pic] = [pic](X1 + X2 + …+ XN )

from dataset to dataset resembles ever more a normal distribution with population mean μ and population variance σ2/N.

We knew the last part already: Whatever the distribution is, it must have population mean (expected value) μ and variance σ2/N, the latter due to the root-N law. The powerful part is that this distribution looks ever more like a bellcurve.

Unfortunately, we can’t indulge the intellectually curious with a proof or even a proof idea. The best we can do is to illustrate with a simulation, and this is what you are doing in Homework 5. In class we will do another simulation using Sim 300x Uniform.JMP

The powerful and counter-intuitive part of the central limit theorem (CLT) is that it does not matter what the distribution of the observations Xi of a variable is: means across datasets will look ever more normally distributed. In other words, your Distribution analysis of the variable/column with values (X1 , X2 , …, XN) may look skewed or discrete, the means of the same variable across datasets would look approximately normally distributed, and this approximation gets better as N → ∞.

A rule of thumb is that for sample sizes as low as N = 50, the normal distribution is a good approximation to the distribution of means across datasets.

• Reminder: We have been careful spelling out that the object of study is the distribution of the mean of a variable/column across datasets with N cases/rows. “Across datasets” means “across dataset collections”. Keep in mind: We are playing a mind game by examining hypothetically what means would look like if we could collect datasets over and over and over…

So we said this is a hypothetical mind game. In reality, if we are ever in the situation of collecting more than one dataset with the same variables, we will most likely not analyze the datasets separately. Instead, we will merge them into one larger dataset with many more cases and the same variables. If the two datasets were both of size N, the merged dataset will be of size 2N. (By how much can we hope to slash the standard error of the means of the variables?)

(A note on “meta-analysis” for the intellectually curious: There exists a situation in which one analyzes results from multiple datasets, namely, when one surveys research that has been going on for years and has produced multiple studies of roughly the same problem resulting in datasets that all contain some of the same variables of interest. This is the case typically in the medical field where a disease is investigated over and over from various angles. Such studies will have some of the same variables and also some that are specific to them. When surveying such studies, one can use techniques from a statistical specialty called “meta-analysis”. Typically one has only access to the summary statistics such as means, standard deviations, correlations of the variables as reported in papers published in scientific journals, but one does not have access to the multiple datasets themselves. By combining the estimates of multiple studies, meta-analytic techniques will then provide more accurate estimates than any of the individual studies.)

The Empirical Rule Based on the Central Limit Theorem

• The upshot of the central limit theorem is that for moderate and large samples sizes (N ≥ 30), we can make approximate probability statements such as those of the empirical rule:

P( | [pic]– μ | ≤ 2σ/N½ ) ≈ 19/20

P( | [pic]– μ | ≤ σ/N½ ) ≈ 2/3

This is of course not useful, although true. It becomes potentially useful once we try to estimate the unknown population standard deviation σ with a sample standard deviation s:

P( |[pic]– μ | ≤ 2s/N½ ) ≈ 19/20

P( |[pic]– μ | ≤ s/N½ ) ≈ 2/3

Are these still acceptable approximations? It turns out the answer is yes! Here is why this is a non-trivial answer: By estimating σ with s, we incur dataset-to-dataset variability in s, just as in the sample mean[pic]. Wouldn’t one expect this variation in s to destroy the nice empirical rule? Think about it: s undershoots σ as often as it overshoots, and when it overshoots, it makes the interval wider than necessary, so maybe the problem is not so bad. In fact, it isn’t.

Here is what mathematical statistics found out: For small sample sizes we need to lift the factor 2 just a little bit, but for large sample sizes we can actually use a factor slightly below 2. The following table lists factors for various sample sizes as suggested by the theory:

N: 10 15 20 30 40

Factor: 2.23 2.13 2.09 2.04 2.02

N: 50 60 75 100

Factor: 2.01 2.00 1.99 1.98

These factors used to be tabulated but are now computed by software such as JMP as needed. If we denote the factors by tN , the following probability statement is made exact, assuming the data themselves are normally and independently distributed:

P( |[pic]– μ | ≤ tN ·s/N½ ) = 0.95

Note the equal sign! For all practical purposes, the factor 2 will be just fine if we only remember that it is a little too small for N less than about 50. For N ≥ 100, the factor may actually be conservative in many cases, which is not a problem. It only means the probability may be a tad greater than 0.95, such as 0.952 for N = 100. Now, these probabilities are computed assuming normal data. If the data are non-normal, such as skewed or discrete, the probability may be a touch below 0.95. In all,

P( |[pic]– μ | ≤ 2 s/N½ ) ≈ 0.95

is a pretty good rule, definitely for N ≥ 100, unless the data are crazy even for N ≥ 50.

Insight into the problem discussed here developed in the early 1900s. Someone named Gosset did a mathematical investigation into the probability distribution of the quantity t = ([pic]– μ)/s/N½, the so-called t-statistic, assuming that the observations X1, X2,… are all normally and independently distributed. He actually derived the density function for this statistic. What we denote as tN is the 97.5% quantile of this t-distribution. The sample size N is called the “degree of the t-distribution”, and you may encounter references to “a t-distribution with N degrees of freedom”.

Here are some trivia surrounding these discoveries, quoted from the Wikipedia (search “student’s t”): “The derivation of the t-distribution was first published in 1908 by William Sealy Gosset, while he worked at a Guinness Brewery in Dublin. He was not allowed to publish under his own name, so the paper was written under the pseudonym Student. The t-test and the associated theory became well-known through the work of R.A. Fisher, who called the distribution "Student's distribution".”

Confidence Intervals

• Above we looked at the probability of the statement |[pic]– μ | ≤ 2s/N½ and came away with the message that it is close to 0.95 for N ≥ 50.

• Preliminary observation for the next step:

o In words, the inequality |[pic]– μ | ≤ 2 s/N½ expresses the idea that the distance between [pic]and μ is no more than 2s/N½.

o There are two ways to express the same idea asymmetrically:

▪ μ is no further away from [pic] than 2s/N½ :

[pic]– 2 s/N½ ≤ μ ≤ [pic]+ 2 s/N½ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches