Stat 460 - Winona



Stat 450Chapter 7: Sampling distributionsFall 2019We turn now in full force to studying sampling distributions. Let’s define some terms:Population: Collection of all elements of interest.Parameter: A quantity (or quantities) that, for a given population, is fixed and that is used as the value of a variable in some general distribution or frequency function to make it descriptive of that population. (E.g., μ is the mean of a normal distribution; (α,β) govern the Gamma distribution and are used to define the mean and variance.)Sample: A collection of elements drawn from the population and observed. In these notes, we will be considering univariate realizations Y1, Y2,…,Yn drawn independently and identically from the population.Statistic: A function of the observable random variables in a sample.These are important, because they bring us to the definition of a sampling distribution. A sampling distribution is the distribution of a statistic across repeated samples taken from the population. What are sampling distributions used for?Finding the mean and variance of a statistic; this is how bias and mean-squared error are defined (more later)Hypothesis testingFinding confidence intervalsGeneral (algebraic) facts about sample mean and residualsConsider a sample Y1,…,Yn drawn i.i.d. with mean μ. LetY=i=1nYins2=i=1n(Yi-Y)2n-1Sum of sample errors is 0: i=1n(Yi-Y)=0Another way to represent “sum of square errors”: n-1s2=i=1n(Yi-Y)2i=1nYi2=(n-1)s2+nY2i=1n(Yi-μ)2=i=1n(Yi-Y)2+n(Y-μ)2Facts when sample is drawn from a N(μ,σ2) distribution:No matter what size n is, Y~N(μ,σ2/n), where μ is the mean of the normal distribution and σ2 is the variance of the normal distribution.The sample mean Y is independent of the sample variance S2.The following, useful for obtaining confidence intervals and doing hypothesis tests for σ2 :(n-1)s2σ2~χn-12The following, useful for obtaining confidence intervals and doing hypothesis tests for μ:Y-μs/n~tn-1If Y1,Y2,...,Yn is an i.i.d. sample from a N(μY,σY2) distribution; and X1,X2,...,Xm is a sample drawn i.i.d from a N(μX,σX2) distribution; then the ratio of the sample variances scaled as follows, sX2/σX2sY2/σY2, follows an Fm-1,n-1 distribution.We will prove these when the sample (or samples, in the case of #5) is drawn from a normal population.Proof of 1:Proof of 2: Not trivial – take this one as fact. (see pg. 358 of text for a proof when n=2)Proof of 3 Preliminaries:If Y~N(μ,σ2), then Z=Y-μσ~N(0,1) (We have proved this previously)If Z~N(0,1), then U=Z2~χ12≡GAMα=12,β=2 (Proved using pdf method in Chapter 6).If Z1,Z2,...,Zn are i.i.d. ~χ12, then i=1nZi~χn2. (You were sort of asked to prove this on Assign 6)(Proof of #3, continued)Simulation to Verify #3Write code to simulate a random sample from a normal population with mean (μ) and variance (σ2). For each sample compute the value of the statistic n-1s2σ2~χn-12We can then examine a histogram of the results and superimpose a chi-square density function with df=n-1.fXx=1Γn-122n-12xn-12-1e-x2 , x>0Function for generating random values of the statistic n-1s2/σ2varSDfunction(sim=1000,n=20,mu=10,sd=2) { stat = rep(0,sim) for (i in 1:sim){ sam = rnorm(n,mu,sd) stat[i] = (n-1)*var(sam)/sd^2 } return(stat)}> results = varSD(sim=1000,n=20,mu=10,sd=2)> hist(results,freq=F,col=”blue”,main=”Sample Variance Statistic”)> x = seq(0,50,.01)> fx = dchisq(x,df=19)> lines(x,fx,lwd=3,col=”red”)Applications of #3 – CI’s for the Pop. Variance/Standard and Hypothesis TestingWhen sampling from a normal population, i.e. Y1,Y2,…,Yn are iid N(μ,σ2) then the random statistic(n-1)s2σ2~χn-12CI’s for σ2/σExample: Suppose we sample n = 65 adult women and measure their body temperature and obtain the following results.Find a 95% CI for the Population Variance (σ2) and the Population Standard Deviation (σ)Simulating Confidence Intervals for σ2Below is some R code to simulate a sample of size n from a N(2,σ2=4) population. The function calculates and returns a single 95% confidence interval. We then replicate this function many times to obtain many confidence intervals, 95% of which should cover the true σ2:#Write code to get sample, calculate 95% confidence intervalget.one.ci <- function(n){ one.sample <- rnorm(n, mean = 2, sd = sqrt(4)) s2 <- var(one.sample) lower <- (n-1)*s2/qchisq(0.975,n-1) upper <- (n-1)*s2/qchisq(0.025,n-1) ci <- c(lower,upper) return(ci)}#Given a 95% confidence interval, and a value of the true sigma^2, does the interval cover sigma^2?covers.sigma2 <- function(ci,sigma2) { cover <- ifelse(ci[1] < sigma2 & sigma2 < ci[2],'Yes','No') return(cover)}#Gather 200 samples and corresponding confidence intervals, and calculate coverage:set.seed(111)many.ci <-replicate(200,get.one.ci(n=10),simplify='matrix')df <- data.frame(t(many.ci))df$Coverage <- apply(df,1,covers.sigma2,sigma2=4)df$Sample <- 1:nrow(df)table(df$Coverage)/200## ## No Yes ## 0.055 0.945##Plot the resultslibrary(ggplot2)ggplot(data = df) + geom_segment(aes(x = X1, xend = X2, y = Sample, yend = Sample,color=Coverage)) + geom_vline(xintercept=4) + xlab('95% confidence interval') + ylab('Sample number')Using #3 for hypothesis testingEXAMPLE: Quality control On a production line, consistency of performance is very important. For example, suppose a machine is calibrated to fill 16-ounce Coke bottles very precisely. The machine is supposed to fill each bottle to be 16 ounces, but may have slight variations from bottle-to-bottle. Specifically, suppose the distribution of actual bottle fills is intended to follow a normal distribution with mean μ=16 and variance of σ2=0.01. If there is evidence that σ2>0.01, the machine will need to be recalibrated. This then becomes a problem of testing:H0:σ2=0.01Ha:σ2>0.01Suppose a sample of n=20 bottles is taken from the production line; how large will s2 need to convincingly suggest the machine needs to be recalibrated? This involves finding the sampling distribution of s2 (or some appropriate scaled version thereof), to find what values of s2 would be very unusual if H0 were true.Proof of #4Here, we want to prove that, if Z~N(0,1), and W~χν2, with Z⊥W, that:T=ZWν ~ N(0,1)χν2ν21follows a t-distribution with ν degrees of freedom. The pdf of the t-distribution:fT(t)=Γ(ν+12)νπΓ(ν2)1+t2ν-ν+12;-∞<t<∞We will proceed as follows:Show T follows a tν distribution. (HW 7)Let Y1,…,Yn be an i.i.d. sample from a N(μ,σ2) distribution. Let:T=Y-μs/nShow that this can be written as Z/W/(n-1) where Z~N(0,1) and W~χn-12 and Z⊥⊥W and hence that T~tn-1.Proof of B:Deriving 95% confidence intervals for μSuppose Y1,Y2,...,Yn is an i.i.d. sample drawn from a N(μ,σ2) population. Derive a 95% confidence interval for μ.Simulation studyWrite the code necessary to get the output below, which shows the simulated coverage of 95% confidence intervals of the true μ=2, when Y1,...,Y10~iidN(2,σ2=4):#Gather 200 samples and corresponding confidence intervals, and calculate coverage:set.seed(24111)many.ci <-replicate(200,get.one.ci(n=10),simplify='matrix')df <- data.frame(t(many.ci))df$Coverage <- apply(df,1,covers.mu,mu=2)df$Sample <- 1:nrow(df)table(df$Coverage)/200## ## No Yes ## 0.055 0.945##Plot the resultslibrary(ggplot2)ggplot(data = df) + geom_segment(aes(x = X1, xend = X2, y = Sample, yend = Sample,color=Coverage)) + geom_vline(xintercept=2) + xlab('95% confidence interval') + ylab('Sample number')Showing #5If Y1,Y2,...,Yn is an i.i.d. sample from a N(μY,σY2) distribution; and X1,X2,...,Xm is a sample drawn i.i.d from a N(μX,σX2) distribution; then the ratio of the sample variances scaled as follows, sX2/σX2sY2/σY2, follows an Fm-1,n-1 distribution.First, we need to prove that in general, if U~χp2 and V~χq2, then W=U/pV/q~Fp,q where:fW(w)=Γp+q2Γp2Γq2pqp/2wp/2-11+pqw-p+q2;w>0Then, we need to prove that sX2/σX2sY2/σY2~χm-12m-1χn-12n-1which by A implies sX2/σX2sY2/σY2~Fm-1,n-1.Proof of A:Proof of A (cont’d):Proof of B:Usage: hypothesis testing for equality of two population variancesSuppose we have X1,X2,...,Xm drawn i.i.d. ~N(μX,σX2) and Y1,Y2,...,Yn~N(μY,σY2). We are interested in testing whether the two population variances are equal, e.g.:H0:σX2=σY2Ha:σX2≠σY2How can we derive a test for these hypotheses?Example: Testing the equality of variancesPre-eclampsia?is a disorder of pregnancy characterized by the onset of high blood pressure and often a significant amount of protein in the urine. When it arises, the condition begins after 20 weeks of pregnancy. It poses a significant health risk to the expecting mother and her unborn baby. In this study researchers were interested in comparing the gestational age (weeks) of infants born to mothers with pre-eclampsia to those without.Ho:σN2=σP2Ha:σN2≠σP2Usage: One-way ANOVAIn one-way ANOVA, we test for the equality of means of k independent samples by deriving the F-statistic. Specifically, let Yi1,Yi2,...,Yiri~N(μi,σ2) and Yij⊥⊥Yij' for j≠j'. I.e.,:We have k groups, and the observations are independent within and across group.In each group we observe ri observations. iri=N observations total.With:F=i=1kri(Yi?-Y??)2/(k-1)i=1kj=1ri(Yij-Yi?)2/(N-k)=SSTreatment/(k-1)SSError/(N-k)=(SSTotal-SSError)/(k-1)SSError/(N-k)Show that, if μ1=μ2=...=μk≡μ (i.e., the null hypothesis is true), that F~Fk-1,N-k.For simplicity we will consider the case of a “balanced design” where r1=r2=...=rk≡r, implying N=rk. We will tackle the proof as follows:A. Prove that if W=U+V, where W~χp2 and V~χq2, with p>q, and U⊥⊥V then U=W-V~χp-q2.B. Prove that SSError/σ2~χN-k2C. Prove that SSTotal/σ2~χN-12D. Prove that SSTreatment⊥⊥SSErrorE. Put it all together(continued)The central limit theoremOne of the most important theorems in statistics, the central limit theorem (CLT) guarantees normality of Y for large n, no matter what distribution the individual Yi themselves came from.Here is the theorem in all its glory:Let Y1,Y2,...,Yn be i.i.d. random variables with E(Yi)=μ and Var(Yi)=σ2<∞. Note that no assumptions are made about normality of the individual Yi! Let:Un=Y-μσ/n=nY-μσ.Then, as n→∞, Un→dN(0,1).The statement →d means “converges in distribution.” Essentially what this means is that, as n→∞,P(Un≤u)→-∞u12πe-t2/2dt,i.e.?the CDF of a standard normal.A couple points of clarification:No matter what distribution the Yi come from:No matter what size n, E(Y)=μ and Var(Y)=σ2/n. This is Chapter 4 stuff.What the CLT gives us is normality of the Y for large n.Before proving this, let’s investigate the CLT via simulations. We’ll take repeated samples of EXP(β=5) random variables, of various sizes. Note from here:μ=E(Yi)=β=5σ2=Var(Yi)=β2=25E(Y)=μ=5 for all nVar(Y)=σ2/n=25/n for all nY are normal for large n onlyget.one.ybar <- function(n){ one.sample <- rexp(n, rate = 1/5) ybar <- mean(one.sample) return(ybar)}set.seed(12345)many.ybar.n2 <- replicate(1000,get.one.ybar(n=2))many.ybar.n5 <- replicate(1000,get.one.ybar(n=5))many.ybar.n20 <- replicate(1000,get.one.ybar(n=20))many.ybar.n50 <- replicate(1000,get.one.ybar(n=50))df <- data.frame(many.ybar.n2,many.ybar.n5,many.ybar.n20,many.ybar.n50)apply(df,2,mean) #Should all be ~5:## many.ybar.n2 many.ybar.n5 many.ybar.n20 many.ybar.n50 ## 4.928742 5.009634 5.001159 4.974126apply(df,2,var) #Should be decreasing:## many.ybar.n2 many.ybar.n5 many.ybar.n20 many.ybar.n50 ## 11.8196198 4.6614922 1.1739804 0.4983923library(tidyr)df2 <- gather(df, key = 'SampleSize',value = 'ybar')xseq <- seq(-5,15,l=1000)df2$xseq <- rep(xseq,4)df2$yseq <- c(dnorm(xseq, mean = 5, sd = sqrt(25/2)), dnorm(xseq, mean = 5, sd = sqrt(25/5)), dnorm(xseq, mean = 5, sd = sqrt(25/20)), dnorm(xseq, mean = 5, sd = sqrt(25/50)))ggplot(data = df2) + geom_histogram(aes(x = ybar, y = ..density..),binwidth =.5) + geom_line(aes(x = xseq, y = yseq),color='red',size=2) + facet_wrap(~SampleSize) + xlim(c(-5,15))Proof of the CLT: preliminariesTo prove the CLT, we will use the method of MGFs. Before we embark, recall a couple important definitions and facts from calculus:Definition: A function f(n) is o(n) (“little oh of n”) if it goes to 0 faster than n goes to ∞. More precisely, if limn→∞nf(n)→0.Examples: f(n)=1n2=o(n); f(n)=1n≠o(n).We also need the following facts:Fact #1, from calculus: For any t, 1+tn+o(n)n→et.Fact #2, from earlier this semester: Let MY(t) be the MGF of Y; then MaY+b(t)=ebtMY(at).Fact #3, from earlier this semester: If Y1,Y2,...,Yn are i.i.d. and Sn=1nYi, then MSn(t)=MY(t)n.Given these facts, here is what we want to prove:The CLT, technically stated: Let Y1,Y2,...,Yn be an i.i.d. sample with |E(Y)|=|μ|<∞ and 0<E(Y2)<∞. Let:Un=Y-μσ/n=1ni=1n(Yi-μ)σ/n=i=1n(Yi-μ)nσ=i=1nXinσ,where Xi=(Yi-μ). Show that MUn(t)→et2/2 as n→∞, where et2/2 is the MGF of a N(0,1) distribution; hence showing that Un→dN(0,1).PROOF:Proof of CLT, continued ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download