Stat 421, Fall 2007



Stat 421, Fall 2008

Fritz Scholz

Homework 4 Solutions

Problem 1: Write a function that does the tasks below (data import, plotting and computing, with appropriate readline(“hit return\n”) interspersed when you want to save plots.

a) Import the csv-file FLUX.csv (see class web page under Data Files) into R. Make sure that FLUX.csv is in the directory from which R is launched.

Separate out the two data sets of SIR under FLUX=X and FLUX=Y, call them SIR.X and SIR.Y,

respectively. Then tighten the spread of SIR.X by a factor .5, i.e., replace each SIR.X by

SIR.Xp = mean(SIR.X)+(.5)*(SIR.X-mean(SIR.X)), and leave the SIR.Y values unchanged. Round the SIR.Xp to one decimal, i.e., SIR.Xp=round(SIR.Xp,1). Treat SIR.Xp and SIR.Y as what was given to you originally for the following analysis.

b) Compare SIR.Xp and SIR.Y via boxplots.

c) Find the vector ref.dist representing the full randomization reference distribution for the test statistic that takes the difference of averages of the two samples resulting from all possible splits of the 18 SIR.Xp and SIR.Y values into 9 and 9 values. Round that vector to 6 decimals and sort it (call it again ref.dist). Using the diff function get the vector of gaps in the ref.dist vector and find the smallest positive gap delta between adjacent values of ref.dist (note min(x[x>0]) gives you the smallest positive value in the vector x). Illustrate this reference distribution as a histogram using the option probability=T and with breaks vector set to breaks=br,

where br is computed as follows:

m=min(ref.dist)

M=max(ref.dist)

br=unique(c(seq(m-delta/2,M+delta/2,delta),M+delta/2))

This intends to center the histogram bar intervals on the distinct possible values of ref.dist.

Note that br=seq(m-delta/2,M+delta/2,delta) will not work since M+delta/2 is not in the vector br because of computing inaccuracies, as illustrated in the following two examples

> seq(1,2.499,1/6)

[1] 1.000000 1.166667 1.333333 1.500000 1.666667 1.833333 2.000000 2.166667 2.333333 where 2.499 is left out, in contrast to

> seq(1,2.5,1/6)

[1] 1.000000 1.166667 1.333333 1.500000 1.666667 1.833333 2.000000 2.166667 2.333333 2.500000

Give the resulting histogram with superimposed appropriate normal density curve and with the annotation requested in d).

d) Find the two-sided p-value for the observed difference of means, DIFF.obs, of the originally given SIR.Xp and SIR.Y samples. Try to deal with the issue of > versus >=.

Indicate the value of +/-DIFF.obs on the above histogram by vertical lines (using abline) and annotate these lines by the two-sided p-value of DIFF.obs.

e) Discuss the result in view of lecture slides 52-61. Does the result fit in with what was shown there?

f) Also, compute the two-sided p-value based on a simulated (with replacement) subsample of size 10000 generated from the full reference distribution. Prior to the start of the simulation you should execute set.seed(35), so that everyone hopefully gets the same result. ( Note: You do not sample directly from the full reference distribution (since that would defeat the practical idea of sampling), but you get such samples by random splits of c(SIR.Xp,SIR.Y) and by computing the relevant statistic for each such split and by looping over this process 10000 times.)

Give a QQ-plot of these simulated values against the full reference distribution

g) Provide the function code that you used.

Solution:

a) see function code below in g)

b)

[pic]

c) and d)

[pic]

e) the p-value of .01148 is larger than the .0004525, that was obtained when both samples were scaled down by a factor of .5.

.01148 is smaller than .02587 that was obtained for the original data. Given that we reduced the spread in one sample by a factor of .5 the separation between the two samples is more pronounced (see boxplots), hence the smaller p-value. Thus the result fits in quite well.

f)

[pic]

g) The code to produce the above results is

SIR.analysis=function (rho=.5,Nsim=10000)

{

# import data

FLUX=read.csv("FLUX.csv",header=T)

# break out SIR.X and SIR.Y and create SIR.Xp

SIR.X=FLUX$SIR[FLUX$FLUX=="X"]

SIR.Y=FLUX$SIR[FLUX$FLUX=="Y"]

SIR.Xp=mean(SIR.X)+rho*(SIR.X-mean(SIR.X))

SIR.Xp=round(SIR.Xp,1)

# make the boxplot of SIRXp and SIR.Y.

boxplot(SIR.Xp,SIR.Y,names=c("SIR.Xp","SIR.Y"),ylab="SIR")

readline("hit return\n")

# compute the observed difference of means

Diff.obs=mean(SIR.Y)-mean(SIR.Xp)

# compute the refrence distribution vector

ref.dist=combn(1:18,9,FUN=function(ind,y){mean(y[ind])-mean(y[-ind])},

y=c(SIR.Xp,SIR.Y))

ref.dist=round(ref.dist,6)

# computation for the proper breaks

xx=diff(sort(ref.dist))

delta=min(xx[xx>0])

m=min(ref.dist)

M=max(ref.dist)

br=unique(c(seq(m-delta/2,M+delta/2,delta),M+delta/2))

# histogram of the reference distribution.

out=hist(ref.dist,breaks=br,col=c("blue","orange"),

main="randomization reference distribution",

xlab=expression(bar(SIR)[Xp]-bar(SIR)[Y]))

# calculation of the two-sided p-value.

pval=mean(abs(ref.dist)>=.999999*abs(Diff.obs))

#

# pval=mean(abs(ref.dist)>=abs(Diff.obs))

#

#

# using .999999 instead of 1 and getting the same result

# makes sure that we don't have rounding issues.

#

#plot annotations

abline(v=c(-abs(Diff.obs),abs(Diff.obs)))

if(Diff.obs=.999999*abs(Diff.obs))

#

#pval.est=mean(abs(sim.ref.dist)>=abs(Diff.obs))

#

# using .999999 instead of 1 and getting the same result

# makes sure that we don't have rounding issues.

#

text(-2,1.5,paste("estimated two-sided p-value =",format(signif(pval.est,5))),

adj=0,cex=1.3)

abline(0,1)

pval.est

}

Problem 2: Using SIR.X (not SIR.Xp) and SIR.Y from the previous problem, do the randomization analysis using the difference in sample medians as test statistic, see ?median. Give the histogram (use breaks=204) of the full reference distribution annotated with the two-sided p-value for the actually observed difference in medians obtained from the original split SIR.X and SIR.Y. Can you think of reasons why the reference distribution looks so different compared to the situation where we took the difference of averages as test statistic?

Repeat this analysis with a simulated reference distribution using 10000 simulated splits and compute the p-value from this simulated distribution. Use set.seed(147)just prior to the simulation. Give the QQ-plot of the simulated and full reference distributions. Provide the function code for your analyses and plots. Note that the analysis for the full reference distribution takes more time with medians than it takes for means.

Solution:

[pic]

The spectrum of possible values for the difference of medians is a lot smaller than for the difference of averages. If values below the medians of both samples are exchanged between samples, nothing changes in the difference of the medians. The same holds for exchanging values above both sample medians. Thus for any split there are many additional configurations that give the same difference in medians. Such changes would produce typically different values when using differences of averages, hence the richer spectrum of values for the latter. For example, when we look at all 18 sorted values:

7.5, 7.7, 8.2, 8.6, 8.8, 9.0, 9.3, 9.7, 9.9, 10.0, 10.1, 10.3, 10.6, 10.7, 11.5, 11.5, 11.6 12.6 we only see one tie at 11.5, but since they are among the 4 largest observations we can never have a median of 9 values which is 11.5. The closest we can make the two sample medians is by choosing one median to be 9.9 and the other to be 10.0, then the difference is either .1 or -.1. The 8 observations below 9.9 can be split in choose(8,4)=70 ways and the 8 observations above 10.0 can also be chosen in choose(8,4)=70 ways. Thus we can make 70 x 70 = 4900 splits where the sample median difference is .1 and also 4900 splits with sample median difference = -.1. This is reflected in the middle two bars. The next closest median difference is .2 or -.2, and one can make similar counts. The differences of such medians is a multiple of .1 and is always a difference of two numbers taken from 8.8, 9.0, 9.3, 9.7, 9.9, 10.0, 10.1, 10.3, 10.6, 10.7. There are choose(10,2)=45 such positive differences and 45 corresponding negative differences. However, some of these difference give the same value, e.g., 9.3-9.0=10.6-10.3. Recall that the gaps between difference of averages were multiples of .022222, a much finer grid.

[pic]

The code is:

SIR.analysis.med=function (rho=.5,Nsim=10000)

{

# import data

FLUX=read.csv("FLUX.csv",header=T)

# break out SIR.X and SIR.Y

SIR.X=FLUX$SIR[FLUX$FLUX=="X"]

SIR.Y=FLUX$SIR[FLUX$FLUX=="Y"]

# observed difference of medians

Diff.obs=median(SIR.Y)-median(SIR.X)

# reference distribution vector

ref.dist=combn(1:18,9,FUN=function(ind,y){median(y[ind])-median(y[-ind])},

y=c(SIR.X,SIR.Y))

# histogram

hist(ref.dist,nclass=204,col=c("blue","orange"),

main="randomization reference distribution",

xlab=expression(median(SIR[X])-median(SIR[Y])))

# calculation of p-value

pval=mean(abs(ref.dist)>=.999999*abs(Diff.obs))

#

# pval=mean(abs(ref.dist)>=abs(Diff.obs))

#

#

# using .999999 instead of 1 and getting the same result

# makes sure that we don't have rounding issues.

#

# annotations

abline(v=c(-abs(Diff.obs),abs(Diff.obs)))

text(1.02*abs(Diff.obs),4500,"two-sided",adj=0,cex=1.3)

text(1.02*abs(Diff.obs),4200,"p-value",adj=0,cex=1.3)

text(1.02*abs(Diff.obs),3900,paste("p =",format(signif(pval,5))),adj=0,cex=1.3)

text(1.02*abs(Diff.obs),1500,"observed",adj=0,cex=1.3)

text(1.02*abs(Diff.obs),1250,"absolute",adj=0,cex=1.3)

text(1.02*abs(Diff.obs),1000,"difference",adj=0,cex=1.3)

text(1.02*abs(Diff.obs),750,format(signif(abs(Diff.obs),6)),adj=0,cex=1.3)

readline("hit return\n")

# initializing simulation

set.seed(147)

sim.ref.dist=rep(0,10000)

SIR=c(SIR.X,SIR.Y)

for(i in 1:Nsim){

ind=sample(1:18,9,replace=F)

sim.ref.dist[i]=median(SIR[ind])-median(SIR[-ind])

}

# QQ-plot

qqplot(ref.dist,sim.ref.dist,

ylab="simulated reference distribution",

xlab="full reference distribution",

pch=16,cex=.5,col="red")

# estimated p-value

pval.est=mean(abs(sim.ref.dist)>=.999999*abs(Diff.obs))

#

# pval.est=mean(abs(sim.ref.dist)>=abs(Diff.obs))

#

# using .999999 instead of 1 and getting the same result

# makes sure that we don't have rounding issues.

#

# annotation

text(-2,1.5,paste("estimated two-sided p-value =",format(signif(pval.est,5))),

adj=0,cex=1.3)

abline(0,1)

pval.est

}

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download