East Tennessee State University



R for MATH1530- All the sections and the special sectionR is a very powerful programming language to perform statistical analysis. Here we restrict ourselves to commands that are useful for the topics we learn in any section of MATH1530. R is free and students can download it to their own computers at home. R is also available in all the computers of the university and through citrix. Usually there is more than one way to do something with statistical software, we will focus on the most simple way in each case.Data files used (they are ANSI text files) : pulserate.txt, CHCH2014.txt, drugsurv.txt available from right click on the name of the files and save them in your computer or Z: or Q: drive.STARTINGWhere can I get R from ? there are many free books and manuals available there too. Updated versions are frequently available.How to make my life easier?With File>Change directory indicate where your data files (if any ) are. In that way you don’t need to specify the drive later when you read filesIf you want to keep the commands in a document, save them in a text file (you could use the extension .txt or .R ) instead of a Word .doc file because sometimes Word changes the font of the quotes and R does not recognize them. Remember R is case sensitiveREADING DATAHow to read data? You create an object (any name you want) and assign there the data but you can input the data in several waysTyping data in the session window. This is practical if we have few observationsnbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)Reading the data from a file with a single column of numbers. As example we will use the pulse rate of 210 students stored in, save the txt file in the directory that you indicated to R that your data were going to be in and read it from Rpulse<-scan('pulserate.txt')Reading the data from a file with several variables (you can use simple or double quotes)cookies<-read.table("chch2014.dat",header=TRUE)attach(cookies) ## to attach the names of variables to the datanames(cookies) ## to look at the name of the variablesTyping the data in a worksheet. First create an empty data framemydata<-data.frame() then use Editor>data editor and type the name to see the worksheet appear, type in the dataWhen done, close the data window. To save the data in a file type write.table(mydata,'nameoffile')PLOTTING AND CALCULATING BASIC STATISTICSHistograms, boxplots and stem and leaf displayshist(pulse)boxplot(pulse)stem(pulse)you can even make them prettier by adding colorhist(pulse,col='salmon')Descriptive statisticsmean(pulse)median(pulse)sd(pulse)var(pulse)summary(pulse)Comparing groups (cookies data)boxplot(chips~brand)by(chips,brand,summary) ## gets means + five number summaryby(chips,brand,sd) ## calculates standard deviations per groupby(chips,brand,mean) ## calculates the mean per groupScatter plot , correlation & regression (example altitude of residence & red blood cells)x<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,12200,12300,14200,14800,14900,17500) y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67, 6.31,7.05,6.46,6.66,7.37) plot(x,y) cor(x,y)lm(y~x)abline(lm(y~x))of course we can make the plot prettier controlling the size, type and color of the icons used, the title of the plot etc. When you want to know all the options you have type ‘help(plot)’ Tables and plots for one categorical variabletable(brand)pie(table(brand))barplot(table(brand))Tables and plots for two categorical variablesmydrugs<-read.table('drugsurv.dat',header=TRUE) attach(mydrugs) table(GENDER,Marijuana)barplot(table(GENDER,Marijuana))barplot(table(GENDER,Marijuana),beside=TRUE)PROBABILITY DISTRIBUTIONSThere are 4 magic letters with regard to probability distributions in R : d, p, q ,rd calculates f(x) (or p(x) in the case of discrete distributions)p calculates the cumulative probabilityq calculates the quantile for a given probability r generates random numbers from a distributionNormal Distributiondnorm(x,u,s) calculates f(x) for a normal with mean u and standard deviation spnorm(x,u,s) calculates P(X≤x) qnorm(p,u,s) calculates the value of x such that P(X≤x)=prnorm(n,u,s) generates n values from the N(u,s) distributionBinomial Distributiondbinom(x,n,p) calculates p(x) for a binomial with parameters n and ppbinom(x,n,p) calculates P(X≤x) qbinom(p,n,p) calculates the value of x such that P(X≤x)=prbinom(m,n,p) generates m values from the B(n,p) distributionIf you want to create a binomial table, for example, for n=10 and p=0.37 , you just typex<-0:10 ## to create the sequence of values from 0 to 10px<-dbinom(x,10,0.37) ## to calculate the probabilities for each value of xmytable<-cbind(x,px) ## to form a table by binding the 2 columsmytable ## to see the tableTESTING HYPOTHESES AND CONFIDENCE INTERVALSSelecting a random samplek<-1:1000 ## creating a sampling frame for 1000 individualsmysample<-sample(k,20) ## selecting a random sample of size 20T-test for mean. Example : Ho: u=25 Ha: u≠25nbschips<-c( 27,15,15,16,16,24, 27,23,26,22,22,18,22,22,20,20,20,24,24,25,30, 27, 20)t.test(nbschips,alternative=c("two.sided"),mu=25)Paired T-test (example: pressure tolerated before and after treatment with cherry juice)before<-c(2.3,2.6,2.5,2,2.4,2.4,2.1,2.5,2,2.2)after<-c(4.3,4.6,4.9,3.8,4.3,4.2,4.1,4.0,3.9,4.3)t.test(before,y=after,alternative=c("less"),paired=TRUE)Two sample t-test. Ho: u1=u2 Ha: u1 ≠u2cookies<-read.table("chch2014.dat",header=TRUE)attach(cookies) ## to attach the names of variables to the data t.test(chips~brand, alternative=c("two.sided"),mu=0,paired=FALSE) Test for one proportion Assume that you want to test Ho:p=0.25 vs Ha: p<0.25, and that in a sample of 2466 you find 574 successesbinom.test(574,2466,p=0.25,alternative= c("less"))Test for two proportions Ho: p1=p2 Ha:p1>p2, data are from the polio vaccine exampleprop.test(x=c(142,52),n=c(200000,200000), alternative= c("greater"))Goodness of fit testobs<-c(315,108,101,32) # read observed frequenciesprob<-c(9/16,3/16,3/16,1/16) ## model probabilitieschisq.test(obs,p=prob)Chisquare test of independence or homogeneity (for raw and tabulated data) chisq.test(table(GENDER,Marijuana)) ## for raw data thetable<-matrix(c(315,108,101,32),nc=2) ## if given a table enter counts by columnschisq.test(thetable)Test of normality shapiro.test(nbschips)qqnorm(nbschips)qqline(nbschips)In the special sections of MATH 1530, we use R for other topics as well, in particular for bootstrapping and randomization test. Note: R code to apply bootstrapping and randomization test available from Here is a short review of the material of bootstrap and randomization test that includes the code in RRandomization testWe introduce the randomization test to compare two populations using a hands on activity and then we apply it by using R-code that mimics the hands-on activity. To apply it to other data sets they simply have to change the data. Randomization tests are used in introductory statistics courses in several parts of the USA and there is even a user friendly software statkey developed by the authors of ‘Unlocking the power of Statistics’ (by 5 members of the Lock family) that they generously put it in the internet to be used by anybody.The phytate story The data The activityWe can not base our conclusions in only 20 regroupings, we can do many more using the computer.R-code available at HYPERLINK "" code:## RANDOMIZATION TEST to compare groups A and B## when the alternative is Ha:u(A)>u(B)A<-c(6.07,7.20,6.61,9.69, 9.45,8.95,8.72,6.07)B<-c(2.57,2.39,2.51,2.57,1.80,2.37,6.28,2.91)nrep<-10000 # number of regroupings## NO MORE INPUT IS NECESSARYn1<-length(A) ## number of observations in An2<-length(B) ## number of observations in Bn<-n1+n2 ## total number of observationsmeanA<-mean(A) ; meanB<-mean(B) truedif<-meanA-meanB ## difference of the two means## performs the randomization testalldata<-append(A,B) ## merges the two groups chips<-1:n ## creates n chipsdifmeans<-double(nrep) ## creates storage space for (i in 1:nrep){ ## commands will be executed 10000 timeschipsgroup1<-sample(chips,n1,replace=FALSE) ## selects group 1chipsgroup2<-chips[-chipsgroup1] ## rest goes to group 2group1<-(alldata[chipsgroup1]) ## values in group 1group2<-(alldata[chipsgroup2]) ## values in group 2difmeans[i]<-mean(group1)-mean(group2) ## difference in means }## assigns value 1 to the reg-roupings in which the ## difference of means is equal to or greater than the difference ## the A and B groupsnumgreater<-difmeans>= truedif ## for two-sided alternative hypothesis## use abs(difmeans)>= abs(truedif) in previous line## calculates the proportion or approximated p-value pvalue<-mean(numgreater) ; pvaluestripchart(difmeans, method='stack',pch=1,at=0,main=" ",ylim=c(-0.3,2)) arrows(truedif,-0.2,truedif,0,col="red",lwd=2) text(truedif,-0.25,c(truedif)) BootstrappingThe rhododendron storyThe sampleSampling with replacement (from the sample, using the same n) and ‘bootstrap replicates’ of the statistic The idea of the percentile confidence interval In this table it is very easy for the students to understand why when you increase confidence the interval becomes wider.There are packages in R to do boostrapping, you can also use statkey, but we use our code on beauty of bootstrap, you can build confidence intervals for things other than the mean without need of formulas, just change the statistic in the code. You can also change the data to use the code for another example)The code## PERCENTILE BOOTSTRAP CONFIDENCE INTERVAL FOR THE MEAN## INPUT :x<-c(9.70,10.40,8.90,10.70,9.85,10.30,9.00 ,10.00,10.45,8.65)nboot<-2000 ## number of bootstrap replicatesconfidence<-0.95 ## NO MORE INPUT IS NEEDEDn<-length(x)sampm<-double(nboot) ## prepares storage space for (i in 1:nboot){ ## repeat 2000 timessubsam<-sample(x,n,replace=TRUE) ## bootstrap samplesampm[i]<-mean(subsam) ## stores the sample mean }## calculates the endpoints of the confidence intervallow<-quantile(sampm,(1-confidence)/2) hi<-quantile(sampm,confidence+(1-confidence)/2) hist(sampm, xlab= "bootstrap replicates ", main=" ") ## marks the endpoints of the intervalarrows(low, 200,low,0,col="red") arrows(hi, 200,hi,0,col="red") ## prints the intervallow; hi## prints mean and sd for sample and bootstrap replicatesmean(x); sd(x)mean(sampm); sd(sampm)Using confidence intervals to test hypothesesIf we use bootstrap to calculate confidence intervals for correlations , the bootstrap samples we need to sample pairs (not the variables y and x separately)## BOOTSTRAPPING THE CORRELATION COEFFICIENT## INPUTx<-c(0,1840,2200,2200,5000,5200,5750,7400,8650,10740,12000,12200,12300,14200,14800,14900,17500)y<-c(4.93,4.75,5.40,4.65,5.42,6.55,5.99,5.39,5.44,5.82,7.50,5.67, 6.31,7.05,6.46,6.66,7.37) nboot<-2000 ## number of bootstrap replicatesconfidence<-0.95## NO MORE INPUT IS NEEDEDn<-length(x) # counts number of observationswho<-seq(1,n,by=1) #creates a list labels 1:nrboot<-double(nboot) # creates storage space## now we select random samples from the labels## and identify the corresponding values of x and yfor (i in 1:nboot){subsam<-sample(who,n,replace=TRUE) xwho<-(x[subsam])ywho<-(y[subsam])rboot[i]<-cor(xwho,ywho) }low<-quantile(rboot,(1-confidence)/2) ## lower end of intervalhigh<-quantile(rboot,confidence+(1-confidence)/2) ## upper end par(mfcol=c(1,2)) ## 2 plots in figureplot(x,y,pch=19) ## scatterplothist(rboot, main= ' ',xlab= 'bootstrap replicates of r',xlim=c(0,1))arrows(low,-100,low ,0,col="red") arrows(high,-100, high ,0,col="red") low;high ## prints the intervalRemember that hypothesis about the value of one parameter can be tested using a confidence interval for the parameter. In the example below Ho:ud=0 and a confidence interval for the mean difference is calculated that obviously does not include the value 0 and thus we reject the null hypothesis. Edith Seier – September 18-2015 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download