Issue 1: Calculation of Design-Based Sample Weights



Statistical Analysis with R

Ann Arbor Chapter,

American Statistical Association

Instructor:

Brady T. West

(bwest@umich.edu)

September 21, 2009

Table of Contents

1. Introduction: What is R? 3

2. How to Obtain R 4

• The R Project Web Page 4

• Adding Packages to R 4

• Starting R / Loading Contributed Packages 6

3. Help Tools 8

4. Importing / Exporting Data Sets 10

• Objects in R 10

• Importing Data 12

• read.table() 13

• Other Useful Importing Functions 14

• Exporting Data 17

5. Data Manipulation / Data Management 18

• Using R as a Calculator 18

• Vectors 19

• Matrices 21

• Data Frames 23

6. Graphical Techniques 28

7. Descriptive Statistical Analyses 41

Obtaining Univariate Statistics 41

• Bivariate Associations 44

8. Multivariate Statistical Analyses / Modeling 50

• Multiple Regression / Linear Models 50

• Using the lm() Function 51

• Other Useful Modeling Functions 60

9. Creating Functions / Programming 63

• Example: Bootstrapping 68

10. Useful References 70

1. Introduction: What is R?

The software package known as R is an interactive computing language and environment for statistical analysis, computing, and graphics. R is an open source software package: the source code behind the software is free for all to look at / modify / play around with, and R in fact grows by leaps and bounds as people from all fields develop new functions for use within R’s computing environment. This is part of what makes R extremely useful! Several extremely complex statistical routines not available in other software packages have been programmed in R, and these routines are freely available for use by anyone.

R is completely interactive; users type commands and program functions as they go. The software is extremely similar in many ways to the commercial software package S-Plus, and offers many of the same features. R, however, can be downloaded for free, while S-Plus is a commercial package that costs money. S-Plus may be slightly easier to use than R, but after this workshop, you should be familiar enough with R and how it functions to do pretty much anything that you would like to do without a hassle!

The software provides users with a wide array of powerful and enlightening graphical techniques, and this is why many researchers love using R; the graphical capabilities are tremendous, and easy to implement. Once you are able to grasp how to work with R’s graphical facilities, you will have a limitless supply of graphical tools at your fingertips that will enhance the appearance of your research presentations in many ways.

We strongly encourage you to visit the central web site behind the R project, which I will frequently refer to throughout this workshop:



Here you will find links for downloading R, downloading additional packages for R, and everything else that you would like to know about the software or the people behind it.

2. How to Obtain R

The R Project Web Page

At the R Project Web Page, you will find a variety of information about the R Project, which you can peruse at your leisure. The most important link will appear at the left hand side of the screen, under the “Download” heading. Click on the CRAN link (Comprehensive R Archive Network), and after you choose one of the U.S. mirrors ( is recommended), you will be taken to the page that you will use to download everything R-related.

Once you find the CRAN web page, take the following steps to obtain R:

1. Click on the “R Binaries” link on the left-hand side of the page under the “Software” heading.

2. Click on the folder that best describes your operating system.

3. When using Windows, click on the “base” subdirectory. This will allow you to download the base R package.

4. Click on the r-X.X.X-win32.exe link. R is updated quite frequently, and the version number is always changing (at the time of this printing, Version 2.9.2 is available). Save the .exe file somewhere on your computer.

5. Double-click on the .exe file once it has been downloaded. A wizard will appear that will guide you through the setup of the R software on your machine.

6. Once you are finished, you should have an R icon on your desktop that gives you a shortcut to the R system. Double-click on this icon, and you are ready to go!

Adding Packages to R

At step 3 above, you also have the option of clicking on “contrib” subdirectory. Doing this will allow you to download additional contributed packages in R. So what exactly are “additional contributed packages”? As mentioned in the introduction, R is an open source software package, meaning users of R are free to explore the code behind the software and write their own new code. Several statisticians and researchers have written additional packages for R that perform complex analyses that are not very common, and in order to use these packages and the functions within them, you need to first download them. The base R package comes with several additional packages, but odds are that you will discover an uncommon analysis technique in your research that requires you to install an additional package that is not available with the base package. There are many additional contributed packages. Don’t hesitate to explore the contributed packages to see if someone has developed a package that will allow you to implement a technique that you are interested in!

To download contributed packages, follow steps 1 and 2 above, and then click on the “contrib” link. Then, follow these steps:

1. Select the version of R that you are using (the newest version for Windows at the time of this workshop is Version 2.7.2).

2. Scroll through the list of contributed packages (in .zip format), and click on the package that you would like to download. You can find descriptions of all of these contributed packages and the techniques implemented within them by clicking on the “Packages” link under the “Software” heading on the CRAN web page. This page will also have links to help manuals for the packages.

3. Save the .zip file in a directory on your machine that you can remember.

4. When using R, select Install package(s) from local zip files… from the Packages menu. Locate the .zip file for the package that you downloaded onto your machine, click on Open, and R will install that package so that it is ready for use.

5. The package will now be ready to use when you start R!

Exercise

Let’s try to download the “quantreg” package and add the package to R. What exactly will this contributed package allow you to do? Check the website provided, which contains a description of the package.

FAQ’s on the CRAN Web Page

Under the “Documentation” heading on the left-hand side of the CRAN web page, click on the “FAQs” link. This will allow you to see an FAQ page that will answer many of the most commonly asked questions about R. You will find that this section will provide answers to many of your questions, whether they are simple or difficult.

Searching on the CRAN Web Page

Under the “CRAN” heading on the left-hand side of the CRAN web page, you can click on the “Search” link. Although there is no formal search engine on the CRAN web page, this will take you to a set of links allowing you to search the R archives (manuals, mail, help files, etc.) for anything that you would like. This is often useful if you are faced with a tough analysis question, and you want to see if another R user has addressed the question before.

Starting R / Loading Contributed Packages

At this point (if you haven’t already), you should be able to start R! If you asked for a shortcut to R to be created on your desktop, simply double click on the R icon to start R. This will open the RGui (Graphical User Interface). You should see a window inside the RGui containing the R Console. This is where you will specify all of your commands and programs interactively, at the red command prompt.

For an example command, we’ll load a contributed package into R for use. Let’s download the “quantreg” package from the CRAN mirror and save it to the desktop, and then install the package as described above. After the package has been installed, simply type library(quantreg) at the command prompt:

> library(quantreg)

Press enter after you type this command to submit the command to R. If you don’t see anything aside from another command prompt, the library was loaded successfully, and you can use all of the functions associated with it! If you see the error message

Error in library(quantreg) : There is no package called 'quantreg'

you did not extract the quantreg package correctly (see pages 4-5). A contributed package must be downloaded and extracted into the R library folder correctly in order for you to use it.

In this case, we see the following error message:

Loading required package: SparseM

Error: package 'SparseM' could not be loaded

In addition: Warning message:

there is no package called 'SparseM' in: library(pkg, character.only = TRUE, logical = TRUE, lib.loc = lib.loc)

Many packages in R require over contributed packages in order to work. This error message indicates that we need to also download and install the “SparseM” package from the CRAN website. Let’s do that now, and then try to re-load the “quantreg” package. We no longer see an error message!

An even quicker way to install packages is to simply select “Install package(s)…” from the Packages menu. You can pick a CRAN mirror, and then directly install a package and all of its related components. This is probably the quickest way to install a package. You would still need to load the package in order to use its functions.

This is how you load contributed packages into R for your personal use. When you submit a command to R, you will either see nothing but another command prompt (good), a result (good), or an error message (bad).

You are now ready to use R!

3. Help Tools

In most well-written statistical software packages, help is never far away. This holds true for R. Although the help is somewhat technical in nature and requires a good understanding of the R language, it is very easy to access.

Once you’ve gained some experience in working with the R language, 90% of your help questions will be directed at how particular functions in R work, what arguments they take, etc. For help on ANY function in R that is a part of a package that has already been loaded, simply type and submit

> help(function.name)

in the R console, where function.name is the name of the function that you would like to see a help window for. Try typing help(lm)!

If you have typed the correct name of a function which belongs to a package that has already been loaded into R, you will see a help window pop up that describes the function, its arguments, what the function returns, and also presents some examples of using the function. Often times, there will be contact information for the person/people who wrote the function.

If you would like to see a list of all of the functions that come with the base R package, including brief descriptions of each, you can simply type

> library(help = “base”)

to generate the list.

Hint: Don’t forget, R is an open source language! If you want to see exactly how a given function has been written, simply type

> fix(function.name)

to see the code in a program editor. You can copy it, update it, and do whatever else you would like with it. Just make sure not to save any changes to the code behind a function unless you know that they will work!!!

Another easy way of obtaining help via the Internet is to type and submit

> help.start()

Doing so will open up a web-based help system that is very easy to navigate.

A third and obvious way to obtain help is via the Help menu when you are working in the R Console. Here you will find FAQ’s, help with navigating the console, and most importantly the official R manuals from the authors of R themselves. Again, these are somewhat technical in nature, but very useful once you have been working with R a lot. I would recommend the “Introduction to R” manual very highly.

Finally, don’t hesitate to contact CSCAR (734.764.7828) if you need further assistance with performing your analyses in R!

4. Importing / Exporting Data Sets

The “bank2” Data Set

All of the following examples of using R will be demonstrated using a data set that appears in a variety of formats on the flash drive that came with this workbook. The “bank2” data set contains a variety of information on each of the 474 employees that work for a large bank. The most important first step in using R for statistical analysis is of course to import a data set!

Objects in R

Before you can successfully import a data set, you need to know about objects in R. The entire R computing environment is based on objects. What exactly is an object? Objects take numerous forms:

• Numbers

• Vectors of numbers

• Matrices of numbers

• Results of analyses

• Data sets

• Many others!

The operator nine nine

[1] 9

R, in this case, returns the value of the object (a number). Many objects (such as results of analyses) are much more complex, and there are ways to look at specific aspects of objects. Fields within objects (e.g. variables within data set objects, or parts of result objects, such as the estimated regression coefficients that come from a regression analysis) can be accessed using the object$field command. Suppose we run a regression analysis, and then want to investigate the resulting coefficients. Submitting the command

> fit fit

Call:

lm(formula = mo.fail ~ lc, data = data)

Coefficients:

(Intercept) lc

64.139 -9.195

This is not very exciting on the surface. But there is more to this object than meets the eye. For example, suppose we wanted to see the regression coefficients that are a part of this “results” object:

> fit$coef

(Intercept) lc

64.139079 -9.195036

Notice how the object$field command was submitted. The “coef” result is a field located in the “fit” object. If you want to see a list of all possible fields located within an object (for example, object.name), simply type

> names(object.name)

If we want to see a summary of the regression results, we can type

> summary(fit)

Call:

lm(formula = mo.fail ~ lc, data = data)

Residuals:

Min 1Q Median 3Q Max

-42.9440 -2.1639 0.6411 3.8361 23.0311

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 64.1391 1.1357 56.48 bank2$ID hist(bank2$SALARY)

And you will see the nice histogram below:

[pic]

We see that the distribution of SALARY is definitely right-skewed (the long tail is to the right), with the majority of the people having a salary less than $40,000. This tells us that SALARY should definitely be transformed (and probably using a natural log transformation) when considering statistical procedures assuming normality. In fact, how would a new variable be created in the bank2 data set object containing the natural log transformation of the SALARY variable for each case?

Don’t forget that additional options are available in these functions for customizing your plots…check the references and help tools!

Another useful graphical summary of continuous data is the boxplot, which displays the numerical summaries obtained using the summary() function in a graphical format. Single boxplots can be easily drawn in R using the boxplot() function:

> boxplot(bank2$SALARY,main="Boxplot of SALARY")

[pic]Notice that several cases are considered to be outliers in terms of salary. This is more evidence of the need for a natural log transformation when working with these types of data.

Bivariate Associations

Checking for bivariate associations between variables is also useful when performing descriptive statistical analyses of data. When checking for associations between categorical and continuous variables (where categorical variables have been defined as factors using factor() in R), the general-purpose plot() function can be used in the following manner to produce side-by-side boxplots for different groups in the data set:

> plot(cont.variable ~ factor.variable)

So, if we desired to see boxplots of SALARY for each GENDER in the bank2 data set, we would use the following commands:

> bank2$GENDER plot(bank2$SALARY ~ bank2$GENDER, main="Boxplots of SALARY by GENDER")

These commands produce the nice plot below:

[pic]

Note the marked discrepancy between males and females!!!

If you desire to calculate descriptive statistics within certain groups in the data set, the tapply() function can be used. This function has three primary arguments: first, the vector (or variable) of data that you want to apply the summary function to; second, the vectors (or variables) that combine to define the groups that you want to divide the first vector of data into; and third, the summary function you want to use. So, for an example using the bank2 data set:

> tapply(bank2$SALARY,bank2$GENDER,mean)

f m

26031.92 41441.78

R outputs the resulting SALARY means for each level of the GENDER factor. You could also store these results in a vector object if you so desired. Again, note the large difference between males and females.

Given these descriptive results, one could compare the means in the two groups using the t.test() function. For example:

> t.test(SALARY ~ GENDER, data = bank2)

Welch Two Sample t-test

data: salary by gender

t = -11.6883, df = 344.262, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-18003.00 -12816.73

sample estimates:

mean in group f mean in group m

26031.92 41441.78

The result of the t-test suggests that the two groups have significantly different means. This version of the t-test automatically incorporates an adjustment to the standard two-sample t-test allowing for unequal variances in the two groups.

One can also perform cross-tabulation analyses (and Chi-square tests, or other measures of association) to check for associations between two categorical variables using the table() function in conjunction with the summary() function. For example, to test for an association between gender and job classification, we could use the following functions:

> ctab ctab

clerical custodial manager

f 206 0 10

m 157 27 74

> summary(ctab)

Number of cases in table: 474

Number of factors: 2

Test for independence of all factors:

Chisq = 79.28, df = 2, p-value = 6.098e-18

This test of association clearly suggests that the genders vary in terms of their distribution on employment classification, with a higher percentage of males being managers.

A useful statistic for assessing the strength of a linear relationship between two continuous numerical variables is the correlation coefficient. When conducting preliminary descriptive analyses, a useful display of correlation coefficients is a correlation matrix, indicating the correlations between all continuous variables in a data set. The following commands can be used to create a correlation matrix for a subset of the variables in the bank2 data set:

> bank3 cor(bank3)

EDUC SALARY SALBEGIN JOBTIME PREVEXP

EDUC 1.00000000 0.66055891 0.63319565 0.047378777 -0.252352521

SALARY 0.66055891 1.00000000 0.88011747 0.084092267 -0.097466926

SALBEGIN 0.63319565 0.88011747 1.00000000 -0.019753475 0.045135627

JOBTIME 0.04737878 0.08409227 -0.01975347 1.000000000 0.002978134

PREVEXP -0.25235252 -0.09746693 0.04513563 0.002978134 1.000000000

The cor() function automatically computes the desired correlation matrix, with nothing more than the data set object as an argument. Optionally, you can obtain a covariance matrix (a non-standardized correlation matrix) using the cov() function, or you can obtain a single correlation coefficient summarizing the strength of a linear association between two variables using two vectors as separate arguments to the function:

> cov(bank3)

> cor(bank3$EDUC,bank3$SALARY)

[1] 0.6605589

Users of SEM packages will find this function helpful for outputting objects containing correlation / covariance matrices.

To graphically examine the strength of a linear association between two continuous variables, you can once again use the general-purpose plot() function to produce a scatterplot:

> plot(bank2$SALBEGIN,bank2$SALARY,main="Scatterplot of SALARY vs. SALBEGIN")

[pic]

To make a simple scatterplot using the plot function, enter the variable that you want on the horizontal axis as the first argument, and the variable that you want on the vertical axis as the second argument.

If you don’t find a linear association between two continuous variables, this does not mean that the variables are not associated in a non-linear manner. Scatterplots are useful for detecting such relationships! In this scatterplot, there is clearly a positive linear association between beginning salary and current salary.

You can take one step further and make a scatterplot matrix to graphically match the correlation matrix that was computed just a minute ago. To do this, you can use the pairs() function:

> pairs(bank3,main="Scatterplot Matrix")

[pic]

Scatterplot matrices are extremely useful when you are conducting exploratory data analysis, and visually checking for any strong linear associations between your continuous variables. When building regression models, using predictors that are highly correlated will wreak havoc on the standard errors of your regression coefficients, and scatterplot matrices are useful for visually checking for these correlations very quickly. Based on this scatterplot matrix, we see that the strongest linear associations in this data set are between the education and salary variables.

Always make sure to perform these types of descriptive analyses initially!

8. Multivariate Statistical Analyses / Modeling

Before you dive into a more complex analysis of your data, make sure to formulate the problem that you are investigating into statistical terms! Understanding the primary objective of the research that you are engaged in is extremely important, and this understanding often leads to simpler analyses than you were initially considering. Don’t try complex models or analyses when they aren’t necessary! A response/predictor mindset is usually helpful, as many statistical analyses address uncovering relationships in data.

Multiple Regression / Linear Models

Multiple linear regression is a popular statistical technique that aims to assess the relationships of several predictor variables with a primary dependent or response variable. Analysts aim to build linear regression models, where the parameters (or regression coefficients) associated with the predictors enter the model in a linear fashion to predict the response. The predictors do NOT have to be linear themselves; their parameters just need to enter the model in a linear fashion. Most linear models feature a linear combination of the predictors and parameters that looks something like this:

y = (o + (1x1 + (2x2 + … + (pxp + e

In this notation, y represents the response variable of interest, and x1 ,…, xp correspond to predictor variables, or variables that are thought to have a relationship with the response variable. When building a linear regression model, you aim to estimate the beta parameters, and determine which of the predictor variables have a significant relationship with the response variable. The resulting models can be used for several purposes, the two most common of which are to predict future responses given new values of the predictors and simply to assess relationships.

Fitting a linear model of this form is very simple in R. Many statistical analyses can be re-formulated into a linear model, and the multi-purpose linear modeling function in R is capable of handling several of the most common statistical analyses.

Using the lm() Function

The general purpose linear modeling function lm() is quite handy for several standard multivariate statistical analyses.

1. Basic Multiple Regression

We’ll use the bank2 data set for a simple illustration of how to fit a multiple regression model in R. Suppose we desire to build a linear regression model assessing the relationship of education and minority status with current salary. We use lm() to fit the model, and store the results in an object named fit1:

> fit1 summary(fit1)

Call:

lm(formula = SALARY ~ MINORITY + EDUC, data = bank2)

Residuals:

Min 1Q Median 3Q Max

-22284 -8182 -2215 5472 78613

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -16539.2 2885.9 -5.731 1.78e-08 ***

MINORITY -3757.6 1428.2 -2.631 0.00879 **

EDUC 3838.2 205.1 18.714 < 2e-16 ***

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 12750 on 471 degrees of freedom

Multiple R-Squared: 0.4445, Adjusted R-squared: 0.4421

F-statistic: 188.4 on 2 and 471 DF, p-value: < 2.2e-16

We can see that both of the predictors are strongly significant, and collectively explain about 45% of the total variation in SALARY.

The usual regression diagnostics should ALWAYS be checked after fitting a multiple regression model. You can use the graphical techniques available in R for a visual investigation of the standard diagnostic plots. A common diagnostic plot is the residual-fitted plot, where you can check for non-constant variance in the residuals. The fitted values on the SALARY variable and associated residuals from the model fit are stored as vectors in the fit1 object:

> plot(fit1$fit,fit1$res,main="Residual-Fitted Plot",xlab="Fitted Values",ylab="Residuals")

[pic]

There is clearly a problem with non-constant variance in the residuals across the fitted values, and constant variance in the residuals is one of the assumptions underlying the validity of the results from fitting a linear regression model. A transformation of the SALARY response would probably be recommended here, in an effort to stabilize the variance in the residuals.

An Analysis of Variance (ANOVA) table comparing two nested multiple regression models (essentially testing the null hypothesis that one or more of the coefficients for the predictors, or parameters in the model, are equal to zero) can be investigated using the anova() function on the results objects from the two fits. For example, if we want to test the null hypothesis that the minority status predictor does not need to be in the model (or that the coefficient is equal to zero), we can submit the following commands:

> fit1 fit2 anova(fit2,fit1)

Analysis of Variance Table

Model 1: SALARY ~ EDUC

Model 2: SALARY ~ MINORITY + EDUC

Res.Df RSS Df Sum of Sq F Pr(>F)

1 472 7.7738e+10

2 471 7.6612e+10 1 1.1260e+09 6.9226 0.00879 **

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

We see from the strongly significant F-statistic comparing the two models that the null hypothesis of the coefficient for minority status being equal to zero is strongly rejected.

The anova() function can also be used on a single model fit object to produce a standard ANOVA table for the variables in the model:

> anova(fit1)

Analysis of Variance Table

Response: SALARY

Df Sum Sq Mean Sq F value Pr(>F)

MINORITY 1 4.3373e+09 4.3373e+09 26.665 3.577e-07 ***

EDUC 1 5.6967e+10 5.6967e+10 350.224 < 2.2e-16 ***

Residuals 471 7.6612e+10 1.6266e+08

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

2. ANCOVA

The lm() function can also be used in tandem with the factor() function to perform Analysis of Covariance, or ANCOVA, in R. ANCOVA refers to the case of a multiple regression model with a mix of quantitative and qualitative predictor variables, and the potential interactions between them. This technique is most commonly used to see if different groups have different relationships between certain quantitative variables. The multiple regression model that was fitted above was actually an example of an ANCOVA.

Suppose we want to investigate whether the relationship between previous experience and beginning salary differs by gender in the bank2 data set. Because gender is already a character variable in the data set, R recognizes it as a factor. However, when you have categorical predictors in a numeric format (like MINORITY), you need to use the factor() function in order for R to recognize the predictor variable as a qualitative factor and create the necessary dummy variables for the analysis.

We use the lm() function to perform the desired ANCOVA, specifying gender as a factor (which is redundant) and including the interaction between gender and previous experience in the model:

> ancova summary(ancova)

Call:

lm(formula = SALBEGIN ~ PREVEXP + factor(GENDER) + PREVEXP *

factor(GENDER), data = bank2)

Residuals:

Min 1Q Median 3Q Max

-10900 -4122 -1744 1133 60081

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 12992.749 614.960 21.128 ancova2 summary(ancova2)

Call:

lm(formula = SALBEGIN ~ PREVEXP + factor(GENDER), data = bank2)

Residuals:

Min 1Q Median 3Q Max

-10829 -4439 -2006 1218 59882

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 13271.646 534.473 24.831 anova(fit3)

Analysis of Variance Table

Response: SALBEGIN

Df Sum Sq Mean Sq F value Pr(>F)

factor(JOBCAT) 2 1.7926e+10 8.9628e+09 371.11 < 2.2e-16 ***

Residuals 471 1.1375e+10 2.4152e+07

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Note once again how R makes the lowest category of a factor the reference category. We see the category 3 has a significantly higher mean than category 1. However, the previously identified concerns prevent us from being confident about making any inferences. Standard diagnostic plots investigating the fit of this model will reveal several problems.

A nice example of a two-way ANOVA (two factors) including plots of interactions between the factors can be found in Julian Faraway’s book (see references). The function interaction.plot() is very useful for making nice plots of two-way interactions between factors.

Other Useful Modeling Functions

Once you understand the basic ideas behind using a multivariate statistical analysis function in R and looking up help on the functions, you have unlimited access to all of the powerful statistical functions that R offers. Nearly all of these functions work in a manner similar to the lm() function. Provided below are tables of the more common functions (and corresponding packages) used for multivariate statistical analyses in R.

LINEAR MODELING

Analysis Function Package

WLS Regression lm(..., weights = ) BASE

Regression Splines splineDesign() splines

Ridge Regression lm.ridge() MASS

Stepwise Regression step(lm()) BASE

Robust Regression rlm() MASS

LTS Regression ltsreg() MASS

Critical Values from qtukey() BASE

the Studentized Range

Distribution

GENERALIZED LINEAR MODELING

Analysis Function Package

Logistic Regression glm(,fam=binomial) BASE

Poisson Regression glm(,fam=poisson) BASE

Poisson Rate Models same as above, BASE

with offset() on

the RHS of the

model

Log-linear Models same as Poisson, BASE

with a count var.

on the LHS of the

model

Ordinal Regression **consult with CSCAR

Gamma Regression glm(,fam=Gamma) BASE

Negative Binomial glm(,fam=negative.binomial(k))

Regression MASS

NOTE: When comparing nested GLMs, make sure to use anova(,test=”Chi”) in order to construct an analysis of deviance table. Use anova(,test=”F”) if there is a free dispersion parameter.

LINEAR MIXED MODELING

**see help(lme) for specification of random effects.

Analysis Function Package

Mixed Effects, ML lme(,method=”ML”) nlme / lme4

Mixed Effects, REML lme() nlme / lme4

Predicting Random random.effects(lme())

Effects (BLUPs) nlme

Nested Effects lme(,random=~1|lev1/lev2/…/)

nlme / lme4

Repeated Measures lme(,corr=) nlme / lme4

**consult with CSCAR

NOTE: When comparing nested Mixed Models with different fixed effects using anova(), make sure to fit both models using method=”ML”.

NOTE: The new lmer() function in the lme4 package is considered to be an improved update of the original lme() function, and allows users to fit models with crossed random effects. Consult with CSCAR for more details!

MULTIVARIATE ANALYSIS / CLUSTERING

Analysis Function Package

Principal Components princomp() mva

Analysis

Linear Discriminant lda() MASS

Analysis

Classification Trees rpart() rpart

K-Nearest Neighbors knn() class

Hierarchical Cluster hclust() mva

Analysis

K-Means Clustering kmeans() mva

MDS cmdscale() mva

Factor Analysis factanal() mva

MCA / HA mca() MASS

NON-PARAMETRIC REGRESSION

Analysis Function Package

Kernel Estimation ksmooth() modreg

Spline Smoothing smooth.spline() stats

Lowess Smoothing lowess() BASE

SURVIVAL ANALYSIS

Analysis Function Package

Create Survival Object Surv(time,event) survival

Cox Proportional coxph() survival

Hazards Modeling

Plot Kaplan-Meier plot(survfit(Surv()))

Survival Estimates survival

Plot Predicted plot(survfit(coxph()))

Survival Function(s) survival

Based on coxph()

The primary advantage of R when it comes to any kind of statistical analysis is the tremendous range of powerful functions at your disposal. There are several special functions that are simply not offered in other statistical software packages! Click on the Packages link on the CRAN web page to get an idea of the functions that are available.

Please note that new functions and updated versions of existing functions are being added to the R software daily. I recommend that you frequently visit a CRAN web site for updates.

9. Creating Functions / Programming

Another very nice interactive feature of R is the ability to easily write your own programs and functions, and update existing functions. R is not only a comprehensive statistical computing language, but also a programming language that can be used to write new statistical software. Researchers write new software for R every day, and are always updating existing functions. Writing functions in R takes a while to get used to, just like any other programming language.

New functions can be written in the R console, but it is much easier to submit the following command:

> fix(new.function.name)

where new.function.name represents a name that you choose to assign to your function. Submitting this command will open a Notepad-based text editor from within R, and you won’t be able to do anything else in R until you close this window.

After opening this window, you will see a skeleton for the basic form of a function in R:

function ()

{

}

In this window, you will enter the names that will be assigned to objects that are arguments to the function in the parentheses, and build the function body in the brackets.

Try creating the following simple addition machine to get the basic idea of how a function in R can be programmed, saved, and run:

> fix(add.machine)

After submitting that command, enter the following text into the function editor, save the file, and then close the editor window:

function (number1,number2)

{

result add.machine(3,5)

[1] 8

R returns the result of running the function! More commonly, the results of function are stored in new objects:

> sum sum

[1] 8

To see other examples of how functions are written in R, you can type fix(function.name) for any of the functions that you are working with.

Programming in R is basically the same as programming in any other language; the core program is controlled by a series of if-then statements, loops, print statements, calls to other programs, and return statements. The only differences are minor syntax conventions that just take a little while to get used to.

To begin with, let’s look at an example of how the if-then statements work in R. Try submitting this command at the prompt to see how it works:

> if (1 > 0) print("I Like Binary")

Note how R prints “I Like Binary,” because the condition in the “if” statement is true. The print() function is very useful for programming purposes, in that it prints simple strings in the R Console. No “then” statements are needed after an “if” statement; you simply type what you would like R to do if the condition is true. In general, if you want to do more than one thing if the “if” condition is true, you use this bracketed structure:

if (logical condition)

{

do this

and this

and this

}

The statements in the brackets usually refer to function calls and object assignments, and simply need to be on separate lines (no punctuation necessary!).

To add else-if conditions, simply use this structure:

if (logical condition)

{

do this

and this

and this

}

else if (logical condition)

{

do this

and this

and this

}

...

else

{

do this

and this

and this

}

Similar to many other programming languages, && is the and operator for logical conditions, and || is the or operator for logical conditions. Remember to make appropriate use of parentheses when working with logical and/or conditions!

Now let’s take a look at how a “for” loop works in R. Try submitting the following syntax at the command prompt:

> for(i in 1:5) print(i)

R prints 1, 2, 3, 4, and 5. For loops work exactly like if conditions, and if you want R to do more than one thing in a “for” loop, use brackets around the commands:

for(i in a:b)

{

do this

and this

and this

}

Keep in mind that the i argument in the for loop is a numerical object for each repetition of the for loop (it takes the value of 1 for the first repetition, 2 for the second, and so forth), and can be used as an index variable if you so desire. We’ll see an example of this soon.

Now we’ll take a look at how “while” loops work in R. Try creating a function called while.loop() by first using the fix() function:

> fix(while.loop)

Then, in the function editor window, type the following lines, and save the function before closing the editor window (note that there are no arguments):

function ()

{

x while.loop()

[1] 1

Counter is 1

[1] 2

Counter is 2

[1] 3

Counter is 3

[1] 4

Counter is 4

[1] 5

Counter is 5

This function proves to be quite handy for more complicated programming.

NOTE: If you would like to run a simple loop at the command prompt and not create a new function in the function editor window, you can define the loop one line at a time in the format introduced above by pressing enter after each line. Give it a try!

The purpose of most R functions is to return an object, whether it be an object containing the results of a model fit, a single number, a vector of numbers, etc. Therefore, several functions will end with a

return(object.name)

line, which tells R to either print the resulting object in the R console (if you do not assign the result of the function to a new object in the console), or to store the resulting object in the object that you have defined.

This is the basic ammunition necessary for programming new functions in R! Let’s take a look at a more complicated example of programming a function in R for statistical analysis purposes.

Example: Bootstrapping

R is often used to perform simulations. Suppose we have a random sample of size 50 from a standard normal distribution, and we want to use the resampling technique known as bootstrapping to check whether or not the actual coverage of a 95% confidence interval is near 95%. In other words, what percentage of the time does a 95% confidence interval (as it is commonly calculated) cover the true sample mean when using bootstrapping?

We want to create a function that will perform bootstrapping to simulate repeated samples from this original sample, and see how many of the simulated confidence intervals contain the true sample mean. In this example, I will use the R convention of commenting a program using # signs.

> fix(boot.cicheck) # call the function editor window

function(n,rep) # n = sample size, rep = no. of

# bootstrap samples

{

samp ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download