A Short Introduction to the ‘R’ Programming Language for ...



A Short Introduction to the ‘R’ Programming Language for Biologists

is a free (GNU open source) software for statistical computing, created by Ross Ihaka and Robert Gentleman. It has several important features that make it especially useful for biologists interested in genomics. R contains many pre-built general statistical functions as well as a large collection of modules specifically designed to analyze genomic data (BioConductor). Also, R can handle extremely large tables of numbers (arrays) such as the tens of thousands of probes on a gene expression microarray (or genome tiling array) and the millions of SNPs on the latest genotyping platforms.

R contains operators that can manipulate numbers or matrices as well as a programming language equipped with conditional expressions, loops and input/output abilities; which allows users to create their own scripts and functions. Using this language, the open source community has created many useful statistical tools for a wide variety of different purposes, including biology. Often, the most sophisticated new data analysis methods are first publicly released as R modules.

R is typically used (by statisticians) as tool for direct data analysis. It is controlled by directly typing text commands (a “command line interface”), but it is capable of producing very elaborate publication-quality graphical output. R is available for Windows, Macintosh, and Linux computers. It can be used on powerful servers and clusters when big CPU power is required, but most people will find it more convenient to install it on their own personal computer or laptop.

So, take a moment and download and install R on your computer now (pick a local site):



That was easy.

Now start up the R application (it may have created a desktop icon during install, or might be located in your “Programs” or “Applications” directory). You should get an “R Console” window on the desktop with some “Welcome to R” text followed by the nice friendly > sign, indicating that R is ready to execute your every command. Type each command exactly as shown below (in blue) followed by a carriage return (or “Enter” key). Results returned by R are shown on the line below each command.

> 1+1 # calculate directly

[1] 2

> x=2+2 # or put the value into a variable

> x

[1] 4

> y=((3 / 2)^2 + 2) * pi

> y

[1] 13.3518

> Q q # case sensitive

Error: object "q" not found

> Q

[1] 18.3518

> dog dog dog

[1] "Lassie"

So, from this little exercise we learn that R can be used a calculator for both simple and complex math; it can store values in a variable, which are returned when you enter the variable name; variables can be used in mathematical operations; and variable names are case sensitive (even on a Windows computer). The assignment operator “ x = c(1,2,4,66,8,4) # concatenate these elements into x

> x[3] = 11 # put 11 into the third element of x

> x[3:5] # show elements 3-5 from x

[1] 11 66 8

> length (x) # how many elements in x?

[1] 6

> sum(x) # add up all elements in x

[1] 92

> mean(x) # average of all elements in x

[1] 15.33333

> y = c(8,8,3,4,1,3)

> x+y # add corresponding elements in vectors

[1] 9 10 14 70 9 7

Got all that? The letter ‘c’ in the first command indicates that the elements separated by commas should be concatenated into the vector x. You can assign (or change) an element in a vector by using its index number. The third to fifth elements of x are now 11, 66, and 8. The length ( ) of x is 6 because it has 6 elements. Sum ( ) and mean ( ) operate on all the elements of vector x. Note that functions such as length ( ), sum ( ), mean ( ), etc are always followed by a target expression in parentheses. When you add vectors x and y, the individual elements in corresponding positions are added.

A matrix is a table that holds both rows and columns of values: y[i,j]. An array holds values in as many dimensions as you wish: z[i,j,k,l,m]. Try the following commands:

> mat1= matrix(1:12, nrow=2, ncol=6, byrow= TRUE)

> mat1

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 1 2 3 4 5 6

[2,] 7 8 9 10 11 12

> length(mat1) # how many elements in mat1?

[1] 12

> dim(mat1) # dimensions of mat1

[1] 2 6

> mat2= rbind(x,y) # put x and y into mat2 by rows

> mat2

[,1] [,2] [,3] [,4] [,5] [,6]

x 1 2 11 66 8 4

y 8 8 3 4 1 3

> mat1 * mat2 # multiply corresponding matrix elements

[,1] [,2] [,3] [,4] [,5] [,6]

x 1 4 33 264 40 24

y 56 64 27 40 11 36

Mat1 contains the numbers from 1 to 12 in two rows and six columns, entered by row. The length of a matrix is equal to the total number of elements it contains; the dimensions are the number of rows and columns. Mat2 contains vectors x and y, entered by row. The product of two matrices is the product of individual elements in corresponding positions. R can do math on very large matrices very quickly.

A data frame is a matrix that contains values that may be of different types (like a spreadsheet with some columns that contain text and other columns with numbers). Each column of the data frame is a vector, and all columns must be the same length. Data frames are often used to hold genomic information. To create a data frame, first create three vectors, using some genome size data from :

> organism genomeSizeBP GeneCount compGenomes compGenomes

organism genomeSizeBP GeneCount

1 Human 3.000e+09 30000

2 Mouse 3.000e+09 30000

3 Fruit Fly 1.356e+08 13061

4 Roundworm 9.700e+07 19099

5 Yeast 1.210e+07 6034

> GeneDensity = compGenomes$genomeSizeBP/compGenomes$GeneCount

> compGenomes compGenomes

organism genomeSizeBP GeneCount GeneDensity

1 Human 3.000e+09 30000 100000.000

2 Mouse 3.000e+09 30000 100000.000

3 Fruit Fly 1.356e+08 13061 10382.053

4 Roundworm 9.700e+07 19099 5078.800

5 Yeast 1.210e+07 6034 2005.303

> MouseGeneDensity MouseGeneDensity

[1] 1e+05

In order to do useful bioinformatics work with R, you will need to be able to load data from a file. R works best with tab-delimited plain text files. It is a simple process to open a file in Excel and "Save As..." in Text (Tab delimited) format.

The function read.table( ) reads data from a plain text file (space or tab delimited columns) directly into a data frame. read.csv( ) reads a comma delimited file (an option used by many database programs). Before R can read a file, you have to guide it to the file — use the “Change Working Directory…” command from the “Misc” menu and navigate to the folder that holds the data file. The read.table( ) function takes many options, but the simplest ones are the filename and “header=TRUE” to read the first line of the file as a header which contains the column names. It is necessary to assign a variable name to the data frame created by read.table( ). Copy the compGenomes data above into a text file, then the command to import the file would look like this:

> compGenomes2 write (myResults, file = "myresults.txt", sep = "\t")

In order to be a functional computer language, R must be able to make IF decisions based on mathematical or logical criteria. The IF command is implicit when any of the comparison operators are used (>, =, a = 3

> a > 1 # is a greater than 1?

[1] TRUE

> a == 7 | a !=4 # is a equal to 7? OR is a not equal to 4?

[1] TRUE

> a < 9 & a ==2 # is a less than 9? AND is a equal to 2?

[1] FALSE

> plus5 5] # put into plus5 all values from mat1 that are greater than 5

#note that plus5 is a vector, not a matrix

[1] 7 8 9 10 11 6 12

> x = c(1,2,11,66,8,4)

> listx = 5) # put into listx indexes for values in x >= 5

[1] 3 4 5

> x[listx] # show values from x for indexes in listx

[1] 11 66 8

R includes very sophisticated graphical output functions. Try the following commands:

> plot(x,y)

> barplot(x, bg=”blue”)

> pie(x)

Scripts and Programming in R

Up to this point, this tutorial has used R only as an interactive tool – by typing commands directly into the R Console and executing them immediately with the Return/Enter key. Like Unix and Perl, R can be used as a scripting language by simply writing a series of commands in a text file, then executing the file. This allows more complex programs to be built, modified, and re-used. A series of R commands can be written in any text editor (Notepad, TextEdit, Word) or using the built-in text editor from the R application. In any case, the script is saved as a text file with the “.R” extension, then loaded into R using the source( ) command or as a menu item (File > Source File …) from the R Console.

The simplest programming task in R is to define a function, which can be any combination of existing R operators and functions. For example, the function std ( ) (standard deviation) can be defined as the square root of the variance (two existing functions):

> std = function (x) sqrt(var(x))

> data std(data)

[1] 1.825742

Once this function has been defined in a session (or in a script) then you can apply the function to any appropriate object (in this case a vector or matrix of numerical values).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download