R: Introduction



R: IntroductionTo begin, download R from the R-Project web site (r-). R is different from most statistical packages in that it contains a very primitive interface (though this is continually improving) and as a result has a more hands-on or programming feel than other statistical software packages.The standard installation of R utilizes a default user interface. In this course, we will instead use RStudio which provides a richer interface. Once you have installed the R base package, download RStudio from the following website: is an open source package. This has both advantages and disadvantages. Because it is open source, there is no ???software support??? you can directly access; however, there are literally thousands of documents on the web that can help you use R efficiently (some are much better than others). The following links give some of the most popular webpages for R support. base package contains many basic functions used in statistics. In addition to the base package, many individuals have created other packages that can be downloaded that will aid in various analyses.The following provides a snapshot of the R Studio interface that is commonly used when using anizational StructureThe organization structure in R is best managed through what is called Projects. To create a new Project, select File > New Project ???First, specify New DirectoryNext, to begin we will create an Empty ProjectSpecify the name and location for the new directory that will contain this new project.Next, specify the name and location fo this new directoryVerify that the new direcory and project has been createdThe frame in the upper-left is your script window and the frame on the lower-left is the R console window. You can enter command directly into the R console; however, I???d encourage you to get accustomed to using the script window. The following can be used to obtain an R script window.Reading Data Files into R StudioCitibike is a bike rental company that operates in New York City. Citibike bike rental data is publically available.Citibike: Data: data download: down and save a copy of the most recent dataset onto your local machine.Unzip the file that you???ve downloaded from this website. A comma delimited text file is produced. The Citibike dataset is highly structured. The first row contains the variable or field names. Each row represents a single instance of a bike rental.To open a text file in RStudio, select Import Dataset in the window shown in the upper-right. Choose to import data from a Text File.Select the text file to be read in. The Citibike bike rental dataset is being read in here. Give a name to the dataset in the Name box. Options may need to be specified, the default setting suffice for this dataset.Understanding Data ObjectsClick Import, and the data set will be added to your workspace. If you click on the data set name in your workspace, the data set will appear in the upper-left window. R is an object oriented language and there are a few basic objects used to store data. In an effort to keep things simple, consider the following explanations.vector: the contents of a single columndata.frame: a collection of vectors ??? restricted to each having same lengthlist: a collection of vectorsR stores the imported data in an object known as a data frame. The type of object can be identified using the class() function.library(readr) BikeData <- read_csv("D:/Teaching/DSCI325/Data/201802_citibikenyc_tripdata.csv/201802_citibikenyc_tripdata.csv")## Parsed with column specification:## cols(## tripduration = col_integer(),## starttime = col_datetime(format = ""),## stoptime = col_datetime(format = ""),## `start station id` = col_integer(),## `start station name` = col_character(),## `start station latitude` = col_double(),## `start station longitude` = col_double(),## `end station id` = col_integer(),## `end station name` = col_character(),## `end station latitude` = col_double(),## `end station longitude` = col_double(),## bikeid = col_integer(),## name_localizedValue0 = col_character(),## usertype = col_character(),## `birth year` = col_character(),## gender = col_integer()## ) class(BikeData)## [1] "tbl_df" "tbl" "data.frame"The str() function can be used to provide additional details regarding an object.str(BikeData)## Classes 'tbl_df', 'tbl' and 'data.frame': 843104 obs. of 16 variables:## $ tripduration : int 509 513 516 196 491 508 830 294 728 1597 ...## $ starttime : POSIXct, format: "2018-02-01 00:00:04" "2018-02-01 00:00:15" ...## $ stoptime : POSIXct, format: "2018-02-01 00:08:34" "2018-02-01 00:08:49" ...## $ start station id : int 3458 459 459 3318 499 479 251 3538 445 3375 ...## $ start station name : chr "W 55 St & 6 Ave" "W 20 St & 11 Ave" "W 20 St & 11 Ave" "2 Ave & E 96 St" ...## $ start station latitude : num 40.8 40.7 40.7 40.8 40.8 ...## $ start station longitude: num -74 -74 -74 -73.9 -74 ...## $ end station id : int 422 380 380 3351 3167 458 465 3543 332 490 ...## $ end station name : chr "W 59 St & 10 Ave" "W 4 St & 7 Ave S" "W 4 St & 7 Ave S" "E 102 St & 1 Ave" ...## $ end station latitude : num 40.8 40.7 40.7 40.8 40.8 ...## $ end station longitude : num -74 -74 -74 -73.9 -74 ...## $ bikeid : int 30139 30046 31062 32204 15907 30239 27500 21081 25618 18721 ...## $ name_localizedValue0 : chr "Annual Membership" "Annual Membership" "Annual Membership" "Annual Membership" ...## $ usertype : chr "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...## $ birth year : chr "1989" "1989" "1990" "1958" ...## $ gender : int 1 1 1 2 0 1 1 2 1 1 ...## - attr(*, "spec")=List of 2## ..$ cols :List of 16## .. ..$ tripduration : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ starttime :List of 1## .. .. ..$ format: chr ""## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"## .. ..$ stoptime :List of 1## .. .. ..$ format: chr ""## .. .. ..- attr(*, "class")= chr "collector_datetime" "collector"## .. ..$ start station id : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ start station name : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ start station latitude : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ start station longitude: list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ end station id : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ end station name : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ end station latitude : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ end station longitude : list()## .. .. ..- attr(*, "class")= chr "collector_double" "collector"## .. ..$ bikeid : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## .. ..$ name_localizedValue0 : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ usertype : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ birth year : list()## .. .. ..- attr(*, "class")= chr "collector_character" "collector"## .. ..$ gender : list()## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"## ..$ default: list()## .. ..- attr(*, "class")= chr "collector_guess" "collector"## ..- attr(*, "class")= chr "col_spec"Questions:How many observations, i.e. rows, does Bikedata contain?How many variables, i.e. columns, doe the Bikedata contain? A data.frame may contains a mix of data types. From the above output, we can see that tripduration is an integer, start.station.latitude is a number, and strings are identified as factors. A factor variable has levels inherently defined ??? which is beneficial when summarizing data.The summary() function is a generic function that produces summaries that are relevant to the object being passed into the function.summary(BikeData)## tripduration starttime ## Min. : 61.0 Min. :2018-02-01 00:00:04 ## 1st Qu.: 319.0 1st Qu.:2018-02-09 08:50:38 ## Median : 519.0 Median :2018-02-16 09:22:42 ## Mean : 779.3 Mean :2018-02-16 06:56:02 ## 3rd Qu.: 872.0 3rd Qu.:2018-02-22 21:36:22 ## Max. :2564621.0 Max. :2018-02-28 23:59:52 ## stoptime start station id start station name## Min. :2018-02-01 00:03:32 Min. : 72 Length:843104 ## 1st Qu.:2018-02-09 09:02:04 1st Qu.: 376 Class :character ## Median :2018-02-16 09:33:55 Median : 497 Mode :character ## Mean :2018-02-16 07:09:02 Mean :1479 ## 3rd Qu.:2018-02-22 21:49:34 3rd Qu.:3172 ## Max. :2018-03-25 10:28:11 Max. :3664 ## start station latitude start station longitude end station id## Min. :40.65 Min. :-74.02 Min. : 72 ## 1st Qu.:40.72 1st Qu.:-73.99 1st Qu.: 369 ## Median :40.74 Median :-73.99 Median : 497 ## Mean :40.74 Mean :-73.98 Mean :1466 ## 3rd Qu.:40.76 3rd Qu.:-73.97 3rd Qu.:3171 ## Max. :45.51 Max. :-73.57 Max. :3668 ## end station name end station latitude end station longitude## Length:843104 Min. :40.65 Min. :-74.04 ## Class :character 1st Qu.:40.72 1st Qu.:-74.00 ## Mode :character Median :40.74 Median :-73.99 ## Mean :40.74 Mean :-73.98 ## 3rd Qu.:40.76 3rd Qu.:-73.98 ## Max. :45.51 Max. :-73.57 ## bikeid name_localizedValue0 usertype ## Min. :14529 Length:843104 Length:843104 ## 1st Qu.:19432 Class :character Class :character ## Median :28384 Mode :character Mode :character ## Mean :26003 ## 3rd Qu.:31657 ## Max. :33541 ## birth year gender ## Length:843104 Min. :0.000 ## Class :character 1st Qu.:1.000 ## Mode :character Median :1.000 ## Mean :1.167 ## 3rd Qu.:1.000 ## Max. :2.000Questions:Provide a brief description of the summary statistics for tripduration. What information about bike rentals in NYC do these summaries provide?Provide a brief description of the summary statistics for usertype. What information about bike rentals in NYC do these summaries provide?Creating New ObjectsR allows one to easily work with individual variables within a data.frame.For example, the str() and summary() function can be applied only the tripduration variablestr(BikeData$tripduration)## int [1:843104] 509 513 516 196 491 508 830 294 728 1597 ...summary(BikeData$tripduration)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 61.0 319.0 519.0 779.3 872.0 2565000.0 R allows us to easily create a new variable. The <- is the assignment operator and is used to assign output to an object. R attempts to identify the most appropriate object type when assignments are made. The following will convert the trip duration to minutes. The outcome will be placed into an object called trip.minutes. trip_minutes <- BikeData$tripduration / 60Notice that a new (vector) object is created.Once again, the structure of this object can be identified using str(). A summary of this newly created vector is provided here as well. str(trip_minutes)## num [1:843104] 8.48 8.55 8.6 3.27 8.18 ... summary(trip_minutes)## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.02 5.32 8.65 12.99 14.53 42740.00A new object can be assigned to directly to an existing data.frame as follows. BikeData$Trip.Minutes <- BikeData$tripduration / 60 head(BikeData)## # A tibble: 6 × 17## tripduration starttime stoptime `start station id`## <int> <dttm> <dttm> <int>## 1 509 2018-02-01 00:00:04 2018-02-01 00:08:34 3458## 2 513 2018-02-01 00:00:15 2018-02-01 00:08:49 459## 3 516 2018-02-01 00:00:16 2018-02-01 00:08:52 459## 4 196 2018-02-01 00:00:16 2018-02-01 00:03:32 3318## 5 491 2018-02-01 00:00:20 2018-02-01 00:08:31 499## 6 508 2018-02-01 00:00:27 2018-02-01 00:08:55 479## # ... with 13 more variables: `start station name` <chr>, `start station## # latitude` <dbl>, `start station longitude` <dbl>, `end station## # id` <int>, `end station name` <chr>, `end station latitude` <dbl>,## # `end station longitude` <dbl>, bikeid <int>,## # name_localizedValue0 <chr>, usertype <chr>, `birth year` <chr>,## # gender <int>, Trip.Minutes <dbl>The number of variables in the Bikedata data.frame before creating this new variable.The number of variables in the Bikedata data.frame after creating this new variable.Referring to Elements of Objects - VectorsAs seen above, R allows us to create a new variable from an existing column within a data.frame. R also allows us to obtain certain segments (or subsets) of data objects as well.The bracket syntax is used to refer to particular segments of data objects. BikeData$Trip.Minutes[1]## [1] 8.483333Getting the first two values in this vector BikeData$Trip.Minutes[1:2]## [1] 8.483333 8.550000The first five elements can be obtained using 1:5 as is shown here.Getting the first two values in this vector BikeData$Trip.Minutes[1:5]## [1] 8.483333 8.550000 8.600000 3.266667 8.183333If the first five elements are to be saved into a new object, simply make the necessary assignment. trip.minutes.first5 <- BikeData$Trip.Minutes[1:5]A new vector of length five is created in your environment.The length() function can be used to identify the number of elements in a vector. length(trip.minutes.first5)## [1] 5Note: The analogous function for a data.frame is nrow(), or the function dim() can be used as well.Referring to Elements of Objects of data.framesFor a data.frame, names can be specified for the rows and columns. Generally speaking, column names are more important as these are used to identify fields or variables within the dataset. The colnames() and rownames() functions can be used to identify column names and row names, respectively. names(BikeData)## [1] "tripduration" "starttime" ## [3] "stoptime" "start station id" ## [5] "start station name" "start station latitude" ## [7] "start station longitude" "end station id" ## [9] "end station name" "end station latitude" ## [11] "end station longitude" "bikeid" ## [13] "name_localizedValue0" "usertype" ## [15] "birth year" "gender" ## [17] "Trip.Minutes"Getting rownames instead of colmames. rownames(BikeData)[1:20]## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"## [15] "15" "16" "17" "18" "19" "20"Akin to vectors, R allows one to refer to rows and columns by number as well.Once again, the square bracket syntax is used to refer to particular segments of data objects. The data.frame is a two-dimensional object; thus, a row and column identifier can be specified.Getting the 1st row, 1st column of Bikedata BikeData[1,1]## # A tibble: 1 × 1## tripduration## <int>## 1 509Getting the 1st three rows and only the 1st column of Bikedata BikeData[1:3,1]## # A tibble: 3 × 1## tripduration## <int>## 1 509## 2 513## 3 516Getting the 1st three rows and 1st three columns of Bikedata BikeData[1:3,1:3]## # A tibble: 3 × 3## tripduration starttime stoptime## <int> <dttm> <dttm>## 1 509 2018-02-01 00:00:04 2018-02-01 00:08:34## 2 513 2018-02-01 00:00:15 2018-02-01 00:08:49## 3 516 2018-02-01 00:00:16 2018-02-01 00:08:52Getting the 1st three rows and all columns of Bikedata. If the column argument is not specified, then all columns are provided. BikeData[1:3, ]## # A tibble: 3 × 17## tripduration starttime stoptime `start station id`## <int> <dttm> <dttm> <int>## 1 509 2018-02-01 00:00:04 2018-02-01 00:08:34 3458## 2 513 2018-02-01 00:00:15 2018-02-01 00:08:49 459## 3 516 2018-02-01 00:00:16 2018-02-01 00:08:52 459## # ... with 13 more variables: `start station name` <chr>, `start station## # latitude` <dbl>, `start station longitude` <dbl>, `end station## # id` <int>, `end station name` <chr>, `end station latitude` <dbl>,## # `end station longitude` <dbl>, bikeid <int>,## # name_localizedValue0 <chr>, usertype <chr>, `birth year` <chr>,## # gender <int>, Trip.Minutes <dbl>Getting the 1st three rows and only columns 1, 4, and 5. The c( ) syntax is used to create a new vector object. The vector simple specifies which columns to retain in this subset of Bikedata. BikeData[1:3, c(1,4,5)]## # A tibble: 3 × 3## tripduration `start station id` `start station name`## <int> <int> <chr>## 1 509 3458 W 55 St & 6 Ave## 2 513 459 W 20 St & 11 Ave## 3 516 459 W 20 St & 11 AveGetting Help with RWithin R, you can find help on any command (or find commands) using the following.If you know the name of the R function, e.g. summary, use help(summary) or ?summary.> help("summary")> ?summaryIf you don???t know the function and want to do a keyword search for it, use help.search().> help.search("summary")The help.start() function will launch the full help system that includes all manuals, references, etc.> help.start() ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related download
Related searches