Course1.winona.edu



DSCI 325: R Handout 4 - Introduction to the dplyr package in RIn this handout, we will introduce the dplyr package which can be used to manipulate data in R. A more through discussion of much of the code below was taken from the following R documentation: stated in this documentation, the dplyr package provides “simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.”Data Source: Tidyverse – A collection of R packages for Data ScienceSelect Import Dataset to read in this file.Select the CSV file to be read in.A snip-if of the code, copy and paste this into your script window….Start by installing both this package and the data set that will be used for illustration purposes. > install.packages("dplyr")> library(dplyr)Basic commands for understanding your dataset> names(FlightDelays)> head(FlightDelays)Filter rows with filter()With the dplyr package, the filter() function allows you to select a subset of the rows of a data frame. > dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1)With the filter() function you can give any number of filtering conditions which are joined together with “&” or other Boolean operators. For example, consider the following commands using this function.> dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1)> dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1 & Carrier == "UA")> dplyr::filter(FlightDelays, Origin == "RST" | Origin == "LSE")> dplyr::filter(FlightDelays, (Origin == "RST" | Origin == "LSE") & Dest == "ORD")Slice rows with slice()To extract rows 1 through 4 from the data frame, we can use the slice function:> dplyr::slice(FlightDelays, 1:4) You can also select nonconsecutive rows:> dplyr::slice(FlightDelays, c(1:2,4))Arrange rows with arrange()This function from the dplyr package can be used to reorder the rows of a data set. > dplyr::arrange(FlightDelays, Month, DayofMonth)> dplyr::arrange(FlightDelays, AirlineID)To order a column in descending order, use desc():> dplyr::arrange(FlightDelays, desc(AirlineID))Select columns with select()If you’re working with a large data set and only a few variables are of actual interest to you, you can select that subset of variables easily with dplyr. For example, consider the following:> dplyr::select(FlightDelays, Carrier, Origin, DepDelay))> dplyr::select(FlightDelays, Year:DayofWeek)You can also use various “helper functions” within select(), as shown below.starts_with()ends_with()matches()contains()> dplyr::select(FlightDelays, starts_with("Dep"))> dplyr::select(FlightDelays, contains("Delay"))Also, a common use of the select() function is to determine how many unique (or distinct) values a variable (or a set of variables) takes on. > dplyr::distinct(select(FlightDelays, Carrier))> dplyr::distinct(select(FlightDelays, Carrier, Dest))Add new columns with mutate()First, let us create a new data.frame with only the following columns.> FlightDelays2 <- dplyr::select(FlightDelays, DayOfWeek, Carrier, Dest, DepDelay, ArrDelay, AirTime, Distance)In addition to selecting from existing columns, you can add new columns that are functions of existing columns.> dplyr::mutate(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60) Note that the newly created columns are *not* automatically put into the existing data.frame.> FlightDelays2 <- dplyr::mutate(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60> head(FlightDelays2) The transmute() function also allows you to create new columns that are functions of existing columns. The difference is that this saves only the new columns that you create.> dplyr::transmute(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60)Summarize values with summarize()This lets you create summaries that collapse a data frame to a single row. For example, consider the following:> dplyr::summarise(FlightDelays2, AvgDepDelay = mean(DepDelay))Need to remove the NA from the mean calculation…> dplyr::summarise(FlightDelays2, AvgDepDelay = mean(DepDelay, na.rm=TRUE))Commonalities of functions in the dplyr packageNote that all of these functions are similar in the following ways:The first argument is a data frameSubsequent arguments tell R what to do with that data frameThe result is a new data frameAs stated in the aforementioned R documentation, these five functions together “provide the basis of a language of data manipulation.” At the most basic level, we alter data sets in the following ways:Reorder rows (arrange())Select observations (rows) of interest (filter() or slice())Select variables (columns) of interest (select())Add new variables (columns) that are functions of existing variables (mutate())Aggregate many rows into a single row (summarize())Grouped operations with group_by()Finally, note that you can also use all of the above functions to process a data set “by group.” >group.carrier <- dplyr::group_by(FlightDelays2, Carrier)> dplyr::summarise(group.carrier, mean(DepDelay, na.rm=TRUE))> dplyr::summarise(group.carrier, AvgDepDelay = mean(DepDelay, na.rm=TRUE), AvgArrDelay = mean(ArrDelay, na.rm=TRUE))> dplyr::summarise(group.carrier, Count=n())Other Common SummariesStandard Deviation: sd()Minimum: min()Maximum: max()Count: n() Chaining operations Suppose you want to do many operations at once:>FlightDelays %>% dplyr::filter(Dest=="RST" | Dest=="LSE") %>% dplyr::group_by(Origin, Dest) %>% dplyr::summarise(AvgDelay = mean(ArrDelayMinutes, na.rm=TRUE)) %>% dplyr::arrange(Dest) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download