course1.winona.edu

DSCI 325: R Handout 4 - Introduction to the dplyr package in RIn this handout, we will introduce the dplyr package which can be used to manipulate data in R. A more through discussion of much of the code below was taken from the following R documentation: stated in this documentation, the dplyr package provides “simple functions that correspond to the most common data manipulation verbs, so that you can easily translate your thoughts into code.”Data Source: Tidyverse – A collection of R packages for Data ScienceSelect Import Dataset to read in this file.Select the CSV file to be read in.A snip-if of the code, copy and paste this into your script window….Start by installing both this package and the data set that will be used for illustration purposes. > install.packages("dplyr")> library(dplyr)Basic commands for understanding your dataset> names(FlightDelays)> head(FlightDelays)Filter rows with filter()With the dplyr package, the filter() function allows you to select a subset of the rows of a data frame. > dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1)With the filter() function you can give any number of filtering conditions which are joined together with “&” or other Boolean operators. For example, consider the following commands using this function.> dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1)> dplyr::filter(FlightDelays, Month == 1 & DayofMonth == 1 & Carrier == "UA")> dplyr::filter(FlightDelays, Origin == "RST" | Origin == "LSE")> dplyr::filter(FlightDelays, (Origin == "RST" | Origin == "LSE") & Dest == "ORD")Slice rows with slice()To extract rows 1 through 4 from the data frame, we can use the slice function:> dplyr::slice(FlightDelays, 1:4) You can also select nonconsecutive rows:> dplyr::slice(FlightDelays, c(1:2,4))Arrange rows with arrange()This function from the dplyr package can be used to reorder the rows of a data set. > dplyr::arrange(FlightDelays, Month, DayofMonth)> dplyr::arrange(FlightDelays, AirlineID)To order a column in descending order, use desc():> dplyr::arrange(FlightDelays, desc(AirlineID))Select columns with select()If you’re working with a large data set and only a few variables are of actual interest to you, you can select that subset of variables easily with dplyr. For example, consider the following:> dplyr::select(FlightDelays, Carrier, Origin, DepDelay))> dplyr::select(FlightDelays, Year:DayofWeek)You can also use various “helper functions” within select(), as shown below.starts_with()ends_with()matches()contains()> dplyr::select(FlightDelays, starts_with("Dep"))> dplyr::select(FlightDelays, contains("Delay"))Also, a common use of the select() function is to determine how many unique (or distinct) values a variable (or a set of variables) takes on. > dplyr::distinct(select(FlightDelays, Carrier))> dplyr::distinct(select(FlightDelays, Carrier, Dest))Add new columns with mutate()First, let us create a new data.frame with only the following columns.> FlightDelays2 <- dplyr::select(FlightDelays, DayOfWeek, Carrier, Dest, DepDelay, ArrDelay, AirTime, Distance)In addition to selecting from existing columns, you can add new columns that are functions of existing columns.> dplyr::mutate(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60) Note that the newly created columns are *not* automatically put into the existing data.frame.> FlightDelays2 <- dplyr::mutate(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60> head(FlightDelays2) The transmute() function also allows you to create new columns that are functions of existing columns. The difference is that this saves only the new columns that you create.> dplyr::transmute(FlightDelays2, Gain = ArrDelay - DepDelay, Speed = Distance / AirTime*60)Summarize values with summarize()This lets you create summaries that collapse a data frame to a single row. For example, consider the following:> dplyr::summarise(FlightDelays2, AvgDepDelay = mean(DepDelay))Need to remove the NA from the mean calculation…> dplyr::summarise(FlightDelays2, AvgDepDelay = mean(DepDelay, na.rm=TRUE))Commonalities of functions in the dplyr packageNote that all of these functions are similar in the following ways:The first argument is a data frameSubsequent arguments tell R what to do with that data frameThe result is a new data frameAs stated in the aforementioned R documentation, these five functions together “provide the basis of a language of data manipulation.” At the most basic level, we alter data sets in the following ways:Reorder rows (arrange())Select observations (rows) of interest (filter() or slice())Select variables (columns) of interest (select())Add new variables (columns) that are functions of existing variables (mutate())Aggregate many rows into a single row (summarize())Grouped operations with group_by()Finally, note that you can also use all of the above functions to process a data set “by group.” >group.carrier <- dplyr::group_by(FlightDelays2, Carrier)> dplyr::summarise(group.carrier, mean(DepDelay, na.rm=TRUE))> dplyr::summarise(group.carrier, AvgDepDelay = mean(DepDelay, na.rm=TRUE), AvgArrDelay = mean(ArrDelay, na.rm=TRUE))> dplyr::summarise(group.carrier, Count=n())Other Common SummariesStandard Deviation: sd()Minimum: min()Maximum: max()Count: n() Chaining operations Suppose you want to do many operations at once:>FlightDelays %>% dplyr::filter(Dest=="RST" | Dest=="LSE") %>% dplyr::group_by(Origin, Dest) %>% dplyr::summarise(AvgDelay = mean(ArrDelayMinutes, na.rm=TRUE)) %>% dplyr::arrange(Dest) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches