STAT 1261/2260: Principles of Data Science



STAT 1261/2260: Principles of Data ScienceLecture 8 - Data Wrangling: one table (2/2)Where are we?Data Wrangling: Single TableIntroduction to tibblesFive verbsfilter(): select rows (observations) based on some conditionselect(): select columns (variables) by their namesarrange(): reorder the rowsmutate(): creating new variables with functions of existing variablessummarise(): collapse many values to a single summaryCreate new variables with mutate()Frequently, in the process of conducting our analysis, we will create, re-define, and rename some of our variables. The functions mutate() and rename() provide these capabilities.While we have the data on when these presidents took and relinquished office, we don’t actually have a variable giving the length of each president’s term.library(dplyr);library(lubridate);library(ggplot2) ## Warning: As of rlang 0.4.0, dplyr must be at least version 0.8.0.## x dplyr 0.7.6 is too old for rlang 0.4.2.## b9 Please update dplyr to the latest version.## b9 Updating packages on Windows requires precautions:## < <- presidential %>% mutate(term_length = (end - start)/dyears(1)) ## Warning: The `printer` argument is deprecated as of rlang 0.3.0.## This warning is displayed once per session.head(mypresidents,3)## # A tibble: 3 x 5## name start end party term_length## <chr> <date> <date> <chr> <dbl>## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84## 3 Johnson 1963-11-22 1969-01-20 Democratic 5.17In this situation, it is generally considered good style to create a new object rather than clobbering the one that comes from an external source. To preserve the existing presidential data frame, we save the result of mutate() as a new object called mypresidents.Note that dyears(x) function gives duration of time in x years. It can be measured in seconds, days, weeks, etc.dyears(1)## [1] "31536000s (~52.14 weeks)"as.numeric(dyears(1))/(24*3600)## [1] 365Load the package lubridate and type ?dyears in R console for more information about duration functions.Useful creation functions (1)The key property is that the function must be vectorized: it must take a vector of values as input, and return a vector with the same number of values as output.Arithmetic operators +,-,*,/,^Logs. log(),log2(),log10() y <- c(2, 4, 8) log2(y)## [1] 1 2 3Exponential functions: exp()x <- seq(0,1,0.3)z <- exp(x); z## [1] 1.000000 1.349859 1.822119 2.459603Modular arithmetic operators: %/% and %%%/% is integer division%% yield the remainder of the division x <- c(10, 5, 3) x %/% 3## [1] 3 1 1 x %% 3## [1] 1 2 0Useful creation functions (2)Cumulative and rolling aggregates z <- seq(1,5);z## [1] 1 2 3 4 5 cumsum(z) # cumulative sum## [1] 1 3 6 10 15 cummean(z) # cumulative mean## [1] 1.0 1.5 2.0 2.5 3.0 cumprod(z) # cumulative product## [1] 1 2 6 24 120 cummin(z) # cumulative min## [1] 1 1 1 1 1 cummax(z) # cumulative max## [1] 1 2 3 4 5Arrange rows (1)The function sort() will sort a vector, but not a data frame.The function that will sort a data frame is called arrange().To sort the presidential data frame by the term length, we specify that we want the column term_length in descending order.mypresidents %>% arrange(desc(term_length))## # A tibble: 11 x 5## name start end party term_length## <chr> <date> <date> <chr> <dbl>## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01## 2 Reagan 1981-01-20 1989-01-20 Republican 8.01## 3 Clinton 1993-01-20 2001-01-20 Democratic 8.01## 4 Bush 2001-01-20 2009-01-20 Republican 8.01## 5 Obama 2009-01-20 2017-01-20 Democratic 8.01## 6 Nixon 1969-01-20 1974-08-09 Republican 5.55## 7 Johnson 1963-11-22 1969-01-20 Democratic 5.17## 8 Carter 1977-01-20 1981-01-20 Democratic 4.00## 9 Bush 1989-01-20 1993-01-20 Republican 4.00## 10 Kennedy 1961-01-20 1963-11-22 Democratic 2.84## 11 Ford 1974-08-09 1977-01-20 Republican 2.45# arrange(mypresidents, desc(term_length))Arrange rows (2): Save the sorted dataNote that the original data frame is not changed, even after the sorting.mypresidents## # A tibble: 11 x 5## name start end party term_length## <chr> <date> <date> <chr> <dbl>## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01## 2 Kennedy 1961-01-20 1963-11-22 Democratic 2.84## 3 Johnson 1963-11-22 1969-01-20 Democratic 5.17## 4 Nixon 1969-01-20 1974-08-09 Republican 5.55## 5 Ford 1974-08-09 1977-01-20 Republican 2.45## 6 Carter 1977-01-20 1981-01-20 Democratic 4.00## 7 Reagan 1981-01-20 1989-01-20 Republican 8.01## 8 Bush 1989-01-20 1993-01-20 Republican 4.00## 9 Clinton 1993-01-20 2001-01-20 Democratic 8.01## 10 Bush 2001-01-20 2009-01-20 Republican 8.01## 11 Obama 2009-01-20 2017-01-20 Democratic 8.01To save the sorted data frame, you need to define it as a new object.mypresidents_1 <- mypresidents %>% arrange(desc(term_length)); mypresidents_1## # A tibble: 11 x 5## name start end party term_length## <chr> <date> <date> <chr> <dbl>## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01## 2 Reagan 1981-01-20 1989-01-20 Republican 8.01## 3 Clinton 1993-01-20 2001-01-20 Democratic 8.01## 4 Bush 2001-01-20 2009-01-20 Republican 8.01## 5 Obama 2009-01-20 2017-01-20 Democratic 8.01## 6 Nixon 1969-01-20 1974-08-09 Republican 5.55## 7 Johnson 1963-11-22 1969-01-20 Democratic 5.17## 8 Carter 1977-01-20 1981-01-20 Democratic 4.00## 9 Bush 1989-01-20 1993-01-20 Republican 4.00## 10 Kennedy 1961-01-20 1963-11-22 Democratic 2.84## 11 Ford 1974-08-09 1977-01-20 Republican 2.45Arrange rows (3): Sort by two variablesMost presidents completed either one or two full terms, and thus have the exact same term length. To break these ties, we can further sort by start.mypresidents %>% arrange(desc(term_length), start)## # A tibble: 11 x 5## name start end party term_length## <chr> <date> <date> <chr> <dbl>## 1 Eisenhower 1953-01-20 1961-01-20 Republican 8.01## 2 Reagan 1981-01-20 1989-01-20 Republican 8.01## 3 Clinton 1993-01-20 2001-01-20 Democratic 8.01## 4 Bush 2001-01-20 2009-01-20 Republican 8.01## 5 Obama 2009-01-20 2017-01-20 Democratic 8.01## 6 Nixon 1969-01-20 1974-08-09 Republican 5.55## 7 Johnson 1963-11-22 1969-01-20 Democratic 5.17## 8 Carter 1977-01-20 1981-01-20 Democratic 4.00## 9 Bush 1989-01-20 1993-01-20 Republican 4.00## 10 Kennedy 1961-01-20 1963-11-22 Democratic 2.84## 11 Ford 1974-08-09 1977-01-20 Republican 2.45See In-Class Exercise 1.Summarize data using summarize()When used alone, summarize() collapses a data frame into a single row. We have to specify how we want to reduce an entire column of data into a single value.mypresidents %>% summarize( N = n(), first_year = min(year(start)), last_year = max(year(end)), num_dems = sum(party == "Democratic"), years = sum(term_length), avg_term_length = mean(term_length) )## # A tibble: 1 x 6## N first_year last_year num_dems years avg_term_length## <int> <dbl> <dbl> <int> <dbl> <dbl>## 1 11 1953 2017 5 64.0 5.82Numerical summaries that calculatedN: ?The function n() simply counts the number of rows (observations)first_year:? the smallest start year of the presidentslast_year:? the largest end year of the presidentsnum_dems:? the number of Democratic presidentsyears:? total term length of the presidentsavg_term_length:? average term length of the presidentsUse summarize() with group_by()From the numerical summary of the presidential dataset, we found that the average length of term of the presidents is 5.82 years. This begs the question of whether Democratic or Republican presidents served a longer average term during this time period.summarize() with group_by() enables us to make this kind of comparisons.mypresidents %>% group_by(party) %>% summarize( N = n(), avg_term_length = mean(term_length), std_term_length = sd(term_length) )## # A tibble: 2 x 4## party N avg_term_length std_term_length## <chr> <int> <dbl> <dbl>## 1 Democratic 5 5.60 2.34## 2 Republican 6 6.00 2.40The average length of term of Republican presidents is slightly higher than that of the Democratic. The standard deviations are quite close too.See In-Class Exercise 2.Useful summary functionsMeasures of location: median(), quantile, min(), max()Countsn() returns the size of the current groupn_distinct(x) counts the number of distinct values of a variable xsum(!is.na(x)) counts the number of non-missing values in xCounts and proportions of logical values x <- seq(1:10); x## [1] 1 2 3 4 5 6 7 8 9 10 (x<5)## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE sum(x<5) # number of values in x that is less than 5## [1] 4 mean(x<5) # proportion of values in x that is less than 5## [1] 0.4Handling missing valuesIn R, missing values are usually denoted by NA (not available), which represents an unknown value. Almost any operation involving an unknown value will also be unknown. That’s why missing values are “contagious”. For example,x<-c(1,2,NA,4,5)x + 10 ## [1] 11 12 NA 14 15mean(x)## [1] NATo check whether elements of x is NA , use is.na(x).To remove the missing values prior to computation, use na.rm=TRUE argument in the function.mean(x, na.rm=TRUE)## [1] 3To filter out all observations with missing values:<DATA_FRAME> %>% filter(!is.na(<VARIABLE>)An extended exampleThe nycflights13 package has information of all flights departed from NYC in 2013.? Suppose we would like to know which carrier has the longest average departure delay on Jan. 1, 2013.library(nycflights13) delay<- flights %>% # Create a new dataset select(year,month,day,dep_delay,carrier) %>% # select variables of interest filter(month == 1 & day == 1 & dep_delay>0) %>% # pick flights on 1/1/2013 group_by(carrier) %>% # group the data by carrier summarize(AvgDelay = mean(dep_delay,na.rm=TRUE)) %>% # ignore missing values arrange(desc(AvgDelay)) # order the summary dataset by average departure delay## Warning: `lang()` is deprecated as of rlang 0.2.0.## Please use `call2()` instead.## This warning is displayed once per session.## Warning: `new_overscope()` is deprecated as of rlang 0.2.0.## Please use `new_data_mask()` instead.## This warning is displayed once per session.## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.## Please use `eval_tidy()` with a data mask instead.## This warning is displayed once per session.delay## # A tibble: 11 x 2## carrier AvgDelay## <chr> <dbl>## 1 MQ 82.3 ## 2 EV 58.1 ## 3 9E 56.6 ## 4 AA 35 ## 5 B6 24.5 ## 6 DL 19.7 ## 7 UA 14.4 ## 8 WN 10.2 ## 9 US 6 ## 10 FL 4 ## 11 VX 2.25An extended example (cont.)Suppose some people would like to know which destinations have the longest average arrival delay on Dec. 25, 2013.library(nycflights13) ArrDelay<- flights %>% # Create a new dataset select(year,month,day,arr_delay,dest) %>% # select variables of interest filter(month == 12 & day == 25 & arr_delay>0) %>% # pick flights on 1/1/2013 group_by(dest) %>% # group the data by carrier summarize(AvgArrDelay = mean(arr_delay,na.rm=TRUE)) %>% # ignore missing values arrange(desc(AvgArrDelay)) # order the summary dataset by average departure delayprint(ArrDelay,n=10)## # A tibble: 57 x 2## dest AvgArrDelay## <chr> <dbl>## 1 JAC 175 ## 2 CLE 148. ## 3 CHS 98.5## 4 SAT 79 ## 5 BNA 70.6## 6 DTW 60 ## 7 MSY 59 ## 8 BUF 51.8## 9 BOS 48 ## 10 HOU 46.3## # ... with 47 more rowsAn extended example (cont.)Which carrier has the highest average speed to San Francisco (SFO)?library(nycflights13) Speed<- flights %>% select(air_time, distance, dest,carrier) %>% filter(dest=="SFO") %>% # pick flights to SFO mutate(speed=distance/air_time*60,na.rm=TRUE) %>% # calculate speed group_by(carrier) %>% # group the data by carrier summarize(AvgSpeed = mean(speed,na.rm=TRUE)) %>% # ignore missing values arrange(desc(AvgSpeed)) # order the summary dataset by average departure delaySpeed## # A tibble: 5 x 2## carrier AvgSpeed## <chr> <dbl>## 1 UA 450.## 2 DL 448.## 3 B6 448.## 4 AA 446.## 5 VX 446.It turns out that United Airline has the highest speed in the air! 450 miles/hour.See In-Class Exercise 3.A bit of side storyWickham’s approach is inspired by his desire to blur the boundaries between R and the ubiquitous relational database querying syntax SQL. In fact, the five verbs, when combined, exhibit a close similarity to SQL query statements (at least for data analysis purpose).Thus, mastering dplyr data wrangling verbs have become a gateway to analyzing big data, through relational database management system and beyond. The real power of dplyr is that it abstracts the data source, i.e., whether it is a data frame, a database, or Spark.In fact, the statistical package “SAS” have always had a powerful “data step” that does about the same thing, since 1970s.dplyr also includes the powerful workflow operator called “pipes”, found in e.g.?Unix shell script. We’ve seen the pipe in several examples. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download