STAT 1261/2260: Principles of Data Science



STAT 1261/2260: Principles of Data ScienceLecture 7 - Data Wrangling: One Table (1/2)Where are we?A taxonomy for data graphicsVisual cuesCoordinate systemsScaleContextThe grammar of graphics: ggplot2One numeric variable: histogram, density plot, QQ-plotOne categorical variable: bar graph, pie chartOne discrete numeric variable or ordinal variable: bar graphTwo numeric variables: scatterplot with 2D-density (contour plot)Two categorical variables: tile plot, stacked bar graphComparing two or more univariate distributions: side-by-side boxplotData wranglingData manipulation, also called data wrangling, includes three main parts:import datatidy datatransform dataThe goal of data wrangling is to get the data ready for further analysis, such as data visualization and modeling.The dplyr package presents a grammar of data wrangling. It was designed to:provide commonly used data manipulation tools;have fast performance for in-memory operations;abstract the interface between the data manipulation operations and the data source.The github repo for dplyr not only houses the R code, but also vignettes for various use cases. The introductory vignette is a good place to start and can be viewed by typing the following on the command line: vignette("dplyr", package = "dplyr") or by opening the dplyr file in the vignettes directory of the dplyr repo. The material for this section is extracted from Hadley Wickham’s Introduction to dplyr Vignette, R for data science, and MDSR.Introduction to tibbles (1)Let’s take a look at the presidential dataset in the ggplot2 package.library(dplyr)library(lubridate)library(ggplot2) presidential## # A tibble: 11 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican## 2 Kennedy 1961-01-20 1963-11-22 Democratic## 3 Johnson 1963-11-22 1969-01-20 Democratic## 4 Nixon 1969-01-20 1974-08-09 Republican## 5 Ford 1974-08-09 1977-01-20 Republican## 6 Carter 1977-01-20 1981-01-20 Democratic## 7 Reagan 1981-01-20 1989-01-20 Republican## 8 Bush 1989-01-20 1993-01-20 Republican## 9 Clinton 1993-01-20 2001-01-20 Democratic## 10 Bush 2001-01-20 2009-01-20 Republican## 11 Obama 2009-01-20 2017-01-20 DemocraticThe variable names in presidential are self explanatory, but note that presidential does not print like a regular data frame.This is because it is a tibble!tibbles are opinionated data frames that make working with big data a little easier. The tibble package is included in the package tidyverse.Introduction to tibbles (2)Two main differences in the usage of a tibble versus a classic data frame:Printing. It only prints the first 10 rows and all the columns that fit on screenSubsetting. You can pull out a single variable with $ or [[]].[[]] can extract by name or position while $ can only extract by name. presidential[["name"]]## [1] "Eisenhower" "Kennedy" "Johnson" "Nixon" "Ford" ## [6] "Carter" "Reagan" "Bush" "Clinton" "Bush" ## [11] "Obama" presidential[[1]] # pull out the first column## [1] "Eisenhower" "Kennedy" "Johnson" "Nixon" "Ford" ## [6] "Carter" "Reagan" "Bush" "Clinton" "Bush" ## [11] "Obama" presidential$name## [1] "Eisenhower" "Kennedy" "Johnson" "Nixon" "Ford" ## [6] "Carter" "Reagan" "Bush" "Clinton" "Bush" ## [11] "Obama"What about presidential["name"] and presidential[1]? Try it. Check the class. See In-class Exercise 1.Introduction to tibbles (3)You may change the default print behavior using print() functionprint(presidential, n = 2, width = Inf)## # A tibble: 11 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican## 2 Kennedy 1961-01-20 1963-11-22 Democratic## # ... with 9 more rows#presidential %>% print(n = 2,width = Inf)You can also control the default print behavior by setting optionsoptions(tibble.width=Inf)print(mpg,n=2)## # A tibble: 234 x 11## manufacturer model displ year cyl trans drv cty hwy fl ## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p ## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p ## class ## <chr> ## 1 compact## 2 compact## # ... with 232 more rowsUse RStudio’s built-in data viewer to view the complete dataset.View(mpg)Introduction to tibbles (4)You can coerce a data frame to a tibble with as_tibble()"class(iris)## [1] "data.frame"iris_tb <- as_tibble(iris)class(iris_tb)## [1] "tbl_df" "tbl" "data.frame"Sometimes you might need to turn a tibble back to a data frame:class(presidential)## [1] "tbl_df" "tbl" "data.frame"presidential_df <- as.data.frame(presidential)class(presidential_df)## [1] "data.frame"See In-Class Exercise 2.Introduction to tibbles (5)Some other good properties of tibblesIt never changes the type of the inputs (e.g.?it never converts strings to factors)It never creates row namestibbles report the type of each column while printingData Wrangling: ?Single Tabledplyr provides a suite of verbs for data manipulation of one datasetselect(): select columns (variables) by their namesfilter(): select rows (observations) based on some conditionarrange(): reorder the rowsmutate(): create new variables with functions of existing variablessummarise(): collapse many values to a single summary;See MDSR Figures 4.1–4.5 for a graphical illustration of these operations.These five verbs can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data manipulation.All verbs work similarly:The first argument is a data frame.The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).The result is a new data frame.Select Columns (Variables)(1)Let us revisit the presidential dataset.## # A tibble: 11 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican## 2 Kennedy 1961-01-20 1963-11-22 Democratic## 3 Johnson 1963-11-22 1969-01-20 Democratic## 4 Nixon 1969-01-20 1974-08-09 Republican## 5 Ford 1974-08-09 1977-01-20 Republican## 6 Carter 1977-01-20 1981-01-20 Democratic## 7 Reagan 1981-01-20 1989-01-20 Republican## 8 Bush 1989-01-20 1993-01-20 Republican## 9 Clinton 1993-01-20 2001-01-20 Democratic## 10 Bush 2001-01-20 2009-01-20 Republican## 11 Obama 2009-01-20 2017-01-20 DemocraticTo retrieve only the names and party affiliations of these presidents, we would use select().The first argument to the select() function is the data frame, followed by an arbitrarily long list of column names, separated by commas: ?select(data,var1,var2, ...)Select Columns (Variables)(2)For example, suppose we only need name and party:select(presidential, name, party)## Warning: `lang()` is deprecated as of rlang 0.2.0.## Please use `call2()` instead.## This warning is displayed once per session.## Warning: `new_overscope()` is deprecated as of rlang 0.2.0.## Please use `new_data_mask()` instead.## This warning is displayed once per session.## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.## Please use `eval_tidy()` with a data mask instead.## This warning is displayed once per session.## # A tibble: 11 x 2## name party ## <chr> <chr> ## 1 Eisenhower Republican## 2 Kennedy Democratic## 3 Johnson Democratic## 4 Nixon Republican## 5 Ford Republican## 6 Carter Democratic## 7 Reagan Republican## 8 Bush Republican## 9 Clinton Democratic## 10 Bush Republican## 11 Obama DemocraticSee In-Class Exercise 3.Select Columns (Variables)(3)If your dataset has a large number of variables, there are a few handy options to select multiple variables. To demonstrate, load nycflights13::flights.library(nycflights13);names(flights)## [1] "year" "month" "day" "dep_time" ## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"## [9] "arr_delay" "carrier" "flight" "tailnum" ## [13] "origin" "dest" "air_time" "distance" ## [17] "hour" "minute" "time_hour"Select all variables from variable year to variable arr_time:flights1<- select(flights, year:arr_time);print(flights1,n=1)## # A tibble: 336,776 x 7## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## # ... with 3.368e+05 more rowsTo select all variables except year, month, day:flights2<- select(flights, -(year:day));print(flights2,n=1)## # A tibble: 336,776 x 16## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay## <int> <int> <dbl> <int> <int> <dbl>## 1 517 515 2 830 819 11## carrier flight tailnum origin dest air_time distance hour minute## <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>## 1 UA 1545 N14228 EWR IAH 227 1400 5 15## time_hour ## <dttm> ## 1 2013-01-01 05:00:00## # ... with 3.368e+05 more rowsSelect Columns (Variables)(4)There are a number of helper functions you can use within select():starts_with("abc"): ?matches names that begin with “abc”.ends_with("xyz"): ?matches names that end with “xyz”.contains("ijk"):? matches names that contain “ijk”.num_range("x", 1:3):? matches x1, x2 and x3.For example: Select all columns that end with “time”.flight_time1<- select(flights, ends_with("time"))print(flight_time1,n=2)## # A tibble: 336,776 x 5## dep_time sched_dep_time arr_time sched_arr_time air_time## <int> <int> <int> <int> <dbl>## 1 517 515 830 819 227## 2 533 529 850 830 227## # ... with 3.368e+05 more rowsSee In-Class Exercise 4.Filter Rows (Observations)To retrieve only the Republican presidents, we use filter(). The first argument to filter() is a data frame, and subsequent arguments are logical conditions that are evaluated on any involved variables.filter(presidential, party == "Republican")## # A tibble: 6 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican## 2 Nixon 1969-01-20 1974-08-09 Republican## 3 Ford 1974-08-09 1977-01-20 Republican## 4 Reagan 1981-01-20 1989-01-20 Republican## 5 Bush 1989-01-20 1993-01-20 Republican## 6 Bush 2001-01-20 2009-01-20 Republicansimilarly, we can retrieve the presidents served after 2001.filter(presidential, year(start)>=2001)## # A tibble: 2 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Bush 2001-01-20 2009-01-20 Republican## 2 Obama 2009-01-20 2017-01-20 DemocraticLogical Conditions (1)Logical conditions in the filter() function may be a single comparison or a combination of multiple comparisons.Basic comparison operators: ? >, ?>=, ?<, ?<=, ?!= (not equal), and ?== (equal).When you combine two or more comparisons, use Boolean operators: & (and), | (or), ! (not)For set comparison, use x %in% Y, which is true if x is an element of the set Y.Example: Suppose that x is a variable with four observations. What is the resulting logical vector?x <- c(2,1,3,0)x == 0 ## [1] FALSE FALSE FALSE TRUE!(x == 0)## [1] TRUE TRUE TRUE FALSEx == 0 | x == 1 ## [1] FALSE TRUE FALSE TRUEx %in% c(0,1)## [1] FALSE TRUE FALSE TRUELogical Conditions (2)For example, suppose we want to retrieve Democratic presidents who started the term before 1977.filter(presidential, year(start) <1977 & party == "Democratic")## # A tibble: 2 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic## 2 Johnson 1963-11-22 1969-01-20 DemocraticSuppose now we want to retrieve presidents whose party is “Democratic” OR who started the term before 1977.filter(presidential, year(start) <1977 | party == "Democratic")## # A tibble: 8 x 4## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican## 2 Kennedy 1961-01-20 1963-11-22 Democratic## 3 Johnson 1963-11-22 1969-01-20 Democratic## 4 Nixon 1969-01-20 1974-08-09 Republican## 5 Ford 1974-08-09 1977-01-20 Republican## 6 Carter 1977-01-20 1981-01-20 Democratic## 7 Clinton 1993-01-20 2001-01-20 Democratic## 8 Obama 2009-01-20 2017-01-20 DemocraticSee In-Class Exercise bine filter() and select()Naturally, combining the filter() and select() commands enables one to drill down to very specific pieces of information. For example, we can find which Democratic presidents served since Watergate.select(filter(presidential, start > 1973 & party == "Democratic"), name)## # A tibble: 3 x 1## name ## <chr> ## 1 Carter ## 2 Clinton## 3 ObamaIn the syntax demonstrated above, the filter() operation is nested inside the select() operation. Each of the five verbs takes and returns a data frame, which makes this type of nesting possible.These long expressions become very difficult to read. Instead, we recommend the use of the pipe operator: %>%.Pipes are a powerful tool for clearly expressing a sequence of multiple operations. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.The pipe operator %>%The pipe, %>%, comes from the magrittr package. Packages in tidyverse load %>% automatically.Behind the scenes, data %>% f(x) turns into f(data,x) and data %>% f(x) %>% g(y) turns into g(f(data,x),y), and so on. For examplepresidential %>% filter(start > 1973 & party == "Democratic") %>% select(name)## # A tibble: 3 x 1## name ## <chr> ## 1 Carter ## 2 Clinton## 3 Obama#select(filter(presidential, start > 1973 & party == "Democratic"), name)Notice how the expression dataframe %>% filter(condition) is equivalent to filter(dataframe, condition).The pipe operator %>% (cont.)presidential %>% filter(start > 1973 & party == "Democratic") %>% select(name)The above pipeline reads:Take presidential data frame, then filter the Democrat presidents whose start year is greater than 1973, then select the variable name.Keyboard shortcut for %>%:?Cmd + Shift + M (Mac) and Ctrl + Shift + M (Windows).See In-Class Exercise 6. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download