STAT 1261/2260: Principles of Data Science



STAT 1261/2260: Principles of Data ScienceLecture 10 - Data Wrangling: Two Tables (2/2)Where are we?A grammar of data manipulation: two tablesmutating join (inner, left, right, full)filtering joinset operationData import with readrFiltering joinsFiltering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:semi_join(x, y) keeps all observations in x that have a match in y.anti_join(x, y) drops all observations in x that have a match in y.Semi-joins (1)Semi_joins are useful for matching filtered summary tables back to the original rows.For example, imagine you have found the top-5 most popular destinations:library(nycflights13)library(tidyverse)## Warning: As of rlang 0.4.0, dplyr must be at least version 0.8.0.## x dplyr 0.7.6 is too old for rlang 0.4.2.## b9 Please update dplyr to the latest version.## b9 Updating packages on Windows requires precautions:## < Warning: package 'purrr' was built under R version 3.5.3top5_dest <- flights %>% count(dest, sort = TRUE) %>% head(5)## Warning: The `printer` argument is deprecated as of rlang 0.3.0.## This warning is displayed once per 5_dest## # A tibble: 5 x 2## dest n## <chr> <int>## 1 ORD 17283## 2 ATL 17215## 3 LAX 16174## 4 BOS 15508## 5 MCO 14082Semi-joins (2)Now you want to find each flight that went to one of those destinations. You could construct a filter:flights %>% select(month,day,dep_time,arr_time,dep_delay, arr_delay, dest) %>% filter(dest %in% top5_dest$dest)## Warning: `lang()` is deprecated as of rlang 0.2.0.## Please use `call2()` instead.## This warning is displayed once per session.## Warning: `new_overscope()` is deprecated as of rlang 0.2.0.## Please use `new_data_mask()` instead.## This warning is displayed once per session.## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.## Please use `eval_tidy()` with a data mask instead.## This warning is displayed once per session.## # A tibble: 80,262 x 7## month day dep_time arr_time dep_delay arr_delay dest ## <int> <int> <int> <int> <dbl> <dbl> <chr>## 1 1 1 554 812 -6 -25 ATL ## 2 1 1 554 740 -4 12 ORD ## 3 1 1 557 838 -3 -8 MCO ## 4 1 1 558 753 -2 8 ORD ## 5 1 1 558 924 -2 7 LAX ## 6 1 1 559 702 0 -4 BOS ## 7 1 1 600 837 0 12 ATL ## 8 1 1 606 837 -4 -8 ATL ## 9 1 1 608 807 8 32 ORD ## 10 1 1 615 833 0 -9 ATL ## # ... with 80,252 more rowsYou can also use a semi-join:flights %>% select(month,day,dep_time,arr_time,dep_delay, arr_delay, dest) %>% semi_join(top5_dest,"dest")## # A tibble: 80,262 x 7## month day dep_time arr_time dep_delay arr_delay dest ## <int> <int> <int> <int> <dbl> <dbl> <chr>## 1 1 1 554 812 -6 -25 ATL ## 2 1 1 554 740 -4 12 ORD ## 3 1 1 557 838 -3 -8 MCO ## 4 1 1 558 753 -2 8 ORD ## 5 1 1 558 924 -2 7 LAX ## 6 1 1 559 702 0 -4 BOS ## 7 1 1 600 837 0 12 ATL ## 8 1 1 606 837 -4 -8 ATL ## 9 1 1 608 807 8 32 ORD ## 10 1 1 615 833 0 -9 ATL ## # ... with 80,252 more rowsAnti-joinsThe inverse of a semi-join is an anti-join. An anti-join keeps rows that don’t have a match.These are most useful for diagnosing join mismatches. For example, there are many flights in the nycflights13 dataset that don’t have a matching tail number in the planes table:library(tidyverse)library(nycflights13)flights %>% anti_join(planes, by = "tailnum") %>% count(tailnum, sort = TRUE)## # A tibble: 722 x 2## tailnum n## <chr> <int>## 1 <NA> 2512## 2 N725MQ 575## 3 N722MQ 513## 4 N723MQ 507## 5 N713MQ 483## 6 N735MQ 396## 7 N0EGMQ 371## 8 N534MQ 364## 9 N542MQ 363## 10 N531MQ 349## # ... with 712 more rowsSemi-joins and anti-joinsIf you’re worried about what observations your joins will match, start with a semi_join() or anti_join(). semi_join() and anti_join() never duplicate; they only remove observations.library(tidyverse)df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)## Warning: `list_len()` is deprecated as of rlang 0.2.0.## Please use `new_list()` instead.## This warning is displayed once per session.df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))df1;df2## # A tibble: 4 x 2## x y## <dbl> <int>## 1 1 1## 2 1 2## 3 3 3## 4 4 4## # A tibble: 3 x 2## x z ## <dbl> <chr>## 1 1 a ## 2 1 b ## 3 2 adf1 %>% semi_join(df2, by = "x")## # A tibble: 2 x 2## x y## <dbl> <int>## 1 1 1## 2 1 2df1 %>% anti_join(df2, by = "x")## # A tibble: 2 x 2## x y## <dbl> <int>## 1 3 3## 2 4 4Class Exercise 1: semi_joinDo you think the following two will give you the same result?df1 %>% semi_join(df2, by = "x")df2 %>% semi_join(df1, by = "x")Verify your claim by coding… starting with defining the data frames:library(tidyverse)df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))library(tidyverse)df1 <- data_frame(x = c(1, 1, 3, 4), y = 1:4)df2 <- data_frame(x = c(1, 1, 2), z = c("a", "b", "a"))df1 %>% semi_join(df2, by = "x")## # A tibble: 2 x 2## x y## <dbl> <int>## 1 1 1## 2 1 2df2 %>% semi_join(df1, by = "x")## # A tibble: 2 x 2## x z ## <dbl> <chr>## 1 1 a ## 2 1 bClass Exercise 2: anti_joinRecall that in our last lecture, we compared the results of left join and right join of flights and airports.flights7 <- flights %>% right_join(airports, c("dest"="faa"))## Warning: `chr_along()` is deprecated as of rlang 0.2.0.## This warning is displayed once per session.dim(flights);dim(flights7)## [1] 336776 19## [1] 330531 26Please find the flights whose destination airports are not included in airports.airports0 <- airports %>% rename("dest"="faa")flights0 <- flights %>% select(year:day,dep_time,arr_time,origin,dest) %>% anti_join(airports0, by="dest")flights0## # A tibble: 7,602 x 7## year month day dep_time arr_time origin dest ## <int> <int> <int> <int> <int> <chr> <chr>## 1 2013 1 1 544 1004 JFK BQN ## 2 2013 1 1 615 1039 JFK SJU ## 3 2013 1 1 628 1137 JFK SJU ## 4 2013 1 1 701 1123 EWR SJU ## 5 2013 1 1 711 1151 JFK SJU ## 6 2013 1 1 820 1254 JFK SJU ## 7 2013 1 1 820 1249 JFK SJU ## 8 2013 1 1 840 1311 JFK SJU ## 9 2013 1 1 909 1331 JFK STT ## 10 2013 1 1 913 1346 EWR STT ## # ... with 7,592 more rowsCount the number of flights to each of the airports that are not in airports.flights0 %>% count(dest,sort=TRUE)## # A tibble: 4 x 2## dest n## <chr> <int>## 1 SJU 5819## 2 BQN 896## 3 STT 522## 4 PSE 365Alternatively,flights0 %>% group_by(dest) %>% summarize(N=n()) %>% arrange(desc(N))## # A tibble: 4 x 2## dest N## <chr> <int>## 1 SJU 5819## 2 BQN 896## 3 STT 522## 4 PSE 365Set operationsThe final type of two-table verb is set operations. These expect the x and y inputs to have the same variables, and treat the observations like sets:intersect(x, y): ?return only observations in both x and yunion(x, y): ?return unique observations either x or y or both x and ysetdiff(x, y): ?return observations in x, but not in y.Set operations (1): Union and Intersectiondf1 <- data_frame(x = c(1,2), y = rep(1,2));df1## # A tibble: 2 x 2## x y## <dbl> <dbl>## 1 1 1## 2 2 1df2 <- data_frame(x = rep(1,2), y = c(1,2));df2## # A tibble: 2 x 2## x y## <dbl> <dbl>## 1 1 1## 2 1 2intersect(df1, df2)## # A tibble: 1 x 2## x y## <dbl> <dbl>## 1 1 1union(df1, df2)## # A tibble: 3 x 2## x y## <dbl> <dbl>## 1 1 2## 2 2 1## 3 1 1Note that we get 3 rows, not 4 rows. The common row (1,1) only shows once in the union.Set operations (2): Set differencesReturn observations in df1, but not in df2setdiff(df1, df2)## # A tibble: 1 x 2## x y## <dbl> <dbl>## 1 2 1Return observations in df2, but not in df1setdiff(df2, df1)## # A tibble: 1 x 2## x y## <dbl> <dbl>## 1 1 2All of these operations work with a complete row, comparing the values of every variable.Data import with readrNow we will learn how to read plain-text rectangular files in to R using the package readr.readr is part of tidyverse. So let us first load tidyverse.library(tidyverse)readr functions:read_csv() reads comma-delimited filesread_csv2() reads semicolon-separated filesread_tsv() reads tab-delimited filesread_delim() reads in files with any delimiterThese functions all have similar syntax. We will focus on read_csv because CSV files are one of the most common forms of data storage.read_csv (1): BasicThe first argument is the part to the file to read. ?For example:survey<-read_csv("C:/Users/jub69/Dropbox/A-Pitt/STAT 1261-PDS/CourseMaterial/Lec7-12_wrangle/survey.csv")## Parsed with column specification:## cols(## Student = col_integer(),## Year = col_character(),## Age = col_integer(),## Gender = col_character(),## Major = col_character(),## GPA = col_double()## )survey## # A tibble: 5 x 6## Student Year Age Gender Major GPA## <int> <chr> <int> <chr> <chr> <dbl>## 1 1 Junior 20 F Statistics 3.45## 2 2 Senior 21 F Mathematics 3.88## 3 3 Freshman 18 M Computer Science 3.67## 4 4 Junior 19 M Mathematics 3.25## 5 5 Sophomore 19 F Statistics 3.98When you run read_csv it prints out a column specification that gives the name and type of each column. That’s an important part of readr.One thing you may need to pay attention is that R uses / in the path. Sometimes you might need to change \ to / in your path since R does not recognize /.read_csv (2): Data with metadata at the topBy default, read_csv() uses the first line of the data for column names. There are two cases where you might want to tweak this behavior:Datasets with metadata at the top of the file.For example, in the survey data, I added three lines of description:STAT 1261/2260FALL 2019Student SurveyIn this case, we can use skip = n argument to skip the first n lines:survey2<-read_csv("C:/Users/jub69/Dropbox/A-Pitt/STAT 1261-PDS/CourseMaterial/Lec7-12_wrangle/survey2.csv",skip=3)## Parsed with column specification:## cols(## Student = col_integer(),## Year = col_character(),## Age = col_integer(),## Gender = col_character(),## Major = col_character(),## GPA = col_double()## )survey2## # A tibble: 5 x 6## Student Year Age Gender Major GPA## <int> <chr> <int> <chr> <chr> <dbl>## 1 1 Junior 20 F Statistics 3.45## 2 2 Senior 21 F Mathematics 3.88## 3 3 Freshman 18 M Computer Science 3.67## 4 4 Junior 19 M Mathematics 3.25## 5 5 Sophomore 19 F Statistics 3.98read_csv (3): Data without column namesThe data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn.Suppose my survey data does not have variable names in it.survey3<-read_csv("C:/Users/jub69/Dropbox/A-Pitt/STAT 1261-PDS/CourseMaterial/Lec7-12_wrangle/survey3.csv", col_names = FALSE)## Parsed with column specification:## cols(## X1 = col_integer(),## X2 = col_character(),## X3 = col_integer(),## X4 = col_character(),## X5 = col_character(),## X6 = col_double()## )survey3## # A tibble: 5 x 6## X1 X2 X3 X4 X5 X6## <int> <chr> <int> <chr> <chr> <dbl>## 1 1 Junior 20 F Statistics 3.45## 2 2 Senior 21 F Mathematics 3.88## 3 3 Freshman 18 M Computer Science 3.67## 4 4 Junior 19 M Mathematics 3.25## 5 5 Sophomore 19 F Statistics 3.98read_csv (4): Give column namesAlternatively, you can pass col_names a character vector, which will be used as the column names:survey3<-read_csv("C:/Users/jub69/Dropbox/A-Pitt/STAT 1261-PDS/CourseMaterial/Lec7-12_wrangle/survey3.csv", col_names = c("Student","Year","Age","Gender","Major","GPA"))## Parsed with column specification:## cols(## Student = col_integer(),## Year = col_character(),## Age = col_integer(),## Gender = col_character(),## Major = col_character(),## GPA = col_double()## )survey3## # A tibble: 5 x 6## Student Year Age Gender Major GPA## <int> <chr> <int> <chr> <chr> <dbl>## 1 1 Junior 20 F Statistics 3.45## 2 2 Senior 21 F Mathematics 3.88## 3 3 Freshman 18 M Computer Science 3.67## 4 4 Junior 19 M Mathematics 3.25## 5 5 Sophomore 19 F Statistics 3.98read_csv (5): Create a csv fileLastly, you can also supply an inline CSV file using read_csv().read_csv("a,b,c 1,2,3 4,5,6")## # A tibble: 2 x 3## a b c## <int> <int> <int>## 1 1 2 3## 2 4 5 6This is useful for experimenting with readr and for creating reproducible examples to share with others.Note that in this case read_csv() also uses the first line for column pared to Base RYou might wonder why we are not using read.csv(). There are a few good reasons to favor readr functions over the base equivalents:They are typically much faster (~10X) than their base equivalents.They produce tibbles.They are more reproducible. Base R functions inherit some behavior from your operating system and environment variables, so import code that works on your computer might not work on someone else’s.DatabasesEach two-table verb has a straightforward SQL equivalent. The correspondences between R and SQL are:inner_join(): SELECT * FROM x JOIN y ON x.a = y.aleft_join(): SELECT * FROM x LEFT JOIN y ON x.a = y.aright_join(): SELECT * FROM x RIGHT JOIN y ON x.a = y.afull_join(): SELECT * FROM x FULL JOIN y ON x.a = y.asemi_join(): SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a)anti_join(): SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a)intersect(x, y): SELECT * FROM x INTERSECT SELECT * FROM yunion(x, y): SELECT * FROM x UNION SELECT * FROM ysetdiff(x, y): SELECT * FROM x EXCEPT SELECT * FROM yx and y don’t have to be tables in the same database. If you specify copy = TRUE, dplyr will copy the y table into the same location as the x variable. This is useful if you’ve downloaded a summarized dataset and determined a subset for which you now want the full data.You should review the coercion rules, e.g., factors are preserved only if the levels match exactly and if their levels are different the factors are coerced to character.At this time, dplyr does not provide any functions for working with three or more tables. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download