Course1.winona.edu



DSCI 325: Final – TakehomeName(s): ______________________________________Spring 2018Points: 75 ______________________________________ ______________________________________I recently heard about the work of Matthew Desmond who is the author of a book called Evicted: Poverty and Profit in the American City. Desmond has studied housing, poverty, and evictions since 2008. Website: Desmond recently made is his data available which he claims is the most comprehensive data regarding evictions currently available. I used the following BASH script to download data from his website. Your Task: Create a R script (instead of a BASH script) that can be used to download each counties.csv file onto your local computer. Your script will then read-in each file and concatenate them into one large data.frame that contains information for all states.Note: To my knowledge, an API does not exist to get this data; thus, a brute force approach to downloading data will be used here.Getting Access to Data: data provided on the eviction lab website has the following structureeach state has its own directory – note: some directories will not be used, e.g. court-reported-stats/ and non-imputed/ within each state directory there are several files – the only file to be downloaded here is the counties.csv fileDirectory StructureContents of each state directoryAK/AL/AR/AZ/CA/CO/court-reported-stats/CT/DC/::NM/non-imputed/NV/::WY/qDATA_DICTIONARY.txtrobots.txtAK/all.csvblock-groups.csvblock-groups.geojsoncities.csvcities.geojsoncounties.csvcounties.geojsonstates.csvstates.geojsontracts.csvtracts.geojsonAL/all.csvblock-groups.csvblock-groups.geojsoncities.csvcities.geojsoncounties.csvDownload the State_Abbreviations.csv file from the course website and read in this file. This file contains each state (including the District of Columbia) and their respective abbreviations which will be used to create the necessary directories and links.To begin, consider the following sequence of R commands.Note: You will need to change the string in Line 16 to reflect the directory location for which you’d like to download the data into.Run Line 11. What is the purpose of Line 11? (1 pt)Change Line 11 so you not have to deal with the stupid dialog box that pops up in order to read in the file. (2 pts)<Paste new Line 11 here>What is the purpose of Line 12? Briefly discuss. (1 pt)How could one check whether or not the code contained in Lines 18-20 was successful or not? [Note: Your check need not be done inside of R.] Explain. (2 pts)Provide a screen-shot (or some other output/proof) that ensures Lines 18-20 were successfully run. (2 pts)What is the full url address that must be used to download the AK/counties.csv file from the Eviction Lab website? How does this full url change for a different state? Discuss briefly. (2 pts)Create a vector, say Links, that contains the full url address so that the counties.csv file can be downloaded for each state and DC. (2 pts)Note: I used the paste() function to accomplish this task. <Paste code used to create the Links vector here>Next, use a for() loop and the download.file() function in R to automatically download the counties.csv files for each state and DC from the Eviction Lab website. (4 pts)Notes: 1) the for() loop should have a similar structure to the one provided above, 2) An example use of download.file() function is provided on page 3 of Handout #12.<Paste code used to automatically download counties.csv from eviction lab website>Consider the following sequence of R commands.What is the purpose of Line 35? Briefly discuss. (1 pt)What is the purpose of Line 39? Briefly discuss. (2 pts)What is the purpose of Line 41? Briefly discuss. (2 pts)Consider the following map which shows the eviction rates by county. Re-run the code provided to create your own map. Remove my map and paste your new map in its place. (2 pts)#Preparing data.frame for dplyr::plot_usmap function Rate2016 <- dplyr::filter(evictiondata,eviction.rate > 0 & year > 2015)Rate2016 <- dplyr::mutate(Rate2016,fips=GEOID)usmap::plot_usmap(regions = "counties", data = Rate2016, values ="eviction.rate", lines = "black" ) + labs(title = "Eviction Rates by County") + scale_fill_continuous(low="lightgreen", high="darkred", guide="colorbar", na.value="white", name = "Evictions" )What is the purpose of the dplyr::filter() function in the first line of this code? What values for eviction rates and what years are being excluded by this filter? Briefly discuss. (2 pts)When looking at the map created above, it is difficult to see variations in the evictions rates for counties because of the extreme skewness in this data. That is, most evictions rates are fairly small with a long right tail reaching up to almost 25%. On possible solution would be to truncate the top 10% down of the distribution down to a single value to reduce the scale being used the scale_fill_continuous() function. The 90th percentile for the eviction rates is about 4; thus, we will need to reassign any eviction rate value that is greater than 4 to be 4.Use the dplyr::mutate() function along with the ifelse() function to reassign any eviction. rate value that is greater than 4 to be a 4. (3 pts)<Paste code used to make reassignment>Use the usmap::us_map() function above to recreate the map. What areas are more easily seen on this map that were hidden on the first map? Discuss briefly. (3 pts) For this part of your take-home exam we will consider the commute times for how long it takes to get to work. The American Community Survey is a large survey done by the Census Bureau and this survey contains outcomes that pertain to commute times for regions across the United States.A project completed by WNYC provided a United States map for commute times. The user can specify a zipcode and their app will provided a dialog box with the average commute time for the city. The map will also zoom in on this area. For example, the following screen-shots are for the commute time for the City of Minneapolis with zipcode = 55401.Link: ObservationsThe average commute time for people living in zipcode = 55401 is 20.5 minutes. The communities north of downtown Minneapolis and St. Paul appear to have longer commute times compared to people on the south side. Also, as expected, the commute times for people living further away from downtown Minneapolis and St. Paul appear to longer than those that live closer to downtown.Unfortunately, the data used by this app is somewhat dated going back to ACS 20111.Your Task: Using the most updated data available, i.e. the American Community Survey 5-year estimates data for 2011-2016, compute the average commute time for the 12 regions specified on the map which is centered around downtown Minneapolis. To accomplish this task, I have provided two datasets: Census_CommuteTimes.csv (Dataset #1) and GPS_USCities.csv (Dataset #2). Data Set #1Filename: Census_CommuteTimes.csvSource: , S0801_C01_046E,S0802_C04_090E&for=place:* Data Set #2Filename: GPS_USCities.csvSource: #1 contains the following information on about 30,000 cities, towns, and CDPs, i.e. census-designated places, across the United StatesLocation: City, State of residentsPopulation: Population of LocationCommuteTime: commute time (one-way) of workers who live in this location CommuteTimePT: commute time (one-way) for those using public transportation. Note: Many NAs as public transportation is not available in many locations, e.g. rural areasState ID & Location ID: ID information provided by census bureau’s API for this particular locationDataset #2 contains the following relevant information (variables not mentioned here can be ignored) on about 36,000 cities across the United States City: City of residentsStateName: State of residentsLatitude: Latitude measurement for this cityLongitude: Longitude measurement for this cityUse the following definitions in completing this task:Use the approximate GPS location of Latitude = 44.96 and Longitude = -93.27 for downtown Minneapolis.Quadrants are to be defined as Quadrant I: NE of downtown MinneapolisQuadrant II: NW of downtown MinneapolisQuadrant III: SW of downtown MinneapolisQuadrant IV: SE of downtown MinneapolisDistance bands are to be defined asInner circle less than 10 milesMiddle circle between 10 and 25 milesOuter circle between 25 and 50 milesTo begin, read-in Dataset #1 and Dataset #2 using the Import Dataset feature of R Studio. Run the following command on the Census_CommuteTimes dataset.Census_CommuteTimes2 <- dplyr::filter(Census_CommuteTimes, !grepl('CDP', Location))What is this command doing? Discuss. (2 pts)How many locations were removed from the Census_CommuteTimes data.frame by this command? (2 pts)Run the following command on the GPS_USCitites dataset.GPS_USCities2 <-dplyr::mutate(GPS_USCities, Location = paste(City,StateName,sep=", ")) What is this command doing? (2 pts)Why must this command be run before a join can be done between the two datasets? (2 pts)Use the dplyr::left_join function to put the Latitude and Longitude columns from GPS_USCities2 into the Census_CommuteTimes2 data.frame. I called the data.frame after the join FinalData.What type of check can be done to ensure the left_join was successful? Explain. (2 pts)Provide the output used to verify that the left_join was successful. (2 pts)Next, install the geosphere package in R. Load this library so that you can use its functions. Run the following command in R.FinalData2 <- FinalData %>% rowwise() %>% mutate(Distance = distHaversine(c(Longitude, Latitude), c(-93.27, 44.96)))The distHaversine() function is part of the geosphere library. What is the purpose of this function? (2 pts)Why is c(-93.27, 44.96) being used here? (2 pts)A distance is being calculated by the distHaversine() function. What are the units (e.g. feet, miles, m, km, etc.) on what is being returned by this function? (2 pts)Use the dplyr::filter function to get only locations whose distance is less than 50 miles from downtown Minneapolis. I called the resulting data.frame after the filter FinalData3.How many locations are within 50 miles of downtown Minneapolis? (2 pts)What type of check can be done to ensure the filter was successful? Explain. (2 pts)Provide the output used to verify that the filter was successful. (1 pt)Consider the following use of the case_when function from the dplyr suite of functions.FinalData4 <- dplyr::mutate(FinalData3, DistanceBand = case_when(Distance_Miles < 10 ~ "Inner_Circle",Distance_Miles > 10 & Distance_Miles < 25 ~ "Middle_Circle",Distance_Miles > 25 ~ "Outer_Circle"))What type of check can be done to ensure the mutate was successful? Explain. (2 pts)Provide the output used to verify that the filter was successful. (2 pts)Use an similar case_when function to create a new variable, say Quadrant, to specify which quadrants each location resides. (2 pts)<Paste code for creating Quadrant variable here>What type of check can be done to ensure the mutate was successful? Explain. (2 pts)Provide the output used to verify that the filter was successful. (1 pt)After the variables DistanceBand and Quadrant have been successfully added to the data.frame, use the group_by() function along with the summarise() function to compute the summaries needed for the following table. (4 pts)RegionTotal PopulationAverage Commute TimeAverage Commute Timevia Public Transportation123456789101112Copy and paste all your R code used to a complete this problem.<paste your R script file in this box>Provide a recommendation for which region surrounding the Twin Cities is in the most need of policies/actions to reduce their commute time. Your recommendation should not solely be based on the region with the maximum commute time. For example, how might population or availability of public transportation affect your recommendation? (3 pts)Provide a recommendation for which region surrounding the Twin Cities is in the most need of policies/actions to reduce their commute time by way of public transportation. Again, your recommendation should consider other relevant factors. (3 pts)The following map is of the Existing and Planned Transitways for the Metro – the Twin Cites train system. Notice that two train lines are being planned for Quadrant III. Assume that metro trains are meant to serve those in the inner most circle, i.e. less than 10 miles from downtown Minneapolis and St. Paul. Does our data support the notation that Quadrant III has the most need for policies/actions regarding additional train lines compared to other quadrants? Discuss. (2 pts)Map Source: ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download