Course1.winona.edu



DSCI 325: R Handout 5 -- Web scraping with R using APIsIn this handout, we will explore scraping data from the web (in R) using APIs. The information in this handout is largely based on the following tutorial prepared by Bradley Boehmke: is an API? Many websites and companies allow programmatic access to their data via web APIs (API stands for application programming interface). Essentially, this gives a web-based database a way to communicate with another program (such as R).Each organization’s API is unique, but you typically need to know the following for your specific data set of interest:URL for the organization and data you are pullingData set you are pulling fromData content (so that you can specify variables you want the API to retrieve, you need to be familiar with the data library)In addition, you may also need the following:API key (i.e., API token). This is typically obtained by supplying your name/email to the organization.OAuth. This is an authorization framework that provides credentials as proof for certain information.The remainder of this handout will introduce these topics with the use of examples.Example: Pulling U.S. Bureau of Labor Statistics dataNo API key or OAuth is required, but a key is recommended. To register for a key, visit this website: . All available data sets from the Bureau of Labor Statistics and their series code information are available here: . In this example, we will consider pulling Local Area Unemployment Statistics: The BLS provides this breakdown: the SERIES ID: LA U CN2716900000000 PrefixSeasonal Adjustment CodeArea CodeMeasure CodeLAUCN 27 1690000000003Source: You can view the data being pulled via the BLS dataViewer website. following link contains some answers to FAQs: read this data into R from the web, you should make use of the following package.blsAPI {blsAPI}R DocumentationRequest Data from the U.S. Bureau Of Labor Statistics APIDescriptionAllows users to request data for one or multiple series through the U.S. Bureau of Labor Statistics API. Users provide parameters as specified in <; and the function returns a JSON string or data frame.UsageblsAPI(payload = NA, api.version = 1, return.data.frame = FALSE)Argumentspayloada string or a list containing data to be sent to the API.api.versionan integer for which api version you want to use (i.e. 1 for v1, 2 for v2)return.data.framea boolean if you want to the function to return JSON (default) or a data frame. If the data frame option is used, the series id will be added as a column. This is helpful if multiple series are selected.R code:install.packages("blsAPI")library(blsAPI)# Supply series identifier to pull datapayload <- list('seriesid'=c('LAUCN271690000000003'),'registrationKey'='your registration key here') unemployment_winona = blsAPI(payload,api.version=2,return.data.frame=TRUE)Task: Use the blsAPI package to pull data from the National Employment, Hours, and Earnings data set. Pull the average weekly hours of all employees in the construction sector.Data for a three-year period is returned by default. To pull data from a specified set of years, you can specify additional arguments:R code:payload < list('seriesid'=c('LAUCN271690000000003'), 'registrationKey'='your registration key here', 'startyear'=2010, 'endyear'=2017)unemployment_winona <- blsAPI(payload, 2, return.data.frame=TRUE)Finally, note that you can also pull multiple series at a time:R code:payload < list('seriesid'=c('LAUCN271690000000003', 'LAUCN271690000000004'), 'registrationKey'='your registration key here', 'startyear'=2010, 'endyear'=2017)unemployment_winona_rateandnumber <- blsAPI(payload, 2, return.data.frame=TRUE)Example: Pulling NOAA DataThe rnoaa package can be used to request data from the National Climatic Data Center (now known as the National Centers for Environmental Information) API. This package requires you to have an API key. To request a key, click here and provide your email address.NOAA data set descriptions are available here: is an R interface to NOAA climate data.Data SourcesMany functions in this package interact with the National Climatic Data Center application programming interface (API) at , all of which functions start with?ncdc_. An access token, or API key, is required to use all the?ncdc_?functions. The key is required by NOAA, not us. Go to the link given above to get an API key.More NOAA data sources are being added through time. Data sources and their function prefixes are:buoy_*?- NOAA Buoy data from the National Buoy Data Centergefs_*?- GEFS forecast ensemble dataghcnd_*?- GHCND daily data from NOAAisd_*?- ISD/ISH data from NOAAhomr_*?- Historical Observing Metadata Repository (HOMR) vignettencdc_*?- NOAA National Climatic Data Center (NCDC) vignette (examples)seaice?- Sea ice vignettestorm_?- Storms (IBTrACS) vignetteswdi?- Severe Weather Data Inventory (SWDI) vignettetornadoes?- From the NOAA Storm Prediction Centerargo_*?- Argo buoyscoops_search?- NOAA CO-OPS - tides and currents dataFor example, let’s start by pulling all weather stations in Winona County, MN, using Winona County’s FIPS code. We will focus on the GHCND data set, which contains records on daily measurements such as maximum and minimum temperature, total daily precipitation, etc. R code:install.packages("rnoaa")library(rnoaa)stations <- ncdc_stations(datasetid=’GHCND’, locationid=’FIPS:27169’, token ='your registration key here')To pull data from one of these stations, we need the station ID. Suppose we want to pull all available data from the “Winona, MN US” station. The following commands supply the data to pull, the start and end dates (you are restricted to a one-year limit), station ID, and your key.R code:climate = ncdc(datasetid='GHCND', startdate = '2007-01-01', enddate = '2007-12-31', stationid='GHCND:USC00219067’, token = 'your registration key here')> climate$data date datatype station value fl_m fl_q fl_so fl_t1 2007-01-01T00:00:00 PRCP GHCND:USC00219067 41 0 18302 2007-01-01T00:00:00 SNOW GHCND:USC00219067 18 0 3 2007-01-01T00:00:00 SNWD GHCND:USC00219067 25 0 4 2007-01-01T00:00:00 TMAX GHCND:USC00219067 33 0 18305 2007-01-01T00:00:00 TMIN GHCND:USC00219067 -22 0 18306 2007-01-01T00:00:00 TOBS GHCND:USC00219067 -22 0 18307 2007-01-02T00:00:00 PRCP GHCND:USC00219067 0 0 18308 2007-01-02T00:00:00 SNOW GHCND:USC00219067 0 0 9 2007-01-02T00:00:00 SNWD GHCND:USC00219067 0 0 10 2007-01-02T00:00:00 TMAX GHCND:USC00219067 56 0 183011 2007-01-02T00:00:00 TMIN GHCND:USC00219067 -72 0 183012 2007-01-02T00:00:00 TOBS GHCND:USC00219067 17 0 183013 2007-01-03T00:00:00 PRCP GHCND:USC00219067 0 0 183014 2007-01-03T00:00:00 SNOW GHCND:USC00219067 0 0 15 2007-01-03T00:00:00 SNWD GHCND:USC00219067 0 0 16 2007-01-03T00:00:00 TMAX GHCND:USC00219067 78 0 183017 2007-01-03T00:00:00 TMIN GHCND:USC00219067 6 0 183018 2007-01-03T00:00:00 TOBS GHCND:USC00219067 72 0 183019 2007-01-04T00:00:00 PRCP GHCND:USC00219067 0 0 183020 2007-01-04T00:00:00 SNOW GHCND:USC00219067 0 0 21 2007-01-04T00:00:00 SNWD GHCND:USC00219067 0 0 22 2007-01-04T00:00:00 TMAX GHCND:USC00219067 78 0 183023 2007-01-04T00:00:00 TMIN GHCND:USC00219067 39 0 183024 2007-01-04T00:00:00 TOBS GHCND:USC00219067 72 0 183025 2007-01-05T00:00:00 PRCP GHCND:USC00219067 0 0 1830Next, let’s pull data on precipitation for 2007 (note the use of the datatypeid argument). By default, ncdc limits the results to 25, but we can adjust the limit argument as shown below. R code:precip = ncdc(datasetid='GHCND', startdate = '2007-01-01', enddate = '2007-12-31', limit=365, stationid='GHCND:USC00219067', datatypeid = 'PRCP', token = 'your registration key here')Finally, we can sort the observations to see which days in 2007 experienced the greatest rainfall.R code:precip.data=precip$data precip.data %>% arrange(desc(value)) date datatype station value fl_m fl_q fl_so fl_t1 2007-08-19T00:00:00 PRCP GHCND:USC00219067 1257 0 18302 2007-08-20T00:00:00 PRCP GHCND:USC00219067 1016 0 18303 2007-08-11T00:00:00 PRCP GHCND:USC00219067 424 0 18304 2007-02-24T00:00:00 PRCP GHCND:USC00219067 404 0 18305 2007-08-18T00:00:00 PRCP GHCND:USC00219067 396 0 18306 2007-09-07T00:00:00 PRCP GHCND:USC00219067 391 0 18307 2007-05-24T00:00:00 PRCP GHCND:USC00219067 381 0 18308 2007-08-14T00:00:00 PRCP GHCND:USC00219067 361 0 1830...Example – Leveraging an Organization’s API without an R PackageIn some situations, an R package may not exist to communicate with an organization’s API interface. Hadley Wickham has developed the httr package to easily work with web APIs. It offers multiple functions, but in this example, we will focus on the use of the GET() function to access an API, provide some request parameters, and get output. Suppose we wanted to pull College Scoreboard data from the Department of Education. Though an R package does in fact exist to facilitate such a data pull, we will illustrate the use of the httr package, instead. Start by requesting a key: library: explanation: we wanted to retrieve all information from Winona State University. The following URL will retrieve this information. Paste this URL into a browser, specify your registration key with the api_key parameter. The following R code can be used to bring this into R.install.packages("httr")library(httr)URL <- "?"# import all available data for Winona State Universitywsu_request <- httr::GET(URL, query = list(api_key = 'your registration key here', school.name = "Winona State University"))This request will provide us with all information collected on Winona State University. The JSON data structure is the format that the API is using to send data back to R. The hierarchical structure of JSON data does not work well in R which is best suited for rectangular shaped data. To retrieve the contents, use the content() function. This returns an R object (specifically a list). > wsu_data <- content(wsu_request) > names(wsu_data)[1] "metadata" "results"Note that the data is segment into metadata and results; we are interested in the results. > names(wsu_data$results[[1]])To see what data are available, we can look at a single year:> names(wsu_data$results[[1]]$'2015')With such a large data set containing many embedded lists, we can explore the data by examining names at different levels:> wsu_data$results[[1]]$'2015'$cost> names(wsu_data$results[[1]]$'2015'$cost)> wsu_data$results[[1]]$'2015'$cost$attendance$academic_yearGetting cost over a sequence of years…> wsu_data$results[[1]]$'2014'$cost$attendance$academic_year> wsu_data$results[[1]]$'2013'$cost$attendance$academic_year> wsu_data$results[[1]]$'2012'$cost$attendance$academic_yearGetting median ACT cumulative for > names(wsu_data$results[[1]]$'2015'$admissions)[1] "sat_scores" "admission_rate" "act_scores" > names(wsu_data$results[[1]]$'2015'$admissions$act_scores)[1] "midpoint" "25th_percentile" "75th_percentile"> names(wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint)[1] "math" "cumulative" "writing" "english" > wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint$cumulative[1] 23We can pull data collected on a certain variable over many years as follows:install.packages("dplyr")library(dplyr)# subset list for annual student data onlywsu_yr <- wsu_data$results[[1]][c(as.character(2000:2015))]# extract median cumulative ACT score data for each yearwsu_yr %>% sapply(function(x){x$admissions$act_scores$midpoint$cumulative}) %>% unlist()#extract net price for each yearwsu_yr %>% sapply(function(x) x$cost$avg_net_price$overall) %>% unlist()# extract median debt data for each yearwsu_yr %>% sapply(function(x) x$aid$median_debt$completers$overall) %>% unlist()> # extract median cumulative ACT score data for each year> wsu_yr %>%+ sapply(function(x) x$admissions$act_scores$midpoint$"cumulative") %>% + unlist()> #extract net price for each year> wsu_yr %>%+ sapply(function(x) x$cost$avg_net_price$overall) %>% + unlist() > # extract median debt data for each year> wsu_yr %>%+ sapply(function(x) x$aid$median_debt$completers$overall) %>% + unlist()CSV Version of file… If you wanted to obtain a csv file instead of a json, change the URL as follows: the CSV file has been downloaded via a browser, you can read it into R using Import Dataset within R Studio.> #extract net price for each year> library(readr)> wsu_data <- read_csv(“D:/Teaching/DSCI325/Data/wsu_data.csv”)> View(wsu_dataNotice that the CSV file is very wide (about 17,000 variables and only 1 observation). To pull off certain columns, we will use the contains() function in conjunction with dplyr::select() function. Here, variables associated with median debt for completers.overall is being selected.> dplyr::select(wsu_data, contains("median_pleters.overall"))The gather() function in dplyr can be used to transpose this row of data into a column of data.> dplyr::select(wsu_data, contains("median_pleters.overall")) %>% dplyr::gather()Function to pull off cost for a particular year…mypull<-function(year){return(eval(parse(text=paste("wsu_data$results[[1]]$'",year,"'$cost$attendance$academic_year",sep=""))))} ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download