Www.statsclass.org



DSCI 210: Final Take-homeName(s): ______________________________Spring 2019Points: 75 ______________________________For this problem, we will investigate the effect of weather on bike rentals for New York City. To begin, download CitiBike system data for May, June, July, August, and September 2018 Data: Download each of the requested datasets. Extract the zip files and read each one into JMP. Next, use Tables > Concatenate to join all the tables into a single dataset. Name this dataset BikeData.Tables > ConcatenateSelect all tables to be joined into a single table. Provide a name for this new table in the Output table name.To alleviate potential memory issues on our laptops, select the variables down to the following as only these will be used for this problem.tripdurationstarttimeusertypebirthyeargenderThe tripduration variable is provided in seconds. Create a new column called TripDurationMins which contains the tripduration values in minutes. Provide a screen shot of the first few rows of the table with the new variables added. (4 pts)<Delete my screen-shot and add yours in its place>The weather information, which this data will eventually be joined against, is provided on an hour-by-hour basis. Therefore, in order to more fairly measure the effect of weather on bike rentals, we will restrict bike rentals to be less than 1 hour in length. TripDurationMins ≤60Apply the filter specified above to your data using Row > Row Selection > Select Where. Once the rows have been identified, select Tables > Subset > and specified Selected Rows. Click OK. How many rows remain? (3 pts)# Rows after filter applied: _______________Use Tables > Summary to obtain the average Trip Duration (in Minutes) for the Customers and Subscribers. Customer: One time bike rental (tourist-type)Subscriber: Monthly subscription for bike rentals (regular users of the CitiBike)<Delete my screen-shot and add yours in its place>Provide an interpretation of these averages. Which one is larger and by how much? Briefly discuss. (3 pts)Next, we will consider a summary for age of the individual who is renting the bicycle across the two user types. Create a new variable, called Age, and obtain the following summaries in JMP. (3 pts)<Delete my screen-shot and add yours in its place>It is known that on application forms for subscription services, people may not provide “valid” information. For example, in this data, some people said their birth year was 1885. For this reason, we will consider the following filter. Apply the following filter and recalculate the average age for the Customers and the Subscribers.Age ≤75How many observations remain when this filter was applied to your data? (2 pts)Did the average Age for Customers or Subscribers change much after this filter is applied? Briefly discuss. (3 pts)Next, let us consider the following which will provide a more thorough understanding of the Age distributions between the two user types. Select Tables > Summary and specify the following setup in the Summary window provided by JMP. Some edits will need to be made to this table, so deselect Link to original data table. This will produce a table of counts by Age and UserType.Next, select Graph > Graph Builder. Specify the following for this plot.Y-axis: Count, i.e. N(TripDurationMins)X-axis: Age (for people 75 years and younger) Group Y: usertypeFor graph type, select Area plot, which is about ? way across the top and is specified by the box below.There are two major issues when looking at these distributions. Each issue is discussed here and a proposed fix is provided. Issues #1: Age = 50 has a strange anomaly. Upon further investigation it was determined that 1969 was the default birth year in our database system. Fix for Issue #1: For usertype=Customer, obtain the average number of rentals for Age = 49 and Age =51, use this average to replace number of rentals for Age = 50. Do the same for usertype = Subscriber. Note: You can do this “by-hand” if you’d like – compute the average and replace the appropriate value in your table. You do *not* need to automate this process.Issue #2: There are many rentals by Subscribers than Customers. This makes it difficult to compare the Age distributions. Fix for Issue #2: Instead of plotting N(TripDurationMins), divide the count obtained by Tables > Summary by the total number of users in that category. Note: Again, you can do this in a “brute force” way. After finding the totals for each group, I used and “If-else” statement in JMP to make the appropriate percentages. Make the proposed fixes specified above and recreate the above plot. (4 pts).<Delete my screen-shot and add yours in its place>What differences exist in the Age distributions across the Customers and Subscribers? Briefly discuss. (3 pts)Let us now consider the investigation into the effect of weather on the bike rental data. There are two tables that will need to be joined for this investigation. A summary table based off the Bike Rental data and the Weather data will be joined here.Table: Bike RentalTable: WeatherWeather Data:4 Fields: PrimaryKey_DateHour, Date, Temperature, Precipitation3,672 rows (153 days x 24 hours)Before a join can be done, we must have a matching primary key across the two tables. I have created the primary key for the weather data. The format of the primary key created is Month/Day/Year-HourCreate a matching primary key for the Bike Rental table. That is, use the starttime column to create a matching primary key so that a join can be done with the weather data. Provide a screen-shot of the first few rows of your Bike Rental table with the additional primary key ments: Creating this primary key can be done however you want – using whatever software you want. I used a combination of string function and date/time functions in JMP to build my primary key.String functions used in JMP: substr() is the substring function which can be used to pull off parts of a string, concat is the concatenate function which can be used to build up strings. Note for concat() to work, each individual element must be a stringDate functions used in JMP: Functions such as Month(), Day(), Year(), Hour() could be used as well when building up this primary key.After each element was obtained, I used the following concat() function in JMP to create the primary key. The icon can be used to add elements to the concat() function.After you have built the necessary primary key, provide a screen-shot of the first few rows of the updated Bike Rental data in JMP. (7 pts)<Delete my screen-shot and add yours in its place>Create one additional column in Bike Rental data, called DayofWeek. Use the DayofWeek() function in JMP to create this column. Again, provide a screen-shot of the first few rows of the updated Bike Rental table. (3 pts)<Delete my screen-shot and add yours in its place>Next, we must create a Bike Rental Count table that contains the number of bike rentals for each day-hour combination. In order to make the necessary comparisons, the counts must be separated by usertype. This field should be placed in the Subgroup box in JMP as shown here. Including DayofWeek in the Group box will ensure that this field is kept in the Bike Rental Count table. The resulting table should have (at least) the columns shown here. There should be 3,672 rows in this table so that a simple join can be done with the weather data.Finally, we will complete the necessary JOIN in JMP. From the Bike Rental Count table, select Table> Join. In the upper-left box, select the dataset to be joined, i.e. Weather. From the Source columns, select the two primary keys that will be used to complete the matching for the join. Place these fields in the Match box as shown here. Click OK.After a successful join has been obtained, provide a screen-shot the Columns and Rows boxes from JMP. (4 pts)<Delete my screen-shot and add yours in its place>After the two tables have been successfully joined, create the following new fields. Provided a screen-shot of the first few rows of the resulting table. Note: I have done a SELECT to retain only the variables needed for further analysis. (2 pts)Below60: Yes:Temperature<60No:Temperature ≥60Rain: Yes:Precipitation ≥0No:Precipitation=0Hour_of_Day: The hour of the day, 0, 1, …, 23<Delete my screen-shot and add yours in its place>Consider thee following “heat-map” summary that was done in Excel. The Conditional Formatting feature in Excel that allows us to easily make these types of displays. This summary has the following features:Subscribers / Customers have been divided from top / bottom. Above60 and Below 60 have been divided from left / rightRows in each heat-map indicate the hour of the dayColumns in each heat-map indicate the day of the week (Monday – Friday, Saturday, and Sunday).You do *not* need to make this display. You should make a display similar to this with for Rain = No and Rain = Yes.The following steps were followed in JMP / Excel to obtain this heat-map summary.Step 1: Obtain the summaries using Table > Summary for each group. The SUM() function is being specified in the Statistics box as the total number of rentals is desired summary.Getting summaries for CustomersGetting summaries for CustomersStep 2: Save the tables produced by JMP above into two separate CSV files. Open these files in Excel. Created a table that has an appearance close to the provided here.Step 3: Highlight the cells in one quadrant of this display. Select Conditional Formatting > Color Scales and select a reasonable color scale. I choose yellow / green gradient scale. Repeat this process for the cells in each quadtrant. Step 4: The final display. Notice that my display has some missing information due to the fact that temperatures did not reach below 60 during these times.Create a four quandrant heat-map summary for Subscribers vs Customers and Rain = No vs Rain = Yes. Provide a screen-shot of your final display. (7 points).Answer the following questions using your display. (3 pts each)For what hours are bike rentals the highest for Customers? Is this the same for Subscribers? Briefly discuss. How does the heat-map differ between a weekday vs weekend? Briefly discuss.Does rain appear to have an effect on bike rentals? Briefly discuss. For this problem, we will consider data regarding the delay of flights. The Bureau of Transportation Statistics website provides data on flight delays. This data includes several variables -- we will restrict our investigation to the following. You can select data for any month/year you’d like. Reporting_AirlineCancelledOriginCarrierDelayDest (Destination)WeatherDelay Data Source: In what month is your birthday? __________________________________Download data for the month of your birthday for 2018 using the link provided. Unzip this file and read it into R.Consider the following code. Note: The summarise() function is computing a count of the number of flights and a count of the number of delays due to the carrier (the carrier delay count must ignore NA’s and the sum( !is.na() ) can be used to trick R into providing this count. Run this code for your data. Provide a screen-shot the output from your data. (3 pts)<Delete my screen-shot and add yours in its place>What is the purpose of Line 9 in this code? (2 pts)I have generically labeled a Var A. What is being calculated here? Provide an interpretation of one of the Var A values in your table. (3 pts) Next, consider the following code. Change this code so that instead of getting a summary for Carrier Delay’s, the code obtains a summary for weather delay’s. Provide a copy of your code and the resulting output. (3 pts)Remove the “#” character, i.e. comment line character, from Line 18 and Line 19. Note: You may have to specify slightly different values instead of 10 and 0.45. What is the purpose of Line 18 in this code? Why might it be appropriate to apply a filter of this type? Briefly dicsuss. (3 pts)Does the order of a filter matter? That is, suppose Line 19 and Line 18 were reveresed. Would this produce a different set of output? Briefly discuss. (4 pts) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download