Winona



DSCI 325: Midterm – TakehomeName(s): ______________________________________Spring 2018Points: 125 ______________________________________ ______________________________________Consider the TASK provided on page 14 of Handout #7. You are to complete this task using the programming strategies presented in Handout #7. In particular, you must use steam processing methods to accomplish this task.TASKThe task for this handout is to compute the total number of services provided BY place of service. In this dataset, there are two distinct places of services: F = Facility and O = Non-facility, i.e. Office type.The variable descriptions for the variables to be used for this task are provided here and are from the data dictionary for this dataset.Place of Service:Number of Services Provided:Complete the following tasks.Modify the RETAIN statement(s) so that total number of services across all providers is computed for each place of service outcome, i.e. a total count for Place of Service = F and a total count for Place of Service = O. (10 pts)Notes: Your computational step will likely require the use of an IF statement. You should begin by reading in only the first few rows to verify your computational process. Turn In: 1) Code used in the DATA STEP to accomplish the desired task, 2) A screen-shot or two that clearly shows that your stream processing steps are working as intended for the first few rowsNext modify your code so that a larger set of rows are read-in to once again verify your streaming process. (5 pts)Turn In: 1) Code used in the DATA STEP to accomplish the desired task and 2) A screen-shot or two that clearly shows that your stream processing is working on this larger set of rowsFinally, modify your code one last time to steam the entire dataset to compute the desired summaries for this problem. What is the total number of services for Facilities and Non-Facilities? (5 pts)Turn In: 1) Code used in the DATA STEP to accomplish the desired task and 2) A screen-shot that clearly shows the desired summaries from this stream processing taskConsider the Walmart locations dataset provide by the website poi- . This data includes locations of Walmart stores and various subsidiaries of Walmart -- i.e. Walmart, Walmart Supercenter, Sam’s Club, Murphy USA, Walmart Neighborhood Markets, etc. For the sake of our investigation here, all these subsidiaries are considered a “Walmart” location. You can obtain a copy of this dataset from the link below.Dataset: Consider the following SAS code.Line 1 specifies that the file to be read-in is located on the webserver and provides the url of the file locationLines 3-7 provides code to read-in the Longitude and Latitude for each Walmart location in this datasetfilename Walmart url '';23 DATA Walmart_All;4 INFILE Walmart delimiter=', ';5 INPUT Longitude :10. Latitude :10.;6 7 RUN;Use the code above to read-in the file directly from the webserver. Run PROC CONTENTS on the file read-in. (2 pts)Turn In: A screen-shot of the top portion of the PROC CONTENTS output.Modify the above code so that only stores whose location is along an interstate are read-in. To operationalize this, you should read-in only records whose address contains an I- somewhere, e.g. records 1, 3, 11, 16, etc. (5 pts)Turn In: 1) A copy of your DATA STEP code, 2) screen-shot of top portion of PROC CONTENTS output.Modify the code provided in Problem 2 to now only read-in Walmart locations along I-90. Having I-90 in the address anywhere suffices for being along I-90 (see note below). (4 pts)Note: The addresses have variations regarding stores along I-90, e.g. record 1 below has an address along I-90; whereas, record 3347 has an address along I-90/94 – the reason being that these two interstates are one-in-the-same between Madison, WI, and Tomah, WI. You should include both of these variations when filtering on I-90.Turn In: 1) A copy of your DATA STEP code, 2) screen-shot of top portion of PROC CONTENTS output.Provide values for the following – you can use the output from PROC CONTENTS to determine these values, you do *not* need to determine these values via programming. (3 pts)How many Walmart locations are there provided in this dataset? ____________How many Walmart locations are there whose locations are along I-90? _____________What percent of the Walmart locations are along I-90? ___________Reconsider the code you provided in Problem 2. Several of the records have issues related to the interstate designation – i.e. some records just have I- with no indication of interstate, one record has I-95Exit instead of I-95, yet others have I-city instead of a number to indicate the interstate. Clean-up these irregularities. In particular, fix as many of these issues as possible If reasonable fixes cannot be made, then exclude the record from consideration. You should use the address and Google Maps (or the Store ID # and the relevant url for that location) when considering how to make fixes. (10 pts)For example, record 686 has a missing number for the interstate. Using the URL provided and Google maps, I determined that this Walmart Neighborhood Market with address 2141 Dave Ave, Roanoke, VA is a considerable distance off of I-585. Thus, I would exclude this observation from consideration when reading in this data.Walmart Store Finder for Store #3614: Turn In: 1) A copy of your DATA STEP code, 2) screen-shot of top portion of PROC CONTENTS output, 3) a brief description for each record that was either fixed or deleted in the table below.Fill out the following table for each record that you either fixed or removed from consideration.Provide a description of the fix or rational for removal for each record addressed in your code aboveStore IDFix / Rational for removal3614This location was determined to be too far off of the interstate (I-581) and thus will not be read-in.Suppose your organization likes to build new stores along interstates and next to Walmart locations whenever possible. Your boss asks for the following: Which 5 interstates have the most number of Walmart locations?The Store IDs of the newest Walmart locations that resides along the nation’s 10 busiest interstates: I-95, I-405, I-5, I-90, I-15, I-110, I-45, I-10, I-80, and I-85. (Newest stores will have the largest Store ID values).On the surface this may seem like a simple aggregation task – simply obtain a count by interstate and then sort this table. These counts be obtained using PROC REPORT, SQL, or what I’d do is export the data and obtain the counts in JMP or Excel. However, there is a catch as some Walmart location have addresses at intersections of two interstates. For example, Store ID #4502 is next to I-64 and I-95. This is represented in the address by I-64/95. When determining a count, this store should be included in the count for I-64 and included in the count for I-95.Use DATA STEP in SAS to convert Dataset A into Dataset B. Dataset A represents the data read-in in Problem 5 and Dataset B represents a copy of Dataset A with a second record anytime a Walmart location address includes two interstates. (8 pts)The key component to making this work in SAS’s DATA STEP is the OUTPUT command. OUTPUT writes a record out to the dataset. To the right is the structure I used to accomplish this task.Check if at an intersection. If not, then recode Interstate to Interstate_Separated and identify Duplicate as N.If address has two interstates, then create one record for 1st interstate with first output statement, and then create a second record for second interstate with second output statement. Turn In: 1) A copy of your DATA STEP code and 2) screen-shot of top portion of PROC CONTENTS output for Dataset B.After obtained a cleaned version of the data from Problem 6, use whatever program (SQL, JMP, Excel, etc.) you’d like to answer the following for your boss. You can use PROC EXPORT to export the file out as a CSV file. (3 pts each).Which 5 interstates have the most number of Walmart locations? Turn In: A list of 5 interstates with most number of Walmart locations. The Store IDs of the newest Walmart locations that resides along the nation’s 10 busiest interstates: I-95, I-405, I-5, I-90, I-15, I-110, I-45, I-10, I-80, and I-85. (Newest stores will have the largest Store ID values).Turn In: A list of the largest Store IDs that reside along the nation’s busiest interstates. This list should be determined after interstates have been separated -- a new Walmart location built at an intersection of two interstates should have the opportunity to be included. For this portion of the take home exam, we will continue working with the NursingHomeCompare database that we used in class. We will begin by considering the ProviderInfo table from this database.Download the PROC IMPORT and SQL code provided on the course website for this problem. You should run this code in progression while answer the following questions. Code: Consider CODE CHUNK #1. Why is this code chunk useful? Discuss. (2 pts) Consider CODE CHUNK #2. What is the purpose of this code? Discuss. (2 pts)Consider CODE CHUNK #3. Why is this code useful? Discuss. (2 pts)Write a brief SQL Statement that would return a table with the PROVNUM, PROVNAME, and ZIP of all nursing homes within a particular zip code. (3 pts)Turn In: A copy of SQL code and a screen-shot of returned table. Write a brief SQL Statement that would return a table with the PROVNUM, PROVNAME, and ZIP of all nursing homes within one of two different zip codes. (3 pts)Turn In: A copy of SQL code and a screen-shot of returned table. Consider CODE CHUNK #4. What is the purpose of this code? Discuss. (3 pts)Let us take a small break from data management and consider the following Table 1 from a publication provided by the Centers for Medicare and Medicaid Services (CMS). This table specifies that the minimum standard for of “Total Nurse Hours per Resident Day” is 4.10. Our table has two fields related to this metric for each nursing home. We will only consider the actual value in our investigation here.TOTHRD: Actual value for Reported Total Nurse Staffing – Hours per Resident Per Dayexp_total: Expected value for Reported Total Nurse Staffing – Hours per Resident Per DayThe corresponding field for this value in our data table is TOTHRD.Source: Consider CODE CHUNK #5. What is the purpose of this code? Discuss. (3 pts)Consider CODE CHUNK #6. What is the purpose of this code? Discuss. (3 pts)Lines 78- 82 in the above code chunk is using the CASE statement. To the best of your ability, describe the purpose of this CASE statement. Discuss. (4 pts)Consider CODE CHUNK #7. Review this code and the table returned by this code. Consider one row in this table. Carefully interpret the meaning of the values provided in this table. (4 pts)Consider CODE CHUNK #8. Review this code. What is this code doing different than CODE CHUNK #7? Discuss. (5 pts)The ORDER BY statement can be added to sort a table that is returned by and SQL statement. For example, consider Line 93 in the CODE CHUNK #7 which sorts the table by Exceeds_Percent in descending order. Purpose a ORDER BY Statement for CODE CHUNK #8 that would make it easier to identify which states have a large proportion of nursing homes whose actual staffing levels exceed their expected staffing levels. (2 pts)Code for Line 105: ORDER BY ________________________________________;Next, we will consider the most difficult data management task for preparing this data.Table 1: Contains information about staffing, e.g. Actual and Expected HRD values, etc.Table 2: Contains information about the relationship between SSA values (which are provided in the Staffing dataset) and FIPS (which are provided in Table 3).Table 3: Information obtained from the US Census Bureau regarding the Elderly Population in the United States. Example API Pull from Census Bureau: :*Table 1: Staffing TablePrimary Key = PROVNUMTable 2Table 3: Census TablePrimary Key = FIPSKey Issues before joining tables:fipscounty in Table 2 is a string; whereas, FIPS in Table 3 is numeric (due to PROC IMPORT). Data types will need to match before joiningTable 2 contains ssacounty; however, Table 1 version of COUNTY_SSA is different. Luckily, the first two digits of PROVNUM correspond to the SSA state value. When joining information from Table 2/3 to Table 1, will need to use substring to grab 1st two digits of PROVNUM and concatenate these against COUNTY_SSA in order to match ssacounty from Table 2.Consider CODE CHUNK #9. What is the purpose of this code? Discuss. (3 pts)Consider CODE CHUNK #10. The code will join together two tables – Table A: CensusElderly and Table B: SSA_FIPS_Crosswalk. What is the purpose of the ON statement in Line 185? Discuss. (3 pts)Consider CODE CHUNK #11. What is the purpose of this code? Discuss. (3 pts)Consider CODE CHUNK #12. What is the purpose of this code? Discuss. (3 pts)Just about done --- we now get to reap the rewards of all our data management of these tables and conduct some clever analyses.Consider the following metrics that were used to create the visualization provided below.StaffingRatio=Actual_HRDExpected_HRD and ElderlyRatio=Population Over 65Total PopulationStaffingRatio is simply the ratio of actual staffing levels to the expected. If this value is 2; then actual staffing levels are 2 times higher than expected. A large StaffingRatio value may indicate a nursing homes ability to take on more residents without additional staff. ElderlyRatio is simply the percentage of people in a county whose age is over 65. Some counties have a much larger proportion of people over 65 than others. Counties with a large ratio value here can expect additional strain on elderly care in the coming years.The analysis goal here is fairly straight forward – identify the nursing homes in MN who are in the targeted region on the graph below. The boundaries of this region are given by Staffing Ratio < 1 (already understaffed) and ElderlyRatio above the 0.20. The 90th percentile for ElderlyRatio is 0.20 which means only 10% of the counties have a ElderlyRatio larger than 0.20. Distribution of ElderlyRatioUse SQL to answer the following questions. Before proceeding, you will need to create the StaffingRatio and ElderlyRatio quantities for each nursing home in your table. When creating these variables make sure and specify them to be numeric and make sure and allocated enough memory space for each variable, e.g. I used the statement ADD StaffingRatio NUM FORMAT = 20.5 when adding StaffingRatio to my table. Using SQL, determine how many nursing homes are within the identified region on the scatterplot above. (3 pts)Turn In: A copy of SQL code and a screen-shot of the returned table with the count. Using SQL, determine the percent of nursing homes that are within the identified region. (4 pts)Turn In: A copy of SQL code and a screen-shot of returned table. I live in Rushford which is in Fillmore County, MN with a zip code of 55971. We have one nursing home in Rushford. Is this nursing home in the identified region? (2 pts)Turn In: A copy of SQL code and a screen-shot of returned table. My parents live in Spring Valley which is also in Fillmore County, MN. Is the Spring Valley Care Center in the identified region of this scatterplot? (2 pts)Turn In: A copy of SQL code and a screen-shot of returned table. Which nursing home, the one in Rushford or the one in Spring Valley, is better off in terms of their future staffing potential? Discuss. (3 pts)Suppose the State of MN is willing to provide substantial funding for building a new nursing home in MN somewhere. Where would you suggest they build it? Provide rationale based on the analysis conducted here. (5 pts) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches