Fort Lewis College



BA 355: Short Assignment 6 – Zillow Big Data SetIn the accompanying data set you will find file with 269 rows of data with information about houses in Durango. I cleaned up the formatting, but it’s still a bit of a hot mess in a number of ways. In this assignment, I want you to think about how you would clean the data; I was going to make you actually do it, but 1) it’s a tough slog and 2) I want us to all have the same data for Case 3.2 so I have done it for you. 1) Take a look at the data and see if you can spot some potential errors or issues. What are they? The cells I have highlighted for you should point you in the right direction, but I might’ve missed a potential problem too.2) Note the price in cell B11 (highlighted in yellow) is exactly $750,000. Students were supposed to record the Zillow estimate – or Zestimate – but this is clearly a case where the house is for sale and that is the asking price, not the Zestimate. Sellers almost always ask more than the Zestimate, so this data is invalid. Describe a way to find (and then eliminate) all of these invalid data points. [Hint: I used Conditional Formatting and then “contains.”]3) Note the addresses in cells A4 and A15 (highlighted in blue) are repeats. This is no one’s fault, just a byproduct of how we created the data set. Describe a way to find (and then eliminate) all of these repeats. Be specific, take one step beyond the obvious first step.4) Note the ages of the houses in cells F11 and F12 (highlighted in orange) are vastly different. Some students reported the age of the house, others reported when it was built (unless we have some houses in Durango that Jesus Christ himself worked on, he was a carpenter after all). Describe a way to find and fix all of the data in this column, turning any points with the year built into the age of the house. And don’t expect just to do each one by hand, I want a system that will do them all in a couple of steps. [Hint: I use and =IF function.]5) Note the address in cell A45 (highlighted in green) is Unit 203. This is clearly an apartment, not a house. While there’s nothing wrong with buying an apartment, our data set was only supposed to include houses. I can’t figure out how to eliminate these data points since I can’t find a consistent label for then – some are “unit,” others “suite,” etc. So, I’ve left them in the final data set that you’ll get next. Still, describe a way to identify (and then eliminate) these data points that are not houses systematically – sure, you could eyeball each one then decide, but this would not work if you had 110 million of these like Zillow does. Is there a systematic way to separate the apartments from the houses? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download