Problem 1 – the boston housing data



DSCI 425 – Supervised LearningAssignment 1 – Multiple Linear Regression (105 points)Problem 1 – the boston housing dataThe Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property. Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. Variables Used in the Harrison-Rubinfeld Housing Value Equationvariable typedefinitionsourceCMEDVDependent Variable (Y)Median value of homes in thousands of dollars1970 U.S. CensusRMStructuralAverage number of rooms1970 U.S. CensusAGE% of units built prior to 19401970 U.S. CensusBNeighborhood% of population that is black1970 U.S. CensusLSTAT% of population that is lower socioeconomic status1970 U.S. CensusCRIMCrime rate measureFBI (1970)ZN% of residential land zoned for lots > than 25,000 sq. ft.Metro Area Planning Commission (1972)INDUS% of non-retail business acres (proxy for industry)Mass. Dept. of Commerce & Development (1965)TAXProperty tax rateMass. Taxpayers Foundation (1970)PTRATIOPupil-Teacher ratioMass. Dept. of Ed (’71-‘72)CHASDummy variable indicating proximity to Charles River (1 = on river)1970 U.S. Census Tract mapsDISAccessibilityWeighted distances to major employment centers in areaSchnare dissertation (Unpublished, 1973)RADIndex of accessibility to radial highwaysMIT Boston ProjectNOXAir PollutionNitrogen oxide concentrations (pphm)TASSIMReferenceHarrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.Develop a regression models for predicting CMEDV using the available predictors in the table above. Note that all variable are numeric with the exception of CHAS which is in indicator/dummy variable indicating whether or not the census tract is located along the Charles River in Boston. The file Boston.csv on my website can be read into R as shown in the handouts. > Boston = read.table(file.choose(),header=T,sep=”,”)> Boston$CHAS = as.factor(Boston$CHAS) because this 0/1 coded, you do not have to do this.> bos.lm = lm(CMEDV~.,data=Boston)Your analysis should be thorough! Document the model development process by copying and pasting relevant R commands, output, and graphics into your write-up. Grading rubric (50 points)In this part of your analysis of these data you will fit a simple MLR model to these data without trying to address any model deficiencies etc.Fit a base model and discuss any deficiencies (but don’t try to fix them). (5 pts.)Stepwise reduction of base model and discussion of final model. (5 pts.)Use cross-validation methods to estimate the prediction error of this model using split-sample, k-fold, and the .632 bootstrap approaches. (10 pts.)In this part of your analysis of these data you will develop a MLR that addresses any deficiencies you identified in part (1). Things to consider would be adding higher order terms (polynomials terms) and power transformations. In end I would like you to compare the predictive performance of this model to the one you developed in part (1).Model development, documentation, and discussion. (15 pts.)Fitting final model, critiquing it, and discussing any deficiencies. (5 pts.)Use cross-validation methods to estimate the prediction error of this model using split-sample, k-fold, and the .632 bootstrap approaches. All prediction measures should be for the response in the ORIGINAL scale, thus you will need to back-transform your predictions in the CV process. (10 pts.)Problem 2 – listing Price of homes in the twin cities metro areaThese data are contained in the TC Homes (train).csv file on the website. The variable descriptions are below. TC Homes (test).csv contains homes I would like you to use your final model to predicting the list price force in the ORIGINAL scale. Whatever data torturing you do the training data will also need to be done to the test cases as well.VariableInfoDescriptionListPriceResponse (Y)Current List Price ($)BEDSX1# of BedroomsBATHSX2# of Bathrooms (can be fractional)SQFTX3Square footage of home (ft.2)LotSizeX4Square footage of lot (ft.2) – missing for severalof the homes in these data.YearBuiltX6Year the home was built, could be used to createa new variable called Age = 2014 - YearBuiltParkingSpotsX7# of Parking Spots (I assume off-street parking)HasGarageX8Garage or No (Nominal)DOMX9Days on the market, number of days the home has been listed for sale.BeenReducedX10Has the price been reduced from the originallisting price – Y or N. (Nominal)SoldPrevX12Has the home been sold previously? Y or N (Nominal)LatitudeX13Latitude (degrees)LongitudeX14Longitude (degrees)ShortSaleX15Is more money owed on the home than what the asking price is? Y or N (Nominal)Grading Rubric (55 points)Fitting base model, critiquing it, and discussing any deficiencies. (5 pts.)Model development, documentation, and discussion. (20 pts.)Consideration of assumptionsPossible predictor transformationsStepwise proceduresFitting final model, critiquing it, interpreting it, and discussing any deficiencies. (5 pts.)Cross-validation results and discussion for predicting the response in the original scale. (10 pts.)Give me your predicted list price for the test cases contained in the file TC Homes (test).csv using your model. I will discuss how to do this this class. (10 pts.) ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download