Criminal investigoatrs often need to predict unobserved ...



Statistics 215 Winter 2007 Dobrow

Regression worksheet

At the scene of the crime*

Criminal investigators often need to predict unobserved characteristics of individuals from observed characteristics. For example, if a footprint is left at the scene of a crime, how accurately can we estimate that person’s height based on the length of the footprint?

1. Identify the explanatory and response variables in this study.

Following are the self-reported heights (in inches) of students in Statistics 215 along with some summary statistics.

60 61 64 64 64 64 64 64 65 65 66 66 67 67 67 67 68 68 68 68 69

70 70 70 71 71 72 72 73 73 73 73 73 74 75 76

Min. 1st Qu. Median 3rd Qu. Max. Mean St. Dev.

60.00 65.50 68.00 72.00 76.00 68.51 4.025

Draw a stem plot for these data.

2. If you were trying to predict the height of a random statistics student at Carleton based on these data, what single value would you report?

3. How accurate would you be using such a prediction method? Fill in the blanks:

I would predict a random student’s height to be about _______ inches give or take ______ inches.

4. When you were sleeping last night, my investigators entered your room and recorded your foot length (in centimeters). Here is a scatterplot of height and foot length.

[pic]

5. Describe the association in this scatterplot (remember type, direction, strength, clusters, outliers, etc.) Guess the correlation. Draw (very lightly) on the scatterplot what looks like the line of best fit for the data.

6. Following are the regression commands in R and the output. The regression command lm stands for linear models.

> reg = lm(heights ~ foot)

> summary(reg)

Residuals:

Min 1Q Median 3Q Max

-3.32481 -1.92841 0.02754 1.07159 3.93945

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 42.29380 2.55731 16.54 < 2e-16 ***

foot 0.95596 0.09243 10.34 6.92e-12 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.984 on 33 degrees of freedom

Multiple R-Squared: 0.7642, Adjusted R-squared: 0.7571

7. The “Multiple R-Squared” value reported in the R output is just [pic]. What is the correlation r? Write the regression equation in context. Graph the equation carefully on the scatterplot.

8. It can be shown mathematically that the point [pic]always lies on the regression line. Find this point and thus determine the approximate mean foot length for a student of average height.

9. In Chapter 8 you will find the formulas for the slope (b1) and the intercept (b0) of the regression equation. They are given in terms of five key summary statistics: r, the two means (of foot length and height), and the two standard deviations.

[pic]

Verify these formulas using the above summary statistics. You will need the mean and standard deviation for foot length, which are 27.43 and 3.68, respectively.

10. Use the regression equation to predict the height of a person with a 28 cm foot length. Then repeat for a person with a 29 cm foot length. Calculate the difference in these two height predictions. Does this value look familiar? Explain.

11. Interpret the slope coefficient in context.

12. Interpret the intercept term in context. Is such a prediction meaningful for these data? Explain.

13. Predict the height of someone whose foot length is 44 cm. Explain why you would not be as comfortable making this prediction as the one in (10).

The residual of an observation is the difference between the observed (actual) response and the predicted response. That is Residual = Observed – Predicted.

14. Zach, a student in statistics, has a foot length of 25 cm and is 67 inches tall. What is Zach’s residual? In general, what is the meaning of a positive residual? Of a negative residual?

15. Here is a residual plot of predicted values versus residuals.

[pic]

The R commands I used were

> par(mfrow=c(1,2))

> plot(foot,heights)

> abline( lm(heights~foot) )

> predicted = 42.29380 + .95596*foot

> residuals = heights – predicted

> zresiduals = (residuals – mean(residuals))/sd(residuals)

> plot(foot, zresiduals, main=”Residuals Plot”, ylab=”Standardized residuals”)

> abline(h = 0)

16. The reason that [pic] is reported in the regression output instead of r, is that there is a nice interpretation of [pic]. It is the proportion of the variability in the response variable that is explained by the linear model, by the explanatory variable. (We’ll talk more about this.) What fraction of the variability in heights is explained by the linear regression model?

17. Do you think that the linear model is appropriate for these data? If you answered yes to this question, how precise would you say the model is for making predictions?

18. Guess-timate the standard deviation of the residuals. How much do the data points spread about the regression line? The actual standard deviation of the residuals is sd(residuals) =

19. Consider how height varies about its mean. Compare your answers in (3) and (16). Is the regression model for estimating height superior to just estimating height based on its mean and standard deviation without any foot information?

*This worksheet is based, in part, on Investigation 6.3.3 in Investigating Statistical Concepts, Applications, and Methods by Beth Chance and Allan Rossman.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download