Journal of Statistics Education: Vol 29, No sup1



Appendix B: Height and Shoe Size Data Set Instructor’s Manual

This appendix contains the questions and their solutions for the three exercises in the Height and Shoe Size Data Set article. Solutions have been created in Minitab version 15. Square brackets indicate areas that could be customized by local instructors.

Exercise 1: Correlation

Open the Height and Shoe Size Data Excel file. [Local instructors may use a different name for this file.] This file contains information from 408 college students. [Update this total if you add observations from your students.]

1. Create a scatter plot showing men’s shoe sizes on the horizontal axis and the associated height on the vertical axis. Examine the result. Do the points suggest that the relationship between men’s shoe size and height is appropriate for linear analysis? Explain your choice using several sentences.

[pic]

Although the points do not lie on a perfect line, there is a definite linear positive relationship that is almost “strong” as defined by the correlation “ruler” in Introduction to Statistics and Data Analysis by Peck, Olsen & Devore.

2. What is the value of the correlation coefficient for height and shoe size for the men?

3. What is the value of the correlation coefficient for height and shoe size for the women?

4. Using either the p-value method or the critical value method, conduct this hypothesis test for the significance of the correlation of the women’s heights and shoe sizes at α= 0.05. Enter your results in one of the two boxes.

H0: ρ = 0

Ha: ρ ≠ 0

The t statistic for the correlation test is [pic]

The value of the test statistic is 13.64 . There are 185 degrees of freedom.

|Using the Critical Value Method |Using the p-value Method |

|The critical value of the t statistic is 1.9729 |The p-value is 0.000 |

|(from Minitab, with 185 DF and α = 0.05) | |

| | |

|What is the conclusion and why? |What is the conclusion and why? |

|We are able to conclude that there is a statistically significant |We are able to conclude that there is a statistically significant |

|correlation between women’s heights and shoe sizes because the test |correlation between women’s heights and shoe sizes because the p-value|

|statistic (13.64) is more extreme than the critical t value. |of 0.000 is smaller than α = 0.05. |

Helpful Hint: Remind students that the very small p-value corresponds to the area in the tails of the t distribution beyond the test statistic and represents the probability of a test statistic at least this extreme when the null hypothesis is true. Because the sample did indeed yield this test statistic, the appropriate conclusion is that the null hypothesis extremely unlikely to be true. It never hurts to emphasize that the appropriate comparison is either to compare the test statistic to the critical value or to compare the p-value to alpha.

Exercise 2: Simple Linear Regression

Open the Height and Shoe Size Data Excel file. [Local instructors may use a different name for this file.] This file contains information from 408 college students. [Update this total if you add observations from your students.]

1. Do you think it is reasonable to use shoe size to predict height, or is it more logical to use height to predict shoe size? Explain your choice using several sentences.

Students will have individual answers for this question. Encourage them to provide a reason or realistic application to support their choices.

Helpful Hint: There is really no correct answer to this question, but students should provide some rationale for their choices. For example, those who think it more important to predict shoe size might say that the distribution of heights is more widely known (many sites post such information) and a shoe store could use that information to plan its inventory. Those who would rather use shoe size to predict height might be imagining a model that is somewhat akin to pediatric growth models that use measurements of a wrist or other body part to predict adult height. Students might be swayed to use shoe size as the x (predictor) variable if they are fans of crime scene television shows. Given a shoe print left at the scene of the crime, what height can be estimated for the suspect?

2. Using all observations, create a scatter plot showing Size on the horizontal axis and the Height on the vertical axis. If your software permits, plot the points from each gender using a different color or marker. Examine the result. Do the points suggest that the relationship between Size and Height is appropriate for linear analysis when both genders are included? Explain your choice using several sentences.

[pic]

The scatter plot of all observations retains the linear form seen in the plot of the men’s data.

Potential Pitfall: Although close examination of the markers shows that Minitab sometimes retains the black circles behind the red squares, it does not indicate where there are multiple identical observations. Students should be aware that the graph does not supply information on the frequency of each size/height combination and that some observations may be hidden. Instructors may find it useful to ask students to create a joint frequency table.

3. Calculate the correlation coefficient for the height and shoe size variables and test its significance. Is the relationship between the two variables sufficiently strong to pursue a regression analysis? Compare the correlation coefficient for all observations to the correlation coefficient you found for a single gender in Exercise 1. To what can you attribute the fact that the correlation for all observations is larger?

The correlation output from Minitab is as follows

[pic]

The correlation coefficient is highly significant and exceeds the values found when each gender was considered separately. It is reasonable to pursue the creation of a simple linear regression model.

To discover the reason for the higher correlation coefficient it is helpful for students to revisit scatter plots of the observations.

Helpful Hint: In order to make an appropriate comparison, it is extremely useful to have the same scale on both plots. You may have to help your students understand how to scale the axes to achieve this in whichever software they use.

For Women:

[pic]

For Men:

[pic]

Together:

[pic]

When both genders are included, the distribution of shoe sizes and heights is a longer, narrower cloud of points than was present for the genders alone. For beginning students, this visual analysis is intuitively reasonable.

Helpful Hint: Remember that including both genders changes the length, and therefore the relative width, of the cloud. This may not indicate anything inherently important in the relationship between height and shoe size.

4. Using the choices for the predictor (x) and response (y) variables that you indicated in question 1 develop a regression model using all the observations. Write your regression equation here using the variable names Height and Size. Explain the meaning of the values of the intercept and regression coefficient.

The regression output is as follows:

Using Size to predict Height

Height = 50.83 + 1.7753(Size)

The regression coefficient indicates that each additional unit of shoe size adds 1.78 inches to the expected height. The intercept shows where the fitted regression line would cross the vertical axis, but the value of 50.8 has no meaning in this application because there are no observations with a shoe size of 0; the smallest shoe size is a 5. These values illustrate the dangers of extrapolating beyond the data used to create the model.

Using Height to predict Size

Size = -19.33 + 0.4273(Height)

The regression coefficient indicates that each additional inch of height adds 0.427 to the expected shoe size. The intercept shows where the fitted regression line would cross the vertical axis, but the value of -19.3 has no meaning in this application. There are no observations with a height of 0; in fact, the smallest value of Height is 60 inches. These values illustrate the dangers of extrapolating beyond the data used to create the model.

5. One of the students in the dataset is 64 inches tall and wears a size 8 shoe. What is the error associated with your model’s prediction for this student?

Using Size to predict Height

Predicted Height = 65.0324 inches

Error = 64 – 65.0324 = -1.0324 inches

Using Height to predict Size

Predicted Size = 8.0172

Error = 8 – 8.0172 = -0.0172

6. From your analysis, determine whether a statistically significant relationship exists between Height and Size. Provide an explanation that supports your decision.

In both models the F statistic is over 1275 and the t statistic for the predictor (independent) variable is 35.71, hence the p-value is 0. The hypothesis that the coefficient is 0 can be rejected.

7. What percentage of the variation in the response (y) is explained by the regression model?

As is typical in business statistics texts, students are asked to report the coefficient of determination as a part of their regression analysis.

|R Square |0.758533 |

Helpful Hint: Instructors may want to illustrate that the R square value that appears on the output can be calculated directly as either the ratio of SSReg/SST or as the square of the correlation coefficient (labeled by Excel as Multiple R on regression output).

Exercise 3: Indicator Variables

Open the Shoe Size Data Excel file. [Local instructors may use a different name for this file.] This file contains information from 408 college students. [Update this total if you add observations from your students.]

1. No distinction was drawn between male and female students in Exercise 2. Rerun your model for male students alone and then for female students alone, using the same choices for dependent and independent variables that you did in Exercise 2. Are these two new models “better” than the model than combines males and females? Explain.

For Men:

For Women:

Although the coefficient in each model is significant, the R square value is higher for the model that includes both genders (Exercise #2, question 4).

Helpful Hint: In each case, the relationship between height and shoe size is significant and is positive, although not identical. As noted in Exercise 2, question 3, when both genders are included, the distribution of shoe sizes and heights is a longer, narrower cloud of points than for the two genders alone. As the correlation increases, so too does the R square value.

2. Create an indicator variable for the Gender column. You will need to decide which gender will be represented by a “1” and which by a “0.” Create a multiple regression model incorporating the indicator variable and using all the observations and examine the results. Is the gender variable statistically significant? What is the expected effect of gender? Is more variation explained in this new model?

Regression results:

Both the gender variable and the other explanatory variable are significant. In the first model, being male increases the expected height by over an inch when compared to a female with the same shoe size. In the second model, being male increases the expected shoe size by almost a full size over that for a female of the same height. The inclusion of the gender variable leads to an increase in the adjusted R square value as compared to the value for the corresponding model that does not include the gender variable.

3. Use your original single variable model, your single variable gender-specific model, and your indicator variable model to predict your own measurement. Calculate the error associated with each model. Which one provided the best prediction for you? Which of these models do you believe would be best to apply to this kind of prediction in practice, the pair of gender-specific models, or the single model with gender?

There will be individual student answers for this question. To illustrate, revisit the case of the 64 inch tall student who wears a size 8 shoe. This student is a woman.

Using Size to predict Height

The single variable model is

Height = 50.83 + 1.7753(Size)

Predicted Height = 65.0324 inches

Error = 64 – 65.0324 = -1.0324 inches

The gender-specific model is

Height = 52.9 + 1.49(Size)

Predicted Height = 64.82 inches

Error = 64 – 64.82 = -0.82 inches

The indicator variable model is

Height = 52.2 + 1.58(Size) + 1.13(Gender_M)

Predicted Height = 64.84 inches

Error = 64 – 64.84 = -0.84 inches

For this particular student, the error is smallest from the gender-specific model.

Using Height to predict Size

The single variable model is

Size = -19.33 + 0.4273(Height)

Predicted Size = 8.0172

Error = 8 – 8.0172 = -0.0172

The gender-specific model is

Size = -13.7 + 0.337(Height)

Predicted Size = 7.868

Error = 8 – 7.868 = 0.132

The indicator variable model is

Size = -14.5 + 0.349(Height) + .948(Gender_M)

Predicted Size = 7.836

Error = 8 – 7.836 = 0.164

For this particular student, the error is smallest from the single variable model. However, this does not mean that this is the model that should be used for the general population.

Helpful Hint: remind students that the same model will not be the best for every student. This is also a good time to remind students that making a decision, in this case which model to choose, should not be based on one piece of evidence. A model should appeal to the analyst’s common sense. The residuals from each model should be examined not only to compare their adjusted R square values but also to see how well they match the assumptions of regression analysis. Finally, introduce the concept of parsimony and explain that statisticians prefer smaller, simpler models when there is not a significant degradation in performance.

-----------------------

Correlations: Size for M, Height for M

Pearson correlation of Size for M and Height for M = 0.768

P-Value = 0.000

Correlations: Size for F, Height for F

Pearson correlation of Size for F and Height for F = 0.708

P-Value = 0.000

Correlations: Size, Height

Pearson correlation of Size and Height = 0.871

P-Value = 0.000

Regression Analysis: Height versus Size

The regression equation is

Height = 50.8 + 1.78 Size

Predictor Coef SE Coef T P

Constant 50.8315 0.5031 101.04 0.000

Size 1.77528 0.04971 35.71 0.000

S = 2.07227 R-Sq = 75.9% R-Sq(adj) = 75.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 5476.9 5476.9 1275.39 0.000

Residual Error 406 1743.5 4.3

Total 407 7220.4

Regression Analysis: Size versus Height

The regression equation is

Size = - 19.3 + 0.427 Height

Predictor Coef SE Coef T P

Constant -19.3266 0.8202 -23.56 0.000

Height 0.42728 0.01196 35.71 0.000

S = 1.01664 R-Sq = 75.9% R-Sq(adj) = 75.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 1318.2 1318.2 1275.39 0.000

Residual Error 406 419.6 1.0

Total 407 1737.8

Regression Analysis: Height for M versus Size for M

The regression equation is

Height for M = 52.5 + 1.65 Size for M

Predictor Coef SE Coef T P

Constant 52.546 1.056 49.78 0.000

Size for M 1.64527 0.09280 17.73 0.000

S = 2.02272 R-Sq = 58.9% R-Sq(adj) = 58.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 1286.1 1286.1 314.34 0.000

Residual Error 219 896.0 4.1

Regression Analysis: Size for M versus Height for M

The regression equation is

Size for M = - 14.2 + 0.358 Height for M

Predictor Coef SE Coef T P

Constant -14.191 1.438 -9.87 0.000

Height for M 0.35823 0.02021 17.73 0.000

S = 0.943832 R-Sq = 58.9% R-Sq(adj) = 58.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 280.02 280.02 314.34 0.000

Residual Error 219 195.09 0.89

Total 220 475.11

Regression Analysis: Height for F versus Size for F

The regression equation is

Height for F = 52.9 + 1.49 Size for F

Predictor Coef SE Coef T P

Constant 52.9298 0.9163 57.77 0.000

Size for F 1.4867 0.1091 13.63 0.000

S = 2.05372 R-Sq = 50.1% R-Sq(adj) = 49.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 783.40 783.40 185.74 0.000

Residual Error 185 780.28 4.22

Total 186 1563.69

Regression Analysis: Size for F versus Height for F

The regression equation is

Size for F = - 13.7 + 0.337 Height for F

Predictor Coef SE Coef T P

Constant -13.702 1.615 -8.48 0.000

Height for F 0.33699 0.02473 13.63 0.000

S = 0.977775 R-Sq = 50.1% R-Sq(adj) = 49.8%

Analysis of Variance

Source DF SS MS F P

Regression 1 177.58 177.58 185.74 0.000

Residual Error 185 176.87 0.96

Total 186 354.44

Regression Analysis: Height versus Size, Gender_M

The regression equation is

&*=>?—¢±µ½Å> @ [ ïßïßïÏÀ±¢“„ufWH8h°h°5?CJOJQJaJhLW\hÝlõCJOJQJaJhLW\hb$ACJOJQJaJhLW\h`«CJOJQJaHeight = 52.2 + 1.58 Size + 1.13 Gender_M

Predictor Coef SE Coef T P

Constant 52.1773 0.6048 86.27 0.000

Size 1.57751 0.07074 22.30 0.000

Gender_M 1.1331 0.2930 3.87 0.000

S = 2.03755 R-Sq = 76.7% R-Sq(adj) = 76.6%

Analysis of Variance

Source DF SS MS F P

Regression 2 5539.0 2769.5 667.09 0.000

Residual Error 405 1681.4 4.2

Total 407 7220.4

Regression Analysis: Size versus Height, Gender_M

The regression equation is

Size = - 14.5 + 0.349 Height + 0.948 Gender_M

Predictor Coef SE Coef T P

Constant -14.509 1.025 -14.16 0.000

Height 0.34936 0.01567 22.30 0.000

Gender_M 0.9483 0.1323 7.17 0.000

S = 0.958868 R-Sq = 78.6% R-Sq(adj) = 78.5%

Analysis of Variance

Source DF SS MS F P

Regression 2 1365.43 682.72 742.55 0.000

Residual Error 405 372.37 0.92

Total 407 1737.80

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download