MAR 5621 Assignment 1



MAR 5621: Advanced Managerial Statistics

Assignment #2 Solutions

Grading: Unless otherwise stated, all problems are worth 2 points per letter.

1. The following questions deal with the Magazine data we encountered in Assignment 1 and in class.

DV: Page Costs for a 1-page ad

IVs: Audience (measured in thousands)

Male (percentage of audience that is male)

Income (median household income of audience)

(a) Which pair of predictors does the best job of predicting PageCosts? Which is a more useful prediction equation: the best single predictor, or the best pair of predictors? Why? (Use a measure that allows for comparing between equations with different numbers of predictors.)

Here, we look for the best Adjusted R2 or lowest residual SD among the 3 models with two predictors in them. The best model with two predictors is the one with Audience and Income.

The best two-predictor equation (Audience & Income; Adj R2=.7754) does a little bit better job than the best one-predictor equation (Audience only; Adj R2=.7564), even after adjusting for its larger number of predictors.

Another way to address this is by testing whether Income is a significant predictor in the Audience and Income model (which it is, p=.02297 from the detailed output for the Audience & Income model )

(b) (3 pts) If you knew the Audience score for a magazine, would it be helpful for you to know the Income score for that magazine also? Explain why or why not. Test the null hypothesis that, holding Audience constant, Income is unrelated to PageCosts. Report an appropriate p-value, and state your conclusion of the hypothesis test in a simple English sentence.

Part (a) dealt with this same issue. By comparing Adjusted R2 we found that the Audience & Income model is indeed better than the Audience only model. So it is useful to know Income in addition to Audience.

Based on the fact that the Audience & Income model fits better than the Audience-only model, we expect the Income coefficient to not be zero. We can verify this by looking at the output for the Audience & Income model.

Using the regression output for the Audience & Income model: The Income coefficient is .718 with a standard error of .306, producing a t-statistic of 2.34 and a p-value of .023. Using the usual significance level of .05, we reject the null hypothesis that the slope for Income is 0.

Conclusion: Holding Audience constant, Income is positively related to PageCosts.

(c) Compare the coefficient for Income in the Income-only model, and in the Audience & Income model. Why is it different? Why does the sign change? Explain, in as simple English as possible.

Income is a significant predictor in the Income & Audience model, but is not a significant predictor in the Income-only model. Here, the Income coefficient changes from negative and nonsignificant (b= -0.74, p=.22) to positive and significant (b=0.72, p=.023) when Audience is added to the model.

The coefficient changes because Income and Audience are correlated. More specifically: overall, bigger Audiences go along with higher PageCosts. Also, bigger Audiences tend to have lower Income; this can be verified by looking at the correlation between Audience and Income, r= -.353.

The negative overall association b/w Income and PageCosts (r= -.167), then, is largely due to the confounding variable Audience. In other words, magazines with high Income scores tend to have cheaper PageCosts because their Audiences tend to be smaller; think of these as specialty magazines aimed at small rich audiences (e.g., Yachting Monthly)

When we control for Audience, though, (removing the contaminating effect of Audience), we find that Income and PageCosts are positively related. This makes sense, given that richer audiences (of the same size) will tend to be more attractive to advertisers than poorer audiences.

(d) Is the model with Male & Income at all useful? Test the null hypothesis that both slopes are 0. Report an appropriate p-value for this test.

The overall test of whether a regression equation is at all useful is given in the ANOVA table, with the F test. The model with Male and Income produces a very small F statistic of .7, and a corresponding large p-value of .48. This large p-value means that we cannot reject the null hypothesis that the model is totally useless. Put another way, it could certainly be the case that the slope for Male and the slope for Income are both equal to 0 in this model. Male and Income are not a good “prediction team.”

(e) (1 pt) Which is the best overall prediction equation of the 7 possible combinations of the IVs?

Using Adjusted R2, we can pick an overall winner of the 7 models given, and that winner is the Audience & Income model. It performs a tiny bit better (Adj R2=.7754, residual SD = 21,537) than the model with all 3 predictors (Adj R2=.7746, residual SD = 21,578).

2. An economist is analyzing the incomes of three groups of professionals (physicians, dentists, and lawyers). He takes a random sample of 125 professionals and estimates a multiple regression model predicting annual income (in thousand of dollars) from three predictor variables:

EXP = years of experience

PHYS = 1 if physician, 0 if not

DENT = 1 if dentist, 0 if not

|Source of Variation |df |SS |MS |F |

|Regression |3 |98008 |32669.3 |18.01 |

|Error |121 |219508 |1814.1 | |

|Total |124 |317516 | | |

| |Coefficient |Standard Error |t Stat |

|Intercept |71.65 |18.56 |3.860 |

|EXP |2.07 |0.81 |2.556 |

|PHYS |10.16 |3.16 |3.215 |

|DENT |-7.44 |2.85 |-2.611 |

(A) THE REGRESSION OUTPUT SPECIFIES 3 REGRESSION LINES RELATING INCOME TO YEARS OF EXPERIENCE, ONE FOR EACH GROUP OF PROFESSIONALS. DRAW A SKETCH OF THESE THREE REGRESSION LINES. MAKE SURE TO LABEL THE AXES AND LABEL EACH REGRESSION LINE WITH THE APPROPRIATE PROFESSION. INDICATE THE SLOPE OF EACH REGRESSION LINE ALSO.

Your sketch should show three parallel lines, where the y-axis is income, and the x-axis is years of experience.

Because the coefficient for PHYS is positive, it means that the physicians’ line is above the lawyers’ line.

Because the coefficient for DENT is negative, it means that the dentists’ line is below the lawyers’ line

So the physicians are highest, the lawyers in the middle, and the dentists the lowest.

The slope of each line is 2.07 (the coefficient for EXP).

(b) What proportion of total variability in income is explained by all three predictor variables together? What is the standard deviation of the residuals? Based on these two pieces of information, do you think the regression equation is a good tool for predicting incomes?

R2 = SSRegression / SSTotal = 98008 / 317516 = .309

About 31% of the variability in income is explained by the 3 predictors together. The SD of the residuals is sqrt(MSResidual) = sqrt(1814.1) = 42.59 or about $42,600.

The regression equation is an ok tool for predicting incomes. It explains a decent chunk of the variability in income (31%), but the typical prediction error is quite large ($42,600). There’s quite a lot of variability in incomes among the different subgroups (defined by the dummy variables and the level of experience.)

(c) Consider a physician in the sample with 10 years of experience, whose annual income is $120,000. What is the predicted income and the residual for this physician? Is this a big or a small residual, compared to the standard deviation of the residuals for this regression?

Predicted income = 71.65 + 2.07 * 10 + 10.16*1 – 7.44 * 0 = 102.51 or $102,510

residual = $120,000 – $102,510 = $17,490

This is a reasonably small residual – it’s less than the typical size of all the residuals ($42,600). This physician’s income is predicted pretty well, compared to the bulk of the other observations.

(d) Holding experience constant, on average how much more do lawyers make compared to dentists? Construct a 95% confidence interval for this quantity. Interpret the interval in a sentence.

The coefficient for DENT estimates this difference between the lawyers’ line and the dentists’ line

-7.44 +/- 1.96*2.85 ( (-1,854 to -13,026)

The negative sign indicates that the dentists’ line is below the lawyers’ line. So the interval tells us that we’re 95% confident that, on average, lawyers make between $1850 and $13000 more than dentists with comparable years of experience.

3. Let’s again look at the ETS data that we’ve discussed in class (use the data from ). In the GENDER column is a dummy variable for the gender of the student (men are coded 0 and women are coded 1). We will focus on the relationship between high school grades and first-year college grades for men and for women.

Predicting first-year grades from gender only

(a) First, run a simple regression predicting first-year college grades from gender only. (that is, predict FYGPA from just the dummy variable GENDER). Report the regression equation and describe in simple non-jargony English what the slope and intercept represent.

Predicted FYGPA = 2.40 + .15 * Gender

The intercept tells us that the average FYGPA for the males (the Gender=0 group) is 2.40.

The slope of .15 tells us that the average FYGPA for the females is .15 points higher than the average FYGPA of the males (the male avg is 2.40 and the female avg is 2.55).

(b) What can we say about the overall differences in first-year college grades between men and women? Is it plausible that in the larger population, the average FYGPA is exactly the same for men and for women? Report a p-value relevant to this question.

The p-value for testing whether the difference between the mean FYGPA of females and the mean FYGPA of males is 0.0015. Because this p-value is quite small, we can conclude that the difference of .15 GPA points that we observe is not simply a fluke of our sample; it probably reflects a difference in the larger population as well. We conclude that it is not plausible that the average FYGPA is exactly the same for men and for women.

Predicting first-year grades from both high-school grades and gender

(c) Run a multiple regression predicting FYGPA from both HSGPA and GENDER. Report the regression equation, and (in simple non-jargony English) interpret the meaning of the terms in the regression equation.

Predicted FYGPA = 0.089 + .74 * HSGPA + .036 * Gender

The intercept of .089 is the predicted avg FYGPA for males (Gender=0) with a high school GPA of 0. Because a high-school GPA of 0 is not something we observe in our sample, we wouldn’t place much faith in this estimate of the regression equation (because it entails extrapolating from our data).

The slope of .74 for HSGPA represents the expected increase in first-year college grades for each 1-point increase in HSGPA, for both males and for females. Thus, two male students who differ by 1 point on HSGPA are expected to differ by, on average,.74 on FYGPA; and similarly, two female students who differ by 1 point on HSGPA are expected to differ by, on average,.74 on FYGPA.

The slope of .036 for Gender suggests that, for a given level of high-school performance (HSGPA), women average slightly higher (.036 GPA points) than men in their first-year college grades. Given a male and a female with the same HSGPA, we would predict that the female would have slightly higher first year college grades. (But of course, given the high p-value for the Gender coefficient, we can’t be sure the difference between the two genders, controlling for HSGPA, is not 0.)

(d) Compare the regression coefficient for GENDER to the one you found in part (a). In which case does the “effect” of GENDER appear larger? Explain briefly why the coefficient changes -- or stays the same -- as HSGPA was added to the regression equation.

The effect of Gender was much larger when it was the only predictor (.15) than when it was included in a model along with HSGPA (.036). The coefficient changes because Gender and HSGPA are correlated, with females having generally higher HSGPAs than males. Thus, the overall difference in FYGPA between males and females found in the first regression was in part attributable to differences in HSGPA. When HSGPA was taken into account in the second regression, the gender difference shrunk substantially.

Or, you could note that the two Gender coefficients represent fundamentally different concepts. In the first regression the gender coefficient represents the overall difference in avg FYGPA between all the males and all the females; in the second regression, the gender coefficient represents the difference in FYGPA between males and females with the same HSGPA.

(e) By plugging in the appropriate values of GENDER into the regression equation from part (c), determine the equations for predicting first-year college grades from high-school grades, separately for men and for women.

Men: Predicted FYGPA = .089 + .74 * HSGPA

Women: Predicted FYGPA = .125 + .74 * HSGPA

(f) What assumption(s) are we making about the relationship between FYGPA, HSGPA, and GENDER in this analysis? Describe them in simple non-jargony English.

We are assuming that the relationship between FYGPA and HSGPA follows a straight line for men, and also for women, and that the change in FYGPA for a given change in HSGPA (that is, the slope) is the same for both genders.

Or, slightly more jargony, we are assuming that the relationship between FYGPA and HSGPA and Gender can be represented adequately by two parallel lines.

Controlling for SAT scores

(g) Finally, regress FYGPA on 3 predictors: HSGPA, GENDER, and SATSUM. Report the regression equation. and interpret the coefficients of the equation in as simple & non-jargony English as you can.

Predicted FYGPA = -0.98 + .545 HSGPA + .0016 SATSUM +.143 GENDER

Controlling for SATSUM and HSGPA, women on average get .143 higher first-year GPA’s than men.

Controlling for SATSUM and GENDER, an increase of 1 point in HSGPA is associated with (on average) a .545 increase in first-year college GPA

Controlling for HSGPA and GENDER, an increase of 100 points in SATSUM is associated with (on average) a .16 increase in first-year college GPA

Intercept doesn’t make much sense, because HSGPA=0 and SATSUM=0 aren’t sensible values.

(h) Does SAT score help (in addition to the other predictors already in the model) in predicting first year college grades? Explain why or why not.

SAT score helps to improve prediction, as indicated by SATSUM being a significant predictor in the model (t=10.5, p ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download