Distinguish between fundamental objectives and means ...



Name: _______________________________

MAR 5621: Advanced Managerial Statistics

Quiz, March 30, 2004

Please make sure that your answers are legible and understandable. Show your work. There are 25 points total; each part is worth 1 point unless otherwise noted. Good luck!

Problem 1: Predicting exam scores (4 pts)

Consider a simple regression equation predicting the score received on an exam (Y) as a function of the number of hours spent studying (X). The regression line is described by [pic]= 40 + 15 X.

(a) For these data, the predictor variable X is unable to explain 64% of the variation in the outcome variable Y. What is the correlation between hours spent studying and the exam grade?

64% of variance unexplained means 100%-64%=36% is explained

R2 = .36, so r = +/-sqrt(.36) = +/- 0.6

Because the slope of the regression line is positive, the correlation is too, so r=0.6

(b) (3 pts) Evaluate the following statements as either true or false (in terms of whether they are legitimate conclusions from this regression) and explain your answer in one sentence.

An explanation of your answer is required to get credit for the 2 false parts.

• Each hour of additional study is associated with, on average, 15 more points on the exam.

True, this is the correct interpretation of the slope of the regression line.

• All students who don’t study at all will receive a score of 40 on the exam.

False, the predicted average score for students who don’t study is 40, and that number is an average, so we expect variability around that number anyway.

• Each hour of additional study causes students to get 15 more points on the exam.

False, the correlation between hours of study and test score does not mean that study time caused the higher grades. Association does not necessarily imply causation. (Also acceptable is a statement that different students will show different benefits from amount of studying, or that the slope represents average improvement with each additional hour on study.)

Problem 2: Hotel/Casino Revenue (5 pts)

Below is partial output from a regression predicting annual Gaming Revenue (called GamingRev, and measured in millions of dollars) from annual Hotel Revenue (called HotelRev, also measured in millions of dollars) based on data for 10 hotel/casinos in Las Vegas in 1997.

| ANOVA |df |SS |MS |F |Significance F | |

|Regression |1 |425460 |425460 |51.7 | 0.00009 | |

|Residual |8 |65793 |8224 | | | |

|Total |9 |491253 |  |  |  | |

| | | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value |Lower 95% |Upper 95% |

|Intercept |113.52 |48.24 |2.35 | 0.04646 |2.27 |224.77 |

|HotelRev |0.94 |0.13 |7.19 | 0.00009 |0.64 |1.24 |

(a) (2 pts) The MGM Grand Casino had Hotel Revenue of $373.1 million and Gaming Revenue of $404.7 million. Use the regression equation to determine the predicted value of Gaming Revenue, and the residual, for the MGM Grand.

Regression Equation: Predicted GamingRev = 113.52 + 0.94*HotelRev

For MGM Grand:

Predicted Gaming Revenue = 113.52 + 0.94*373.1 = 464.2

Residual = actual – predicted = 404.7 – 464.2 = –59.5

(b) Based on the appropriate 95% confidence interval, describe in a sentence or two what we can conclude about how much additional Gaming Revenue is associated with each additional dollar of Hotel Revenue.

The slope is interpreted as how much Y (GamingRev) changes as X (HotelRev) increases by 1 unit. The regression output tells us that the 95% confidence interval for the slope goes from $0.64 to $1.24

We are 95% confident that, on average, each additional dollar of HotelRev is associated with between $0.64 and $1.24 of GamingRev.

(c) Suppose we wanted to test the null hypothesis that each additional dollar of Hotel Revenue is associated with, on average, an additional $1 in Gaming Revenue. Is this null hypothesis plausible, or implausible? Explain in a sentence or two.

It’s plausible, because the confidence interval ($0.64, $1.24) includes the value $1.00. The slope of the regression line could plausibly be 1.00.

(d) What is the typical difference between the predicted value of Gaming Revenue and the actual value of Gaming Revenue, across the 10 observations? (In other words, what’s the typical size of the prediction errors for this regression?)

The residual SD gives us the typical size of the prediction errors.

Residual SD = sqrt(MSResidual) = sqrt( 8224) = 90.69,

Or about $91 million, when we remember that GamingRev is measured in millions (remember, the residuals, or prediction errors, will have the same units as the Y variable).

Problem 3: Income and Height (9 pts)

In a University of Pittsburgh study (reported in the Wall Street Journal Dec. 30, 1986), 250 male business school graduates, all about 30 years old, were polled and asked to report their height and their annual incomes. The men ranged in height from 62 to 78 inches, with a mean height of 69 inches, and a standard deviation of 3.2 inches. The annual incomes ranged from $40,000 to $81,000, with a mean of $59,600, and a standard deviation of $8,500. A regression was performed to predict annual income (in thousands) from height (in inches); partial regression output is below.

|ANOVA Table | | | | |

| |df |SS |MS |F |

|Regression |1 |906 |906 |13.2 |

|Residual |248 |17011 |69 | |

|Total |249 |17917 | | |

| | | | | |

| |Coefficients |Standard Error |t Stat |P-value |

|Intercept |17.9 |11.48 |1.56 |0.1194 |

|Height |0.60 |0.17 |3.63 |0.0003 |

(a) Determine the proportion of variability in Income that is explained by Height.

Use the ANOVA table to get R2 = SSRegression / SSTotal = 906/17917 =.05

5% of the variability in Income is explained by Height

(b) (2 pts) Construct a 95% confidence interval for the average additional income associated with each additional inch of height, and interpret this interval in a sentence.

“the average additional income associated with each additional inch of height” refers to the slope of the regression line

95% CI for the slope is: sample slope +/- tvalue * (standard error of the sample slope)

tvalue from table for df=248 and 95% confidence is 1.96

95% CI: .60 +/- 1.96 * .17 ( (.27,.93)

Converting to $thousands: ($270,$930)

We’re pretty (95%) confident that each additional inch of height is worth, on average, between $270 and $930.

(c) (2 pts) Suppose we wanted to determine 95% confidence intervals for the average Income for 3 separate subpopulations: 64-inch-tall men, 68-inch-tall men, and 72-inch-tall men

Which of these 3 intervals would be widest? Which would be narrowest? (Note: no elaborate calculations are needed here)

Estimating a conditional mean (or predicting a new observation) gets harder & less precise as you move further from the overall mean of X. (This is because of the (X0 – Xbar)2 portion of the formulas for the standard errors of conditional means.)

So the interval for 64-inch-tall men will be widest because 64 inches is furthest from the average height of 69 inches.

And the interval for 68-inch-tall men will be narrowest because 68 inches is closest to the average height of 69 inches.

(Height & income, continued)

(d) (2 pts) Suppose a new observation was added to the data set: Frank is 80 inches tall, and his income is $41,000. How would the regression line change if it was re-estimated with Frank included?

Frank is in the lower right quadrant, relative to the means of X and Y (high on X, low on Y). Adding Frank will pull down on the right-hand side of the regression line, decreasing the slope of the line. Also, since Frank is a pretty big exception to the overall trend, his prediction error will be relatively large, (and by moving the regression line a bit, he’ll make the prediction errors larger for the other points too). So the Residual SD will increase

The slope would: Increase Decrease Stay the same

The standard deviation of the residuals would: Increase Decrease Stay the same

(e) (2 pts) Imagine that a second “height & income” survey of 250 men was conducted, but only includes men of "medium height" (ranging in height from 68 to 72 inches). Suppose that the regression equation for this second sample is about the same as the one for the first sample (i.e., the sample described in detail on the previous page). What can we reasonably predict about the following quantities: (i) the correlation between income and height, and (ii) the standard error of the regression slope, for this second sample as compared to the first sample? Write greater than, less than, or the same as in each blank.

Restricting the range of the X-scores will decrease the correlation, and increase the SE of the slope. Remember: wider is better. Wider X-scores are better because the correlation gets bigger, and the SE of the regression slope goes down

The correlation in the second sample will be __less than _____ the correlation in the first sample.

The standard error of the regression slope in the second sample will be __greater than__________ the standard error in the first sample.

Problem 4: When SSResidual = SSTotal (3 pts)

If the residual sum of squares (SSResidual) is equal to the total sum of squares (SSTotal) for a regression, then what can we conclude about the following quantities? If it is possible to determine a quantity, do so. If it is not possible to determine a quantity, write “cannot determine.”

a) The correlation coefficient

Correlation r = 0. Because SSResidual = SSTotal, that means that SSRegression=0 and the regression line explains none of the variance in Y. Therefore R2 = 0, and thus r=0.

b) The slope of the regression line

Slope b=0. When the correlation is 0, the slope is also 0.

c) The standard error of the slope of the regression line

Cannot be determined. This quantity depends on the sample size, the residual variance, and the variance of X, which we do not know.

Problem 5: Mortgage Expenses (4 pts)

When you take out a mortgage, there are several different kinds of costs you typically pay. Usually the two largest costs are the interest rate (a yearly percentage paid on the amount you owe), and the loan fee (sometimes called “points” -- a one-time percentage of the total loan amount paid at the time the loan is made). Financial institutions typically let you “buy down” the interest rate by paying a higher initial loan fee; thus there is expected to be a relationship between the interest rate and the loan fee offered by different lenders. Below we see an analysis of data for n=57 lenders in the Seattle area in 1995. Here’s what a portion of the data looks like:

|Lender |Interest Rate (%) |Loan Fee (%) |

|Abacus Mortgage |7.25 |1.875 |

|Advocate Mortgage |7.875 |1.50 |

|Etc… |… |… |

In the analysis below, the interest rate is predicted from the loan fee. Each variable is measured in percentage points. Edited regression output follows below:

|ANOVA | | | | | |

|  |df |SS |MS |F |Significance F |

|Regression |1 |1.0419 |1.0419 |41.04 |3.52E-08 |

|Residual |55 |1.3962 |0.0254 | | |

|Total |56 |2.4380 | | | |

| | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value | |

|Intercept |7.936 |0.043 |184.050 |1.99E-78 | |

|Fee |-0.195 |0.030 |-6.406 |3.52E-08 | |

(a) Determine the correlation between interest rates and loan fees.

R2 = 1.0419 / 2.4380 = .427 (this is the proportion of variability in interest rates explained by loan fees)

Correlation r = sqrt(.427) = +/- .65

Because the slope is negative, the correlation is too, so r = -.65

Need to report the correlation as negative to get credit.

(b) From the information given, is it possible to determine the proportion of the variability in loan fees that is explained by interest rates? If so, report that value. If not, explain why not.

Correlations are symmetric, so the correlation between interest rates and loan fees is the same as the correlation between loan fees and interest rates. Proportion of variability explained, because it is equal to the correlation squared, is also symmetric. So the proportion of variability in loan fees explained by interest rates is the same as the proportion of variability in interest rates explained by loan fees, which is .427 or 42.7%

(c) The overall average loan fee in the sample is 1.24%. Use the regression output to determine the overall average interest rate in the sample.

The regression line always goes thru the point defined by the overall mean of X and the overall mean of Y. So if we plug in the average loan fee into the regression equation, it will tell us the overall average interest rate.

Overall average interest rate = 7.936 - .195 * 1.24 = 7.694

(d) Based on the typical size of the residuals around the regression line, are you likely to find a lender offering an interest rate a full percentage point less than what the regression line predicts? Are you likely to find a lender offering an interest rate 0.25 percentage points less than what the regression line predicts? Explain why or why not.

The residual standard deviation tells us how big the residuals typically are.

The residual SD = sqrt(.0254) = 0.159

So the residuals tend to be about 0.159 percentage points in size, on average.

You’re not likely to find a lender offering 1.0 points less than the regression line’s predictions, because that would correspond to a residual that is 1.0 / .159 or about 6 residual standard deviations away from the regression line. Not very likely to happen!!

But .25 points less than the regression line’s prediction corresponds to .25 / .159 or about 1.5 residual SDs, which is quite plausible (given than 95% of the observations should be within 2 residual SDs of the regression line).

Problem 6: Bonus question! (1 pt) In class, we discussed how a set of data will contain both an overall systematic trend, and also exceptions to that trend. Explain in a sentence or two how the ANOVA table relates to this idea.

The ANOVA table splits the total variability (the Total row) into a part related to the systematic straight-line trend (Regression row) and a part related to exceptions to that trend (the Residual row).

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download