Sometimes, more than one explanatory(predictor) variable ...



Multiple Regression

Sometimes, more than one explanatory(predictor) variable is needed to accurately predict the value of the response variable. Multiple regression involves the use of more than one explanatory variable to predict a response variable. Multiple regression is also used for the purpose of examining the relationship between two variables, while controlling (keeping fixed) the values of the other variables.

Example The following data set consists of observations on 100 recent homes in the Gainsville, Florida. (from Statistics: The Art and Science of Learning from Data by Agresti and Franklin).

|House |Real EstateTax |# of Bedrooms |# of Baths |Quadrant |NW |Selling price |House size |Lot size |

|1 |1360 |3 |2.0 |NW |1 |145000 |1240 |18000 |

|2 |1050 |1 |1.0 |NW |1 |68000 |370 |25000 |

|3 |1010 |3 |1.5 |NW |1 |115000 |1130 |25000 |

|4 |830 |3 |2.0 |SW |0 |69000 |1120 |17000 |

|5 |2150 |3 |2.0 |NW |1 |163000 |1710 |14000 |

|6 |1230 |3 |2.0 |NW |1 |69900 |1010 |8000 |

|7 |150 |2 |2.0 |NW |1 |50000 |860 |15300 |

|8 |1470 |3 |2.0 |NW |1 |137000 |1420 |18000 |

|9 |1850 |3 |2.0 |NW |1 |121300 |1270 |16000 |

|11 |630 |3 |2.0 |NE |0 |64500 |1220 |12000 |

|12 |1780 |3 |2.0 |NW |1 |167000 |1690 |30000 |

|13 |1630 |3 |2.0 |NW |1 |114600 |1380 |15500 |

|14 |1530 |3 |2.0 |NW |1 |103000 |1590 |16800 |

|15 |930 |3 |1.0 |NW |1 |101000 |1050 |16000 |

|16 |590 |2 |1.0 |NW |1 |50000 |770 |22100 |

|17 |1050 |3 |2.0 |NE |0 |85000 |1410 |12000 |

|18 |20 |3 |1.0 |SE |0 |22500 |1060 |3500 |

|19 |870 |2 |2.0 |NW |1 |90000 |1300 |17500 |

|20 |1320 |3 |2.0 |NW |1 |133000 |1500 |30000 |

|21 |1350 |2 |1.0 |NW |1 |90500 |820 |25700 |

|22 |2790 |3 |2.5 |NE |0 |260000 |2130 |25000 |

|23 |680 |2 |1.0 |NE |0 |142500 |1170 |22000 |

|24 |1840 |3 |2.0 |NW |1 |160000 |1500 |19000 |

|25 |3680 |4 |2.0 |NW |1 |240000 |2790 |20000 |

|26 |1660 |3 |1.0 |NW |1 |87000 |1030 |17500 |

|27 |1620 |3 |2.0 |NW |1 |118600 |1250 |20000 |

|28 |3100 |3 |2.0 |NW |1 |140000 |1760 |38000 |

|29 |2070 |2 |3.0 |NW |1 |148000 |1550 |14000 |

|30 |650 |3 |1.5 |NE |0 |65000 |1450 |12000 |

|31 |2260 |4 |2.0 |NW |1 |176000 |2000 |18000 |

|32 |1760 |3 |1.0 |NE |0 |86500 |1350 |14000 |

|33 |2750 |3 |2.0 |NW |1 |180000 |1840 |28000 |

|34 |2020 |4 |2.0 |NW |1 |179000 |2510 |25000 |

|35 |4900 |3 |3.0 |NW |1 |338000 |3110 |38000 |

|36 |1180 |4 |2.0 |NW |1 |130000 |1760 |30000 |

|37 |590 |3 |1.5 |NW |1 |77300 |1120 |14500 |

|38 |1600 |2 |1.0 |NW |1 |125000 |1110 |10000 |

|39 |1970 |3 |2.0 |NW |1 |100000 |1360 |25000 |

|40 |2060 |3 |1.0 |SW |0 |100000 |1250 |25000 |

|41 |1980 |3 |1.0 |NW |1 |100000 |1250 |25000 |

|42 |1510 |3 |2.0 |NW |1 |146500 |1480 |18000 |

|43 |1710 |3 |2.0 |NW |1 |144900 |1520 |20100 |

|44 |1590 |3 |2.0 |NW |1 |183000 |2020 |26000 |

|45 |1580 |3 |1.5 |NE |0 |77000 |1220 |12000 |

|46 |1510 |2 |2.0 |NW |1 |60000 |1640 |17500 |

|47 |1450 |2 |2.0 |NW |1 |127000 |940 |22000 |

|48 |970 |3 |2.0 |NE |0 |86000 |1580 |12000 |

|49 |1250 |3 |2.5 |NE |0 |95000 |1270 |12000 |

|50 |4020 |4 |2.5 |NW |1 |270500 |2440 |41000 |

|51 |700 |4 |1.5 |NE |0 |75000 |1520 |12000 |

|52 |820 |2 |1.0 |NW |1 |81000 |980 |13000 |

|53 |2050 |4 |2.0 |SW |0 |188000 |2300 |21600 |

|54 |710 |3 |2.0 |NE |0 |85000 |1430 |12500 |

|55 |1280 |3 |2.0 |NW |1 |137000 |1380 |18000 |

|56 |1800 |3 |1.5 |NW |1 |92900 |1010 |18000 |

|57 |20 |3 |1.5 |NW |1 |93000 |1780 |4000 |

|58 |800 |3 |2.0 |NW |1 |109300 |1120 |15500 |

|59 |1220 |3 |2.0 |NW |1 |131500 |1900 |25000 |

|60 |3360 |4 |3.0 |NW |1 |200000 |2430 |30000 |

|61 |210 |3 |2.0 |NW |1 |81900 |1080 |8000 |

|62 |380 |2 |1.0 |NW |1 |91200 |1350 |13000 |

|63 |1920 |4 |3.0 |NW |1 |124500 |1720 |12000 |

|64 |4350 |3 |3.0 |NE |0 |225000 |4050 |35000 |

|65 |1510 |3 |2.0 |NW |1 |136500 |1500 |18000 |

|66 |4290 |4 |2.5 |NW |1 |268000 |2530 |36000 |

|67 |1160 |3 |1.5 |NE |0 |70700 |1020 |9000 |

|68 |970 |4 |2.5 |NE |0 |70000 |2070 |10800 |

|69 |1400 |3 |2.0 |NW |1 |140000 |1520 |18500 |

|70 |790 |2 |2.0 |NW |1 |89900 |1280 |6000 |

|71 |1210 |3 |2.0 |NW |1 |137000 |1620 |18000 |

|72 |1550 |3 |2.0 |SW |0 |103000 |1520 |12000 |

|73 |2800 |3 |2.0 |NW |1 |183000 |2030 |23000 |

|74 |2560 |3 |2.0 |NW |1 |140000 |1390 |20100 |

|75 |1390 |4 |2.0 |NW |1 |160000 |1880 |22000 |

|76 |2820 |4 |2.5 |NW |1 |192000 |2780 |30000 |

|77 |2850 |2 |1.0 |NW |1 |130000 |1340 |27000 |

|78 |2230 |2 |2.0 |NW |1 |123000 |940 |22000 |

|79 |20 |2 |1.0 |NW |1 |21000 |580 |9000 |

|80 |1510 |4 |2.0 |NE |0 |85000 |1410 |12000 |

|81 |710 |3 |2.0 |SW |0 |69900 |1150 |4500 |

|82 |1540 |3 |2.0 |NW |1 |125000 |1380 |18000 |

|83 |1780 |3 |2.0 |NW |1 |162600 |1470 |20100 |

|84 |2920 |2 |2.0 |NW |1 |156900 |1590 |22000 |

|85 |1710 |3 |2.0 |NW |1 |105900 |1200 |15500 |

|86 |1880 |3 |2.0 |SW |0 |167500 |1920 |22000 |

|87 |1680 |3 |2.0 |NW |1 |151800 |2150 |29000 |

|88 |3690 |5 |3.0 |NW |1 |118300 |2200 |30000 |

|89 |900 |2 |2.0 |NW |1 |94300 |860 |15500 |

|90 |560 |3 |1.0 |NE |0 |93900 |1230 |12000 |

|91 |2040 |4 |2.0 |NW |1 |165000 |1140 |18200 |

|92 |4390 |4 |3.0 |NW |1 |285000 |2650 |36000 |

|93 |690 |3 |1.0 |NW |1 |45000 |1060 |8000 |

|94 |2100 |3 |2.0 |NW |1 |124900 |1770 |16000 |

|95 |2880 |4 |2.0 |NW |1 |147000 |1860 |35000 |

|96 |990 |2 |2.0 |NW |1 |176000 |1060 |27500 |

|97 |3030 |3 |2.0 |SW |0 |196500 |1730 |47400 |

|98 |1580 |3 |2.0 |NW |1 |132200 |1370 |18000 |

|99 |1770 |3 |2.0 |NE |0 |88400 |1560 |12000 |

|100 |1430 |3 |2.0 |NW |1 |127200 |1340 |18000 |

| | | | | | | | | |

[pic]

Correlations: price, size, lot

price size

size 0.761

lot 0.714 0.534

Cell Contents: Pearson correlation

The Multiple regression Model relates the mean μy of a quantitative response variable y to a set of explanatory variables x1, x2, …… For example, for two predictor variables, the population regression equation is μy=Bo+B1x1+B2x2 and the sample regression equation is [pic]. Using sample data, the sample regression equation is determined (using the method of least squares) and this equation estimates the population regression equation.

If we use house size(x1) and lot size(x2) as predictor variables and selling price as the response variable, we can see from the output that the sample regression equation is [pic].

Thus, for a fixed lot size, the predicted selling price increases by $53.8 for every square foot increase in house size. Likewise, for a fixed house size, the predicted selling price increases by $2.84 for every square foot increase in lot size.

Using the prediction equation, predict the selling price of a house having house size=1,500 and lot size=20,000. [pic]$126,964

Minitab Output

The regression equation is

price = - 10536 + 53.8 size + 2.84 lot

Predictor Coef SE Coef T P VIF

Constant -10536 9436 -1.12 0.267

size 53.779 6.529 8.24 0.000 1.4

lot 2.8404 0.4267 6.66 0.000 1.4

S = 30588.1 R-Sq = 71.1% R-Sq(adj) = 70.5%

Source DF SS MS F P

Regression 2 2.23676E+11 1.11838E+11 119.53 0.000

Residual Error 97 90756293211 935631889

Total 99 3.14433E+11

Note: Ideally, in using multiple regression, the sample size should be at least 10 times the number of predictor variables.

R2 Interpretation

R2[pic]

The larger the value of R2, the better the predictor variables collectively predict y.

R2 =.711

Interpretation: An r2 of .711 means that the sum of squares of deviations of the y values about their predicted values has been reduced 71.1% by the use of the prediction equation [pic], instead of [pic], to predict y.

Alternate Interpretation: 71.1% of the total sample variation in y is explained by the fitted model.

Inferences in Multiple Regression

Assumptions required for inference

1. The regression equation truly holds for the population means. This implies that there is a straight line relationship between the mean of y and each explanatory variable, with the same slope at each value of the other predictors.

2. The response variable y has a normal distribution at each combination of values of the explanatory variables, with the same standard deviation.

3. The observations are independent.

Test for the Utility of the Model

H0: B1=B2=B3=…=Bk=0

(i.e. there is no useful linear relationship between y and any of the predictors)

Ha: At least one among B1, B2, …., Bk is not 0

(i.e. there is a useful linear relationship between y and at least one of the predictors

In our example, we are interested in a test of

H0: B1=B2=0

Ha: At least one of the two B’s is not 0

From Output F=119.53

Pvalue=.0000 thus reject Ho.

We reject H0 and conclude that at least one of the two variables has some predictive power.

Test for Individual Slope parameters

For example,

H0: B2=0

This null hypothesis asserts that x2 it has no additional predictive value over and above that contributed by the other predictor variables in the model. In the context of our example, the null hypothesis states that lot size does not help us better predict selling price, if we already know the house size.

Ha: B2[pic]

If H0 is rejected, we can conclude that x2 provides useful information about y, over and above the information contained in the other predictors.

From the output, p-value=.0000 thus we can reject H0.

Confidence interval for the population mean selling price for houses having house size=1,500 and lot size=20,000

New

Obs Fit SE Fit 95% CI

1 126940 3081 (120824, 133055)

Values of Predictors for New Observations

New

Obs size lot

1 1500 20000

Checking Assumptions

Assessing Normality Assumption

[pic]

[pic]

Assessing equal Variance Assumption

And Assessing appropriateness of the model (μy=Bo+B1x1+B2x2)

[pic]

[pic]

Sometimes in multiple regression you can have a high degree of correlation among the set of predictor variables. This condition is known as multicollinearity. Multicollinearity can have serious effects on the estimates of the parameters in the population regression model. In particular, estimates of the B’s will vary considerably from sample to sample.

One of the simplest approaches to assessing the amount of multicollinearity is to examine the correlation matrix. Often a correlation over .9 indicates a serious problem.

You can also determine if multicollinearity exists by calculating a VIF for each predictor variable. As a rule of thumb, if the VIF associated with any predictor variable is > 10, you can conclude that multicollinearity exists.

If multicollinearity is detected, one possibility is to drop one or more of the correlated predictor variables from the model.

Including a Categorical Predictor

The regression equation is

price = - 15258 + 78.0 size + 30569 NW

Predictor Coef SE Coef T P

Constant -15258 11908 -1.28 0.203

size 77.985 6.209 12.56 0.000

NW 30569 7949 3.85 0.000

S = 34390.2 R-Sq = 63.5% R-Sq(adj) = 62.8%

p-value for indicator variable is .0000 indicating that region has significant effect on selling price. For any fixed value of house size, we predict that the selling price is $30,569 higher for houses in the NW.

Variable Selection

Often a researcher has many potential predictor variables for a multiple regression model. Below are some techniques for deciding on a set of useful predictor from these many variables.

Use one or more of the following measures:

Calculate Adjusted R2: Unlike R2, adjusted R2 does not automatically increase when a new variable is added to the model. The higher adjusted R2 the better.

Press Statistic = [pic] The smaller the value of the press statistic the better.

Use an Automatic Search procedure such as forward stepwise regression.

***********************************************************

Minitab Output for Best Subsets regression

Best Subsets Regression: price versus Taxes, Bedrooms, ...

Response is price

B

e

d

T r B

a o a s

x o t i l

Mallows e m h N z o

Vars R-Sq R-Sq(adj) C-p S s s s W e t

1 67.9 67.5 37.0 32114 X

1 58.0 57.5 77.9 36730 X

2 73.0 72.5 17.6 29571 X X

2 71.1 70.5 25.4 30588 X X

3 75.7 74.9 8.5 28216 X X X

3 74.9 74.1 11.8 28663 X X X

4 76.9 75.9 5.5 27647 X X X X

4 76.2 75.2 8.3 28052 X X X X

5 77.3 76.0 6.1 27581 X X X X X

5 77.1 75.9 6.8 27690 X X X X X

6 77.5 76.1 7.0 27571 X X X X X X

Minitab Output for Stepwise Regression

Stepwise Regression: price versus Taxes, Bedrooms, Baths, NW, size, lot

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is price on 6 predictors, with N = 100

Step 1 2 3 4

Constant 49965 21115 6305 -4854

Taxes 46.0 32.1 22.0 20.7

T-Value 14.38 7.36 4.24 4.04

P-Value 0.000 0.000 0.000 0.000

size 34.1 34.5 38.4

T-Value 4.31 4.58 5.05

P-Value 0.000 0.000 0.000

lot 1.59 1.40

T-Value 3.25 2.86

P-Value 0.002 0.005

NW 15050

T-Value 2.23

P-Value 0.028

S 32114 29571 28216 27647

R-Sq 67.86 73.02 75.69 76.91

R-Sq(adj) 67.53 72.47 74.93 75.93

Mallows C-p 37.0 17.6 8.5 5.5

Sources used for this handout:

The Art and Science of Learning from Data by Agresti and Franklin

Introduction to Statistics and Data Analysis by Peck, Olsen, and Devore

An Introduction to Statistical Methods and Data Analysis by Ott and Longnecker

Applied Linear Statistical Models by Neter, Kutner, Nachtsheil, and Wasserman

A Date-Based Approach to Statistics by Iman

-----------------------

Selling price of houses is on the y-axis in the plots in this row.

Selling price of houses is on the x-axis in the plots in this column

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download