Columbia University in the City of New York



Part VII: Multiple Regression

Modern science, as training the mind to an exact and impartial analysis of facts, is an education specially fitted to promote citizenship.

— Karl Pearson (1857-1936)

The multiple regression model is structured in the same way as the simple regression model, only it allows more independent variables. There are now k independent variables: X1, X2, X3, ..., Xk. We are using these to help predict the value of another random variable Y, the dependent variable. For simple regression, we had k = 1.

Example: Consider expanding our previous model (in Part VI), where we now use several additional characteristics of the house in order to predict the selling price. This information is:

| |Actual | | | |

|House |Selling Price |House Size |Age |Lot Size |

|Number |(in $1000) |(100 sq. ft.) |(yrs.) |(1000 sq. ft.) |

| |Y |X1 |X2 |X3 |

|1 |89.5 |20.0 |5 |4.1 |

|2 |79.9 |14.8 |10 |6.8 |

|3 |83.1 |20.5 |8 |6.3 |

|4 |56.9 |12.5 |7 |5.1 |

|5 |66.6 |18.0 |8 |4.2 |

|6 |82.5 |14.3 |12 |8.6 |

|7 |126.3 |27.5 |1 |4.9 |

|8 |79.3 |16.5 |10 |6.2 |

|9 |119.9 |24.3 |2 |7.5 |

|10 |87.6 |20.2 |8 |5.1 |

|11 |112.6 |22.0 |7 |6.3 |

|12 |120.8 |19.0 |11 |12.9 |

|13 |78.5 |12.3 |16 |9.6 |

|14 |74.3 |14.0 |12 |5.7 |

|15 |74.8 |16.7 |13 |4.8 |

|Average |[pic] = 88.84 |[pic]1 = 18.17 |[pic]2 = 8.67 |[pic]3 = 6.54 |

|Std Dev |[pic] = 21.10 |[pic] = 4.38 |[pic] = 4.03 |[pic] = 2.35 |

Defining variables for each column:

1. Y = actual selling price of house (in $1000s)

2. X1 = house size (in 100s sq. feet)

3. X2 = age (in years)

4. X3 = lot size (in 1,000s sq. feet)

Let’s first build a new model using all three independent variables to predict the dependent variable (selling price). We will see that model design and selecting the “right” variables is crucial, but we’ll get to that later. For now we will use the underlying model:

Y = (0 + (1 X1 + (2 X2 + (3 X3 + (, (1)

where again ( is assumed to be random and normally distributed with a mean of 0 and unknown standard deviation (( . This is the population model and thus these (i values (the ones that would provide the “best” model of type (1)) and (( are unknown. Therefore we will get estimates of them from our sample data (the 15 houses).

After running a multiple regression, the output will be a hyperplane (a “line” in higher dimensions):

[pic] = [pic] + [pic] X1 + [pic] X2 + [pic] X3, (2)

which we use as an estimate of the (unknown) true population regression equation.

Here is the regression output for model (1):

|Regression Statistics |  | | | | |

|Multiple R |0.9571 | | | | |

|R Square |0.9161 | | | | |

|Adjusted R Square |0.8932 | | | | |

|Standard Error |6.8940 | | | | |

|Observations |15 | | | | |

| | | | | | |

|ANOVA | | | | | |

|  |df |SS |MS |F |Significance F |

|Regression |3 |5707.4385 |1902.4795 |40.0294 |0.0000 |

|Residual |11 |522.7975 |47.5270 | | |

|Total |14 |6230.2360 |  |  |  |

| | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value | |

|Intercept |-16.0580 |19.0710 |-0.8420 |0.4177 | |

|Size (100 sq ft) |4.1462 |0.7512 |5.5195 |0.0002 | |

|Age (Yrs) |-0.2361 |0.8812 |-0.2679 |0.7937 | |

|Lot Size (1000 sq ft) |4.8309 |0.9011 |5.3612 |0.0002 | |

For these data the regression equation is:

Selling Price = -16.06 + 4.15((House Size) - 0.24((Age) + 4.83((Lot Size).

As in simple regression, we can do two things:

1. Use and interpret the equation.

2. Try to increase our understanding of the model and try to improve it.

Using the Equation

This again is the easy part. Suppose a 10-year old house has 2,100 sq. ft. and has a lot of 9,000 sq. ft. You can immediately plug in the numbers (be careful of the units):

|Predicted Selling Price |= -16.06 + 4.15 ( (21) – 0.24 ( (10) + 4.83 ( (9) |

| |= 112.16 |

| |= $112,160. |

We would estimate the selling price to be $112,160.

As in simple regression, we must be careful when using the equation in a range where we have no data. For instance, if we attempt to predict the selling price of a 75 year-old house we should realize that our model is based on analyzing houses that are between 1 and 16 years old and therefore cannot necessarily guarantee a good prediction.

Interpreting the Coefficients

The value of [pic] = 4.15, for example, can be interpreted as "each one unit change in X1 (100 sq. ft.) tends to increase the predicted selling price by 4.15 units" (where one unit is $1,000). This means that each additional 100 square feet adds $4,150 to the predicted selling price. Therefore, each square foot adds an average of $41.50.

Note that these estimates apply only if all the other variables are held constant. Each additional year decreases our estimated selling price by 0.24($1,000) = $240 when all other variables are held constant. When all other variables are held constant, each square foot of lot adds about

4.83 ($1,000)/1000 = $4.83.

Analyzing a Multiple Regression

Method I: Estimating the Standard Error

Let’s tabulate how our well our model would have predicted selling prices for the houses in our sample. (We know the actual selling prices; they were the dependent variable when we ran the regression in the first place.)

| |Actual |Predicted | | |

|House |Selling Price |Selling Price | |Squared |

|Number |(in $1000) |(in $1000) |Residuals |Residuals |

| |[pic] |[pic] |[pic][pic] |[pic] |

|1 |89.5 |85.4920 |4.0080 |16.0641 |

|2 |79.9 |75.7948 |4.1052 |16.8530 |

|3 |83.1 |97.4848 |-14.3848 |206.9220 |

|4 |56.9 |58.7543 |-1.8543 |3.4383 |

|5 |66.6 |76.9745 |-10.3745 |107.6293 |

|6 |82.5 |81.9451 |0.5549 |0.3079 |

|7 |126.3 |121.3975 |4.9025 |24.0348 |

|8 |79.3 |79.9448 |-0.6448 |0.4157 |

|9 |119.9 |120.4539 |-0.5539 |0.3068 |

|10 |87.6 |90.4439 |-2.8439 |8.0876 |

|11 |112.6 |103.9402 |8.6598 |74.9929 |

|12 |120.8 |122.4411 |-1.6411 |2.6931 |

|13 |78.5 |77.5393 |0.9607 |0.9230 |

|14 |74.3 |66.6917 |7.6083 |57.8866 |

|15 |74.8 |73.3025 |1.4975 |2.2424 |

|Sum | | | |SSE = 522.7975 |

As you can see our SSE is now only about 522.8 instead of 2195.82 as in our first model (with house size only). To estimate (( we use s( provided in the regression output under the heading “standard error.” For this example, s( = 6.894. The number is calculated from SSE as follows:

[pic]

Note we now have n - k - 1 = 15 - 3 - 1 = 11 degrees of freedom. We are now able to predict the selling price with fairly good accuracy. That is, we estimate that the error term (() is normal with mean 0 and standard deviation 6.894 (as opposed to 12.997 in the regression with house size only).

For our 10 year-old house of 2,100 square feet with a lot of 9,000 square feet, we predict a selling price of $112,160. What is a 95% confidence interval on the actual selling price?

A 95% confidence interval on the actual selling price (a prediction interval) is:

[pic]

Looking in the t-table (with 11 degrees of freedom), we find that the correct number of standard deviations for this type of interval is 2.201, so a 95% confidence interval on the actual selling price is:

|112.16 ( (2.201)(6.894) |= 112.16 ( 15.17 |

| |= (96.99, 127.33) |

| |Or ($96,990, $127,330). |

A 95% confidence interval on the average selling price of all houses with these characteristics (10 years old, 2,100 sq. ft. and 9,000 sq. ft. lot) would be constructed in the following way:

|[pic] |= 112.16 ( [pic] |

| |= 112.16 ( 3.92 |

| |= (108.24, 116.08) |

| |Or ($108,240, $116,080). |

Method II: Making Inferences about the Coefficients

Another method of assessing the accuracy of the model involves determining whether a particular variable is useful in the model. This was relatively easy in simple regression since we had only to test to see if the variable’s coefficient was sufficiently far from zero. In multiple regression it is not so easy. We will present some of the possible difficulties later on.

We can test each of the independent variables and see whether they are helping us to predict the selling price. For example we can test each of the (i’s against 0 at the 1% significance level. The first test is:

|H0 : (1 = 0 |vs. |HA: (1 ( 0 |

We see that at the 1% level we can reject the null hypothesis because the p-value of the first variable is less than 1% (it is 0.0002 or 0.02%). This tells us that (1 is almost certainly not 0, in which case, house size is useful in predicting the selling price.

The second test is:

|H0 : (2 = 0 |vs. |HA: (2 ( 0 |

We see that at the 1% level we cannot reject the null hypothesis because the p-value (79.37%) is very large (larger than 1%). This tells us that (2 could very well be 0, in which case the age of the house would not add anything to our model. Note that we need to be precise here: the fact that age does not seem to add anything to our model does not necessarily mean that age is of no value in predicting the selling price. All it means is that with house size and lot size in the model already, age does not add any new information.

The third test is:

|H0 : (3 = 0 |vs. |HA: (3 ( 0 |

We see that at the 1% level we reject the null hypothesis because the p-value (0.0002) is less than 1%. This tells us that (3 is almost certainly not 0, in which case, the size of the lot is useful in predicting the selling price.

At this point, we see that only two of the independent variables seem to help us predict the selling price, so we might want to consider taking out the age variable. We’ll get to that later.

An update to our earlier interval formula table, now generalized for multiple regression:

| |Confidence Interval for Y |Confidence Interval for β |

| |(expected value of the dependent variable, |(the expected effect on the dependent |

| |given a specific values of the independent |variable resulting from a one-unit change in|

| |variables) |a independent variable) |

|Individual observation |[pic] |N/A |

|(prediction interval) | | |

|Population mean of all observations |[pic] |[pic] |

|(confidence interval) | | |

Method III: Measuring the Strength of the Linear Relationship

We see that R-square = R2 = 91.6% which means our model is able to account for 91.6% of the variation in selling price from the three independent variables. Still 8.4% (100% - 91.6%) remains unexplained. This unexplained variability might be due to other factors that affect selling price (location of house, number of bedrooms, bathrooms, etc...).

Notice that the computer also provides something called the Adjusted R-square (89.3% in our example). This is the coefficient of determination adjusted for degrees of freedom, which is a different version of R2. It has been adjusted to take into account the sample size and the number of independent variables. The rationale for this statistic is that, if the number of independent variables K is large relative to the sample size n, the unadjusted R2 may be unrealistically high. (Think about doing a regression with one independent variable but only 2 data points. Your R2 would be 100% no matter how good the “model” was.) To avoid creating a false impression, the adjusted R2 is often used instead. The adjusted R2, denoted [pic] is calculated as follows:

[pic]

Note: One way to see the utility of the adjusted R2 is to consider what happens when you add a new independent variable to a regression. The R2 always increases! The adjusted R2 can decrease if the new variable does not explain enough.

The computer also lists the coefficient of multiple correlation as Multiple-R. This is just the square root of R2. It is the coefficient of correlation between Yi (the actual prices) and [pic] (the predicted prices).

Method IV: Using the ANOVA Table

The multiple regression output also makes it easy to perform a quick analysis of variance in the ANOVA table. The table looks like:

Analysis of Variance

| |df |SS |MS |F-ratio |

|Regression |k |SSR |SSR/k |(SSR/k)/(SSE/(n-k-1)) |

|Residual |n-k-1 |SSE |SSE/(n-k-1) | |

|Total |n-1 |SST | | |

(Note SSE + SSR = SST.)

The F-ratio is the test statistic for the following test:

|H0: (1 = (2 = (3 = … = (K = 0 |vs. |HA : at least one (i ( 0 |

i.e., it tests the validity of the “entire regression.” This test statistic should be compared with an F-distribution (see 12.6 in Levine for details). Fortunately, we don’t have to do this since the p-value is already provided in the output. In our case, the p-value (called “Significance F” in Excel output) is very small indeed (0.0000033). At the 5% level, we can say that our model “passes the F-test.”

Improving the Model

In the previous model, we found that the variable Age did not help us predict selling price (see the test above). Let’s take that variable out and rerun the regression. Our variables are:

5. Y = selling price of house (in $1000s)

6. X1 = house size (in 100s sq. feet)

7. X2 = lot size (in 1,000s sq. feet)

Our output is:

|Regression Statistics |  | | | | |

|Multiple R |0.9568 | | | | |

|R Square |0.9155 | | | | |

|Adjusted R Square |0.9015 | | | | |

|Standard Error |6.6220 | | | | |

|Observations |15 | | | | |

| | | | | | |

|ANOVA | | | | | |

|  |df |SS |MS |F |Significance F |

|Regression |2 |5704.0273 |2852.0137 |65.0391 |0.0000 |

|Residual |12 |526.2087 |43.8507 | | |

|Total |14 |6230.2360 |  |  |  |

| | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value | |

|Intercept |-20.3722 |9.8139 |-2.0758 |0.0601 | |

|Size (100 sq ft) |4.3117 |0.4104 |10.5059 |0.0000 | |

|Lot Size (1000 sq ft) |4.7177 |0.7646 |6.1705 |0.0000 | |

Our regression equation for the predicted selling price is:

|[pic] |[pic] |

| |= -20.37 + 4.31 ( (House Size) + 4.72 ( (Lot Size). |

Let's evaluate this model and compare it to the previous one.

Method I: Estimating the Standard Error

An analysis of the residuals gives the following table:

| |Actual |Predicted | | |

|House |Selling Price |Selling Price | |Squared |

|Number |(in $1000) |(in $1000) |Residuals |Residuals |

| |[pic] |[pic] |[pic][pic] |[pic] |

|1 |89.5 |85.18 |-4.32 |18.65 |

|2 |79.9 |75.51 |-4.39 |19.24 |

|3 |83.1 |97.72 |14.62 |213.77 |

|4 |56.9 |57.58 |0.68 |0.46 |

|5 |66.6 |77.03 |10.43 |108.87 |

|6 |82.5 |81.86 |-0.65 |0.42 |

|7 |126.3 |121.28 |-5.02 |25.17 |

|8 |79.3 |80.01 |0.71 |0.50 |

|9 |119.9 |119.76 |-0.14 |0.02 |

|10 |87.6 |90.76 |3.16 |10.01 |

|11 |112.6 |104.19 |-8.41 |70.80 |

|12 |120.8 |122.41 |1.61 |2.59 |

|13 |78.5 |77.96 |-0.55 |0.30 |

|14 |74.3 |66.87 |-7.43 |55.15 |

|15 |74.8 |74.26 |-0.54 |0.29 |

|Sum | | | |SSE = 526.21 |

Our SSE is now 526.21 (compared to 522.82 before) and our standard error of the estimate is s( = 6.622 or $6,622 (compared to $6,894). We actually reduced the standard error by taking out the age variable! Our house of 2,100 square feet with a lot of 9,000 square feet now would be predicted to sell for:

|Predicted Price |= -20.37 + 4.31 ( (21) + 4.72 ( (9) |

| |= 112.62 |

| |= $112,620, |

which is close to what we had before ($112,160).

A 95% prediction interval on the actual selling price would be constructed as usual (Note that we must now look at the [pic] = [pic] distribution since n = 15 and k = 2):

112.620 ( [pic]s(.

We look in the t-table and find 2.179, so a 95% confidence interval on the actual selling price is:

|112.62 ( (2.179)(6.622) |= 112.62 ( 14.43 |

| |= (98.19, 127.05) |

| |Or ($98,190, $127,050). |

A 95% confidence interval on the average price of all houses with these parameters would be constructed in the following way:

|112.62 ( [pic] |= 112.62 ( 3.77 |

| |= (108.89, 116.35) |

| |Or ($108,890, $116,350). |

Method II: Making Inferences about the Coefficients

We immediately can see from the output that (most likely) both (1 and (2 are different from 0. To see this just look at the p-values of X1 and X2; they are very small. At the 5% level we reject the null hypotheses that each is zero.

Method III: Measuring the Strength of the Linear Relationship

The R-square (R2) is 91.6% which means our model is able to account for 91.6% of the variation in selling price from only the size of the house and lot. So the age variable did not add anything here. The adjusted R-square is 90.1%, an improvement over the previous model.

Method IV : Using the ANOVA table

The p-value for the F statistic is 0.00000036. Thus at the 5% level, our model passes the F-test.

Was it a good idea to get rid of the age variable? It seems like it was a good idea since our R2 did not change at all (from 91.6% to 91.6%) and our adjusted R2 increased. In addition, the standard error of the estimate went from 6.894 to 6.622 (a decrease). So, yes, it was a good idea. We made the model simpler (only two independent variables versus three) and we have not lost anything in accuracy.

Model Design

Using Qualitative Data: Dummy Variables

If you look at the errors in the previous model you will notice that our model has severely under-priced two of the houses: house #11 (predicted price is $104,190 and actual selling price is $112,600) and house #14 (predicted price is $66,870 and actual selling price is $74,300). We notice that these two houses both have swimming pools. The only other house with a pool is house #9; no other house has a pool. We suspect that a house with a swimming pool would be priced higher than one without (with all other variables held constant).

We decide to include this variable in the regression. To do this, we add a dummy variable indicating whether or not there is a swimming pool. We do this by adding a column of data (in this case 1’s and 0’s). A house with a pool has a 1 in the “pool” column, a house with no pool has a 0.

Let's run the regression with this new dummy variable (X3) along with house size (X1) and lot size (X2):

|Regression Statistics |  | | | | |

|Multiple R |0.9661 | | | | |

|R Square |0.9334 | | | | |

|Adjusted R Square |0.9153 | | | | |

|Standard Error |6.1397 | | | | |

|Observations |15 | | | | |

| | | | | | |

|ANOVA | | | | | |

|  |df |SS |MS |F |Significance F |

|Regression |3 |5815.5835 |1938.5278 |51.4257 |0.0000 |

|Residual |11 |414.6525 |37.6957 | | |

|Total |14 |6230.2360 |  |  |  |

| | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value | |

|Intercept |-18.7146 |9.1500 |-2.0453 |0.0655 | |

|Size (100 sq ft) |4.1572 |0.3910 |10.6330 |0.0000 | |

|Lot Size (1000 sq ft) |4.6794 |0.7092 |6.5978 |0.0000 | |

|Pool? |7.0053 |4.0722 |1.7203 |0.1134 | |

How do we interpret the coefficient of a dummy variable? Notice that if two houses have identical sizes and identical lot sizes but one has a pool and the other does not, the model will predict that the house with the pool is 7.01(1000) = $7,010 more expensive than the house with no pool. In this model, it seems as though a pool adds about $7,010 to the selling price of a house. How confident are we in this value? We can build a 95% confidence interval on the coefficient (3 (using the t-table with 11 degrees of freedom):

|7.01 ( (2.201)(4.07) = 7.01 ( 8.96 |= (-1.95, 15.97) |

| |Or (-$1,950, $15,970). |

These data seem to imply that a pool adds anywhere from up to $15,970 to as little as decreasing the value of the house by $1,950. A hypothesis test on (3 (against 0) at the 5% level tells us that we cannot reject the null hypothesis that (3 = 0. You get this by looking at the p-value (11.34%).

At the 5% level it is not clear that the pool variable is useful. On this basis we might decide to leave the pool variable out of our model. One the other hand, since it has increased our adjusted R2, we might decide to keep it in. This is a situation in which a manager needs to make a decision as to which model "works" best (i.e. gives the best predictions).

Non-Linear Models

Another option for constructing a regression model is to include non-linear relationships between variables. For example, suppose that we believed that house size was the only important predictor of selling price, but there were diminishing returns. That is, the added value of 1 square foot (or 100 square feet) depended on the house size. If a house is 1000 square feet, then 100 extra square feet represents a 10% increase. If a house has 4,000 square feet, then an extra 100 square feet is only a 2.5% increase. Therefore we would expect that the 100 square feet would add less to the predicted selling price as the house size went up. This is a non-linear relationship. (Note: a linear relationship assumes that each 100 square feet adds the same amount to the selling price no matter what the house size!) How can we create a new model that can use this idea? Consider the following new data:

|House |Selling |House |

|Number |Price |Size |

| |Y |X1 |

|1 |100.3 |18.3 |

|2 |72.4 |12.5 |

|3 |139.5 |28.2 |

|4 |150.2 |32.2 |

|5 |150.6 |36.6 |

|6 |87.4 |14.3 |

|7 |155.1 |37.7 |

|8 |156.5 |43.2 |

|9 |99.8 |17.6 |

|10 |112.3 |22.2 |

|Averages |122.4 |26.3 |

Let’s first do a regular linear model (with house size as the only independent variable). We get the following regression output:

|Multiple R |0.967 |

|R Square |0.935 |

|Adjusted R Square |0.927 |

|Standard Error |8.497 |

|Observations |10 |

Analysis of Variance

| |df |SS |MS |F |Significance F |

|Regression |1 |8341.23 |8341.23 |115.54 |4.9E-06 |

|Residual |8 |577.54 |72.19 | | |

|Total |9 |8918.77 | | | |

| |Coeff |Stnd Error |t Statistic |P-value |

|Intercept |48.37 |7.39 |6.54 |0.0002 |

|X1 |2.82 |0.26 |10.75 |0.0000 |

This model is good, but we might take a look at a scatter plot to see how the relationship looks. The graph suggests that some kind of curvilinear relationship exists between house size and selling price.

[pic]

We decide to try a model where selling price has a non-linear (diminishing returns) relationship with house size. This can be done by using log(X1) as the house size variable (since this is a variable that exhibits decreasing returns to scale). Here we have used the natural logarithm:

|House Number |Selling Price |Log of House Size |

|Y |[pic] |ln(X1) |

|1 |100.3 |2.907 |

|2 | 72.4 |2.526 |

|3 |139.5 |3.339 |

|4 |150.2 |3.472 |

|5 |150.6 |3.600 |

|6 | 87.4 |2.660 |

|7 |155.1 |3.630 |

|8 |156.5 |3.766 |

|9 | 99.8 |2.868 |

|10 |112.3 |3.100 |

We now do another regression using the log of house size column instead of the house size column. The output is now:

|Regression Statistics |  | | | | |

|Multiple R |0.9897 | | | | |

|R Square |0.9794 | | | | |

|Adjusted R Square |0.9768 | | | | |

|Standard Error |4.7903 | | | | |

|Observations |10 | | | | |

| | | | | | |

|ANOVA | | | | | |

|  |df |SS |MS |F |Significance F |

|Regression |1 |8735.1900 |8735.1900 |380.6619 |0.0000 |

|Residual |8 |183.5790 |22.9474 | | |

|Total |9 |8918.7690 |  |  |  |

| | | | | | |

|  |Coefficients |Standard Error |t Stat |P-value | |

|Intercept |-105.4501 |11.7766 |-8.9542 |0.0000 | |

|ln(House Size) |71.5019 |3.6648 |19.5106 |0.0000 | |

As you can see by comparing the two models (the linear and the non-linear one), the non-linear one is better (higher R2, higher adjusted R2, and lower standard error).

Our regression equation is:

Predicted Selling Price = -105.45 + 71.502 ( ln(House Size).

[pic]

[pic]

Violations of Regression Assumptions

Since it is so easy to run a regression using Excel, this tool is often used without checking to see if the assumed conditions are really true. For example, sometimes the assumptions concerning the random error may be violated. Here are a few examples:

Non-normality: One of our central assumptions was that the random error term is normally distributed. We can do a quick unscientific check to see if this is the case by making a histogram of the residuals. If the histogram is not really bell-shaped then non-normality might be present (if n is small it may be difficult to tell if the histogram is not bell-shaped). In this case, the regression results may be less accurate. Note however that the tests applied in regression analysis are robust, which means that only when the error variable is quite non-normal are the test results called into question.

Heteroscedasticity: We assumed that the standard deviation of the error term, ((, is a fixed number. This assumes that the errors made are roughly of the same magnitude no matter what kind of prediction is being made (e.g. a prediction on a small house is assumed to have the same magnitude error as a prediction on a large house). If this condition is violated, we have something called heteroscedasticity. One method of diagnosing heteroscedasticity is to plot the residuals against the predicted values of Y. We then look for a change in the spread or dispersion of the plotted points. If the spread seems larger on the right side of this graph than on the left side (or smaller), then heteroscedasticity may be present. One option for correcting this is replace Y with a non-linear function of Y such as log(Y).

Collinearity: Collinearity is a condition that exists when the independent variables are correlated with one another. The adverse effect of collinearity is that the standard deviations of the coefficients may be overestimated. If so, when the coefficients are tested, the t-stat is smaller than it should be, and some independent variables appear not to be linearly related to the dependent variable when in fact they are. A simple example of this is if we decided to add a new variable to our linear model above. Say we added the variable “house size in square meters”. What happens?

Stepwise Regression

In preparing a multiple regression model, one usually prepares a long list of potential independent variables. The task of determining which variables belong in the model can be challenging. We could, of course, include all potential variables and use the t-test on the (i’s to determine which of them are linearly related to the dependent variable. Problems such as collinearity, however, might make the t-tests misleading. Also, performing a large number of t-tests means that we increase the likelihood that one or more Type I errors are made (a Type I error is made whenever we reject a variable as being useless when in fact it is useful). Stepwise regression can help overcome some of these problems. It is a popular algorithm for constructing a regression model.

Consider starting with a dependent variable and m potential independent variables.

Step 1. Include as the first variable the independent variable that is most correlated (positively or negatively) with the dependent variable. Call this variable X1.

Step 2. Perform m - 1 regressions with X1 and each of the m - 1 other variables. Choose the variable to include as X2 the one that has the largest t-stat (in absolute value). In the regression with X1 and X2, check if X1 is still significant (p-value less than significance level). If it is not, take it out and start over with X2 as the only variable.

Step 3. Continue in this manner until no more significant variables can be added to the model.

-----------------------

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download