SECTION 11



Lecture Notes For Chapter 11, Multiple Regression

Multiple Regression

In multiple linear regression more than one explanatory variable is used to explain or predict a single response variable. Many of the ideas of simple linear regression (one explanatory variable, one response variable) carry over to multiple linear regression.

Multiple Linear Regression Model

The statistical model for multiple linear regression is

[pic]

p is the number of explanatory variables in the model.

The deviations/errors, [pic], are independent and normally distributed with mean 0 and standard deviation σ.

The parameters of the model are [pic], [pic], [pic],….., [pic], and σ.

σ is constant for all values of all explanatory or predictor variables,

PROCEDURE:

1. Look at the variables individually. Graph (stem plot, histogram) each variable, determine means, standard deviations, minimums, and maximums. Are there any outliers?

2. Look at the relationships between the variables using the correlation and scatter plots. Do a scatterplot, determine a correlation between each pair of data.

To determine a correlation between each pair, enter all the variables (the y and all the x’s) into SPSS, the select Analyze>>Correlate>>Bivariate. The higher the correlation between 2 variables, the lower the Sig.(2-tailed), the better. This will help you determine which are the stronger relationships between the y and an x.

3. Do a regression to define the relationship of the variables. Start with the full model, all potential explanatory variables and the response variable. The regression results will indicate/confirm which relationships are strong.

For multiple linear regression a least squares procedure is used to estimate the parameters [pic], [pic], [pic],….., [pic], and σ. The sample has n observations. Perform the multiple regression procedure on the data from the n observations.

[pic] are statistics which estimate the population parameters

[pic], [pic], [pic],….., [pic]

Another notation is [pic], the jth estimator of [pic], the jth population parameter, where j = 0, 1, 2, …., p, and p is the number of explanatory variables in the model.

For the ith observation, the predicted response is:

[pic]

The ith residual, the difference between the observed and predicted response is:

[pic] = observed response – predicted response = [pic]

The method of least squares minimizes:

[pic], or

[pic]

The parameter [pic] measures the variability of the response about the regression equation. It is estimated by:

[pic]

The quantity n-p-1 is the degree of freedom associated with [pic].

Confidence Intervals and Significance Tests for [pic], the regression coefficients:

A level C confidence interval for [pic]is calculated by SPSS using

[pic], df = n-p-1 for looking up the value of t*.

where [pic] is the standard error of [pic] and t* is the value for the desired confidence level. [pic] is the jth estimator, j varies from 0 to p.

It is important to see if the Confidence Interval includes the value 0 because if the regression coefficient could be zero, that predictor variable does not belong in the predictor equation.

To test the hypothesis [pic], compute the t statistic

[pic] SPSS CALCULATES THIS

In terms of a random variable T having the t(n-p-1) distribution, the P-value for a test of [pic] against

[pic] is [pic] one side right (not calculated by SPSS)

[pic] is [pic] one side left (not calculated by SPSS)

[pic] is [pic] two side ( is calculated by SPSS)

ANOVA table for multiple regression:

|Source |Degrees of Freedom |Sum of squares |Mean square |F |

|Model |p |[pic] |SSM/DFM |MSM/MSE |

| | |[pic] |= MSM | |

|Error |n-p-1 | | | |

| | | |SSE/DFE | |

| | | |= MSE | |

|Total |n-1 |[pic] |SST/DFT | |

Analysis of Variance F Test

In the multiple regression model, the hypothesis

[pic]

vs Ha: Not all β = 0 ie, at least one β is non-0

is tested by the analysis of variance F statistic

F=MeanSquareModel / MeanSquareError, MSM/ MSE

The P-value is the probability, if Ho is true, that the F statistic could be equal to or greater than the value of F obtained.

The Squared Multiple Correlation

The statistic

[pic]

is the proportion of variation of the response variable y that is explained by the explanatory variables [pic], [pic], [pic],….., [pic].

Even if the p-value is small, you need to look at the [pic]. If the [pic] is small, it means the model you are using does not do a good job of explaining the variation in “y”. We want [pic] to be as close to 1.0 as possible

4. Interpret the results. Check the regression model assumptions. Assumptions:

• Linearity. The regression equation must be of the right form to describe the true underlying relationship among the variables. (To check for linearity, look at the scatterplots of y against each x.)

• Constant variance. The variability of the residuals must be the same for all values of the x variables. (To check for constant variance make a scatterplot of residuals against predicted values of y)

• Independence. Each explanatory variable should be independent of all the other explanatory variables. (To check this, plot residuals vs. each of the explanatory variables.) Also, determine whether any of the explanatory variables has a strong linear relationship with another explanatory variable by looking at the scatterplots of each x vs other x’s, the correlations, and the [pic]. A strong linear relationship between two explanatory variables suggests that we may be able to drop one of them from the model.

• Normality. The distribution of the residuals must be Normal for the t-tests on the coefficients to follow the t-distribution exactly. (To check the normality assumption, make a normal probability plot of residuals.)

5. Refine the model, if necessary. You are only interested in keeping the predictor variables which have strong relationships with the dependent variable. So, try deleting the predictor variable with the largest P-value, the weakest relationship, and re-run the regression. You may have to do this again and again, each time deleting the predictor variable with the largest P value , (weakest relationship with Y)

To determine which is the best model, look at:

• [pic] . It will always drop when a predictor variable is removed, but we do not want it to drop very much.

• p-values for coefficients (should be lowest),

Any x variables left in the equations should have a significant p-value from the t-test of their coefficient, and their confidence intervals should not contain 0.

• s, the standard deviation may increase or decrease when a predictor variable is removed. We would like s to be as small as possible.

• the F-test statistic from ANOVA should increase when a predictor variable is removed. We would like F to be as high as possible.

• P-value from the ANOVA F-test should decrease when F increases. We would like the P value to be as small as possible.

Example:

1. As cheddar cheese matures a variety of chemical processes take place. The taste of mature cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the La Trobe Valley ofVictoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. Data for one type of cheese-manufacturing processes appears in below. The variable “Case” is used to number the observations from 1 to 30. “Taste” is the response variable of interest. The taste scores were obtained by combining the scores from several tasters. Three chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and lactic acid. For acetic acid and hydrogen sulfide (natural) log transformations were taken. Thus the explanatory variables are the transformed concentrations of acetic acid (“Acetic”) and hydrogen sulfide (“H2S”) and the untransformed concentration of lactic acid (“Lactic”). These data are based on experiments performed by G. T. Lloyd and E. H. Ramshaw of the CSIRO Division of Food Research, Victoria, Australia.

|Case |Taste |Acetic |H2S |Lactic |

|1 |12.3 |4.543 |3.135 |0.86 |

|2 |20.9 |5.159 |5.043 |1.53 |

|3 |39 |5.366 |5.438 |1.57 |

|4 |47.9 |5.759 |7.496 |1.81 |

|5 |5.6 |4.663 |3.807 |0.99 |

|6 |25.9 |5.697 |7.601 |1.09 |

|7 |37.3 |5.892 |8.726 |1.29 |

|8 |21.9 |6.078 |7.966 |1.78 |

|9 |18.1 |4.898 |3.85 |1.29 |

|10 |21 |5.242 |4.174 |1.58 |

|11 |34.9 |5.74 |6.142 |1.68 |

|12 |57.2 |6.446 |7.908 |1.9 |

|13 |0.7 |4.477 |2.996 |1.06 |

|14 |25.9 |5.236 |4.942 |1.3 |

|15 |54.9 |6.151 |6.752 |1.52 |

|16 |40.9 |6.365 |9.588 |1.74 |

|17 |15.9 |4.787 |3.912 |1.16 |

|18 |6.4 |5.412 |4.7 |1.49 |

|19 |18 |5.247 |6.174 |1.63 |

|20 |38.9 |5.438 |9.064 |1.99 |

|21 |14 |4.564 |4.949 |1.15 |

|22 |15.2 |5.298 |5.22 |1.33 |

|23 |32 |5.455 |9.242 |1.44 |

|24 |56.7 |5.855 |10.199 |2.01 |

|25 |16.8 |5.366 |3.664 |1.31 |

|26 |11.6 |6.043 |3.219 |1.46 |

|27 |26.5 |6.458 |6.962 |1.72 |

|28 |0.7 |5.328 |3.912 |1.25 |

|29 |13.4 |5.802 |6.685 |1.08 |

|30 |5.5 |6.176 |4.787 |1.25 |

a) Look at each variable individually using graphs and descriptive statistics. Any outliers?

Enter data in SPSS, then select> Analyze>Descriptive Statistics>Explore. Select Plots options, Statistics options.

HISTOGRAM FOR TASTE

[pic]

BOX PLOT FOR TASTE

[pic]

HISTOGRAM FOR ACETIC

[pic]

BOX PLOT FOR ACETIC

[pic]

HISTOGRAM FOR H2S

[pic]

BOX PLOT FOR H2S

[pic]

HISTOGRAM FOR LACTIC

[pic]

BOX PLOT FOR LACTIC

[pic]

b) Look at a scatterplot and a residual plot of taste versus acetic, taste versus H2S and taste versus lactic. Do you see any problems?

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

c) Which explanatory variables (x’s, Acetic, H2S, Lactic) are most strongly correlated to the response variable (y, taste)?

Select Analyze>>Correlate>>Bivariate.

Correlations

| | |taste |Acetic |H2S |Lactic |

|taste |Pearson Correlation|1 |.550(**) |.756(**) |.704(**) |

| |Sig. (2-tailed) |. |.002 |.000 |.000 |

| |N |30 |30 |30 |30 |

|Acetic |Pearson Correlation|.550(**) |1 |.618(**) |.604(**) |

| |Sig. (2-tailed) |.002 |. |.000 |.000 |

| |N |30 |30 |30 |30 |

|H2S |Pearson Correlation|.756(**) |.618(**) |1 |.645(**) |

| |Sig. (2-tailed) |.000 |.000 |. |.000 |

| |N |30 |30 |30 |30 |

|Lactic |Pearson Correlation|.704(**) |.604(**) |.645(**) |1 |

| |Sig. (2-tailed) |.000 |.000 |.000 |. |

| |N |30 |30 |30 |30 |

** Correlation is significant at the 0.01 level (2-tailed).

d) Find the LSR equation for predicting taste using the three variables Acetic, H2S, and Lactic, FULL MODEL.

Coefficients(a)

|Model | |Unstandardized |Standardized |t |

| | |Coefficients |Coefficients | |

|1 |.807(a) |.652 |.612 |10.13071 |

a Predictors: (Constant), , ,

b Dependent Variable:

65.2% of the variation in taste is explained by the LSR line.

e) What is the value of s, the estimator for standard deviation?

s = 10.13071

ANOVA(b) FULL MODEL

|Model | |Sum of Squares |df |Mean Square |

|1 |.807(a) |.652 |.626 |9.94236 |

a Predictors: (Constant), ,

b Dependent Variable:

R square is .652, or 65.2% of the variation in taste is explained by this equation.

ANOVA(b) REDUCED MODEL

|Model | |Sum of Squares |df |

|R squared |.651775 |.651702 |Lower (worse) 0.01% |

|S |10.13071 |9.942 |Lower, (better) 1.86 % |

|F |16.221 |25.260 |Higher, (better) 55.7 % |

|P Value |3.81 E-6 |6.55 E -7 |Lower, (better) 82.8 % |

The value of R squared became slightly lower, but that will always happen when a predictor variable is dropped.

The values of s, F and P value are all better for the Reduced Model compared with the Full Model. R squared did not drop much, so the reduced model is better than the full model.

f) Is the assumption of normality met? Yes, the Q-Q Plot looks acceptable.[pic]

g) The F statistic reported for the second model is F = 25.2600. State the null and alternative hypotheses for this statistic. Give the degrees of freedom and the P-value for this test. What do you conclude?

[pic]

Ha: Not all β = 0 ie, at least one β is not zero

Degrees of Freedom = 27

The P value is very small (6.5 E-7) so we can reject the null hypothesis and conclude at least one of the regression coefficients is not zero.

The P values for the coefficients are .002 for H2S and .019 for Lactic so we would regard both of these as not zero. Unless the alpha were .01, we would stop here and not reduce the model further.

If we wanted to try to reduce the model further, we would try dropping the variable “lactic”. Below is shown the output for the regression with only H2S in the model

|Variables Entered/Removedb |

|Model |Variables Entered |Variables Removed |Method |

|1 |H2Sa |Acetic, lactic |Enter |

|a. All requested variables entered. |

|b. Dependent Variable: Taste |

|Model Summary |

|Model |R |R Square |Adjusted R Square |Std. Error of the |

| | | | |Estimate |

|1 |.756a |.571 |.556 |10.833382 |

|a. Predictors: (Constant), H2S |

R squared was .652 in the model with H2S and Lactic, so it has dropped 12.4%

s was 9.94 in the model with H2S and Lactic, so it has increased 8.9%

|ANOVAb |

|Model |

|b. Dependent Variable: Taste |

F was 25.260 in the model with H2S and Lactic so it has increased 47.64%

P value increased from 6.51 E-7 to 1.37 E-6, an increase of 111%.

|Coefficientsa | |

|Model |Unstandardized |

| |Coefficients |

Since R squared, s, and P value went the wrong direction this model is not as good as the model which contained H2S and Lactic. The first reduced model with H2S and Lactic as predictor variables was the optimum model.

Only F went the right direction. This is summarized in the table below.

|Statistic |Full Model |Reduced Model |Effect of Reducing |Third Model |Effect Of Reducing |

|R squared |.651775 |.651702 |Lower, (worse) 0.01 % |.571116 |Down (worse) 12.4% |

|S |10.13071 |9.942 |Lower, (better) 1.86 % |10.8334 |Up (worse) 8.9% |

|F |16.221 |25.260 |Higher, (better) 55.7 % |37.293 |Up (better) 47.64% |

|P Value |3.81 E-6 |6.55 E -7 |Lower, (better) 82.8 % |1.37E-6 |Up (worse) |

| | | | | |110% |

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download