THINGS TO KNOW ABOUT REGRESSION



In a bivariate regression, the regression coefficient (b) measures the change in the predicted Y for a one unit increase in X. In multiple regression, bi (the subscript, i, refers to the i-th independent variable), measures the change in the predicted Y for a one unit increase in Xi holding all other independent variables constant. In writing the results of a regression analysis, one could say

(a) the net effect of Xi on Y is bi;

(b) when the other independent variables are held constant, Y changes bi for each unit change in Xi;

(c) the effect of Xi on Y is bi;

The intercept has a substantive interpretation only when the value of zero is included in the range of the Xi values. That is, don’t tell a football team that they can win ‘bo’ games if they don’t pass, don’t rush, and the opponent’s don’t rush!!!

The total sum of squares may be decomposed into two components -- a portion explained by the independent variables and a portion unexplained. That is:

SStotal = SSreg + SSres

The ratio, SSreg/SStotal, is the proportion of the total variation in Y that is explained by the set of independent variables. This value is called the coefficient of determination, commonly symbolized as R2 and referred to as the R-square.

The values of the regression coefficients, bi, are estimated in such a way as to minimize the errors of prediction, which is why the set of procedures we are studying is called ordinary least squares (OLS) regression. Minimizing error variance is the same as maximizing explained variance, which is the same as maximizing R2 or the proportion of variance explained in the dependent variable.

The test for the significance of the regression is an F-test. This test can be thought of as testing either of two null hypotheses. The first is that each of the regression coefficients is zero in the population against the alternative that at least one coefficient is non-zero. If the null hypothesis is rejected, we then look at the t-tests on the individual coefficients to determine which are not zero. The second way to conceptualize this F-test is that it tests whether a significant proportion of variance in the dependent variable is explained by the linear combination of the independent variables. If the overall F is significant, we then look at the tests on the individual coefficients to determine which contribute to the explanation of variance.

This F is computed by first computing the mean squares:

MSreg = SSreg/dfreg, dfreg = k (number of independent variables)

MSres = SSres/dfres, dfres = N-k-1

and the F-test is

F = MSreg/MSres

If you are interested in measuring the true effect of an independent variable Xi on the dependent variable Y, it is important to avoid a misspecified regression equation. That is, don’t estimate Y = b0 + b1X1 when the real model is Y = b0 + b1X1 + b2X2. If you do, the b1 in the first equation will be a biased estimate of the effect of X1 on Y. But how do we know that we have included all the important independent variables? Ah, there’s the rub!! We may not simply include EVERYTHING as independent variables because we soon become involved in a nasty little problem called multicollinearity (not to mention running out of degrees of freedom). Yet we have to include all the important independent variables. This requires a thorough knowledge of one’s subject matter, and a fine appreciation of which other independent variables are important. Here statistics becomes more of an art. It is what some people have called the hard part of doing research, as opposed to the easy part, which merely involves running an SPSS statistical analysis, which anyone can master.

The first thing you look at on your printout is the significance of the F statistic in the ANOVA table (or the R-Square if you want to first decide whether the proportion of variance explained is of any substantive importance before you determine whether it is significantly different from 0). The R-square gives you the proportion of variance in the dependent variable that is explained by the set of independent variables. The significance of the F simply tells you whether that amount of variance explained is different from 0. It is up to you to decide whether it is of any substantive importance. We follow the same rules in reading the significance level of the F as we did in all other statistical tests.

If sig F > ( = .05 (.01), the R-square is considered to be essentially 0 and we are not explaining a significant proportion of variance in the dependent variable.

If sig F < ( = .05 (.01), the R-square is considered to be greater than 0 and we are explaining a significant proportion of variance in the dependent variable.

If the R-square is significant, we then want to know which of the independent variables are contributing to that explanation of variance and what effect the variables have on the dependent variable. We do this by looking at the information contained in the Coefficients portion of the printout. First, to see which variables are contributing to the explanation of variance we look at the significance level of the t statistics.

If sig t > ( = .05 (.01), that variable does not contribute to the explanation of variance and has no effect on the dependent variable in the presence of the other independent variables.

If sig t < ( = .05 (.01), that variable does contribute to the explanation of variance and has an effect on the dependent variable in the presence of the other independent variables.

If the variable does have an effect, we look at the unstandardized regression coefficient, b, to determine the magnitude of the effect. This coefficient represents the amount of change in the dependent variable for a one unit change in the independent variable holding all other variables constant. Since the variables are usually measured on different scales or metrics, to determine the relative importance of the variables in influencing the dependent variable (or the relative magnitude of effect) we examine the standardized regression coefficients, betas. The betas are our effect sizes for regression analyses since these coefficients represent the effects of independent variables on the dependent variable that would occur if we had standardized all variables to z-values so that all variables were in the same scale of measurement. Their numerical value represents the number of standard deviations the dependent variable would change if the independent variable changed by one standard deviation, again holding other variables in the equation constant. We’re usually not so interested in the actual numerical value of either the unstandardized or standardized coefficients - just which are significant, whether the variable has a positive or negative effect on the dependent variable, and the relative magnitude of influence of the independent variables.

However, one instance in which we are more interested in the actual size of the unstandardized coefficient is when the variable is a dichotomous variable representing two groups. In this instance, the unstandardized regression coefficient represents the difference between the two group means on the dependent variable controlling for differences between the groups on the other independent variables.

Substantive Importance vs. Statistical Significance

When statistical significance is found, one must then address the issue of the substantive importance of the findings. It is commonly known that large sample sizes contribute to the likelihood of finding statistically significant effects in any type of statistical analysis, and we often use survey data that represent responses by hundreds if not thousands of subjects. It is the researcher’s responsibility to determine what magnitude of effect is substantively meaningful, given the nature of the data gathered and the question being addressed. Some authors seek to give guidelines for criteria of importance, but an effect of a certain magnitude that is important in one setting is not necessarily important in other settings. For example, Cohen (1977) suggests that an R2 of .01 could be viewed as a small, meaningful effect, but few would agree that explaining only 1% of the variance in the dependent variable using a collection of independent variables is of any importance. Comparing the R2 obtained in a study to R2s reported in similar studies and careful consideration of the magnitude of the betas (standardized coefficients) can help place substantive importance on findings. Betas of .05 or less can hardly be argued to be meaningful given that this represents a 5/100 standard deviation change in the dependent variable for a 1 standard deviation change in the independent variable holding other effects constant.

References

Cohen, J. (1977). Statistical power analysis for the behavioral sciences (Rev. ed.). New York: Academic Press.

Ethington, C. A., Thomas, S. L., & Pike, G. R. (2002). Back to the basics: Regression as it should be. In J. C. Smart (Ed.), Higher education: Handbook of theory and research, Vol. 17. New York: Algora Publishing.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download