SW 981 - CORRELATION AND SIMPLE REGRESSION



SW 983 - CORRELATION AND SIMPLE REGRESSION

Covariance and Correlation

We have defined variance for a specified variable to be equal to the average of the squared deviations of individual values about the mean, i.e.

Vy = ( (yi-My)²/N

Covariance can be similarly defined as the average cross product of two variables expressed as deviations about their means (i.e., centered data).

CoVxy = Vxy = [pic] ,where ( x = ( y = 0

The correlation coefficient is a standardized covariance, i.e.,

[pic]

Simple Linear Regression and Correlation

The Linear Model:

(1) Y = (+ ßX + ( population

(2) Y = a + bX + e sample Note: We have dropped the subscripts.

(

(3) Y or Y' = a + bX predicted value of Y

The most common type of regression is linear regression, in which the objective is to locate the "best-fitting" straight line.

Equation (3) is the equation for a straight line. Note that there is no error term in the equation for the predicted value.

"a" is called the intercept and is equal to the value of Y at the point where the line crosses the Y (vertical) axis (i.e., where X is equal to zero).

"b" is the slope (rise/run) of the line (it denotes how much Y changes for a unit change in X)

Equations (1), (2) and (3) are expressions of a linear relationship between two variables X and Y. The observed relationship is unlikely to fit the expressed linear relationship perfectly, thus the error term in equation (2). The researcher must decide where to draw the line and how well it "fits" the data or "explains" the relationship between the two variables.

See graphic representation on board

Ordinary Least-Squares (OLS) Regression

This method is based on the belief that the best-fitting line is the one in which the vertical distances of all the points from the line are minimized (see graph). The line itself is called the regression line. If some straight line were drawn through the scattergram, any point which did not fall exactly on the regression line would be incompletely accounted for. The amount of "error" then is the vertical distance from the point to the line. Actually, the distances are squared and then added together. This summation of the squared error distances is a measure of the total error involved when the regression line is used as the prediction of the location of the data points. A line which minimizes this sum of squared distances will serve as a better predictor than any other line.

If variable Y is plotted along the vertical axis and variable X along the horizontal axis, we would call the resulting line the regression of Y on X since it is the vertical distances that are being minimized. If we were to compute the regression of X on Y, we would be minimizing the horizontal distances; our result would usually be a different line.

Calculation of Least-Squares Estimators

[pic] [pic]

Substituting into equation (3) from page one, the following equation can be derived:

_

Y' = Y + bx

Note: Lower case x and y are "centered"

(i.e., deviations about the means).

The predicted value of Y (Y') is composed of two components: the mean of Y and the product of the deviation of X from the mean of X (x) by the regression coefficient (b). Therefore, when the regression of Y on X is zero (i.e., b=0), or when X does not affect Y, the regression equation would lead to a predicted y being equal to the mean of Y for each value of X.

When b is not zero (that is, when X and Y covary), the application of the regression equation will lead to a reduction in errors of prediction as compared with the errors resulting from predicting the mean of Y for each individual.

Partitioning the Sum of Squares

[pic]

SStotal = SSregression + SSresidual

Dividing both sides by the SStotal yields:

1 = proportion explained variance + proportion error variance

Correlation and Regression

r2xy = SSreg/SStotal

= proportion of the sum of squares of the dependent

variable that is due to the regression; explained variance

rxy = Pearson product-moment correlation coefficient

b = sy/sx * rxy

Thus b and rxy are closely related but provide different interpretations. The correlation rxy measures linear association between X and Y, while the regression coefficient measures the size of the change in Y, which can be predicted when a unit change is made in X.

Tests of Significance

SSreg/k

Overall Regression (R2): F = SSres/(N-k-1)

In simple regression (k=1), same as testing b.

Testing Coefficient (b): t = b/sb (Equation 17.32)

This tests null hypothesis: Beta = 0 (t = (b-ß)/sb

Confidence interval for b: b ± t ((/2, df) sb

Advantages of t-ratio over F statistic:

Test all null hypotheses (i.e., ß not equal to 0).

Constructing confidence interval.

One-tailed tests.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download