Formulas and Relationships from Linear Regression



Formulas and Relationships from Simple & Multiple Linear Regression

I. Basics for Simple Linear Regression

Let [pic]be sample data from a bivariate normal population [pic] (technically we have [pic] where [pic] is the sample size and will use the notation [pic] for [pic]). Then we have the following sample statistics:

[pic] (sample mean for [pic]) [pic] (sample mean for [pic])

[pic](sample variance for [pic]) [pic](sample variance for [pic])

We will also use the following “sums of squares”:

[pic] and [pic]

Note: Sometimes we refer to [pic]as [pic].

We have the following relationships between the sample statistics and the sums of squares:

[pic] or [pic]

and

[pic] or [pic]

The sample covariance between [pic]and [pic]is defined by

[pic]

The sample correlation coefficient is defined as

[pic]

where [pic] and [pic]. By plugging the formulas of [pic] and [pic]into the formula for [pic]we can easily derive that

[pic].

The least squares regression line for the data has the form

[pic]

where

[pic] and [pic]

Associated with the regression we have some additional “sums of squares”:

[pic] and [pic]

Where for each value of [pic] in the sample data, [pic]is the corresponding [pic]coordinate and [pic]is the predicted value from the regression line when [pic]is used as the predictor (input) to the line.

It is clear to see both symbolically and graphically that [pic]. However, the more interesting result is that:

[pic]

i.e.

[pic]

This describes the total variation in [pic]by the sum of the “explained variation” ([pic]) and the “unexplained variation” ([pic]). Remember, the least squares regression line is the line that fits the data in a way that minimizes the unexplained variation. Also note that if the data perfectly fits a line then [pic].

One can derive that the square of the correlation coefficient can be written in terms of these sums of squares:

[pic]

So here again we see that [pic]measures how strongly the data fits a line since if the data is perfectly linear, then [pic]. [pic] is called the coefficient of determination. Another relationship that can be easily derived from the formula for [pic] is that

[pic]

We will define the standard error of the linear regression to be the following:

[pic]

We will use this standard error to define the standard error for random variables associated with the linear regression and thus to test hypotheses about these random variables. We will also use the standard error of the linear regression to make inferences about the predicted values of [pic]for a fixed value of [pic].

II. Testing the correlation coefficient.

The population correlation coefficient is represented by the symbol [pic]. We will use the sample correlation coefficient, [pic], obtained from the sample [pic] data, to test if there is a significant linear relationship between the population variables [pic]and [pic]. If one assumes that [pic]then it can be shown that the standard error for the random variable [pic]is given by

[pic]

and that the test statistic [pic] has a student-t distribution with [pic] degrees of freedom. Thus we can test the null hypotheses, [pic] against appropriate alternative hypotheses using this test statistic.

The standard error for[pic], [pic], can also be written in terms of the standard error of linear regression and the sum of squares total:

[pic]

III. Testing the slope of the regression line.

The population regression line slope is represented by the symbol [pic]. We will use the slope of the regression line that fits the sample[pic]data,[pic], to make inferences about the slope of the regression line that would fit the population variables [pic]and [pic], i.e. [pic]. If one assumes that [pic]then it can be shown that the standard error for the random variable [pic]is given by

[pic]

and that the test statistic [pic] has a student-t distribution with [pic] degrees of freedom. Thus we can test the null hypotheses, [pic] against appropriate alternative hypotheses using this test statistic. When conducting this test we are essentially determining if the “independent effect” of the variable[pic]is significant on the predicted (or criterion) variable [pic]. The independent effect is measured by slope. That is a change in [pic]produces a proportional change in [pic]as measured by [pic] or [pic].

Now here is a very interesting result. You should be able to follow this derivation by seeing how we are using the formulas above:

[pic]

Thus when testing [pic] or [pic]the value of the t-statistic will be the same!

IV. Testing the coefficient of determination.

Another test that we will use to test the significance of the linear regression is the [pic]-test. The [pic]-test is a test about the population coefficient of determination ([pic]). The null hypothesis for the [pic]-test is [pic] with alternative hypothesis [pic]. The [pic]distribution is used to test if two variances from independent populations are equal. This is done by testing the ratio of the two variances. The [pic]distribution has two degrees of freedom parameters: the degrees of freedom of the numerator variance and the degrees of freedom of the denominator variance. In the case of linear regression we will use the following ratio for our [pic]-test statistic:

[pic]

Where [pic]is the number of independent predictors (so far our [pic]but this will change once we get to multiple regression) and [pic]is the number of data points in the sample. The numerator of the [pic]-test statistic is the variance in the regression ([pic]) and is also called the mean square regression. The denominator of the [pic]-test statistic is the variance in the residual ([pic]) and is also called the mean square residual. We think of the [pic]-test statistic as being the ratio of the explained variance to the unexplained variance. Since in regression our goal is to minimize the unexplained variance, then to have a significant result, we would expect the [pic]-test statistic to be greater than 1.

Consider the case when [pic](i.e. simple (or single variable) linear regression). Then

[pic].

Another way to think of what the [pic]-test statistic measures is to consider a comparison of how well the regression line ([pic]) fits the data versus how well the line [pic] fits the data. The more the regression line explains the variance in [pic](i.e. the higher the ratio of the variance in the regression to the variance in the residual), the more significant the result. Note that the line [pic] has slope equal to zero, so the [pic]-test is essentially a test about the slope of the regression line (i.e. if the slope of the regression line is zero then the [pic]-test statistic will be zero since [pic]). Recall the [pic]-test statistic for testing [pic] against appropriate alternative hypotheses:

[pic]

Thus we have

[pic]

So the [pic]-test statistic is the square of the [pic]-test statistic for the slope of the regression line. Thus it turns out that in single variable regression, using the [pic]-test statistic is equivalent to testing if the slope of the regression line is zero! When doing simple linear regression, if you check your computer output for the [pic]-value for the [pic]-test statistic and for the [pic]-test statistic for the slope, they will be the same value! There will be an analogy to this test in multivariable regression!

V. Basics for Multiple Regression

Let [pic]be sample data from a multivariate normal population [pic] (technically we have [pic] where [pic] is the sample size and will use the notation [pic] for [pic]).

We will again perform linear regression on the data: i.e. we will use the data to find [pic] and [pic]such that

[pic]

As before we will consider the sum of the squares of the residuals: [pic] where [pic]is from the data (i.e., [pic]) and [pic] is the predicted value of [pic] using the inputs [pic].

The values that we get for [pic] and [pic]are such that [pic]is minimized (note this is the same criteria we had for single variable (i.e. simple) regression). The formulas for [pic] and [pic]are somewhat complicated and for now we will compute these via computer technology.

Now, just as in simple linear regression, we can determine how well the regression equation explains the variance in [pic]by considering the coefficient of determination:

[pic]

Where [pic]is defined as in simple linear regression and we have the relationship of [pic]still true!

Testing the coefficient of determination in multiple regression. Again we will use the[pic]-test. The null hypothesis for the [pic]-test is [pic] with alternative hypothesis[pic]. Our[pic]-test statistic will again be:

[pic]

Where [pic]is the number of independent predictors and [pic]is the number of data points in the sample. The numerator of the [pic]-test statistic is the variance in the regression ([pic]) and is also called the mean square regression. The denominator of the [pic]-test statistic is the variance in the residual ([pic]) and is also called the mean square residual. We think of the [pic]-test statistic as being the ratio of the explained variance to the unexplained variance. Since in regression our goal is to minimize the unexplained variance and have most of the variance in [pic]explained by the regression equation, then to have a significant result, we would expect the [pic]-test statistic to be greater than 1. Note that an alternative formula for the [pic]-test statistic is

[pic]

Can you show this is true??

In the case of simple linear regression we saw that the above hypothesis test was equivalent to the hypothesis test: [pic] with [pic]. In the case of multiple regression, the above hypothesis test is equivalent to the following hypothesis test:

[pic] with alternative hypothesis

[pic] At least one coefficient is not equal to zero.

Where [pic] are the slopes of the regression equation that fits the population. So acceptance of the null hypothesis implies that the explanatory (i.e. predictor, i.e. independent) variables do not have any significant impact on explaining the variance in the dependent variable, [pic]. So to estimate the average value of [pic]one would not take into account the values of the independent variables and thus use the sample mean [pic] as a point estimate.

Rejection of the null hypothesis does not necessarily mean that all of the coefficients are significantly different from zero but that at least one of the coefficients is significantly different than zero. Rejection does imply that the regression equation is useful for predicting values of [pic]given values of the independent variables. In particular, given values of [pic], a point estimate for the average value of [pic]is [pic] (i.e. we would take into account the variances of the [pic] from their averages to compute the average value of [pic].

Once we have determined that the regression equation significantly explains the variance in [pic]then we can run a [pic]-test on each of the independent variable coefficients to test if they are significantly different from zero. We will then be interested in seeing if we can get a better model by reducing the full model by dropping independent variables whose coefficients are not significantly different than zero.

More statistics associated with multiple regression:

Then we have the following sample statistics:

[pic] (sample mean for each predictor variable[pic])

[pic] (sample mean for [pic])

[pic](sample variance for each predictor variable[pic])

[pic](sample variance for [pic])

We will also use the following “sums of squares”:

[pic] and [pic]

Note: Sometimes we refer to [pic]as [pic].

We have the following relationships between the sample statistics and the sums of squares:

[pic] or [pic] for each predictor variable[pic]

and

[pic] or [pic]

The sample covariance between [pic]and [pic]is defined by

[pic]

We will also define the covariance between [pic]and [pic] for [pic] as follows:

[pic]

Note that when [pic], [pic].

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download