Regression

Regression

Suppose we are analyzing a bivariate data set [pic]from an explanatory variable x and response variable y. As we have seen before sometimes there is a definite linear relationship between x and y as evidenced by the scatterplot.

A regression line is a line that describes how the variable y changes as the explanatory variable x changes. The idea is to get the line to pass as close to as many of the points on the scatterplot as possible.

Recall that any line can be written in the form

[pic]

Where [pic] is called the slope of the line and [pic]is called the intercept.

If our data set is [pic], then plugging in [pic] into the line would give a predicted or fitted value of [pic] , denoted [pic]

Thus

If there were a perfect linear relationship between [pic], (meaning that the line actually passed through all the points) then the actual value of [pic] and the predicted value[pic] would be the same. However, there is seldom a perfect linear relationship and so the values would differ. One defines the value of the difference to be the “error” or “residual”

Once again: although in the ideal situation of a perfect linear relationship it would be true that all of the [pic] would be zero, this situation rarely arises. However, one can try to do the “best possible” situation making the [pic] as small as possible.

Regression using the Least Squares Line

Assume that we have bivariate data [pic], where [pic] is the explanatory variable and [pic] is the response variable.

Assume also that the mean and the standard deviation of the [pic]is given by [pic] and [pic], and also that mean and the standard deviation of the [pic]is given by [pic] and [pic].

The least squares regression line is the line

[pic]

where [pic] and [pic]

How the Least Squares Line is “Best Possible”

Of all the possible lines that one could fit to data:

The Least Squares Line is the one that makes the residual sum of squares

[pic]

the smallest it possibly can be

Other facts about the Least Squares Line:

1. The line always passes through the mean center [pic]on the scatterplot

2. The line predicts that for every increase in x by one standard deviation, y increases r standard deviations.

Example:

|x |y |

|1 |2 |

|2 |4 |

|3 |6 |

|4 |3 |

|5 |7 |

|6 |9 |

|SUMMARY OUTPUT | | | | | | | |

|Multiple R |0.83030436 |

|1.3 |38 |

|2.1 |19 |

|2.8 |14 |

|1.5 |27 |

|1.6 |31 |

|1.5 |34 |

|1.3 |33 |

|2.4 |17 |

|2.7 |17 |

|SUMMARY OUTPUT | | | | | | | |

| | | | | | | | |

|Multiple R |0.944587 | | | | | | |

| | | | | | | | | |Observation |Predicted MPG |Residuals | | | | | | | |1 |34.15182 |3.84818 | | | | | | | |2 |22.89853 |-3.89853 | | | | | | | |3 |13.0519 |0.948102 | | | | | | | |4 |31.3385 |-4.3385 | | | | | | | |5 |29.93184 |1.068164 | | | | | | | |6 |31.3385 |2.661503 | | | | | | | |7 |34.15182 |-1.15182 | | | | | | | |8 |18.67854 |-1.67854 | | | | | | | |9 |14.45856 |2.541441 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

[pic]

Accordingly, the According to the chart output: the equation of the least squares line is

[pic]

Applications:

Predict: the MPG if the weight is 2000 lbs:

[pic] MPG

Predict: the MPG if the weight is 3500 lbs:

[pic] MPG

Predict: the MPG if the weight is 4000 lbs:

[pic] MPG

What is wrong with the last two predictions?

Using the equation to predict for x values outside the range of x in the original data set is called extrapolation and in many cases is unwise. This is due to the fact that the regression line fitted x values within a specified range and often “breaks down” beyond this range.

What is [pic]and what does it tell us?

[pic] is the square of the correlation coefficient. It actually has an elegant interpretation for our regression.

Notice the variation in the MPG values. The range from 14 to 38 with many values in between. One would like to know what factors are responsible for affecting a vehicle MPG. One factor is clearly weight. Are there more? Are they important?

Look at [pic]=.8922 ( it is simply the square of r). We can say that 89.22% of the variation in the MPG values is explained by the regression on weight.

[pic] is the percent of variation in the y values that is explained by the regression on x.

[pic]

[pic]

A visually check of [pic] is accomplished by looking at how much the residuals vary as opposed to how much the y values vary:

[pic]

[pic]

-----------------------

[pic]= predicted or fitted value of [pic]

=[pic]

= result of plugging in [pic] into the line.

Residuals:

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches