Regression: Finding the equation of the line of best fit
Simple Linear Regression: 1. Finding the equation of the line of best fit
Objectives: To find the equation of the least squares regression line of y on x.
Background and general principle
The aim of regression is to find the linear
relationship between two variables. This is in
85
Weight (kg)
turn translated into a mathematical problem
80
of finding the equation of the line that is
75
closest to all points observed.
70
65
Consider the scatter plot on the right. One
60
possible line of best fit has been drawn on
55
the diagram. Some of the points lie above
50
the line and some lie below it. The vertical distance each point is above or
1?4
1?5
1?6
1?7
1?8
Height (m)
below the line has been added to the
diagram. These distances are called deviations or errors ? they are symbolised as d1, d2 ,..., dn .
When drawing in a regression line, the aim is to make the line fit the points as closely as possible. We do this by making the total of the squares of the deviations as small as possible, i.e. we minimise
di2 .
If a line of best fit is found using this principle, it is called the least-squares regression line.
Example 1: A patient is given a drip feed containing a particular chemical and its concentration in his blood is measured, in suitable units, at one hour intervals. The doctors believe that a linear relationship will exist between the variables.
Time, x (hours) Concentration, y
0
1
2
3
4
5
6
2.4 4.3 5.0 6.9 9.1 11.4 13.5
We can plot these data on a scatter graph ?
time would be plotted on the horizontal axis
(as it is the independent variable). Time is
12
Concentration
here referred to as a controlled variable,
10
since the experimenter fixed the value of this
8
variable in advance (measurements were
6
taken every hour).
4
Concentration is the dependent variable as the
2
concentration in the blood is likely to vary
according to time.
The doctor may wish to estimate the concentration of the chemical in the blood after 3.5 hours.
1
2
3
4
5
6
Time (hours)
She could do this by finding the equation of the line of best fit.
There is a formula which gives the equation of the line of best fit.
1
** The statistical equation of the simple linear regression line, when only the response variable Y is random, is: Y 0 1x (or in terms of each point: Yi 0 1xi i ) Here 0 is called the intercept, 1 the regression slope, is the random error with mean 0, x is the regressor (independent variable), and Y the response variable (dependent variable).
** The least squares regression line is obtained by finding the values of 0 and 1 values (denoted in
the solutions as ^0 & ^1 ) that will minimize the sum of the squared vertical distances from all points
to the line:
d
2 i
yi y^i 2 yi 0 1xi 2
The solutions are found by solving the equations: 0 and 0
0
1
** The equation of the fitted least squares regression line is Y^ ^0 ^1x (or in terms of each point:
Y^i ^0 ^1xi ) ----- For simplicity of notations, many books denote the fitted regression equation as:
Y^ b0 b1x (* you can see that for some examples, we will use this simpler notation.)
where
^1
S xy S xx
and
^0 y ^1 x .
Notations:
S xy
xy
x
n
y
xi
x yi
y;
Sxx
x2
x2
n
xi x 2 ;
x and y are the mean values of x and y respectively.
Note 1: Please notice that in finding the least squares regression line, we do not need to assume any distribution for the random errors i . However, for statistical inference on the model parameters ( 0 and 1 ), it is assumed in our class that the errors have the following three properties:
Normally distributed errors
Homoscedasticity (constant error variance var i 2 for Y at all levels of X)
Independent errors (usually checked when data collected over time or space)
i.i.d.
***The above three properties can be summarized as: i ~ N 0, 2 , i 1,, n
Note 2: Please notice that the least squares regression is only suitable when the random errors exist in the dependent variable Y only. If the regression X is also random ? it is then referred to as the Errors in Variable (EIV) regression. One can find a good summary of the EIV regression in section 12.2 of the book: "Statistical Inference" (2nd edition) by George Casella and Roger Berger.
We can work out the equation for our example as follows:
x 0 1 ... 6 21 so
x 21 3 7
y 2.4 4.3 ... 13.5 52.6
so
y 52.6 7.514...
7
xy (0 2.4) 1 4.3 ... 613.5 209.4
x2 02 12 ... 62 91
so
x 21 3 7
These could all be found on a calculator (if you enter the data into a calculator).
2
S xy
xy
x y
n
209.4
21 52.6 7
51.6
Sxx
x2
x 2 91 212 28
n
7
So,
^1
S xy S xx
51.6 28
1.843
and
^0 y ^1 x 7.514 1.843 3 1.985 .
So the equation of the regression line is y^ = 1.985 + 1.843x.
To work out the concentration after 3.5 hours: y^ = 1.985 + 1.843 ? 3.5 = 8.44 (3sf)
If you want to find how long it would be before the concentration reaches 8 units, we substitute y^ = 8
into the regression equation: 8 = 1.985 + 1.843x
Solving this we get: x = 3.26 hours
Note: It would not be sensible to predict the concentration after 8 hours from this equation ? we don't know whether the relationship will continue to be linear. The process of trying to predict a value from outside the range of your data is called extrapolation.
Example 2: The heights and weights of a sample of 11 students are:
Height (m) h Weight (kg) w
1.36 1.47 1.54 1.56 1.59 1.63 1.66 1.67 1.69 1.74 1.81
52
50
67
62
69
74
59
87
77
73
67
[ n 11 h 17.72 h2 28.705 w 737 w2 50571 hw 1196.1 ]
a) Calculate the regression line of w on h. b) Use the regression line to estimate the weight of someone whose height is 1.6m.
Note: Both height and weight are referred to as random variables ? their values could not have been predicted before the data were collected. If the sampling were repeated again, different values would be obtained for the heights and weights.
Solution: a) We begin by finding the mean of each variable:
h h 17.72 1.6109... n 11
w w 737 67 n 11 Next we find the sums of squares:
3
Shh
h2
h 2 28.705 17.722 0.1597
n
11
Sww
w2
w2
737 2
50571
1192
n
11
S hw
hw
h w
n
1196.1
17.72 737 11
8.86
The equation of the least squares regression line is:
w^ ^0 ^1h
where
^1
S hw S hh
8.86 0.1597
55.5
and
^0 w ^1 h 67 55.51.6109 22.4
So the equation of the regression line of w on h is:
w^ = -22.4 + 55.5h
b) To find the weight for someone that is 1.6m high: w^ = -22.4 + 55.5?1.6 = 66.4 kg
Simple Linear Regression: 2. Measures of Variation
Objectives: measures of variation, the goodness-of-fit measure, and the correlation coefficient
Sums of Squares
Total sum of squares = Regression sum of squares + Error sum of squares SST SSR SSE (Total variation = Explained variation + Unexplained variation)
Total sum of squares (Total Variation): SST Yi Y 2
Regression sum of squares (Explained Variation by the Regression): SSR Y^i Y 2 Error sum of squares (Unexplained Variation): SSE Yi Y^i 2
4
Coefficients of Determination and Correlation
Coefficient of Determination ? it is a measure of the regression goodness-of-fit
It also represents the proportion of variation in Y "explained" by the regression on X
R2 SSR ; 0 R2 1
SST
Pearson (Product-Moment) Correlation Coefficient -- measure of the direction and strength of the linear association between Y and X
The sample correlation is denoted by r and is closely related to the coefficient of determination as follows:
r sign ^1 R2 ; 1 r 1
The sample correlation is indeed defined by the following formula:
r
(x x)(y y)
S XY
n xy x y
[ (x x)2 ][ ( y y)2 ] S XX SYY [n( x2 ) ( x)2 ][n( y 2 ) ( y)2 ]
The corresponding population correlation between Y and X is denoted by and defined by:
COV X ,Y VarX VarY
EX X Y Y
VarX VarY
Therefore one can see that in the population correlation definition, both X and Y are assumed to be random. When the joint distribution of X and Y is bivariate normal, one can perform the following t-test to test whether the population correlation is zero:
? Hypotheses H0: = 0 HA: 0
(no correlation) (correlation exists)
? Test statistic
~ t 0
r
H0
1 r2
tn2
n2
Note: One can show that this t-test is indeed the same t-test in testing the regression slope 1 = 0 shown in the following section.
Note: The sample correlation is not an unbiased estimator of the population correlation. You can study this and other properties from the wiki site:
Example 3: The following example tabulates the relations between trunk diameter and tree height.
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- regression finding the equation of the line of best fit
- lecture 2 linear regression a model for the mean
- review of multiple regression university of notre dame
- multiple regression with categorical data
- multiple regression
- how to interpret regression coefficients econ 30331
- 15 regression introduction
- interpretation in multiple regression
- multiple regression and correlation
- reference document
Related searches
- line of best fit calculator
- line of best fit equation calculator
- line of best fit generator
- line of best fit slope calculator
- line of best fit graph
- line of best fit maker
- line of best fit formula
- python line of best fit scatter plot
- create a line of best fit online
- line of best fit creator
- line of best fit calculator with slope
- line of best fit graph maker