Linear Least Squares - Stanford University
2
Linear Least Squares
The linear model is the main technique in regression problems and the primary
tool for it is least squares fitting. We minimize a sum of squared errors, or
equivalently the sample average of squared errors. That is a natural choice
when we¡¯re interested in finding the regression function which minimizes the
corresponding expected squared error. This chapter presents the basic theory
of linear least squares estimation, looking at it with calculus, linear algebra
and geometry. It also develops some distribution theory for linear least squares
and computational aspects of linear regression. We will draw repeatedly on the
material here in later chapters that look at specific data analysis problems.
2.1
Least squares estimates
We begin with observed response values y1 , y2 , . . . , yn and features zij for i =
1, . . . , n and j = 1, . . . , p. We¡¯re looking for the parameter values ¦Â1 , . . . , ¦Âp that
minimize
S(¦Â1 , . . . , ¦Âp ) =
n
X
i=1
yi ?
p
X
zij ¦Âj
2
.
(2.1)
j=1
The minimizing values are denoted with hats: ¦Â?1 , . . . , ¦Â?p .
To minimize S, we set its partial derivative with respect to each ¦Âj to 0.
The solution satisfies
p
n
X
X
?
S = ?2
yi ?
zij ¦Â?j zij = 0,
?¦Âj
i=1
j=1
1
j = 1, . . . , p.
(2.2)
2
2. LINEAR LEAST SQUARES
We¡¯ll show later that this indeed gives the minimum, not the maximum or a
saddle point. The p equations in (2.2) are known as the normal equations.
This is due to normal being a synonym for perpendicular or orthogonal, and
not due to any assumption about the normal distribution. Consider the vector
Zj = (z1j , . . . , znj )0 ¡Ê Rn of values for the j¡¯th feature. Equation (2.2) says
that this feature vector P
has a dot product of zero with the residual vector having
p
i¡¯th element ¦Å?i = yi ? j=1 zij ¦Â?j . Each feature vector is orthogonal (normal)
to the vector of residuals.
It is worth while writing equations (2.1) and (2.2) in matrix notation. Though
the transition from (2.1) to (2.2) is simpler with coordinates, our later manipulations are easier in vector form. Let¡¯s pack the responses and feature values
into the vector Y and matrix Z via
? ?
?
?
y1
z11 z12 . . . z1p
? y2 ?
? z21 z22 . . . z2p ?
?
?
? ?
Y = ? . ? , and, Z = ? .
..
.. ? .
..
? .. ?
? ..
.
.
. ?
yn
zn1 zn2 . . . znp
We also put ¦Â = (¦Â1 , . . . , ¦Âp )0 and let ¦Â? denote the minimizer of the sum of
squares. Finally let ¦Å? = Y ? Z ¦Â?, the n vector of residuals.
Now equation (2.1) can be written S(¦Â) = (Y ? Z¦Â)0 (Y ? Z¦Â) and the
normal equations (2.2) become
¦Å?0 Z = 0
(2.3)
after dividing by ?2. These normal equations can also be written
Z 0 Z ¦Â? = Z 0 Y.
(2.4)
When Z 0 Z is invertible then we find that
¦Â? = (Z 0 Z)?1 Z 0 Y.
(2.5)
We¡¯ll suppose at first that Z 0 Z is invertible and then consider the singular case
in Chapter 2.7.
Equation (2.5) underlies another meaning of the work ¡®linear¡¯ in linear regression. The estimated coefficient ¦Â? is a fixed linear combination of Y , meaning
that we get it by multiplying Y by the matrix (Z 0 Z)?1 Z 0 . The predicted value
of Y at any new point x0 with features z0 = ¦Õ(x0 ) is also linear in Y ; it is
z00 (Z 0 Z)?1 Z 0 Y .
Now let¡¯s prove that ¦Â? = (Z 0 Z)?1 Z 0 Y is in fact the minimizer, not just a
point where the gradient of the sum of squares vanishes. It seems obvious that
a sum of squares like (2.1) cannot have a maximizing ¦Â. But we need to rule
out saddle points too, and we¡¯ll also find that ¦Â? is the unique least squares
estimator. Since we already found an expression for ¦Â? we prove it is right by
expressing a generic ¦Âe ¡Ê Rp as ¦Â? + (¦Âe ? ¦Â?) and then expanding S(¦Â? + (¦Âe ? ¦Â?)).
This adding and subtracting technique is often applicable in problems featuring
squared errors.
2.2. GEOMETRY OF LEAST SQUARES
3
Theorem 2.1. Let Z be an n ¡Á p matrix with Z 0 Z invertible, and let Y be an
n vector. Define S(¦Â) = (Y ? Z¦Â)0 (Y ? Z¦Â) and set ¦Â? = (Z 0 Z)?1 Z 0 Y . Then
S(¦Â) > S(¦Â?) holds whenever ¦Â 6= ¦Â?.
Proof. We know that Z 0 (Y ? Z ¦Â?) = 0 and will use it below. Let ¦Âe be any point
in Rp and let ¦Ã = ¦Âe ? ¦Â?. Then
e = (Y ? Z ¦Â)
e 0 (Y ? Z ¦Â)
e
S(¦Â)
= (Y ? Z ¦Â? ? Z¦Ã)0 (Y ? Z ¦Â? ? Z¦Ã)
= (Y ? Z ¦Â?)0 (Y ? Z ¦Â?) ? ¦Ã 0 Z 0 (Y ? Z ¦Â?) ? (Y ? Z ¦Â?)Z¦Ã + ¦Ã 0 Z 0 Z¦Ã
= S(¦Â?) + ¦Ã 0 Z 0 Z¦Ã.
e = S(¦Â?) + kZ¦Ãk2 ¡Ý S(¦Â?). It follows that ¦Â? is a minimizer of S.
Thus S(¦Â)
For uniqueness we need to show that ¦Ã 6= 0 implies Z¦Ã 6= 0. Suppose to the
contrary that Z¦Ã = 0 for ¦Ã 6= 0. Then we would have Z 0 Z¦Ã = 0 for ¦Ã 6= 0, but
this contradicts the assumption that Z 0 Z is invertible. Therefore if ¦Âe 6= ¦Â? then
e = S(¦Â?) + kZ(¦Âe ? ¦Â?)k2 > 0
S(¦Â)
2.2
Geometry of least squares
Figure xxx shows a sketch to illustrate linear least squares. The vector y =
(y1 , . . . , yn )0 is represented as a point in Rn . The set
M = {Z¦Â | ¦Â ¡Ê Rp }
(2.6)
is a p dimensional linear subset of Rn . It is fully p dimensional here because
we have assumed that Z 0 Z is invertible and so Z has rank p. Under our model,
E(Y ) = Z¦Â ¡Ê M.
The idea behind least squares is to find ¦Â? so that y? = Z ¦Â? is the closest point
to y from within M. We expect to find this closest point to y by ¡°dropping a
perpendicular line¡± to M. That is, the error ¦Å? = y ? y? should be perpendicular
to any line within M. From the normal equations (2.3), we have ¦Å?0 Z = 0 so ¦Å?
is actually perpendicular to every point Z¦Â ¡Ê M. Therefore ¦Å? is perpendicular
to every line in M.
We can form a right angle triangle using the three points y, y? and Z ¦Âe for
any ¦Âe ¡Ê Rp . Taking ¦Âe = 0 and using Pythagoras¡¯ theorem we get
kyk2 = k¦Å?k2 + kz ¦Â?k2 .
When the first column of Z consists of 1s then (y?, . . . , y?)0 = Z(y?, 0, . . . , 0) ¡Ê M
and Pythagoras¡¯ theorem implies
n
n
n
X
X
X
(yi ? y?)2 =
(y?i ? y?)2 +
(yi ? y?)2 .
i=1
i=1
i=1
(2.7)
4
2. LINEAR LEAST SQUARES
The left side of (2.7) is called the centered sum of squares of the yi . It is
n ? 1 times the usual estimate of the common variance of the Yi . The equation
decomposes this sum of squares into two parts. The first is the centered sum of
squared errors of the fitted values y?i . The second is the sumPof squared model
n
errors. When the first column of Z consists of 1s then (1/n) i=1 y?i = y?. That
?
is y? = y?. Now the two terms in (2.7) correspond to the sum of squares of the
fitted values y?i about their mean and the sum of squared residuals. We write
this as SSTOT = SSFIT + SSRES . The total sum of squares is the sum of squared
fits plus the sum of squared residuals.
Some of the variation in Y is ¡®explained¡¯ by the model and some of it left
unexplained. The fraction explained is denoted by
R2 =
SSRES
SSFIT
=1?
.
SSTOT
SSTOT
The quantity R2 is known as the coefficient of determination. It measures how
well Y is predicted or determined by Z ¦Â?. Its nonnegative square root R is called
the coefficient of multiple correlation. It is one measure of how well the response
Y correlates with the p predictors in Z taken collectively. For the special case
of simple linear regression with zi = (1, xi ) ¡Ê R2 then (Exercise xxx) R2 is the
square of the usual Pearson correlation of x and y.
Equation (2.7) is an example of an ANOVA (short for analysis of variance)
decomposition. ANOVA decompositions split a variance (or a sum of squares)
into two or more pieces. Not surprisingly there is typically some orthogonality
or the Pythagoras theorem behind them.
2.3
Algebra of least squares
The predicted value for yi , using the least squares estimates, is y?i = Zi ¦Â?. We
can write the whole vector of fitted values as y? = Z ¦Â? = Z(Z 0 Z)?1 Z 0 Y . That is
y? = Hy where
H = Z(Z 0 Z)?1 Z 0 .
Tukey coined the term ¡°hat matrix¡± for H because it puts the hat on y. Some
simple properties of the hat matrix are important in interpreting least squares.
By writing H 2 = HH out fully and cancelling we find H 2 = H. A matrix
H with H 2 = H is called idempotent. In hindsight, it is geometrically obvious
that we should have had H 2 = H. For any y ¡Ê Rn the closest point to y inside
of M is Hy. Since Hy is already in M, H(Hy) = Hy. That is H 2 y = Hy for
any y and so H 2 = H. Clearly H k = H for any integer k ¡Ý 1.
The matrix Z 0 Z is symmetric, and so therefore is (Z 0 Z)?1 . It follows that
the hat matrix H is symmetric too. A symmetric idempotent matrix such as H
is called a perpendicular projection matrix.
Theorem 2.2. Let H be a symmetric idempotent real valued matrix. Then the
eigenvalues of H are all either 0 or 1.
2.4. DISTRIBUTIONAL RESULTS
5
Proof. Suppose that x is an eigenvector of H with eigenvalue ¦Ë, so Hx = ¦Ëx.
Because H is idempotent H 2 x = Hx = ¦Ëx. But we also have H 2 x = H(Hx) =
H(¦Ëx) = ¦Ë2 x. Therefore ¦Ëx = ¦Ë2 x. The definition of eigenvector does not allow
x = 0 and so we must have ¦Ë2 = ¦Ë. Either ¦Ë = 0 or ¦Ë = 1.
Let H = P 0 ¦«P P
where the columns of P are eigenvectors pi of H for i =
n
1, . . . , n. Then H = i=1 ¦Ëi pi p0i , where by Theorem 2.2 each ¦Ëi is 0 or 1. With
no loss of generality, we can arrange
the ones to precede the zeros. Suppose
Pfor
r
that there are r ones. Then H = i=1 pi p0i . We certainly expect r to equal p
here. This indeed holds. The eigenvalues of H sum to r, so tr(H) = r. Also
tr(H)P
= tr(Z(Z 0 Z)?1 Z 0 ) = tr(Z 0 Z(Z 0 Z)?1 ) = tr(Ip ) = p. Therefore r = p and
p
H = i=1 pi p0i where pi are mutually orthogonal n vectors.
The prediction for observation i can be written as y?i = Hi y where Hi is
the i¡¯th row of the hat matrix. The i¡¯th row of H is simply zi0 (Z 0 Z)?1 Z 0 and
the ij element of the hat matrix is Hij = zi0 (Z 0 Z)?1 zj . Because Hij = Hji the
contribution of yi to y?j equals that of yj to y?i . The diagonal elements of the
hat matrix will prove to be very important. They are
Hii = zi0 (Z 0 Z)?1 zi .
(2.8)
We are also interested in the residuals ¦Å?i = yi ? y?i . The entire vector of
residuals may be written ¦Å? = y ? y? = (I ? H)y. It is easy to see that if
H is a PPM, then so is I ? H. Symmetry is trivial, and (I ? H)(I ? H) =
I ? H ? H + HH = I ? H.
p
The model space M = {Z¦Â | ¦Â ¡Ê RP
} is the set of linear combinations of
p
columns of Z. A typical entry of M is j=1 ¦Âj Zj . Thus M is what is known
as the column space of Z, denoted col(Z). The hat matrix H projects vectors
onto col(Z). The set of vectors orthogonal to Z, that is with Zv = 0, is a linear
subspace of Rn , known as the null space of Z. We may write it as M¡Í or as
null(Z). The column space and null spaces are orthogonal complements. Any
v ¡Ê Rn can be uniquely written as v1 + v2 where v1 ¡Ê M and v2 ¡Ê M¡Í . In
terms of the hat matrix, v1 = Hv and v2 = (I ? H)v.
2.4
Distributional results
We continue to suppose that X and hence Z is a nonrandom matrix and that
Z has full rank. The least squares model has Y = Z¦Â + ¦Å. Then
¦Â? = (Z 0 Z)?1 Z 0 Y = (Z 0 Z)?1 (Z 0 Z¦Â + ¦Å) = ¦Â + (Z 0 Z)?1 Z 0 ¦Å.
(2.9)
The only randomness in the right side of (2.9) comes from ¦Å. This makes it easy
to work out the mean and variance of ¦Â? in the fixed x.
Lemma 2.1. If Y = Z¦Â + ¦Å where Z is nonrandom and Z 0 Z is invertible and
E(¦Å) = 0, then
E(¦Â?) = ¦Â.
(2.10)
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- translating key words and phrases into algebraic expressions
- review problems for exam 3
- ece 223 solution for assignment 8 university of waterloo
- translating english words into algebraic expressions
- math 2113 assignment 8 dalhousie university
- 8 goodness of fit in regression
- lecture 8 joint probability distributions
- the pigeonhole principle stanford university
- nonlinear least squares data fitting
- rd sharma solutions class 8 playing with numbers exercise
Related searches
- how to graph least squares regression line
- equation of least squares regression line calculator
- slope of the least squares regression line
- least squares regression line example
- linear least squares regression equation
- linear least squares regression formula
- weighted least squares linear regression
- linear least squares fit calculator
- linear least squares matlab
- linear least squares regression calculator
- linear algebra least squares example
- linear least squares equation