Linear Least Squares - Stanford University

2

Linear Least Squares

The linear model is the main technique in regression problems and the primary

tool for it is least squares fitting. We minimize a sum of squared errors, or

equivalently the sample average of squared errors. That is a natural choice

when we¡¯re interested in finding the regression function which minimizes the

corresponding expected squared error. This chapter presents the basic theory

of linear least squares estimation, looking at it with calculus, linear algebra

and geometry. It also develops some distribution theory for linear least squares

and computational aspects of linear regression. We will draw repeatedly on the

material here in later chapters that look at specific data analysis problems.

2.1

Least squares estimates

We begin with observed response values y1 , y2 , . . . , yn and features zij for i =

1, . . . , n and j = 1, . . . , p. We¡¯re looking for the parameter values ¦Â1 , . . . , ¦Âp that

minimize

S(¦Â1 , . . . , ¦Âp ) =

n 

X

i=1

yi ?

p

X

zij ¦Âj

2

.

(2.1)

j=1

The minimizing values are denoted with hats: ¦Â?1 , . . . , ¦Â?p .

To minimize S, we set its partial derivative with respect to each ¦Âj to 0.

The solution satisfies

p

n 



X

X

?

S = ?2

yi ?

zij ¦Â?j zij = 0,

?¦Âj

i=1

j=1

1

j = 1, . . . , p.

(2.2)

2

2. LINEAR LEAST SQUARES

We¡¯ll show later that this indeed gives the minimum, not the maximum or a

saddle point. The p equations in (2.2) are known as the normal equations.

This is due to normal being a synonym for perpendicular or orthogonal, and

not due to any assumption about the normal distribution. Consider the vector

Zj = (z1j , . . . , znj )0 ¡Ê Rn of values for the j¡¯th feature. Equation (2.2) says

that this feature vector P

has a dot product of zero with the residual vector having

p

i¡¯th element ¦Å?i = yi ? j=1 zij ¦Â?j . Each feature vector is orthogonal (normal)

to the vector of residuals.

It is worth while writing equations (2.1) and (2.2) in matrix notation. Though

the transition from (2.1) to (2.2) is simpler with coordinates, our later manipulations are easier in vector form. Let¡¯s pack the responses and feature values

into the vector Y and matrix Z via

? ?

?

?

y1

z11 z12 . . . z1p

? y2 ?

? z21 z22 . . . z2p ?

?

?

? ?

Y = ? . ? , and, Z = ? .

..

.. ? .

..

? .. ?

? ..

.

.

. ?

yn

zn1 zn2 . . . znp

We also put ¦Â = (¦Â1 , . . . , ¦Âp )0 and let ¦Â? denote the minimizer of the sum of

squares. Finally let ¦Å? = Y ? Z ¦Â?, the n vector of residuals.

Now equation (2.1) can be written S(¦Â) = (Y ? Z¦Â)0 (Y ? Z¦Â) and the

normal equations (2.2) become

¦Å?0 Z = 0

(2.3)

after dividing by ?2. These normal equations can also be written

Z 0 Z ¦Â? = Z 0 Y.

(2.4)

When Z 0 Z is invertible then we find that

¦Â? = (Z 0 Z)?1 Z 0 Y.

(2.5)

We¡¯ll suppose at first that Z 0 Z is invertible and then consider the singular case

in Chapter 2.7.

Equation (2.5) underlies another meaning of the work ¡®linear¡¯ in linear regression. The estimated coefficient ¦Â? is a fixed linear combination of Y , meaning

that we get it by multiplying Y by the matrix (Z 0 Z)?1 Z 0 . The predicted value

of Y at any new point x0 with features z0 = ¦Õ(x0 ) is also linear in Y ; it is

z00 (Z 0 Z)?1 Z 0 Y .

Now let¡¯s prove that ¦Â? = (Z 0 Z)?1 Z 0 Y is in fact the minimizer, not just a

point where the gradient of the sum of squares vanishes. It seems obvious that

a sum of squares like (2.1) cannot have a maximizing ¦Â. But we need to rule

out saddle points too, and we¡¯ll also find that ¦Â? is the unique least squares

estimator. Since we already found an expression for ¦Â? we prove it is right by

expressing a generic ¦Âe ¡Ê Rp as ¦Â? + (¦Âe ? ¦Â?) and then expanding S(¦Â? + (¦Âe ? ¦Â?)).

This adding and subtracting technique is often applicable in problems featuring

squared errors.

2.2. GEOMETRY OF LEAST SQUARES

3

Theorem 2.1. Let Z be an n ¡Á p matrix with Z 0 Z invertible, and let Y be an

n vector. Define S(¦Â) = (Y ? Z¦Â)0 (Y ? Z¦Â) and set ¦Â? = (Z 0 Z)?1 Z 0 Y . Then

S(¦Â) > S(¦Â?) holds whenever ¦Â 6= ¦Â?.

Proof. We know that Z 0 (Y ? Z ¦Â?) = 0 and will use it below. Let ¦Âe be any point

in Rp and let ¦Ã = ¦Âe ? ¦Â?. Then

e = (Y ? Z ¦Â)

e 0 (Y ? Z ¦Â)

e

S(¦Â)

= (Y ? Z ¦Â? ? Z¦Ã)0 (Y ? Z ¦Â? ? Z¦Ã)

= (Y ? Z ¦Â?)0 (Y ? Z ¦Â?) ? ¦Ã 0 Z 0 (Y ? Z ¦Â?) ? (Y ? Z ¦Â?)Z¦Ã + ¦Ã 0 Z 0 Z¦Ã

= S(¦Â?) + ¦Ã 0 Z 0 Z¦Ã.

e = S(¦Â?) + kZ¦Ãk2 ¡Ý S(¦Â?). It follows that ¦Â? is a minimizer of S.

Thus S(¦Â)

For uniqueness we need to show that ¦Ã 6= 0 implies Z¦Ã 6= 0. Suppose to the

contrary that Z¦Ã = 0 for ¦Ã 6= 0. Then we would have Z 0 Z¦Ã = 0 for ¦Ã 6= 0, but

this contradicts the assumption that Z 0 Z is invertible. Therefore if ¦Âe 6= ¦Â? then

e = S(¦Â?) + kZ(¦Âe ? ¦Â?)k2 > 0

S(¦Â)

2.2

Geometry of least squares

Figure xxx shows a sketch to illustrate linear least squares. The vector y =

(y1 , . . . , yn )0 is represented as a point in Rn . The set

M = {Z¦Â | ¦Â ¡Ê Rp }

(2.6)

is a p dimensional linear subset of Rn . It is fully p dimensional here because

we have assumed that Z 0 Z is invertible and so Z has rank p. Under our model,

E(Y ) = Z¦Â ¡Ê M.

The idea behind least squares is to find ¦Â? so that y? = Z ¦Â? is the closest point

to y from within M. We expect to find this closest point to y by ¡°dropping a

perpendicular line¡± to M. That is, the error ¦Å? = y ? y? should be perpendicular

to any line within M. From the normal equations (2.3), we have ¦Å?0 Z = 0 so ¦Å?

is actually perpendicular to every point Z¦Â ¡Ê M. Therefore ¦Å? is perpendicular

to every line in M.

We can form a right angle triangle using the three points y, y? and Z ¦Âe for

any ¦Âe ¡Ê Rp . Taking ¦Âe = 0 and using Pythagoras¡¯ theorem we get

kyk2 = k¦Å?k2 + kz ¦Â?k2 .

When the first column of Z consists of 1s then (y?, . . . , y?)0 = Z(y?, 0, . . . , 0) ¡Ê M

and Pythagoras¡¯ theorem implies

n

n

n

X

X

X

(yi ? y?)2 =

(y?i ? y?)2 +

(yi ? y?)2 .

i=1

i=1

i=1

(2.7)

4

2. LINEAR LEAST SQUARES

The left side of (2.7) is called the centered sum of squares of the yi . It is

n ? 1 times the usual estimate of the common variance of the Yi . The equation

decomposes this sum of squares into two parts. The first is the centered sum of

squared errors of the fitted values y?i . The second is the sumPof squared model

n

errors. When the first column of Z consists of 1s then (1/n) i=1 y?i = y?. That

?

is y? = y?. Now the two terms in (2.7) correspond to the sum of squares of the

fitted values y?i about their mean and the sum of squared residuals. We write

this as SSTOT = SSFIT + SSRES . The total sum of squares is the sum of squared

fits plus the sum of squared residuals.

Some of the variation in Y is ¡®explained¡¯ by the model and some of it left

unexplained. The fraction explained is denoted by

R2 =

SSRES

SSFIT

=1?

.

SSTOT

SSTOT

The quantity R2 is known as the coefficient of determination. It measures how

well Y is predicted or determined by Z ¦Â?. Its nonnegative square root R is called

the coefficient of multiple correlation. It is one measure of how well the response

Y correlates with the p predictors in Z taken collectively. For the special case

of simple linear regression with zi = (1, xi ) ¡Ê R2 then (Exercise xxx) R2 is the

square of the usual Pearson correlation of x and y.

Equation (2.7) is an example of an ANOVA (short for analysis of variance)

decomposition. ANOVA decompositions split a variance (or a sum of squares)

into two or more pieces. Not surprisingly there is typically some orthogonality

or the Pythagoras theorem behind them.

2.3

Algebra of least squares

The predicted value for yi , using the least squares estimates, is y?i = Zi ¦Â?. We

can write the whole vector of fitted values as y? = Z ¦Â? = Z(Z 0 Z)?1 Z 0 Y . That is

y? = Hy where

H = Z(Z 0 Z)?1 Z 0 .

Tukey coined the term ¡°hat matrix¡± for H because it puts the hat on y. Some

simple properties of the hat matrix are important in interpreting least squares.

By writing H 2 = HH out fully and cancelling we find H 2 = H. A matrix

H with H 2 = H is called idempotent. In hindsight, it is geometrically obvious

that we should have had H 2 = H. For any y ¡Ê Rn the closest point to y inside

of M is Hy. Since Hy is already in M, H(Hy) = Hy. That is H 2 y = Hy for

any y and so H 2 = H. Clearly H k = H for any integer k ¡Ý 1.

The matrix Z 0 Z is symmetric, and so therefore is (Z 0 Z)?1 . It follows that

the hat matrix H is symmetric too. A symmetric idempotent matrix such as H

is called a perpendicular projection matrix.

Theorem 2.2. Let H be a symmetric idempotent real valued matrix. Then the

eigenvalues of H are all either 0 or 1.

2.4. DISTRIBUTIONAL RESULTS

5

Proof. Suppose that x is an eigenvector of H with eigenvalue ¦Ë, so Hx = ¦Ëx.

Because H is idempotent H 2 x = Hx = ¦Ëx. But we also have H 2 x = H(Hx) =

H(¦Ëx) = ¦Ë2 x. Therefore ¦Ëx = ¦Ë2 x. The definition of eigenvector does not allow

x = 0 and so we must have ¦Ë2 = ¦Ë. Either ¦Ë = 0 or ¦Ë = 1.

Let H = P 0 ¦«P P

where the columns of P are eigenvectors pi of H for i =

n

1, . . . , n. Then H = i=1 ¦Ëi pi p0i , where by Theorem 2.2 each ¦Ëi is 0 or 1. With

no loss of generality, we can arrange

the ones to precede the zeros. Suppose

Pfor

r

that there are r ones. Then H = i=1 pi p0i . We certainly expect r to equal p

here. This indeed holds. The eigenvalues of H sum to r, so tr(H) = r. Also

tr(H)P

= tr(Z(Z 0 Z)?1 Z 0 ) = tr(Z 0 Z(Z 0 Z)?1 ) = tr(Ip ) = p. Therefore r = p and

p

H = i=1 pi p0i where pi are mutually orthogonal n vectors.

The prediction for observation i can be written as y?i = Hi y where Hi is

the i¡¯th row of the hat matrix. The i¡¯th row of H is simply zi0 (Z 0 Z)?1 Z 0 and

the ij element of the hat matrix is Hij = zi0 (Z 0 Z)?1 zj . Because Hij = Hji the

contribution of yi to y?j equals that of yj to y?i . The diagonal elements of the

hat matrix will prove to be very important. They are

Hii = zi0 (Z 0 Z)?1 zi .

(2.8)

We are also interested in the residuals ¦Å?i = yi ? y?i . The entire vector of

residuals may be written ¦Å? = y ? y? = (I ? H)y. It is easy to see that if

H is a PPM, then so is I ? H. Symmetry is trivial, and (I ? H)(I ? H) =

I ? H ? H + HH = I ? H.

p

The model space M = {Z¦Â | ¦Â ¡Ê RP

} is the set of linear combinations of

p

columns of Z. A typical entry of M is j=1 ¦Âj Zj . Thus M is what is known

as the column space of Z, denoted col(Z). The hat matrix H projects vectors

onto col(Z). The set of vectors orthogonal to Z, that is with Zv = 0, is a linear

subspace of Rn , known as the null space of Z. We may write it as M¡Í or as

null(Z). The column space and null spaces are orthogonal complements. Any

v ¡Ê Rn can be uniquely written as v1 + v2 where v1 ¡Ê M and v2 ¡Ê M¡Í . In

terms of the hat matrix, v1 = Hv and v2 = (I ? H)v.

2.4

Distributional results

We continue to suppose that X and hence Z is a nonrandom matrix and that

Z has full rank. The least squares model has Y = Z¦Â + ¦Å. Then

¦Â? = (Z 0 Z)?1 Z 0 Y = (Z 0 Z)?1 (Z 0 Z¦Â + ¦Å) = ¦Â + (Z 0 Z)?1 Z 0 ¦Å.

(2.9)

The only randomness in the right side of (2.9) comes from ¦Å. This makes it easy

to work out the mean and variance of ¦Â? in the fixed x.

Lemma 2.1. If Y = Z¦Â + ¦Å where Z is nonrandom and Z 0 Z is invertible and

E(¦Å) = 0, then

E(¦Â?) = ¦Â.

(2.10)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download