Errors - University of Michigan



3.2 Least Squares and Orthogonal Projections.

In the previous section we looked at the simplest least squares problem, namely projection on a line. In this section we look at more general least squares problems. The term "least squares" arises in a number of contexts. We first look at the "algebraic least squares" problem and its solution via orthogonal projection. We shall see that it provides a substitute for a solution to a system of linear equations that has no solution. After that we see that the algebraic least squares problems is related to the "data fitting least squares" problem.

Algebraic Least Squares Problem. Given vectors a1 = , a2 = , …, an =  and b = , find the linear combination w = x1a1 + … + xnan of a1, a1, …, an that is closest to b.

If we let S = {x1a1 + … + xnan: x1, …, x1 are numbers} be the subspace of all linear combinations of a1, …, an, then we want to find the element w of S that is closest to b. w is called the orthogonal projection of b onto S.

If b can be written as a linear combination of a1, a1, …, an then b itself would be the linear combination closest to b. However, if there are fewer vectors a1, a1, …, an than the vectors have components, (i.e. n < m) then it may not be possible to write b itself as a linear combination of a1, a1, …, an.

If x1, x1, …, xn are scalars then the linear combination x1a1 + x2a2 + … + xnan can be written as

x1a1 + x2a2 + … + xnan = x1 + x2 + … + xn =   = Ax

where A is the matrix whose columns are a1, a1, …, an and x is the vector of x1, x1, …, xn. So the least squares problem can be restated as follow.

Least Squares Problem (Restatement #1). Given a matrix A = and a vector b = , find x = so that w = Ax is closest to b.

If the equation Ax = b has a solution then x would be the solution of the least squares problem. However, if A has more rows than columns (n < m) then it may not be possible to solve Ax = b and the solution x of the least squares problems provides a substitute for a solution, called the least squares solution.

With (1) in mind the least squares problem can be restated as follows.

Least Squares Problem (Restatement #2). Find x so as to minimize || b – Ax ||, i.e. find x so that || b - Ax || ( || b – Ay || for all vectors y.

To the right is a picture that suggests the solution to the problem. Geometrically, we can see that b – Ax will be smaller than b – Ay if b – Ax is perpendicular to Ay – Ax. In two or three dimensions, this would be a consequence of the Phythagorean theorem. To translate this into algebraic terms, recall the following geometric interpretation of the dot product.

Theorem 1. Given A and b the vector x satisfies

(1) || b - Ax || ( || b – Ay || for all vectors y

if and only if x satisfies

(2) (b – Ax) . (Az) = 0 for all vectors z

Proof. Note that (1) holds if and only if

(3) || b - Ax ||2 ( || b – Ay ||2 for all vectors y

Using (3) and the distributive property of dot products one has

(4) || b – Ay ||2 = (b – Ay) . (b – Ay) = (b – Ax + A(y-x)) . (b – Ax + A(y-x))

= (b – Ax) . (b – Ax) + 2(b – Ax) . (A(y-x)) + (A(y-x)) . (A(y-x))

= || b - Ax ||2 + 2(b – Ax) . (A(y-x)) + || A(y-x) ||2

In particular, if (2) holds then

|| b – Ay ||2 = || b - Ax ||2 + || A(y-x) ||2

so that (3) holds. Conversely, suppose that (3) holds. Then let y = x + tz in (4) where t is an arbitrary scalar. Then one obtains the following quadratic function of t.

q(t) = || b – A(x + tz) ||2 = || b - Ax ||2 + 2t(b – Ax) . (Az) + t2|| Az ||2

If (2) holds then q(t) has a minimum at t = 0. This implies q'(0) = 0. However, q'(t) = 2(b - Ax) . (Az) + 2t|| Az ||2. Thus q'(0) = 2(b – Ax) . (Az). So (7) implies (6). //

The characterization (2) of the least squares problem leads to the following theorem which shows that the solution of the least squares problem can be obtained simply by solving a linear equation.

Theorem 2. Given A and b the vector x satisfies

|| b - Ax || ( || b – Ay || for all vectors y

if and only if x satisfies

(5) ATAx = ATb

If the columns of A are linearly independent then ATA is invertible and

x = (ATA)-1ATb

and

w = Ax = A(ATA)-1ATb = Pb

where

(6) P = A(ATA)-1AT = matrix for the orthogonal projection onto S

Proof. There is a connection between the dot product and the transpose CT of a matrix C, namely (CTu) . v = u . (Cv) for all u and v. Thus (2) is equivalent to (AT(b - Ax)) . z = 0 for all z. However, this holds if and only if AT(b - Ax)) = 0 which is equivalent to (5). //

Example 1. Find the orthogonal projection w of b = on the plane S of all linear combinations of and . Also find the matrix P for the orthogonal projection onto S.

In this case a1 = , a2 = , A = and b = . So ATA = = and ATb =  = . So the equations (5) are

3x1 + x2 = 3

x1 + 3x2 = 3

Multiply the first equation by three and subtract the second equation to get 8x1 = 6. So x1 = 3/4. Substitute into the first equation and solve for x2 to get x2 = 3/4. So w = x1a1 + x2a2 = . To get P note that (ATA)-1 = -1 = , so

P = A(ATA)-1AT =

= = = .

The Data Fitting Least Squares Problem. Above we looked at the algebraic least squares problem. To most people, least squares refers to the data fitting least squares problem which is related to the algebraic least squares problem.

We are given m pairs of values of s and t.

(s1, t1), …, (sm, tm)

Often we would like to summarize such a set of data values by an equation of the form

(7) t = c1f1(s) + … + cnfn(s)

where f1(s), …, fn(s) are given functions.

An important special case is where we want to summarize the data by a linear equation t = c1s + c2. Then n = 2 and f1(x) = x and f2(x) = 1.

Usually there will be no equation of the form (7) that fits the data exactly. Instead we find the equation that fits the data best in the sense of least squares. This means the following.

For each pair of data values (sj, tj) we measure the horizontal distance from (sj, tj) to the curve t = c1f1(s) + … + cnfn(s). This distance is tj – (c1f1(s) + … + cnfn(s)). We square this distance: (tj - (c1f1(s) + … + cnfn(s)))2. This squared value is a measure of how well the equation t = c1f1(s) + … + cnfn(s) fits the pair of data values (sj, tj). We add up all these squared values.

S =

S is a measure of how well the equation t = c1f1(s) + … + cnfn(s) fits the set (s1, t1), (s2, t2), …, (sm, tm) of data values; the smaller S the better the fit. We want to find c1, ..., cn to minimize S. This is called the equation that fits the data best in the sense of least squares.

We can transform this into the algebraic least squares problem considered above as follows. We let

b = a1 = , a2 = , …, an = 

Then

b - (c1a1 + … + cnan) =

|| b - (c1a1 + … + cnan) ||2 = = S

The sum of squares S in the curve fitting problem is just the square of the distance of c1a1 + … + cnan to b. So we want to find the superposition of a1, a1, …, an closest to t. This is just a special case of the problem considered above. Let A be the matrix whose columns are a1, a2, …, an, i.e.

(8) A =

So A is the m ( n matrix with fi(sj) in row j and column i. Then c1, ..., cn are the solution to the equations

(9) ATA = ATb

Example 2. (taken from Linear Algebra with Applications, 3rd edition, by Gareth Williams, p. 380) A study was made of number of traffic fatalities, t, and the age of the driver, s. At the right is a table of data that was collected on the number t of fatalities and the age s of the driver. A plot of the data is also at the right. We want to find the linear equation t = c1s + c2 that best fits the data in the sense of least squares.

In this case

A = b =

A computation shows that

ATA = ATb =

So the equations (9) become

3150c1 + 110c2 = 9895

110c1 + 4c2 = 372

Solving we get c1 = - 67/25 ( - 2.68 and c2 = 1667/10 = 1.66.7. So the least squares line approximating the data is t = - 2.68s + 166.7.

Example 2. Find a and b so that the equation t = as + best fits the data at the right in the sense of least squares.

In this case

A = b =

A computation shows that

ATA = ATb =

So the equations (9) become

14a + 3b = - 4

3a + b = 1

Solving we get a = - 152/181 ( - 0.84 and b = 468/181 ( 2.59. So the least squares equation approximating the data is t = - 0.84s + 2.59/s.

-----------------------

b

Ax

Ay

b- Ay

Ay- Ax

b- Ax

|Age of driver, s |Number of |

| |fatalities, t |

|20 |101 |

|25 |115 |

|20 |92 |

|35 |64 |

|s |t |

|1 |1 |

|2 |2 |

|3 |- 3 |

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download