CSE 5523: Lecture Notes 10 Linear regression

CSE 5523: Lecture Notes 10 Linear regression

Contents

10.1 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 10.2 Sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 10.3 Fitting polynomial lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 10.4 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 10.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 10.6 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 10.7 Sample code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

10.1 Linear regression

Fit weights w for numerical y (e.g. purchases) on continuous variables x (e.g. dimensions of fruits). Use a negative log Gaussian loss function, so deviation from prediction w x is Gaussian `noise':

LNLG,(y, w x) = - ln Nw x,(y)

1

= - ln

1 22

2

exp

1 - 22 (y - w

x)2

Find parameters w that minize negative log Gaussian likelihood (squared) error over all data:

0=

w

LNLG,(y, w

y,x D

x)

=

w

y,x D - ln

1 22

1 2

exp

1 - 22 (y - w

x)2

=

y,x

D

w

-

ln

1 22

1 2

exp

1 - 22 (y - w

x)2

1

=

w - ln

1 22

2

1

- 22 (y - w

x)2

y,x D

=

1 y,x D w - 22 (y - w

x)2

=

1 y,x D - 22 w (y - w

x)2

substitution sum rule

log of product is sum of logs derivative of constant product rule

1

=

1 y,x D - 22 2(y - w

x) w(y - w

x)

=

1 - 22 2(y - w x)(-x)

y,x D

=

x(y - w x)

y,x D

= X (y - Xw)

= X y - X Xw

This has root where X Xw = X y, or w = (X X)-1 X y.

concentration/precision matrix

10.2 Sample code

Sample code in pandas:

import sys import numpy import pandas

X = pandas.read_csv( sys.argv[1] ) y = pandas.read_csv( sys.argv[2] )

w = numpy.linalg.inv(X.T @ X) @ X.T @ y

print( w )

Sample input file `X.csv':

x1,x2 1,1 2,1 3,1 4,1

Sample input file `y.csv':

y 3 5 7 9

Sample output:

y 0 2.0 1 1.0

In other words: y = 2x + 1.

2

power rule

power rule multiply by 2 definition of inner product distributive axiom

10.3 Fitting polynomial lines

You can use any basis function (x ) on input x for x, as long as weights w are linear. For example: x = (x ) = [x[1] (x[1])2 (x[1])3 x[2] (x[2])2 (x[2])3] In this case the weights will be coefficients for terms in a polynomial.

10.4 Convexity

Note that since the loss function is convex, there is only one minimum. We will see other functions that don't have this property, so we'll get local optima.

10.5 Regularization

Again, we may add prior information to avoid overfitting (not zero counts, but wiggly lines). A good choice is to add a term to the loss to penalize large weight parameters:

LNLG,+L2,(y, w x) = - ln Nw x,(y) - ln N0,(w)

Optimizing this (at slope zero) gives:

V

0=

w

-

ln Nw

y,x D

x,(y) -

v=1

ln N0,(w[v])

...

=-

1 y,x D 22 w (y - w

x)2 -

V v=1

1 22

w

(w[v])2

=-

1 y,x D 22 2(y - w

x) w(y - w

x) -

V v=1

1

22 2w[v] w w[v]

=-

1 22 2(y - w

y,x D

x)(-x) -

V v=1

1 22 2w[v]

V 2

=-

(y - w x)(-x) - 2 w[v]

y,x D

v=1

2 = X (y - Xw) - 2 Iw

=X

y-X

Xw -

2 2 Iw

2 = X y - (X X + 2 I)w

product rule power rule power rule

multiply by 2 definition of inner product

distributive axiom distributive axiom

3

which has a single root: w = (X

X

+

2 2

I)-1

X

y, so it is still convex.

10.6 Gradient descent

Log Gaussian (quadratic) loss can be sensitive to outliers. We could use absolute value loss, but it is indeterminate (there are multiple optimal solutions). People often combine the two into log-cosh loss:

LLC,(y, w x) = ln cosh(y - w x) (where is the distance at which you want to start ignoring outliers). It's quadratic in the middle (so it has a single optimum) and linear at the edges (resists outliers):

=1

=

1 2

LLC,(y, w x)

y-w x

The gradient (derivative for multivariable functions) is:

w

LLC,(y, w

x) =

w

ln cosh(y - w x)

y,x D

y,x D

=

w

ln cosh(y - w

y,x D

x)

= w

ln cosh(y - w

y,x D

x)

= y,x D w ln cosh(y - w x)

=

1

cosh(y - w x) w cosh(y - w x)

y,x D

=

sinh(y - w y,x D cosh(y - w

x) x) wy - w

x

=

sinh(y - w x) (-x)

y,x D cosh(y - w x)

4

substitution distributive axiom

product rule sum rule

derivative of log derivative of cos

product rule

= tanh(y - w x)(-x)

y,x D

= - X tanh(y - Xw)

definition of tan definition of inner product

Bah, we can't find roots here (the optimization surface is non-linear), but we can use this as update! Start with any w(0); move in opposite direction of error times active inputs x, to reduce loss:

w(i) = w(i-1) - 1 - X tanh(y - Xw) N

= w(i-1) + 1 X tanh(y - Xw) N

This algorithm is called gradient descent.

10.7 Sample code

Sample code in pandas:

import sys import numpy import pandas

a = .1

X = pandas.read_csv( sys.argv[1] ) y = pandas.read_csv( sys.argv[2] ) N = len(X)

## Randomly initialize... w = pandas.DataFrame( numpy.random.rand(len(X.columns),1), X.columns, ['y'] )

## Run 100 epochs... for i in range(100):

w = w + 1/N * a * X.T @ numpy.tanh( y - X @ w )

# Print sum of loss of training data: should decrease at each epoch... print( numpy.sum( numpy.log( numpy.cosh( y - X @ w ) ) ) )

print( w )

Sample input file `X.csv':

x1,x2 1,1 2,1 3,1 4,1

Sample input file `y.csv' (now with an outlier at the third point):

5

y 3 5 3 9

Sample output:

y x1 1.801261 x2 1.113004

whereas quadratic loss gives us:

y 0 1.6 1 1.0

Graphically, we can see log cosh was less sensitive to the outlier: w from LLC,(y, w x) w from LNLG,(y, w x)

x[1]

6

y

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download