Multilayer Learning and Backpropagation



Multilayer Learning and Backpropagation

Neural Networks

Bibliography

Rumelhart, D. E. and McClelland, J. L., Parallel Distributed Processing, MIT Press, 1986. - Chapter 8, pp. 318-362

Informal but very readable introduction to Backpropagation

Sejnowski, T. J. and C. Rosenberg, NETtalk: a parallel network that learns to read aloud," The Johns Hopkins University EE and CS Tech. Report JHU/EECS-86/01, 1986.

Example of an indepth application use of Backpropagation.

Werbos, P. J., "Beyond Regression: new tools for prediction and analysis in the behavioral sciences," PhD Thesis, Harvard Univ., Cambridge, Mass., 1974.

Backpropagation

(Rumelhart?)

Multi-layer supervised learning

Gradient Descent Weight Changer

Uses Sigmoid rather than Threshold

(Squashing Function)

Sigmoid is differentiable (Widrow differentiated before the threshold)

Threshold vs. Sigmoid

[pic]

How does Multi-layer do non linearly separable mappings

[pic]

Backpropagation Network

[pic]

Input Layer Hidden Layer(s) Output Layer

Backpropagation Derivation

It can be derived from fundamentals by

seeking negative . Can take derivative of the sigmoid.

[pic]

sigmoid: f(net) = output

[pic]

f'(net)

Most active when output is in middle of sigmoid - unstable?

Backpropagation Learning Algorithm

Until Convergence do

Present a training pattern

Calculate the error of output units (T-O)

for each hidden layer

Calculate error using error from next layer

Update weights

end

The error propagates back through the network

Network Equations

Output: Oj = f(netj) =

f'(netj) = = Oj(1 - Oj)

Δwij (general node): C Oi δj

Δwij (output node):

δj ’ (tj - Oj) f'(netj)

Δwij = C Oi δj ’ C Oi (tj - Oj) f'(netj)

Δwij (hidden node)

δj ’ f'(netj)

Δwij = C Oi δj ’ C Oi ( ) f'(netj)

[pic]

Backprop Examples

Epochs = 558

LR = .5

XOR - 2x1x1

Backprop Examples

Epochs = 6587

LR = .25

No Convergence - Local Minima

XOR - 2x2x1

Parity Problem Solution

Epochs = 2825

LR = .5

Parity - 8x8x1

Sets itself up to count

UCSD - Zipser, Ellman (Linguists)

Trained to do Phoneme Identity function. Why?

Speeding up Learning

Momentum term α

Δw(t+1) = Cδο + αΔw(t)

Speed up in flats

Filter out high frequency variations?

usually α set to .9

Dynamic learning rate and momentum

Overloading and Pruning

Different Activation Functions

Recurrent Nets

Backpropagation Summary

Empirically impressive multi-layer learning

Most used of current neural networks

Truly multi-layer?

Slow learning - Hardware

No convergence guarantees

Lack of Rigor - AI Trap?

Black magic - Eye of newt, tricks, few guidelines for initial topology

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download