Tutorial 1 - Hong Kong Polytechnic University



EIE6207: Theoretical Fundamental and Engineering Approaches for Intelligent Signal and Information Processing

Tutorial: Neural Networks and Backpropagation (Solutions)

Q1

a) Because we need to minimize the error E, which is a function of the weights. Therefore, we compute the gradient of E with respect to the weights to find the best downhill direction at the current position of the weights in the weight space.

In the above diagram, the gradient (slope) at [pic]is negative. If [pic]is positive, [pic] according to the BP update formula, which is what we want, i.e., going downhill.

b) The learning rate should be small so that the change in the weights will not be very large in successive iterations. The key point is that we want to have small change in weights for each iteration so that eventually, we will reach the bottom of a valley. If the learning rate is very large, the change in weights will be so large that the error E may increase from iteration to iteration. This is similar to jumping from one side of the hill to the other side but never be able to reach the bottom.

c) No. It is a weakness of BP. BP is based on gradient descent. All gradient descent algorithm cannot found the global minimum (unless E(w) is quadratic in w). But in many real-world applications, it is not necessary to find the global minimum for the networks to be useful.

Q2

a) The advantage of softmax is that the sum over all outputs is equal to 1, which fits nicely to the requirement of posterior probability. That is, we may consider the k-th output produces the posterior probability of Class k give an input vector x, i.e., [pic].

b) The MSE gives equal emphasis on all of the K outputs for a K-class problem. This means that the MSE ([pic]) will be dominated by the errors for which [pic], because among the K outputs, only one of them has target [pic]. On the other hand, the cross-entropy error emphasizes on the output [pic] for which [pic]. In fact, it aims to make the network to produce [pic] when [pic] and ignores all the other outputs for which [pic]. This is closer to the classification objective because to make a classification decision, we look for the largest output regardless of how large its rivals are.

[The following is optional; but as a research student, it is good to understand it.]

For discrete random vectors t and y, the cross-entropy between their distributions p(t) and q(y) is

[pic]

where n indexes the samples drawn from p and q and k indexes the elements in t and y. In general cases, the cross-entropy between distributions p and q is defined as

[pic]

where the first term is the entropy of p and the second term is the Kullback–Leibler divergence between p and q:

[pic]

As H(p) is a constant for a specific p, minimizing the cross-entropy is equivalent to minimizing the KL divergence. The KL divergence will be minimum (= 0) when the two distributions (p and q) are the same everywhere. Note that the cross-entropy between p and q is implemented as follows:

[pic]

which is the same as our cross-entropy error function E(W) if p is the target t and q is the actual output y. Therefore, minimizing the cross-entropy between t and y will cause the distribution of y to be as close as possible to the distribution of t, which is what we want in a classification task.

c) Classification error requires bin counting, which is not differentiable. Although methods have been developed to convert classification error to a differentiable function, the derivative could be quite complicated.

Q3

a)

As shown in the above figure, the outputs of the hidden nodes are a function of the linear weighted sum of inputs plus a bias. If the sigmoid function f() has a steep sloop, its output over a range of input x1 and x2 will look like the following figure. The left and right figures correspond to the output of hidden neurons 1 and 2, respectively. The first neuron maps data above L1 to 0.0 and below L1 to 1.0. Similarly, the second neuron maps data above L2 to 1.0 and below L2 to 0.0. The resulting mapping are shown in the bottom figure. The output neuron separates the data in the 2-D space defined by the hidden node outputs ([pic]). As can be seen, the data on this new space are linearly separable and can be easily classified by L3, which is produced by the output neuron.

[pic]

[pic]

[pic]

-----------------------

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

Output of Hidden Neuron 2

Output of Hidden Neuron 1

[pic]

[pic]

L1

L2

[pic]

[pic]

L3

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download