Problem 1 – First-Order Predicate Calculus (15 points)



University of Wisconsin – Madison

Computer Sciences Department

CS 760 - Machine Learning

Spring 2003

Exam

7:15-9:15pm, May 6, 2003

Room 3345 Engineering Hall

CLOSED BOOK

(one sheet of notes and a calculator allowed)

Write your answers on these pages and show your work. If you feel that a question is not fully specified, state any assumptions that you need to make in order to solve the problem. You may use the backs of these sheets for scratch work.

Write your name on this and all other pages of this exam. Make sure your exam contains

5 problems on 10 pages.

Name ________________________________________________________________

Student ID ________________________________________________________________

Problem Score Max Score

1 ______ 40

2 ______ 17

3 ______ 10

4 ______ 12

5 ______ 21

TOTAL ______ 100

Problem 1 – Learning from Labeled Examples (40 points)

Imagine that you are given the following set of training examples.

Feature F1 can take on the values a, b, or c; Feature F2 is Boolean-valued;

and Feature F3 is always a real-valued number in [0,1].

F1 F2 F3 Category

Example 1 a T 0.2 +

Example 2 b F 0.5 +

Example 3 b F 0.9 +

Example 4 b T 0.6 –

Example 5 a F 0.1 –

Example 6 a T 0.7 –

a) How might a Naive Bayes system classify the following test example?

Be sure to show your work. (Discretize the numeric feature into three equal-width bins.)

F1 = c F2 = T F3 = 0.8

b) Describe how a 2-nearest-neighbor algorithm might classify Part a’s test example.

c) Show the calculations that ID3 would perform to determine the root node of a decision tree using the above training examples.

d) Using the “decision stump” learner as boosting’s weak learner, what would the weights be on each of the six training examples after one (1) round of AdaBoost? Show your work.

Weight on Example 1: _________

Weight on Example 2: _________

Weight on Example 3: _________

Weight on Example 4: _________

Weight on Example 5: _________

Weight on Example 6: _________

e) Show the feature space for this training set. Use only features F2 and F3 (ie, for this part, ignore F1). Show what a linear support-vector machine might learn from this training set. Briefly explain your answer.

f) Create a neural network with no hidden units and a sigmoidal output unit. Initialize all the free parameters to 2 and use a learning rate of 0.1. Use squared error is the error function.

i. Draw a picture of this neural network before training.

ii. Calculate ∂ Error / ∂ weighti for this network. Recall that the derivative of the sigmoid is output ( (1 – output).

iii. Show how the free parameters would change after backpropagating on the first training example in the table above. Indicate the weight changes on the neural network you drew above and show your work below.

iv. According to Rumelhart, which error function should one use on classification tasks such as this one? And how should one interpret the neural network’s output?

Problem 2 – Reinforcement Learning (17 points)

Consider the deterministic reinforcement environment drawn below. The numbers on the arcs indicate the immediate rewards. Let the discount rate equal 0.5.

Assume that the current Q table is represented by the Q values on the arcs on the environment's state-action graph below. Also assume that the learner is currently in Node b.

a) The learner chooses to do an exploration step out of Node b, using the “softmax” function. What is the probability is that the learner will choose the action that leads to Node a (be sure to state any assumptions you need to make in order to compute this probability)?

b) Assume that the learner is magically transported to Node c and that the Q table is still in the state shown below. For this question, assume the learner chooses to first go to Node b and then to Node a. Be sure to show the calculations that produce your answers to the questions below.

i. Using one-step Q learning, what new value would there be on the c→ b arc?

ii. Using one-step SARSA learning, what new value would there be on the c→ b arc?

c) This time assume the learner is currently in Node start and the Q table contains the values in the graph above (the values before you answered Part b). The learner does three (3) exploitation steps and employs two-step Q learning along the way. Indicate on the graph above what changes in the Q table this time. Show your calculations below.

Problem 3 – Computational Learning Theory (10 points)

a) Assume that we have the predictions below of five experts, as well as the correct answer. Using the weighted majority algorithm (with β=0.5) in order to track the best expert, show how the weight given to each expert changes after each example. Show your work.

Expert 1 2 3 4 5 Correct Answer

T T T F F F

Weights ________________________________________

F T F T T T

Weights ________________________________________

T F F F T F

Weights ________________________________________

b) Consider the Boolean concept class of circles in the space of two numeric features (points on or inside the circle are considered positive). What is the VC-dimension for this concept space? Explain.

Problem 4 – Overfitting Avoidance (12 points)

a) Use a Venn diagram of feature space to illustrate how a decision tree might overfit a training set. Briefly explain your answer.

b) Briefly describe how one can choose the number of hidden units to use in a neural network in order to reduce the chances of overfitting.

c) Informally (i.e., verbally or pictorially) explain how using a kernel might lead to overfitting in a support-vector machine.

d) Outline a methodology where an experimenter incorrectly uses cross validation to set parameters and estimate future accuracy.

Problem 5 – Short Discussion Questions (21 points)

a) Using a Venn diagram, qualitatively illustrate how 1-NN (one nearest-neighbor) partitions feature space. Assume a Boolean classification task. Briefly explain your answer.

b) Consider creating an ensemble of eleven support-vector machines (SVM’s) using bagging and a standard linear SVM. Dataset A has 100 support vectors and Dataset B has 1000 support vectors (when run on all the data); both datasets contain 10,000 examples and on both datasets the 10-fold cross-validation accuracy of a single SVM is 80%. On which dataset would you expect bagging to improve accuracy the most? Why?

c) What does transduction mean in machine learning?

d) Explain the role of slack variables in support vector machines.

e) What do you feel is the most important improvement that PAC learning offers over learning in the limit? Explain your answer.

f) Noah Taul does a proper 10-fold cross-validation of his new algorithm compared to backprop. A paired t-test shows he cannot disregard the null hypothesis.

What is the null hypothesis? And how does a t-test allow one to address this hypothesis?

Have a good vacation!

-----------------------

5

10

r = -10

0

-15

15

-2

6

-6

-3

3

start

end

a

c

b

5

12

Q = -1

0

-7

14

-3

7

-1

-9

1

start

end

a

c

b

5

12

Q = -1

0

-7

14

-3

7

-1

-9

1

start

end

a

c

b

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download