Problem 1 – First-Order Predicate Calculus (15 points)



University of Wisconsin – Madison

Computer Sciences Department

CS 760 - Machine Learning

Fall 2004

Exam

11am-12:45pm, December 10, 2004

Room 107 Psychology

CLOSED BOOK

(one sheet of notes and a calculator allowed)

Write your answers on these pages and show your work. If you feel that a question is not fully specified, state any assumptions you need to make in order to solve the problem. You may use the backs of these sheets for scratch work.

Neatly write your name on this and all other pages of this exam.

Name ________________________________________________________________

Problem Score Max Score

1 ______ 45

2 ______ 13

3 ______ 15

4 ______ 12

5 ______ 15

TOTAL ______ 100

Problem 1 – Learning from Labeled Examples (45 points)

Imagine you want to learn to predict which flu patients will respond well to the new drug Pacix, based on three measurements you take. The table below contains the results on a training set of volunteers.

Temp BP Age Result

Ex1 100.0 ave 20 helps

Ex2 98.5 low 41 helps

Ex3 98.6 ave 29 no-effect

Ex4 103.0 high 72 no-effect

Ex5 99.9 low 87 no-effect

a) Assuming we are considering using feature BP (blood pressure) for the root of a decision tree. After splitting on this feature, on average how many bits of information would still be needed in order to classify a future example?

b) Now consider using Age as the root of a decision tree. Which “decisions” (i.e., possible tests at an interior node) do we need to score? Simply list them and briefly explain why these are the candidates we need to consider; you do not need to calculate each’s information gain.

c) Imagine a medical expert suggests that the drug Pacix will help patients whose Age is less than half their Temp. Assume we want to create a decision node that tests this hypothesis. What would be this test be and how much information does it gain (when considered as the root)?

d) You are boosting “decision stumps” and in the first round choose the decision

If age < 50 then helps else no-effect

Show the weights on the five training examples after this first round.

Show your work here.

Weight(Ex1) = __________

Weight(Ex2) = __________

Weight(Ex3) = __________

Weight(Ex4) = __________

Weight(Ex5) = __________

On the second round of boosting, what would be the information gain of the feature BP?

Show your work.

e) Show how a one-nearest neighbor algorithm would classify the following new example:

Temp = 100 BP = high Age = 50

The training examples repeated for your convenience:

Temp BP Age Result

Ex1 100.0 ave 20 helps

Ex2 98.5 low 41 helps

Ex3 98.6 ave 29 no-effect

Ex4 103.0 high 72 no-effect

Ex5 99.9 low 87 no-effect

f) Show how a perceptron could be applied to this data set. Draw the initial network, initializing all free parameters to 0. Then calculate, showing your work, how the delta rule would change the network upon processing the first training example above (let η=0.1). Assume the two real-valued features are normalized by dividing them by 100. Represent helps by 1 and no-effect by 0. Also assume that if the weighted sum equals the threshold, the perceptron outputs 0.

g) Describe how a linear support-vector machine (SVM) could be applied to this problem. Show the expression that would be minimized and the constraints that would need to be satisfied. (Do not assume the data is linearly separable.) Be sure to explain your answer. You do not need to express your solution in terms of the detailed Ax ≥ b matrices and vectors of linear programming, nor do you need to calculate the optimal parameter values that the SVM would find.

minimize explanation

subject to explanation

h) Verbally relate your solution to Part g to:

i. minimal description length

ii. weight decay

Problem 2 – Reinforcement Learning (13 points)

Consider the deterministic reinforcement environment drawn below (let γ=0.5). The numbers on the arcs indicate the immediate rewards.

a) Assume we use a Q table for this task and initialize all of its entries to the immediate rewards obtained from each state-action pair.

i. A learner follows the path start ( a ( b ( end. Using one-step, regular Q learning, which entries in the Q table change? Show the calculations that produce the new Q table entries.

ii. Which of the learner’s steps in Part i are exploration steps? Explain.

iii. Repeat Part i, but this time use one-step SARSA (start with the original Q table, not the one that results from the learning in Part i). For those Q updates that are the same as in Part i, simply say so and do not repeat the calculations.

b) Why does the converge proof of Q learning done in class and in Mitchell, not apply to the case of using a function approximator to represent Q functions?

Problem 3 – Experimental Methodology (15 points)

a) You run your favorite algorithm on a training set of 900 examples, and then find it properly classifies 80 out of 100 test-set examples. In what interval are you 95% confident that the true accuracy on future examples lies?

b) Assume two algorithms get the following test-set accuracies in a four-fold cross validation :

Algo A: 90 92 87 93

Algo B: 91 95 84 97

Is the difference between A and B statistically significant at the 95% confidence level?

(t0.95,3 = 3.18 t0.95,4 = 2.78 t0.95,5 = 2.57 for a two-sided test and

t0.95,3 = 2.35 t0.95,4 = 2.13 t0.95,5 = 2.02 for a one-sided test)

c) Assume your probabilistic learner produces the following test-set results:

Example Probability Correct Category

1 0.99 negative

2 0.90 positive

3 0.75 positive

4 0.25 negative

5 0.05 positive

Draw to the right of this table the ROC curve

for this learner. Be sure to label your axes.

Problem 4–Miscellaneous Questions (12 points)

a) Assume that you are considering a noise-free data set that contains N Boolean-valued features and are using the hypothesis space:

All possible Boolean functions (there are 2 2 possibilities)

According to the PAC model, how many examples do you need to collect in order to be 99%

confident that you will learn the underlying Boolean concept to 90% accuracy when N = 10?

b) “Hardwire” a neural network that memorizes the following two positive examples (and explain your answer):

f1 = true f2 = false f3 = true f4 = false

f1 = true f2 = true f3 = false f4 = false

c) Show how the KBANN algorithm would initialize a neural network given the following

domain theory (assume all examples are represented by four Boolean-valued features, f1-f4). Be sure to indicate the initial weights and node thresholds.

If (f2 and f3) then category = positive

If (f1 and NOT f3) then category = positive

If (f1) then category = positive

Problem 5– Miscellaneous Short Answers (15 points)

Briefly discuss the importance in machine learning of each of the following:

tile coding

random forests

recall-precision curves

options

feature-independence assumption

-----------------------

-2

1

start

c

a

-8

8

-4

-4

2

b

end

0

N

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download