Chapter 18, Sections 1–3 1 Chapter 18, Sections 1–3 2

Outline

? Learning agents

? Inductive learning

Learning from Observations

? Decision tree learning

? Measuring learning performance

Chapter 18, Sections 1C3

Chapter 18, Sections 1C3

1

Chapter 18, Sections 1C3

Learning

2

Learning agents

Performance standard

Learning is essential for unknown environments,

i.e., when designer lacks omniscience

Sensors

Critic

Learning is useful as a system construction method,

i.e., expose the agent to reality rather than trying to write it down

feedback

Environment

Learning modi?es the agents decision mechanisms to improve performance

changes

Learning

element

knowledge

Performance

element

learning

goals

Problem

generator

experiments

Agent

Chapter 18, Sections 1C3

Effectors

Chapter 18, Sections 1C3

3

Learning element

4

Inductive learning (a.k.a. Science)

Design of learning element is dictated by

? what type of performance element is used

? which functional component is to be learned

? how that functional compoent is represented

? what kind of feedback is available

Example scenarios:

Simplest form: learn a function from examples (tabula rasa)

f is the target function

An example is a pair x, f (x), e.g.,

Performance element

Component

Representation

Feedback

Alpha?beta search

Eval. fn.

Weighted linear function

Win/loss

Logical agent

Transition model

Successor?state axioms

Outcome

Utility?based agent

Transition model

Dynamic Bayes net

Outcome

Simple reflex agent

Percept?action fn

Neural net

Correct action

Problem: ?nd a(n) hypothesis h

such that h f

given a training set of examples

Supervised learning: correct answers for each instance

Reinforcement learning: occasional rewards

Chapter 18, Sections 1C3

O O X

X

, +1

X

5

(This is a highly simpli?ed model of real learning:

C Ignores prior knowledge

C Assumes a deterministic, observable environment

C Assumes examples are given

C Assumes that the agent wants to learn f why?)

Chapter 18, Sections 1C3

6

Inductive learning method

Inductive learning method

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve ?tting:

E.g., curve ?tting:

f(x)

f(x)

x

x

Chapter 18, Sections 1C3

7

Inductive learning method

Chapter 18, Sections 1C3

8

Inductive learning method

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve ?tting:

E.g., curve ?tting:

f(x)

f(x)

x

x

Chapter 18, Sections 1C3

9

Inductive learning method

Chapter 18, Sections 1C3

10

Inductive learning method

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve ?tting:

E.g., curve ?tting:

f(x)

f(x)

x

x

Ockhams razor: maximize a combination of consistency and simplicity

Chapter 18, Sections 1C3

11

Chapter 18, Sections 1C3

12

Attribute-based representations

Decision trees

Examples described by attribute values (Boolean, discrete, continuous, etc.)

E.g., situations where I will/wont wait for a table:

Example

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

Attributes

Target

Alt Bar F ri Hun P at P rice Rain Res T ype Est WillWait

T

F

F

T Some $$$

F

T French 0C10

T

T

F

F

T

Full

$

F

F

Thai 30C60

F

F

T

F

F Some

$

F

F Burger 0C10

T

T

F

T

T

Full

$

F

F

Thai 10C30

T

T

F

T

F

Full

$$$

F

T French >60

F

F

T

F

T Some $$

T

T Italian 0C10

T

F

T

F

F

None

$

T

F Burger 0C10

F

F

F

F

T Some $$

T

T

Thai 0C10

T

F

T

T

F

Full

$

T

F Burger >60

F

T

T

T

T

Full

$$$

F

T Italian 10C30

F

F

F

F

F

None

$

F

F

Thai 0C10

F

T

T

T

T

Full

$

F

F Burger 30C60

T

One possible representation for hypotheses

E.g., here is the true tree for deciding whether to wait:

Patrons?

None

F

Some

Full

T

WaitEstimate?

>60

30?60

F

10?30

Alternate?

No

Yes

Reservation?

No

Yes

Bar?

No

F

T

No

Fri/Sat?

No

F

0?10

Hungry?

T

Yes

T

T

Yes

Alternate?

No

Yes

T

Raining?

Yes

No

T

Yes

F

T

Classi?cation of examples is positive (T) or negative (F)

Chapter 18, Sections 1C3

13

Expressiveness

B

F

F

T

T

F

T

F

T

How many distinct decision trees with n Boolean attributes??

A

A xor B

F

T

T

F

14

Hypothesis spaces

Decision trees can express any function of the input attributes.

E.g., for Boolean functions, truth table row path to leaf:

A

Chapter 18, Sections 1C3

F

T

F

T

F

T

F

T

T

F

B

B

Trivially, there is a consistent decision tree for any training set

w/ one path to leaf for each example (unless f nondeterministic in x)

but it probably wont generalize to new examples

Prefer to ?nd more compact decision trees

Chapter 18, Sections 1C3

Chapter 18, Sections 1C3

15

Hypothesis spaces

16

Hypothesis spaces

How many distinct decision trees with n Boolean attributes??

How many distinct decision trees with n Boolean attributes??

= number of Boolean functions

= number of Boolean functions

= number of distinct truth tables with 2n rows

Chapter 18, Sections 1C3

17

Chapter 18, Sections 1C3

18

Hypothesis spaces

Hypothesis spaces

How many distinct decision trees with n Boolean attributes??

How many distinct decision trees with n Boolean attributes??

= number of Boolean functions

n

= number of distinct truth tables with 2n rows = 22

= number of Boolean functions

n

= number of distinct truth tables with 2n rows = 22

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

Chapter 18, Sections 1C3

19

Chapter 18, Sections 1C3

Hypothesis spaces

20

Hypothesis spaces

How many distinct decision trees with n Boolean attributes??

How many distinct decision trees with n Boolean attributes??

= number of Boolean functions

n

= number of distinct truth tables with 2n rows = 22

= number of Boolean functions

n

= number of distinct truth tables with 2n rows = 22

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

How many purely conjunctive hypotheses (e.g., Hungry ?Rain)??

How many purely conjunctive hypotheses (e.g., Hungry ?Rain)??

Each attribute can be in (positive), in (negative), or out

? 3n distinct conjunctive hypotheses

More expressive hypothesis space

C increases chance that target function can be expressed

C increases number of hypotheses consistent w/ training set

? may get worse predictions

Chapter 18, Sections 1C3

21

Chapter 18, Sections 1C3

Decision tree learning

22

Choosing an attribute

Aim: ?nd a small tree consistent with the training examples

Idea: a good attribute splits the examples into subsets that are (ideally) all

positive or all negative

Idea: (recursively) choose most signi?cant attribute as root of (sub)tree

function DTL(examples, attributes, default) returns a decision tree

if examples is empty then return default

else if all examples have the same classi?cation then return the classi?cation

else if attributes is empty then return Mode(examples)

else

best Choose-Attribute(attributes, examples)

tree a new decision tree with root test best

for each value vi of best do

examplesi {elements of examples with best = vi }

subtree DTL(examplesi, attributes ? best, Mode(examples))

add a branch to tree with label vi and subtree subtree

return tree

Chapter 18, Sections 1C3

Type?

Patrons?

None

Some

Full

French

Italian

Thai

Burger

P atrons? is a better choicegives information about the classi?cation

23

Chapter 18, Sections 1C3

24

Information

Information contd.

Information answers questions

The more clueless I am about the answer initially, the more information is

contained in the answer

Scale: 1 bit = answer to Boolean question with prior 0.5, 0.5

Information in an answer when prior is P1, . . . , Pn is

Suppose we have p positive and n negative examples at the root

? H(p/(p+n), n/(p+n) ) bits needed to classify a new example

E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit

An attribute splits the examples E into subsets Ei, each of which (we hope)

needs less information to complete the classi?cation

Let Ei have pi positive and ni negative examples

? H(pi/(pi +ni), ni/(pi +ni) ) bits needed to classify a new example

? expected number of bits per example over all branches is

n

H(P1, . . . , Pn ) = i = 1 ? Pi log2 Pi

(also called entropy of the prior)

i

pi + ni

H(pi/(pi + ni), ni/(pi + ni) )

p+n

For P atrons?, this is 0.459 bits, for T ype this is (still) 1 bit

? choose the attribute that minimizes the remaining information needed

Chapter 18, Sections 1C3

Example contd.

How do we know that h f ? (Humes Problem of Induction)

1) Use theorems of computational/statistical learning theory

Patrons?

Some

Full

T

Hungry?

Yes

Type?

French

T

Italian

2) Try h on a new test set of examples

(use same distribution over example space as training set)

No

Learning curve = % correct on test set as a function of training set size

1

F

Thai

% correct on test set

F

Burger

T

Fri/Sat?

F

No

F

Yes

T

Substantially simpler than true treea more complex hypothesis isnt justi?ed by small amount of data

Chapter 18, Sections 1C3

0.9

0.8

0.7

0.6

0.5

0.4

0 10 20 30 40 50 60 70 80 90 100

Training set size

Chapter 18, Sections 1C3

27

28

Summary

Performance measurement contd.

Learning curve depends on

C realizable (can express target function) vs. non-realizable

non-realizability can be due to missing attributes

or restricted hypothesis class (e.g., thresholded linear function)

C redundant expressiveness (e.g., loads of irrelevant attributes)

Learning needed for unknown environments, lazy designers

Learning agent = performance element + learning element

Learning method depends on type of performance element, available

feedback, type of component to be improved, and its representation

% correct

1

26

Performance measurement

Decision tree learned from the 12 examples:

None

Chapter 18, Sections 1C3

25

realizable

For supervised learning, the aim is to ?nd a simple hypothesis

that is approximately consistent with training examples

redundant

Decision tree learning using information gain

nonrealizable

Learning performance = prediction accuracy measured on test set

# of examples

Chapter 18, Sections 1C3

29

Chapter 18, Sections 1C3

30

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download