Data Classification Preprocessing Overfitting in Decision ...

[Pages:33]Data Preprocessing

Classification & Regression

Overfitting in Decision Trees

? If a decision tree is fully grown, it may lose some generalization capability.

? This is a phenomenon known as overfitting.

1

Data Preprocessing

Classification & Regression

Definition of Overfitting

Consider the error of hypothesis . We let error on the training data be error and error over the entire distribution of data be error .

Then a hypothesis "overfits" the training data if there is an alternative hypothesis, , such that:

error < error error < error

Data Preprocessing

Classification & Regression

Model Overfitting

Errors committed by classification models are generally divided into two types:

1

Training Errors

The number of misclassification errors committed on

training records; also called resubstitution error.

2

Generalization Errors

The expected error of the model on previously unseen

records.

3

Data Preprocessing

Classification & Regression

Causes of Overfitting

1 Overfitting Due to Presence of Noise Mislabeled instances may contradict the class labels of other similar records.

2 Overfitting Due to Lack of Representative Instances Lack of representative instances in the training data can prevent refinement of the learning algorithm.

3 Overfitting and the Multiple Comparison Procedure Failure to compensate for algorithms that explore a large

4

number of alternatives can result in spurious fitting.

Data Preprocessing

Classification & Regression

Overfitting Due to Noise: An Example

An example training set for classifying mammals. Asterisks denote mislabelings.

Name

Porcupine Cat Bat Whale Salamander Komodo dragon Python Salmon Eagle Guppy

Body Temperature Warm-blooded Warm-blooded Warm-blooded Warm-blooded Cold-blooded Cold-blooded Cold-blooded Cold-blooded Warm-blooded Cold-blooded

Gives Birth

Yes Yes Yes Yes No No No No No Yes

Four-legged

Yes Yes No No Yes Yes No No No No

Hibernates

Yes No Yes No Yes No Yes No No No

Class Label

Yes Yes No* No* No No No No No No

5

Data Preprocessing

Classification & Regression

Overfitting Due to Noise

Name

Human Pigeon Elephant Leopard shark Turtle Penguin Eel Dolphin Spiny anteater Gila monster

An example testing set for classifying mammals.

Body Temperature Warm-blooded Warm-blooded Warm-blooded Cold-blooded Cold-blooded Cold-blooded Cold-blooded Warm-blooded Warm-blooded Cold-blooded

Gives Birth

Yes No Yes Yes No No No Yes No No

Four-legged

No No Yes No Yes No No No Yes Yes

Hibernates

No No No No No No No No Yes Yes

Class Label

Yes No Yes No No No No Yes Yes No

6

Data Preprocessing

Classification & Regression

Overfitting Due to Noise

Model 1

Body Temperature

Model 2

Body Temperature

Warm-blooded Cold-blooded

Warm-blooded Cold-blooded

Gives Birth

Yes

No

Nonmammals

Gives Birth

Yes

No

Nonmammals

Four-legged

Nonmammals

Mammals

Nonmammals

Yes No

Model 1 misclassifies humans and dolphins as non-

Mammals

7

Nonmammals

mammals. Model 2 has a lower test error rate (10%) even though its training error rate is higher (20%).

Data Preprocessing

Classification & Regression

Overfitting Due to Lack of Samples

Name

Salamander Guppy Eagle Poorwill Platypus

An example training set for classifying mammals.

Body Temperature Cold-blooded Cold-blooded Warm-blooded Warm-blooded Warm-blooded

Gives Birth

No Yes No No No

Four-legged

Yes No No No Yes

Hibernates

Yes No No Yes Yes

Class Label

No No No No Yes

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download