Data Classification Preprocessing Overfitting in Decision ...
[Pages:33]Data Preprocessing
Classification & Regression
Overfitting in Decision Trees
? If a decision tree is fully grown, it may lose some generalization capability.
? This is a phenomenon known as overfitting.
1
Data Preprocessing
Classification & Regression
Definition of Overfitting
Consider the error of hypothesis . We let error on the training data be error and error over the entire distribution of data be error .
Then a hypothesis "overfits" the training data if there is an alternative hypothesis, , such that:
error < error error < error
Data Preprocessing
Classification & Regression
Model Overfitting
Errors committed by classification models are generally divided into two types:
1
Training Errors
The number of misclassification errors committed on
training records; also called resubstitution error.
2
Generalization Errors
The expected error of the model on previously unseen
records.
3
Data Preprocessing
Classification & Regression
Causes of Overfitting
1 Overfitting Due to Presence of Noise Mislabeled instances may contradict the class labels of other similar records.
2 Overfitting Due to Lack of Representative Instances Lack of representative instances in the training data can prevent refinement of the learning algorithm.
3 Overfitting and the Multiple Comparison Procedure Failure to compensate for algorithms that explore a large
4
number of alternatives can result in spurious fitting.
Data Preprocessing
Classification & Regression
Overfitting Due to Noise: An Example
An example training set for classifying mammals. Asterisks denote mislabelings.
Name
Porcupine Cat Bat Whale Salamander Komodo dragon Python Salmon Eagle Guppy
Body Temperature Warm-blooded Warm-blooded Warm-blooded Warm-blooded Cold-blooded Cold-blooded Cold-blooded Cold-blooded Warm-blooded Cold-blooded
Gives Birth
Yes Yes Yes Yes No No No No No Yes
Four-legged
Yes Yes No No Yes Yes No No No No
Hibernates
Yes No Yes No Yes No Yes No No No
Class Label
Yes Yes No* No* No No No No No No
5
Data Preprocessing
Classification & Regression
Overfitting Due to Noise
Name
Human Pigeon Elephant Leopard shark Turtle Penguin Eel Dolphin Spiny anteater Gila monster
An example testing set for classifying mammals.
Body Temperature Warm-blooded Warm-blooded Warm-blooded Cold-blooded Cold-blooded Cold-blooded Cold-blooded Warm-blooded Warm-blooded Cold-blooded
Gives Birth
Yes No Yes Yes No No No Yes No No
Four-legged
No No Yes No Yes No No No Yes Yes
Hibernates
No No No No No No No No Yes Yes
Class Label
Yes No Yes No No No No Yes Yes No
6
Data Preprocessing
Classification & Regression
Overfitting Due to Noise
Model 1
Body Temperature
Model 2
Body Temperature
Warm-blooded Cold-blooded
Warm-blooded Cold-blooded
Gives Birth
Yes
No
Nonmammals
Gives Birth
Yes
No
Nonmammals
Four-legged
Nonmammals
Mammals
Nonmammals
Yes No
Model 1 misclassifies humans and dolphins as non-
Mammals
7
Nonmammals
mammals. Model 2 has a lower test error rate (10%) even though its training error rate is higher (20%).
Data Preprocessing
Classification & Regression
Overfitting Due to Lack of Samples
Name
Salamander Guppy Eagle Poorwill Platypus
An example training set for classifying mammals.
Body Temperature Cold-blooded Cold-blooded Warm-blooded Warm-blooded Warm-blooded
Gives Birth
No Yes No No No
Four-legged
Yes No No No Yes
Hibernates
Yes No No Yes Yes
Class Label
No No No No Yes
8
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- data classification security framework v5
- data classification and handling
- data classification methodology connecticut
- the definitive guide to data classification
- data mining classification basic concepts decision trees
- data classification preprocessing overfitting in decision
- cuny data classification standard
- data classification university of massachusetts medical
- data classification and data types home home
Related searches
- data classification examples
- data classification types
- data classification policy
- data classification standard
- nist data classification policy
- data classification example
- data classification categories
- data classification scheme
- data classification framework
- data classification policy examples
- nist data classification levels
- sans data classification policy