Data Mining Classification: Basic Concepts, Decision Trees ...

๏ปฟData Mining Classification: Basic Concepts, Decision

Trees, and Model Evaluation

Lecture Notes for Chapter 4

Introduction to Data Mining

by Tan, Steinbach, Kumar

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

1

Classification: Definition

O Given a collection of records (training set )

? Each record contains a set of attributes, one of the attributes is the class.

O Find a model for class attribute as a function of the values of other attributes.

O Goal: previously unseen records should be assigned a class as accurately as possible.

? A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

2

Illustrating Classification Task

Tid 1 2 3 4 5 6 7 8 9 10

10

Attrib1 Attrib2

Yes

Large

No

Medium

No

Small

Yes

Medium

No

Large

No

Medium

Yes

Large

No

Small

No

Medium

No

Small

Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K

Training Set

Class No No No No Yes No No Yes No Yes

Tid 11 12 13 14 15

10

Attrib1 Attrib2

No

Small

Yes

Medium

Yes

Large

No

Small

No

Large

Attrib3 55K 80K 110K 95K 67K

Test Set

Class ? ? ? ? ?

Learning algorithm

Induction Learn Model

Apply Model

Deduction

? Tan,Steinbach, Kumar

Introduction to Data Mining

Model

4/18/2004

3

Examples of Classification Task

O Predicting tumor cells as benign or malignant

O Classifying credit card transactions as legitimate or fraudulent

O Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

O Categorizing news stories as finance, weather, entertainment, sports, etc

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

4

Classification Techniques

O Decision Tree based Methods O Rule-based Methods O Memory based reasoning O Neural Networks O Na?ve Bayes and Bayesian Belief Networks O Support Vector Machines

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

5

Example of a Decision Tree

categoriccaal tegorical continuoucslass

Tid Refund Marital Taxable Status Income Cheat

1 Yes 2 No 3 No 4 Yes 5 No 6 No 7 Yes 8 No 9 No 10 No

10

Single 125K No

Married 100K No

Single 70K

No

Married 120K No

Divorced 95K

Yes

Married 60K

No

Divorced 220K No

Single 85K

Yes

Married 75K

No

Single 90K

Yes

Training Data

Splitting Attributes

Refund

Yes

No

NO

MarSt

Single, Divorced

Married

TaxInc

NO

< 80K

> 80K

NO

YES

Model: Decision Tree

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

6

Another Example of Decision Tree

categoriccaal tegorical continuoucslass

Tid Refund Marital Taxable Status Income Cheat

1 Yes 2 No 3 No 4 Yes 5 No 6 No 7 Yes 8 No 9 No 10 No

10

Single 125K No

Married 100K No

Single 70K

No

Married 120K No

Divorced 95K

Yes

Married 60K

No

Divorced 220K No

Single 85K

Yes

Married 75K

No

Single 90K

Yes

Married NO

MarSt Single, Divorced

Refund

Yes

No

NO

TaxInc

< 80K

> 80K

NO

YES

There could be more than one tree that fits the same data!

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

7

Decision Tree Classification Task

Tid 1 2 3 4 5 6 7 8 9 10

10

Attrib1 Attrib2

Yes

Large

No

Medium

No

Small

Yes

Medium

No

Large

No

Medium

Yes

Large

No

Small

No

Medium

No

Small

Attrib3 125K 100K 70K 120K 95K 60K 220K 85K 75K 90K

Training Set

Class No No No No Yes No No Yes No Yes

Tid 11 12 13 14 15

10

Attrib1 Attrib2

No

Small

Yes

Medium

Yes

Large

No

Small

No

Large

Attrib3 55K 80K 110K 95K 67K

Test Set

Class ? ? ? ? ?

Tree Induction algorithm Induction

Learn Model

Apply Model

Deduction

? Tan,Steinbach, Kumar

Introduction to Data Mining

Model

Decision Tree

4/18/2004

8

Apply Model to Test Data

Start from the root of tree.

Refund

Yes

No

Test Data

Refund Marital Taxable Status Income Cheat

No

Married 80K

?

10

NO

MarSt

Single, Divorced

Married

< 80K

TaxInc

NO > 80K

NO

YES

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

9

Apply Model to Test Data

Refund

Yes

No

Test Data

Refund Marital Taxable Status Income Cheat

No

Married 80K

?

10

NO

MarSt

Single, Divorced

Married

< 80K

TaxInc

NO > 80K

NO

YES

? Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download