STATS 747, 2nd Semester 2002 - Auckland



Model Answers Assignment 4 STATS 747, 2nd Semester 2002

1) Your client owns a chain of cafés and wants to understand what is important to their customers. They have carried out a survey using the questionnaire available online at . The resulting data is available online at . A wide range of cross-tabulations have already been carried out, and now the client wants you to investigate whether there are segments of their customers that have different feelings about what is important.

i) Explore the data, and in particular the questions about how important various service attributes are. Describe any modifications to the data that would be appropriate before carrying out the cluster analyses outlined below, and modify the data accordingly.

Answer:

[Details of data exploration omitted.]

All the importance questions (from Q6) are measured on the same five point scale, so standardisation is not required. However there is a large proportion of missing data (from code 8 = “Don’t Know”), which would result in biased results if list-wise deletion was used (or if the “Don’t Know”s are left as code 8).

For example, more than 15% of customers said “Don’t Know” about the importance of convenient hours, as shown below. All of the items in question 6 had more than 10% “Don’t Know”s.

|6(a) Convenient Hours |

|q6a_1 |Frequency |Percent |Cumulative Frequency |Cumulative Percent |

|1 |14 |1.72 |14 |1.72 |

|2 |44 |5.41 |58 |7.13 |

|3 |165 |20.27 |223 |27.40 |

|4 |261 |32.06 |484 |59.46 |

|5 |193 |23.71 |677 |83.17 |

|8 |137 |16.83 |814 |100.00 |

Imputation should therefore be carried out before the segmentation analyses. Mean or median imputation is a possibility, but this would distort the distribution of responses and could lead to biased results. Hot-deck imputation would be a better choice, or imputation of the mean plus a random error. The latter approach is used here.

A factor analysis might also be a useful preparatory step if a small number of factors could be identified that summarised the data well. Three factors have eigenvalues above or near 1, but these factors account for less than half the variation amongst the Q6 variables. So much of the information would be lost by using just these factors. If all ten factors were used as input to the cluster analyses, this would give factor 1 (the most important component) equal weight to factor 10 (which probably represents random noise). This could drown out important patterns in the data. For this reason, using factors does not seem wise in this situation. [However I have not taken marks off for doing this.]

In case response effects were present, row means were subtracted from the imputed data. Then the k-means clusters for this dataset were compared with those based on the original imputed data. However there was only a small gain in relative R2 from doing this, so for ease of interpretation the final segmentations have been based on the original imputed data.

ii) Conduct a k-means cluster analysis of the attribute importance questions. Obtain a 3 cluster solution.

Answer:

Twenty k-means analyses producing three clusters were run, each with different randomly chosen cluster seeds, to avoid local minima. The cluster solution with the lowest criterion was chosen.

iii) Fit a normal mixture model with 3 components to the same data, using Weka’s EM clustering algorithm.

Answer:

The EM algorithm was run 20 times on the imputed data, looking for three latent classes, and using a different seed each time. The likelihood was maximised when the seed was 1721.

The following command was used to produce cluster membership probabilities:

java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -S 1721

which were then read into SAS, and used to produce demographic profiles.

iv) Describe the market segments resulting from k-means in terms of the input variables and demographic characteristics. Describe the market segments from the EM algorithm in terms of the input variables. How do the results differ?

For extra credit: also describe the EM segments in terms of demographic characteristics, and compare with the k-means results.

Show your work, including relevant program code and output. Be sure to avoid local minima and maxima in your analyses.

Answer:

The three clusters from the k-means algorithm comprise 43%, 33% and 24% of the customers respectively.

People in k-means cluster 1 feel that “Value for Money” is important, along with “Availability of Nutritional Information” and “Convenient Hours”. “Employee Friendliness” and “Appearance of Food” are less important to them.

Members of k-means cluster 2 feel that “Value for Money” is even more important (on average), as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Speed of Service” as important.

Cluster 3 holds customers who believe “Employee Friendliness” and “Appearance of Food” are important, while “Value for Money” and “Cleanliness of the Facility” are not important to them.

|k-means Cluster Means |

|Cluster |q6a_1 |

 

|Frequency |Table of CLUSTER by q8 |

|Row Pct | |

| |CLUSTER(Cluster) |

| |q8(Gender) |

| |Total |

| | |

| | |

| |1 |

| |2 |

| | |

| | |

| |1 |

| |144 |

| |40.91 |

| |208 |

| |59.09 |

| |352 |

| |  |

| | |

| |2 |

| |118 |

| |44.36 |

| |148 |

| |55.64 |

| |266 |

| |  |

| | |

| |3 |

| |80 |

| |40.82 |

| |116 |

| |59.18 |

| |196 |

| |  |

| | |

| |Total |

| |342 |

| |472 |

| |814 |

| | |

The three segments from the EM algorithm comprise 34%, 40% and 25% of the customers respectively.

Members of EM segment 0 feel that “Employee Friendliness” and “Convenient Hours” are important. “Appearance of Food” and especially “Cleanliness of the Facility” are less important to them.

People in EM segment 1 feel that “Value for Money” is important, as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Selection of Food” or “Speed of Service” as important. This is similar to k-means cluster 2, although the EM cluster is noticeably larger.

EM segment 2 holds customers who believe “Value for Money” is very important, and “Cleanliness of the Facility”, “Convenient Hours” and “Availability of Nutritional Information” are also important. “Healthy Choices” , “Appearance of Food” , “Freshness of Food” , and “Employee Friendliness” are not important to them.

|EM Segment Means |

|Segment |q6a_1 |q6a_2 |q6a_3 |

|1 |342 |segprob1 |0.3654961 |

| | |segprob2 |0.3645272 |

| | |segprob3 |0.2699767 |

|2 |472 |segprob1 |0.3290861 |

| | |segprob2 |0.4372480 |

| | |segprob3 |0.2336660 |

 

|Age |N Obs |Variable |Mean |

|1 |58 |segprob1 |0.3620683 |

| | |segprob2 |0.3211545 |

| | |segprob3 |0.3167771 |

|2 |146 |segprob1 |0.3493146 |

| | |segprob2 |0.4212077 |

| | |segprob3 |0.2294777 |

|3 |180 |segprob1 |0.3222212 |

| | |segprob2 |0.4048728 |

| | |segprob3 |0.2729059 |

|4 |196 |segprob1 |0.3418353 |

| | |segprob2 |0.4241338 |

| | |segprob3 |0.2340309 |

|5 |137 |segprob1 |0.3649632 |

| | |segprob2 |0.4110561 |

| | |segprob3 |0.2239807 |

|6 |97 |segprob1 |0.3437480 |

| | |segprob2 |0.3975708 |

| | |segprob3 |0.2586811 |

2) A lending institution has gathered information on loan applicants, including whether or not they are judged to be a good credit risk. They wish to understand the factors that may influence credit risk, so that they can make their approval procedure more efficient. The data is the German credit dataset discussed in lectures (available at ).

i) Conduct an exploratory analysis of the data, including univariate graphical summaries and bivariate measures of association between credit risk and the available predictor variables. Which predictors appear to be most strongly associated with the credit risk?

Answer (restricted to the main points – I will add more detail if time permits):

Graphical summaries were generally well done. Measuring bivariate associations between credit risk and the predictor variables can be based on the measures of association available in PROC FREQ, correlation coefficients (after dummy variables have been created) or a series of bivariate logistic regression analyses.

ii) Carry out an appropriate canonical discriminant analysis of this data. You may need to create dummy or indicator variables for the values of nominal predictor variables.

Answer – main points:

Dummy variables should be created for all values of the nominal variables, and perhaps also for ordinal variables. (See attached SAS code for details.) The important output from the canonical discriminant analysis is the (total-sample) standardised canonical coefficients. These show which variables load strongly on the canonical variables. In particular, the variables that load strongly on the first canonical variable are the ones that best explain credit risk.

iii) Produce a pruned classification tree for this dataset using the rpart and prune functions in R.

Answer – main points:

Most answers got the full tree without any problems. This tree should be pruned back to the simplest tree with a cross-validated error lower than the minimum cross-validated error plus one standard error, which corresponds here to a cp value of approximately 0.03.

iv) Summarise the main results from the above analyses, highlighting their practical implications. Describe any differences between these results that would be of practical significance, and suggest possible reasons for these differences. Which results do you believe would be most useful (and why)?

Q1 SAS code:

PROC IMPORT OUT= WORK.cafe

DATAFILE= "C:\jr\cafe.csv"

DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;

ods html body="temp.html" style=minimal;

proc freq data=cafe;

table q6a_1-q6a_10;

run;

ods html close;

data cafe2;

set cafe;

label

q6a_1="6(a) Convenient Hours"

q6a_2="6(b) Speed of Service"

q6a_3="6(c) Value for Money"

q6a_4="6(d) Employee Friendliness"

q6a_5="6(e) Cleanliness of the facility"

q6a_6="6(f) Selection of Food"

q6a_7="6(g) Appearance of Food"

q6a_8="6(h) Freshness of Food"

q6a_9="6(i) Healthy Choices"

q6a_10="6(j) Availability of Nutritional Information"

q7 = "Age"

q8 = "Gender"

;

array q6 q6a_1-q6a_10;

do over q6;

if q6=8 then q6=.;

end;

run;

proc summary data=cafe2;

var q6a_1-q6a_10;

output out=cafemeans mean=m6a_1-m6a_10;

run;

proc sql;

create table cafe2a as

select * from cafe2, cafemeans;

run;

data cafe3;

set cafe2a;

array q6 q6a_1-q6a_10;

array m6 m6a_1-m6a_10;

array e6 e6a_1-e6a_10;

do over q6;

if q6 ne . then e6=q6-m6;

end;

random=ranuni(1394881787);

idn=id+0;

run;

proc sort data=cafe3 out=cafe3a;

by random;

run;

%macro impute;

%do i=1 %to 10;

data temp; set cafe3a cafe3a; if q6a_&i ne .; run;

data cafe4a&i;

merge cafe3 (keep=idn q6a_&i q7 q8 in=c) temp (keep=m6a_&i e6a_&i);

if c; if q6a_&i=. then q6a_&i=m6a_&i+e6a_&i;

run;

%end;

data cafe4;

merge

%do i=1 %to 10; cafe4a&i %end;

;

by idn;

keep idn q6a_1-q6a_10 q7 q8;

run;

%mend;

%impute;

/** Try factor analysis to see whether this provides useful simplification.;*/

/*proc factor data=cafe4 n=6 rotate=varimax round reorder flag=.54 scree out=scores;*/

/* var q6a_1-q6a_10;*/

/*run;*/

/**/

/** Adjust for possible response effect.;*/

/*proc transpose data=cafe4 out=cafe4t;*/

/* var q6a_1-q6a_10;*/

/* id idn;*/

/*run;*/

/**/

/*proc standard data=cafe4t out=cafe5t m=0;*/

/* var _1-_814;*/

/*run;*/

/**/

/*proc transpose data=cafe5t out=cafe5;*/

/* var _1-_814;*/

/* id _name_;*/

/* idlabel _label_;*/

/*run;*/

/**/

/*proc fastclus data=cafe5 maxc=3 replace=random random=109162319;*/

/* var q6a_1-q6a_10;*/

/*run;*/

%macro kmclust(seed);

ods output Criterion=kmcriterion;

ods html body="temp.html" style=minimal;

proc fastclus data=cafe4 maxc=3 replace=random random=&seed out=clusters;

var q6a_1-q6a_10;

run;

ods output close;

ods html close;

data _null_; set kmcriterion; put _all_; run;

%mend;

%kmclust(839209472);

%kmclust(726230843);

%kmclust(173049203);

%kmclust(320205828);

%kmclust(929017829);

%kmclust(109162319);

%kmclust(619283921);

%kmclust(463892012);

%kmclust(561092881);

%kmclust(718923491);

%kmclust(429729127);

%kmclust(379457285);

%kmclust(178916387);

%kmclust(739876219);

%kmclust(627881618);

%kmclust(821098721);

%kmclust(897638196);

%kmclust(862341233);

%kmclust(327632882);

%kmclust(116298123); * This seed gives the lowest criterion of 0.8027.;

* Cluster profiles by demographics.;

ods html body="temp.html" style=minimal;

proc freq data=clusters;

table cluster*(q7 q8) / nocol nopercent;

run;

ods html close;

* Export imputed data for fitting latent class model using EM algorithm.;

PROC EXPORT DATA= WORK.CAFE4

OUTFILE= "C:\cafe4.csv"

DBMS=CSV REPLACE;

RUN;

* Read in segment membership probabilities from EM output.;

data EMsegments;

infile "c:\jr\cafeEM.txt" missover pad;

input idn 8-10 modeseg 18 segprob1 20-26 segprob2 29-35 segprob3 38-44;

run;

data EMsegments2;

merge EMsegments cafe4;

by idn;

run;

* EM segment profiles by demographics.;

ods html body="temp.html" style=minimal;

proc means data=EMsegments2 mean print;

class q7 q8;

var segprob1-segprob3;

ways 1;

run;

ods html close;

EM code and output:

> java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -d C:\jr\cafe.out -S 1721

Seed: 1721

Number of instances: 814

Number of atts: 10

======================================

Clust: 0 att: 0

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 1

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 2

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 3

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 4

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 5

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 6

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 7

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 8

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 0 att: 9

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 0

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 1

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 2

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 3

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 4

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 5

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 6

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 7

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 8

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 1 att: 9

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 0

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 1

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 2

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 3

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 4

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 5

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 6

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 7

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 8

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Clust: 2 att: 9

Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0

Inst 0 Class 2 0.36 0.11808 0.52192

Inst 1 Class 2 0.44115 0.08651 0.47235

Inst 2 Class 1 0.3127 0.41949 0.26782

Inst 3 Class 0 0.53905 0.23162 0.22934

Inst 4 Class 1 0.3709 0.42352 0.20558

Inst 5 Class 1 0.02279 0.69282 0.2844

Inst 6 Class 0 0.48523 0.41443 0.10033

Inst 7 Class 2 0.11324 0.37974 0.50702

Inst 8 Class 0 0.70994 0.21248 0.07758

Inst 9 Class 2 0.36237 0.13268 0.50495

Inst 10 Class 0 0.59445 0.16679 0.23876

.

.

.

Inst 810 Class 0 0.53874 0.03339 0.42786

Inst 811 Class 2 0.3562 0.10449 0.53931

Inst 812 Class 2 0.3153 0.29604 0.38865

Inst 813 Class 0 0.37679 0.25703 0.36617

Loglikely: -12.947281732790431

Loglikely: -12.945401366274425

Loglikely: -12.938726747977523

Loglikely: -12.901977150615119

Loglikely: -12.75012451077629

Loglikely: -12.493136950893648

Loglikely: -12.365027354312812

Loglikely: -12.304789204789568

Loglikely: -12.244897621583464

Loglikely: -12.193376754473698

Loglikely: -12.167358555788184

Loglikely: -12.157754243677418

Loglikely: -12.15417345138127

Loglikely: -12.152423619369937

Loglikely: -12.151152790252866

Loglikely: -12.149841301667593

Loglikely: -12.148118219983392

Loglikely: -12.145561523116212

Loglikely: -12.141839465588516

Loglikely: -12.137677594019792

Loglikely: -12.134179767406906

Loglikely: -12.131434990910169

Loglikely: -12.129153188213081

Loglikely: -12.12710171492302

Loglikely: -12.125226260553502

Loglikely: -12.12349702882937

Loglikely: -12.121828686952819

Loglikely: -12.12013959946356

Loglikely: -12.118388891732248

Loglikely: -12.116555091413234

Loglikely: -12.1145537246838

Loglikely: -12.112084376404072

Loglikely: -12.10816598293722

Loglikely: -12.098749169963604

Loglikely: -12.055958366015664

Loglikely: -11.741061138423825

Loglikely: -8.27125123227167

Loglikely: -7.964040764111144

Loglikely: -7.962932367664369

Loglikely: -7.962702860111277

Loglikely: -7.962619911759608

Loglikely: -7.962588379101999

Loglikely: -7.962575791799965

Loglikely: -7.962570581434721

Loglikely: -7.962568372746465

Loglikely: -7.962567422597268

======================================

Clust: 0 att: 0

Normal Distribution. Mean = 3.7643 StandardDev = 1.0008 WeightSum = 279.9992

Clust: 0 att: 1

Normal Distribution. Mean = 3.9571 StandardDev = 0.8691 WeightSum = 279.9992

Clust: 0 att: 2

Normal Distribution. Mean = 3.925 StandardDev = 1.0376 WeightSum = 279.9992

Clust: 0 att: 3

Normal Distribution. Mean = 3.6429 StandardDev = 1.0252 WeightSum = 279.9992

Clust: 0 att: 4

Normal Distribution. Mean = 5 StandardDev = 0 WeightSum = 279.9992

Clust: 0 att: 5

Normal Distribution. Mean = 4.4679 StandardDev = 0.5785 WeightSum = 279.9992

Clust: 0 att: 6

Normal Distribution. Mean = 3.9036 StandardDev = 0.9494 WeightSum = 279.9992

Clust: 0 att: 7

Normal Distribution. Mean = 3.9429 StandardDev = 0.8347 WeightSum = 279.9992

Clust: 0 att: 8

Normal Distribution. Mean = 4.0679 StandardDev = 0.8695 WeightSum = 279.9992

Clust: 0 att: 9

Normal Distribution. Mean = 4.1571 StandardDev = 0.8086 WeightSum = 279.9992

Clust: 1 att: 0

Normal Distribution. Mean = 4.1496 StandardDev = 0.8437 WeightSum = 330.5551

Clust: 1 att: 1

Normal Distribution. Mean = 4.2428 StandardDev = 0.7293 WeightSum = 330.5551

Clust: 1 att: 2

Normal Distribution. Mean = 3.235 StandardDev = 1.0697 WeightSum = 330.5551

Clust: 1 att: 3

Normal Distribution. Mean = 3.4858 StandardDev = 1.0646 WeightSum = 330.5551

Clust: 1 att: 4

Normal Distribution. Mean = 3.5058 StandardDev = 0.688 WeightSum = 330.5551

Clust: 1 att: 5

Normal Distribution. Mean = 4.4793 StandardDev = 0.6006 WeightSum = 330.5551

Clust: 1 att: 6

Normal Distribution. Mean = 3.7508 StandardDev = 0.9563 WeightSum = 330.5551

Clust: 1 att: 7

Normal Distribution. Mean = 3.8212 StandardDev = 0.8231 WeightSum = 330.5551

Clust: 1 att: 8

Normal Distribution. Mean = 3.8357 StandardDev = 0.8844 WeightSum = 330.5551

Clust: 1 att: 9

Normal Distribution. Mean = 4.0428 StandardDev = 0.7403 WeightSum = 330.5551

Clust: 2 att: 0

Normal Distribution. Mean = 3.4965 StandardDev = 0.9632 WeightSum = 203.4457

Clust: 2 att: 1

Normal Distribution. Mean = 3.7824 StandardDev = 0.7585 WeightSum = 203.4457

Clust: 2 att: 2

Normal Distribution. Mean = 3.0703 StandardDev = 0.9877 WeightSum = 203.4457

Clust: 2 att: 3

Normal Distribution. Mean = 4.6241 StandardDev = 0.5041 WeightSum = 203.4457

Clust: 2 att: 4

Normal Distribution. Mean = 3.4317 StandardDev = 0.7422 WeightSum = 203.4457

Clust: 2 att: 5

Normal Distribution. Mean = 4.2732 StandardDev = 0.6017 WeightSum = 203.4457

Clust: 2 att: 6

Normal Distribution. Mean = 4.7589 StandardDev = 0.4323 WeightSum = 203.4457

Clust: 2 att: 7

Normal Distribution. Mean = 4.6887 StandardDev = 0.4754 WeightSum = 203.4457

Clust: 2 att: 8

Normal Distribution. Mean = 4.8076 StandardDev = 0.3957 WeightSum = 203.4457

Clust: 2 att: 9

Normal Distribution. Mean = 3.5176 StandardDev = 0.8439 WeightSum = 203.4457

Inst 0 Class 2 0 0.02973 0.97027

Inst 1 Class 2 0 0.01223 0.98777

Inst 2 Class 0 1 0 0

Inst 3 Class 1 0 1 0

Inst 4 Class 1 0 0.91439 0.08561

Inst 5 Class 0 1 0 0

Inst 6 Class 2 0 0.11805 0.88195

Inst 7 Class 2 0 0.14931 0.85069

Inst 8 Class 1 0 1 0

Inst 9 Class 1 0 0.99892 0.00108

Inst 10 Class 0 1 0 0

.

.

.

Inst 810 Class 2 0 0.00531 0.99469

Inst 811 Class 0 0.99991 0 0.00009

Inst 812 Class 1 0 0.99076 0.00924

Inst 813 Class 0 1 0 0

EM

==

Number of clusters: 3

Cluster: 0 Prior probability: 0.344

Attribute: q6a_1

Normal Distribution. Mean = 3.7643 StdDev = 1.0008

Attribute: q6a_2

Normal Distribution. Mean = 3.9571 StdDev = 0.8691

Attribute: q6a_3

Normal Distribution. Mean = 3.925 StdDev = 1.0376

Attribute: q6a_4

Normal Distribution. Mean = 3.6429 StdDev = 1.0252

Attribute: q6a_5

Normal Distribution. Mean = 5 StdDev = 0

Attribute: q6a_6

Normal Distribution. Mean = 4.4679 StdDev = 0.5785

Attribute: q6a_7

Normal Distribution. Mean = 3.9036 StdDev = 0.9494

Attribute: q6a_8

Normal Distribution. Mean = 3.9429 StdDev = 0.8347

Attribute: q6a_9

Normal Distribution. Mean = 4.0679 StdDev = 0.8695

Attribute: q6a_10

Normal Distribution. Mean = 4.1571 StdDev = 0.8086

Cluster: 1 Prior probability: 0.4062

Attribute: q6a_1

Normal Distribution. Mean = 4.1496 StdDev = 0.8437

Attribute: q6a_2

Normal Distribution. Mean = 4.2428 StdDev = 0.7293

Attribute: q6a_3

Normal Distribution. Mean = 3.235 StdDev = 1.0697

Attribute: q6a_4

Normal Distribution. Mean = 3.4858 StdDev = 1.0646

Attribute: q6a_5

Normal Distribution. Mean = 3.5058 StdDev = 0.688

Attribute: q6a_6

Normal Distribution. Mean = 4.4793 StdDev = 0.6006

Attribute: q6a_7

Normal Distribution. Mean = 3.7508 StdDev = 0.9563

Attribute: q6a_8

Normal Distribution. Mean = 3.8212 StdDev = 0.8231

Attribute: q6a_9

Normal Distribution. Mean = 3.8357 StdDev = 0.8844

Attribute: q6a_10

Normal Distribution. Mean = 4.0428 StdDev = 0.7403

Cluster: 2 Prior probability: 0.2498

Attribute: q6a_1

Normal Distribution. Mean = 3.4965 StdDev = 0.9632

Attribute: q6a_2

Normal Distribution. Mean = 3.7824 StdDev = 0.7585

Attribute: q6a_3

Normal Distribution. Mean = 3.0703 StdDev = 0.9877

Attribute: q6a_4

Normal Distribution. Mean = 4.6241 StdDev = 0.5041

Attribute: q6a_5

Normal Distribution. Mean = 3.4317 StdDev = 0.7422

Attribute: q6a_6

Normal Distribution. Mean = 4.2732 StdDev = 0.6017

Attribute: q6a_7

Normal Distribution. Mean = 4.7589 StdDev = 0.4323

Attribute: q6a_8

Normal Distribution. Mean = 4.6887 StdDev = 0.4754

Attribute: q6a_9

Normal Distribution. Mean = 4.8076 StdDev = 0.3957

Attribute: q6a_10

Normal Distribution. Mean = 3.5176 StdDev = 0.8439

=== Clustering stats for training data ===

Clustered Instances

0 280 ( 34%)

1 329 ( 40%)

2 205 ( 25%)

Log likelihood: -7.96257

Q2 SAS code.

libname asst4 'C:\My Documents\747';

* @data portion of credit-g.arff file extracted, and header line containing variable names added,

before reading data into SAS.;

PROC IMPORT DATAFILE = 'C:\My Documents\747\credit-g.csv'

OUT = asst4.credit DBMS = CSV REPLACE;

run;

ods html body="C:\My Documents\747\credit.html" style=minimal;

proc contents data=asst4.credit;

run;

proc freq data=asst4.credit;

tables checking_status duration credit_history purpose credit_amount savings_status employment

installment_commitment personal_status other_parties residence_since property_magnitude

age other_payment_plans housing existing_credits job num_dependents own_telephone

foreign_worker class;

run;

proc freq data=asst4.credit;

tables (checking_status duration credit_history purpose credit_amount savings_status employment

installment_commitment personal_status other_parties residence_since property_magnitude

age other_payment_plans housing existing_credits job num_dependents own_telephone

foreign_worker)*class / measures;

run;

* proc glmmod provides a quick way to create dummy variables.

This could also be done in a data step using if-then statements.;

proc glmmod data=asst4.credit outdesign=asst4.credit2;

* checking_status, savings_status, property_magnitude and job could arguably be regarded as ordinal,

and coded as appropriate numeric values. However here they are treated as nominal variables.;

class checking_status credit_history purpose savings_status employment

personal_status other_parties property_magnitude

other_payment_plans housing job own_telephone

foreign_worker class;

model duration = checking_status duration credit_history purpose credit_amount savings_status employment

installment_commitment personal_status other_parties residence_since property_magnitude

age other_payment_plans housing existing_credits job num_dependents own_telephone

foreign_worker class;

run;

proc contents data=asst4.credit2;

run;

proc candisc data=asst4.credit2;

class col64;

var col1-col62;

run;

ods html close;

Q2 R code.

credit ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download