STATS 747, 2nd Semester 2002 - Auckland
Model Answers Assignment 4 STATS 747, 2nd Semester 2002
1) Your client owns a chain of cafés and wants to understand what is important to their customers. They have carried out a survey using the questionnaire available online at . The resulting data is available online at . A wide range of cross-tabulations have already been carried out, and now the client wants you to investigate whether there are segments of their customers that have different feelings about what is important.
i) Explore the data, and in particular the questions about how important various service attributes are. Describe any modifications to the data that would be appropriate before carrying out the cluster analyses outlined below, and modify the data accordingly.
Answer:
[Details of data exploration omitted.]
All the importance questions (from Q6) are measured on the same five point scale, so standardisation is not required. However there is a large proportion of missing data (from code 8 = “Don’t Know”), which would result in biased results if list-wise deletion was used (or if the “Don’t Know”s are left as code 8).
For example, more than 15% of customers said “Don’t Know” about the importance of convenient hours, as shown below. All of the items in question 6 had more than 10% “Don’t Know”s.
|6(a) Convenient Hours |
|q6a_1 |Frequency |Percent |Cumulative Frequency |Cumulative Percent |
|1 |14 |1.72 |14 |1.72 |
|2 |44 |5.41 |58 |7.13 |
|3 |165 |20.27 |223 |27.40 |
|4 |261 |32.06 |484 |59.46 |
|5 |193 |23.71 |677 |83.17 |
|8 |137 |16.83 |814 |100.00 |
Imputation should therefore be carried out before the segmentation analyses. Mean or median imputation is a possibility, but this would distort the distribution of responses and could lead to biased results. Hot-deck imputation would be a better choice, or imputation of the mean plus a random error. The latter approach is used here.
A factor analysis might also be a useful preparatory step if a small number of factors could be identified that summarised the data well. Three factors have eigenvalues above or near 1, but these factors account for less than half the variation amongst the Q6 variables. So much of the information would be lost by using just these factors. If all ten factors were used as input to the cluster analyses, this would give factor 1 (the most important component) equal weight to factor 10 (which probably represents random noise). This could drown out important patterns in the data. For this reason, using factors does not seem wise in this situation. [However I have not taken marks off for doing this.]
In case response effects were present, row means were subtracted from the imputed data. Then the k-means clusters for this dataset were compared with those based on the original imputed data. However there was only a small gain in relative R2 from doing this, so for ease of interpretation the final segmentations have been based on the original imputed data.
ii) Conduct a k-means cluster analysis of the attribute importance questions. Obtain a 3 cluster solution.
Answer:
Twenty k-means analyses producing three clusters were run, each with different randomly chosen cluster seeds, to avoid local minima. The cluster solution with the lowest criterion was chosen.
iii) Fit a normal mixture model with 3 components to the same data, using Weka’s EM clustering algorithm.
Answer:
The EM algorithm was run 20 times on the imputed data, looking for three latent classes, and using a different seed each time. The likelihood was maximised when the seed was 1721.
The following command was used to produce cluster membership probabilities:
java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -S 1721
which were then read into SAS, and used to produce demographic profiles.
iv) Describe the market segments resulting from k-means in terms of the input variables and demographic characteristics. Describe the market segments from the EM algorithm in terms of the input variables. How do the results differ?
For extra credit: also describe the EM segments in terms of demographic characteristics, and compare with the k-means results.
Show your work, including relevant program code and output. Be sure to avoid local minima and maxima in your analyses.
Answer:
The three clusters from the k-means algorithm comprise 43%, 33% and 24% of the customers respectively.
People in k-means cluster 1 feel that “Value for Money” is important, along with “Availability of Nutritional Information” and “Convenient Hours”. “Employee Friendliness” and “Appearance of Food” are less important to them.
Members of k-means cluster 2 feel that “Value for Money” is even more important (on average), as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Speed of Service” as important.
Cluster 3 holds customers who believe “Employee Friendliness” and “Appearance of Food” are important, while “Value for Money” and “Cleanliness of the Facility” are not important to them.
|k-means Cluster Means |
|Cluster |q6a_1 |
|Frequency |Table of CLUSTER by q8 |
|Row Pct | |
| |CLUSTER(Cluster) |
| |q8(Gender) |
| |Total |
| | |
| | |
| |1 |
| |2 |
| | |
| | |
| |1 |
| |144 |
| |40.91 |
| |208 |
| |59.09 |
| |352 |
| | |
| | |
| |2 |
| |118 |
| |44.36 |
| |148 |
| |55.64 |
| |266 |
| | |
| | |
| |3 |
| |80 |
| |40.82 |
| |116 |
| |59.18 |
| |196 |
| | |
| | |
| |Total |
| |342 |
| |472 |
| |814 |
| | |
The three segments from the EM algorithm comprise 34%, 40% and 25% of the customers respectively.
Members of EM segment 0 feel that “Employee Friendliness” and “Convenient Hours” are important. “Appearance of Food” and especially “Cleanliness of the Facility” are less important to them.
People in EM segment 1 feel that “Value for Money” is important, as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Selection of Food” or “Speed of Service” as important. This is similar to k-means cluster 2, although the EM cluster is noticeably larger.
EM segment 2 holds customers who believe “Value for Money” is very important, and “Cleanliness of the Facility”, “Convenient Hours” and “Availability of Nutritional Information” are also important. “Healthy Choices” , “Appearance of Food” , “Freshness of Food” , and “Employee Friendliness” are not important to them.
|EM Segment Means |
|Segment |q6a_1 |q6a_2 |q6a_3 |
|1 |342 |segprob1 |0.3654961 |
| | |segprob2 |0.3645272 |
| | |segprob3 |0.2699767 |
|2 |472 |segprob1 |0.3290861 |
| | |segprob2 |0.4372480 |
| | |segprob3 |0.2336660 |
|Age |N Obs |Variable |Mean |
|1 |58 |segprob1 |0.3620683 |
| | |segprob2 |0.3211545 |
| | |segprob3 |0.3167771 |
|2 |146 |segprob1 |0.3493146 |
| | |segprob2 |0.4212077 |
| | |segprob3 |0.2294777 |
|3 |180 |segprob1 |0.3222212 |
| | |segprob2 |0.4048728 |
| | |segprob3 |0.2729059 |
|4 |196 |segprob1 |0.3418353 |
| | |segprob2 |0.4241338 |
| | |segprob3 |0.2340309 |
|5 |137 |segprob1 |0.3649632 |
| | |segprob2 |0.4110561 |
| | |segprob3 |0.2239807 |
|6 |97 |segprob1 |0.3437480 |
| | |segprob2 |0.3975708 |
| | |segprob3 |0.2586811 |
2) A lending institution has gathered information on loan applicants, including whether or not they are judged to be a good credit risk. They wish to understand the factors that may influence credit risk, so that they can make their approval procedure more efficient. The data is the German credit dataset discussed in lectures (available at ).
i) Conduct an exploratory analysis of the data, including univariate graphical summaries and bivariate measures of association between credit risk and the available predictor variables. Which predictors appear to be most strongly associated with the credit risk?
Answer (restricted to the main points – I will add more detail if time permits):
Graphical summaries were generally well done. Measuring bivariate associations between credit risk and the predictor variables can be based on the measures of association available in PROC FREQ, correlation coefficients (after dummy variables have been created) or a series of bivariate logistic regression analyses.
ii) Carry out an appropriate canonical discriminant analysis of this data. You may need to create dummy or indicator variables for the values of nominal predictor variables.
Answer – main points:
Dummy variables should be created for all values of the nominal variables, and perhaps also for ordinal variables. (See attached SAS code for details.) The important output from the canonical discriminant analysis is the (total-sample) standardised canonical coefficients. These show which variables load strongly on the canonical variables. In particular, the variables that load strongly on the first canonical variable are the ones that best explain credit risk.
iii) Produce a pruned classification tree for this dataset using the rpart and prune functions in R.
Answer – main points:
Most answers got the full tree without any problems. This tree should be pruned back to the simplest tree with a cross-validated error lower than the minimum cross-validated error plus one standard error, which corresponds here to a cp value of approximately 0.03.
iv) Summarise the main results from the above analyses, highlighting their practical implications. Describe any differences between these results that would be of practical significance, and suggest possible reasons for these differences. Which results do you believe would be most useful (and why)?
Q1 SAS code:
PROC IMPORT OUT= WORK.cafe
DATAFILE= "C:\jr\cafe.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
ods html body="temp.html" style=minimal;
proc freq data=cafe;
table q6a_1-q6a_10;
run;
ods html close;
data cafe2;
set cafe;
label
q6a_1="6(a) Convenient Hours"
q6a_2="6(b) Speed of Service"
q6a_3="6(c) Value for Money"
q6a_4="6(d) Employee Friendliness"
q6a_5="6(e) Cleanliness of the facility"
q6a_6="6(f) Selection of Food"
q6a_7="6(g) Appearance of Food"
q6a_8="6(h) Freshness of Food"
q6a_9="6(i) Healthy Choices"
q6a_10="6(j) Availability of Nutritional Information"
q7 = "Age"
q8 = "Gender"
;
array q6 q6a_1-q6a_10;
do over q6;
if q6=8 then q6=.;
end;
run;
proc summary data=cafe2;
var q6a_1-q6a_10;
output out=cafemeans mean=m6a_1-m6a_10;
run;
proc sql;
create table cafe2a as
select * from cafe2, cafemeans;
run;
data cafe3;
set cafe2a;
array q6 q6a_1-q6a_10;
array m6 m6a_1-m6a_10;
array e6 e6a_1-e6a_10;
do over q6;
if q6 ne . then e6=q6-m6;
end;
random=ranuni(1394881787);
idn=id+0;
run;
proc sort data=cafe3 out=cafe3a;
by random;
run;
%macro impute;
%do i=1 %to 10;
data temp; set cafe3a cafe3a; if q6a_&i ne .; run;
data cafe4a&i;
merge cafe3 (keep=idn q6a_&i q7 q8 in=c) temp (keep=m6a_&i e6a_&i);
if c; if q6a_&i=. then q6a_&i=m6a_&i+e6a_&i;
run;
%end;
data cafe4;
merge
%do i=1 %to 10; cafe4a&i %end;
;
by idn;
keep idn q6a_1-q6a_10 q7 q8;
run;
%mend;
%impute;
/** Try factor analysis to see whether this provides useful simplification.;*/
/*proc factor data=cafe4 n=6 rotate=varimax round reorder flag=.54 scree out=scores;*/
/* var q6a_1-q6a_10;*/
/*run;*/
/**/
/** Adjust for possible response effect.;*/
/*proc transpose data=cafe4 out=cafe4t;*/
/* var q6a_1-q6a_10;*/
/* id idn;*/
/*run;*/
/**/
/*proc standard data=cafe4t out=cafe5t m=0;*/
/* var _1-_814;*/
/*run;*/
/**/
/*proc transpose data=cafe5t out=cafe5;*/
/* var _1-_814;*/
/* id _name_;*/
/* idlabel _label_;*/
/*run;*/
/**/
/*proc fastclus data=cafe5 maxc=3 replace=random random=109162319;*/
/* var q6a_1-q6a_10;*/
/*run;*/
%macro kmclust(seed);
ods output Criterion=kmcriterion;
ods html body="temp.html" style=minimal;
proc fastclus data=cafe4 maxc=3 replace=random random=&seed out=clusters;
var q6a_1-q6a_10;
run;
ods output close;
ods html close;
data _null_; set kmcriterion; put _all_; run;
%mend;
%kmclust(839209472);
%kmclust(726230843);
%kmclust(173049203);
%kmclust(320205828);
%kmclust(929017829);
%kmclust(109162319);
%kmclust(619283921);
%kmclust(463892012);
%kmclust(561092881);
%kmclust(718923491);
%kmclust(429729127);
%kmclust(379457285);
%kmclust(178916387);
%kmclust(739876219);
%kmclust(627881618);
%kmclust(821098721);
%kmclust(897638196);
%kmclust(862341233);
%kmclust(327632882);
%kmclust(116298123); * This seed gives the lowest criterion of 0.8027.;
* Cluster profiles by demographics.;
ods html body="temp.html" style=minimal;
proc freq data=clusters;
table cluster*(q7 q8) / nocol nopercent;
run;
ods html close;
* Export imputed data for fitting latent class model using EM algorithm.;
PROC EXPORT DATA= WORK.CAFE4
OUTFILE= "C:\cafe4.csv"
DBMS=CSV REPLACE;
RUN;
* Read in segment membership probabilities from EM output.;
data EMsegments;
infile "c:\jr\cafeEM.txt" missover pad;
input idn 8-10 modeseg 18 segprob1 20-26 segprob2 29-35 segprob3 38-44;
run;
data EMsegments2;
merge EMsegments cafe4;
by idn;
run;
* EM segment profiles by demographics.;
ods html body="temp.html" style=minimal;
proc means data=EMsegments2 mean print;
class q7 q8;
var segprob1-segprob3;
ways 1;
run;
ods html close;
EM code and output:
> java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -d C:\jr\cafe.out -S 1721
Seed: 1721
Number of instances: 814
Number of atts: 10
======================================
Clust: 0 att: 0
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 1
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 2
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 3
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 4
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 5
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 6
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 7
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 8
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 0 att: 9
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 0
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 1
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 2
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 3
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 4
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 5
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 6
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 7
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 8
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 1 att: 9
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 0
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 1
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 2
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 3
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 4
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 5
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 6
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 7
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 8
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Clust: 2 att: 9
Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0
Inst 0 Class 2 0.36 0.11808 0.52192
Inst 1 Class 2 0.44115 0.08651 0.47235
Inst 2 Class 1 0.3127 0.41949 0.26782
Inst 3 Class 0 0.53905 0.23162 0.22934
Inst 4 Class 1 0.3709 0.42352 0.20558
Inst 5 Class 1 0.02279 0.69282 0.2844
Inst 6 Class 0 0.48523 0.41443 0.10033
Inst 7 Class 2 0.11324 0.37974 0.50702
Inst 8 Class 0 0.70994 0.21248 0.07758
Inst 9 Class 2 0.36237 0.13268 0.50495
Inst 10 Class 0 0.59445 0.16679 0.23876
.
.
.
Inst 810 Class 0 0.53874 0.03339 0.42786
Inst 811 Class 2 0.3562 0.10449 0.53931
Inst 812 Class 2 0.3153 0.29604 0.38865
Inst 813 Class 0 0.37679 0.25703 0.36617
Loglikely: -12.947281732790431
Loglikely: -12.945401366274425
Loglikely: -12.938726747977523
Loglikely: -12.901977150615119
Loglikely: -12.75012451077629
Loglikely: -12.493136950893648
Loglikely: -12.365027354312812
Loglikely: -12.304789204789568
Loglikely: -12.244897621583464
Loglikely: -12.193376754473698
Loglikely: -12.167358555788184
Loglikely: -12.157754243677418
Loglikely: -12.15417345138127
Loglikely: -12.152423619369937
Loglikely: -12.151152790252866
Loglikely: -12.149841301667593
Loglikely: -12.148118219983392
Loglikely: -12.145561523116212
Loglikely: -12.141839465588516
Loglikely: -12.137677594019792
Loglikely: -12.134179767406906
Loglikely: -12.131434990910169
Loglikely: -12.129153188213081
Loglikely: -12.12710171492302
Loglikely: -12.125226260553502
Loglikely: -12.12349702882937
Loglikely: -12.121828686952819
Loglikely: -12.12013959946356
Loglikely: -12.118388891732248
Loglikely: -12.116555091413234
Loglikely: -12.1145537246838
Loglikely: -12.112084376404072
Loglikely: -12.10816598293722
Loglikely: -12.098749169963604
Loglikely: -12.055958366015664
Loglikely: -11.741061138423825
Loglikely: -8.27125123227167
Loglikely: -7.964040764111144
Loglikely: -7.962932367664369
Loglikely: -7.962702860111277
Loglikely: -7.962619911759608
Loglikely: -7.962588379101999
Loglikely: -7.962575791799965
Loglikely: -7.962570581434721
Loglikely: -7.962568372746465
Loglikely: -7.962567422597268
======================================
Clust: 0 att: 0
Normal Distribution. Mean = 3.7643 StandardDev = 1.0008 WeightSum = 279.9992
Clust: 0 att: 1
Normal Distribution. Mean = 3.9571 StandardDev = 0.8691 WeightSum = 279.9992
Clust: 0 att: 2
Normal Distribution. Mean = 3.925 StandardDev = 1.0376 WeightSum = 279.9992
Clust: 0 att: 3
Normal Distribution. Mean = 3.6429 StandardDev = 1.0252 WeightSum = 279.9992
Clust: 0 att: 4
Normal Distribution. Mean = 5 StandardDev = 0 WeightSum = 279.9992
Clust: 0 att: 5
Normal Distribution. Mean = 4.4679 StandardDev = 0.5785 WeightSum = 279.9992
Clust: 0 att: 6
Normal Distribution. Mean = 3.9036 StandardDev = 0.9494 WeightSum = 279.9992
Clust: 0 att: 7
Normal Distribution. Mean = 3.9429 StandardDev = 0.8347 WeightSum = 279.9992
Clust: 0 att: 8
Normal Distribution. Mean = 4.0679 StandardDev = 0.8695 WeightSum = 279.9992
Clust: 0 att: 9
Normal Distribution. Mean = 4.1571 StandardDev = 0.8086 WeightSum = 279.9992
Clust: 1 att: 0
Normal Distribution. Mean = 4.1496 StandardDev = 0.8437 WeightSum = 330.5551
Clust: 1 att: 1
Normal Distribution. Mean = 4.2428 StandardDev = 0.7293 WeightSum = 330.5551
Clust: 1 att: 2
Normal Distribution. Mean = 3.235 StandardDev = 1.0697 WeightSum = 330.5551
Clust: 1 att: 3
Normal Distribution. Mean = 3.4858 StandardDev = 1.0646 WeightSum = 330.5551
Clust: 1 att: 4
Normal Distribution. Mean = 3.5058 StandardDev = 0.688 WeightSum = 330.5551
Clust: 1 att: 5
Normal Distribution. Mean = 4.4793 StandardDev = 0.6006 WeightSum = 330.5551
Clust: 1 att: 6
Normal Distribution. Mean = 3.7508 StandardDev = 0.9563 WeightSum = 330.5551
Clust: 1 att: 7
Normal Distribution. Mean = 3.8212 StandardDev = 0.8231 WeightSum = 330.5551
Clust: 1 att: 8
Normal Distribution. Mean = 3.8357 StandardDev = 0.8844 WeightSum = 330.5551
Clust: 1 att: 9
Normal Distribution. Mean = 4.0428 StandardDev = 0.7403 WeightSum = 330.5551
Clust: 2 att: 0
Normal Distribution. Mean = 3.4965 StandardDev = 0.9632 WeightSum = 203.4457
Clust: 2 att: 1
Normal Distribution. Mean = 3.7824 StandardDev = 0.7585 WeightSum = 203.4457
Clust: 2 att: 2
Normal Distribution. Mean = 3.0703 StandardDev = 0.9877 WeightSum = 203.4457
Clust: 2 att: 3
Normal Distribution. Mean = 4.6241 StandardDev = 0.5041 WeightSum = 203.4457
Clust: 2 att: 4
Normal Distribution. Mean = 3.4317 StandardDev = 0.7422 WeightSum = 203.4457
Clust: 2 att: 5
Normal Distribution. Mean = 4.2732 StandardDev = 0.6017 WeightSum = 203.4457
Clust: 2 att: 6
Normal Distribution. Mean = 4.7589 StandardDev = 0.4323 WeightSum = 203.4457
Clust: 2 att: 7
Normal Distribution. Mean = 4.6887 StandardDev = 0.4754 WeightSum = 203.4457
Clust: 2 att: 8
Normal Distribution. Mean = 4.8076 StandardDev = 0.3957 WeightSum = 203.4457
Clust: 2 att: 9
Normal Distribution. Mean = 3.5176 StandardDev = 0.8439 WeightSum = 203.4457
Inst 0 Class 2 0 0.02973 0.97027
Inst 1 Class 2 0 0.01223 0.98777
Inst 2 Class 0 1 0 0
Inst 3 Class 1 0 1 0
Inst 4 Class 1 0 0.91439 0.08561
Inst 5 Class 0 1 0 0
Inst 6 Class 2 0 0.11805 0.88195
Inst 7 Class 2 0 0.14931 0.85069
Inst 8 Class 1 0 1 0
Inst 9 Class 1 0 0.99892 0.00108
Inst 10 Class 0 1 0 0
.
.
.
Inst 810 Class 2 0 0.00531 0.99469
Inst 811 Class 0 0.99991 0 0.00009
Inst 812 Class 1 0 0.99076 0.00924
Inst 813 Class 0 1 0 0
EM
==
Number of clusters: 3
Cluster: 0 Prior probability: 0.344
Attribute: q6a_1
Normal Distribution. Mean = 3.7643 StdDev = 1.0008
Attribute: q6a_2
Normal Distribution. Mean = 3.9571 StdDev = 0.8691
Attribute: q6a_3
Normal Distribution. Mean = 3.925 StdDev = 1.0376
Attribute: q6a_4
Normal Distribution. Mean = 3.6429 StdDev = 1.0252
Attribute: q6a_5
Normal Distribution. Mean = 5 StdDev = 0
Attribute: q6a_6
Normal Distribution. Mean = 4.4679 StdDev = 0.5785
Attribute: q6a_7
Normal Distribution. Mean = 3.9036 StdDev = 0.9494
Attribute: q6a_8
Normal Distribution. Mean = 3.9429 StdDev = 0.8347
Attribute: q6a_9
Normal Distribution. Mean = 4.0679 StdDev = 0.8695
Attribute: q6a_10
Normal Distribution. Mean = 4.1571 StdDev = 0.8086
Cluster: 1 Prior probability: 0.4062
Attribute: q6a_1
Normal Distribution. Mean = 4.1496 StdDev = 0.8437
Attribute: q6a_2
Normal Distribution. Mean = 4.2428 StdDev = 0.7293
Attribute: q6a_3
Normal Distribution. Mean = 3.235 StdDev = 1.0697
Attribute: q6a_4
Normal Distribution. Mean = 3.4858 StdDev = 1.0646
Attribute: q6a_5
Normal Distribution. Mean = 3.5058 StdDev = 0.688
Attribute: q6a_6
Normal Distribution. Mean = 4.4793 StdDev = 0.6006
Attribute: q6a_7
Normal Distribution. Mean = 3.7508 StdDev = 0.9563
Attribute: q6a_8
Normal Distribution. Mean = 3.8212 StdDev = 0.8231
Attribute: q6a_9
Normal Distribution. Mean = 3.8357 StdDev = 0.8844
Attribute: q6a_10
Normal Distribution. Mean = 4.0428 StdDev = 0.7403
Cluster: 2 Prior probability: 0.2498
Attribute: q6a_1
Normal Distribution. Mean = 3.4965 StdDev = 0.9632
Attribute: q6a_2
Normal Distribution. Mean = 3.7824 StdDev = 0.7585
Attribute: q6a_3
Normal Distribution. Mean = 3.0703 StdDev = 0.9877
Attribute: q6a_4
Normal Distribution. Mean = 4.6241 StdDev = 0.5041
Attribute: q6a_5
Normal Distribution. Mean = 3.4317 StdDev = 0.7422
Attribute: q6a_6
Normal Distribution. Mean = 4.2732 StdDev = 0.6017
Attribute: q6a_7
Normal Distribution. Mean = 4.7589 StdDev = 0.4323
Attribute: q6a_8
Normal Distribution. Mean = 4.6887 StdDev = 0.4754
Attribute: q6a_9
Normal Distribution. Mean = 4.8076 StdDev = 0.3957
Attribute: q6a_10
Normal Distribution. Mean = 3.5176 StdDev = 0.8439
=== Clustering stats for training data ===
Clustered Instances
0 280 ( 34%)
1 329 ( 40%)
2 205 ( 25%)
Log likelihood: -7.96257
Q2 SAS code.
libname asst4 'C:\My Documents\747';
* @data portion of credit-g.arff file extracted, and header line containing variable names added,
before reading data into SAS.;
PROC IMPORT DATAFILE = 'C:\My Documents\747\credit-g.csv'
OUT = asst4.credit DBMS = CSV REPLACE;
run;
ods html body="C:\My Documents\747\credit.html" style=minimal;
proc contents data=asst4.credit;
run;
proc freq data=asst4.credit;
tables checking_status duration credit_history purpose credit_amount savings_status employment
installment_commitment personal_status other_parties residence_since property_magnitude
age other_payment_plans housing existing_credits job num_dependents own_telephone
foreign_worker class;
run;
proc freq data=asst4.credit;
tables (checking_status duration credit_history purpose credit_amount savings_status employment
installment_commitment personal_status other_parties residence_since property_magnitude
age other_payment_plans housing existing_credits job num_dependents own_telephone
foreign_worker)*class / measures;
run;
* proc glmmod provides a quick way to create dummy variables.
This could also be done in a data step using if-then statements.;
proc glmmod data=asst4.credit outdesign=asst4.credit2;
* checking_status, savings_status, property_magnitude and job could arguably be regarded as ordinal,
and coded as appropriate numeric values. However here they are treated as nominal variables.;
class checking_status credit_history purpose savings_status employment
personal_status other_parties property_magnitude
other_payment_plans housing job own_telephone
foreign_worker class;
model duration = checking_status duration credit_history purpose credit_amount savings_status employment
installment_commitment personal_status other_parties residence_since property_magnitude
age other_payment_plans housing existing_credits job num_dependents own_telephone
foreign_worker class;
run;
proc contents data=asst4.credit2;
run;
proc candisc data=asst4.credit2;
class col64;
var col1-col62;
run;
ods html close;
Q2 R code.
credit ................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.