TCSS445 syllabus



How to Use SPSS for ClassificationThis handout explains how to use SPSS for classification tasks. We will see three methods for classification tasks, and how to interpret the results: binary logistic regression, multinomial regression, and nearest neighbor.Binary Logistic RegressionInput data characteristicsIt assumes that the dependent variable is dichotomic (Boolean). The independent variables (predictors) are either dichotomic or numeric.It is recommended to have at least 20 cases per predictor (independent variable).Steps to run in SPSSSelect Analyze Regression Binary Logistic.Select the dependent variable and the independent variables, and the selected method (Enter for example, which is the default).Select the options in the Options menu and select 95 as CI for exp(B), which will provide confidence intervals for the odds ratio of the different predictors. Then click Continue. Click OK.Results interpretation4327451453124No participant has missing data00No participant has missing data243485655945003083442336520Num is the dependent variable and is coed 0 or 100Num is the dependent variable and is coed 0 or 111938004673600Block 0: Beginning BlockClassification Tablea,bObservedPredictednumPercentage Correct'<50''>50_1'Step 0num'<50'1650100.0'>50_1'1380203530871120Predicting num without the predictors would provide 54.3% accuracy00Predicting num without the predictors would provide 54.3% accuracy114302006600.0Overall Percentage54.5a. Constant is included in the model.b. The cut value is .500Without using the predictors, we could predict that no participant has heart disease with 54.3% accuracy – which is not significantly different from 50-50 (i.e, no better than chance).Variables not in the EquationScoredfSig.Step 0Variablesage15.39911782445-208280Age is significantly related to num.00Age is significantly related to num.-241300-787400.000sex(1)23.9141.000chest_pain81.6863.000chest_pain(1)80.6801.000chest_pain(2)18.3181.000chest_pain(3)30.3991.000trestbps6.3651.012chol2.2021.138fbs(1).2381.625restecg10.0232.007restecg(1)7.7351.005restecg(2)9.3141.002thalach53.8931.000exang(1)57.7991.000oldpeak56.2061.000slope47.5072.000slope(1)1.2241.269slope(2)39.7181.000ca74.3674.000ca(1)1.3381.247ca(2)65.6831.000ca(3)16.3671.000ca(4)22.7481.000thal85.3043.000thal(1).0161.899thal(2)3.4421.064thal(3)84.2581.000Overall Statistics176.28922.000Many variables are separately significantly related to num. All variables with Sig. less than or equal to 0.05 are significant predictors of whether a person has heart disease. There are 11 of these significant predictors here: age, sex, chest_pain, trestbps, restecg, thalach, exang, oldpeak, slope, ca, thal.Block 1: Method = Enter4189228428699The model is significant when all independent variables are entered (Sig <= 0.05).00The model is significant when all independent variables are entered (Sig <= 0.05).21818607715250418922822242772.7% of the variance in num can be predicted from the combination of the independent variables.0072.7% of the variance in num can be predicted from the combination of the independent variables.224599549657005513705115443088.4% of the subjects were correctly classified by the model.0088.4% of the subjects were correctly classified by the model.348996012839700Variables in the EquationBS.E.WalddfSig.Exp(B)95% C.I.for EXP(B)LowerUpperStep 1aage-.028.0251.1971.274.973.9251.022sex(1)-1.862.57110.6431.001.155.051.475chest_pain18.5393.000chest_pain(1)2.417.71911.2941.00111.2132.73845.916chest_pain(2)1.552.8233.5581.0594.723.94123.698chest_pain(3).414.707.3431.5581.513.3786.050trestbps.026.0124.7991.0281.0271.0031.051chol.004.0041.0221.3121.004.9961.013fbs(1).446.588.5741.4481.562.4934.944restecg1.4432.486restecg(1)-.7142.769.0671.796.490.002111.371restecg(2)-1.1752.770.1801.672.309.00170.447thalach-.020.0122.8601.091.980.9581.003exang(1)-.779.4522.9731.085.459.1891.112oldpeak.397.2422.6861.1011.488.9252.392slope8.8612.012slope(1).690.948.5301.4671.994.31112.773slope(2)1.465.4988.6451.0034.3281.63011.492ca30.9074.000ca(1)-3.5151.9263.3311.068.030.0011.297ca(2)-2.247.9385.7441.017.106.017.664ca(3).095.958.0101.9211.100.1687.196ca(4)1.2361.1411.1741.2793.442.36832.212thal13.0103.005thal(1).9152.600.1241.7252.497.015408.255thal(2)-1.722.8094.5371.033.179.037.871thal(3)-1.453.44510.6651.001.234.098.559Constant.9564.131.0541.8172.601a. Variable(s) entered on step 1: age, sex, chest_pain, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, thal.Many variables are significantly related to num when all the 13 variables are taken together. All variables with Sig. less than or equal to 0.05 are significant predictors of whether a person has heart disease. There are 6 of these significant predictors here: sex, chest_pain, trestbps, slope, ca, thal. This suggests some correlation among predictors since that age, restecg, thalach, exang, oldpeak were significant predictors when used alone. Fbs and chol are not predictors of the severity of heart disease.Results write-up: Logistic regression was conducted to assess whether the thirteen predictor variables … significantly predicted whether or not a subject has heart disease. When all predictors are considered together, they significantly predict whether or not a subject has heart disease, chi-square = 238, df = 22, N = 303, p < .001. The classification accuracy is 88.4%. Table 1 presents the odd ratios of the major predictors, which suggests that the odds of estimating correctly who has heart disease improve significantly if one knows sex, chest_pain, trestbps, slope, ca, and thal.Table 1.VariableOdds ratiopSex0.16.001chest_pain (1)11.21.001trestbps1.03.028slope (2)4.33.003ca (1)0.07.030thal (1).73.015Binary Logistic Regression with Independent Training and Test SetsOne can repeat the previous experience, but after splitting the dataset into a training set (75% of the data) and a test set (25% of the data).Steps to run in SPSSSet the random seed number to Random (or you can select a predefined value): Transform Random Number Generators Set Starting PointCreate a split variable which will be set to 1 for 75% of the cases, randomly selected, and to 0 for 25% of the cases.Transform Compute Variable split = uniform(1) <= 75Recall the Logistic Regression experiment from the Recall icon.Repeat the regression as before, except that the Selection Variable is chosen as the split variable and the value 1 is selected for this variable under Rule.Click on OK to run the Binary Logistic regression.Results interpretationBlock 1: Method = Enter4189228428699The model is significant when all independent variables are entered (Sig < 0.01).00The model is significant when all independent variables are entered (Sig < 0.01).21818607715250 418922822242772.7% of the variance in num can be predicted from the combination of the independent variables.0072.7% of the variance in num can be predicted from the combination of the independent variables.22459954965700477583514960600 4944140437185.0% of the test subjects were correctly classified by the model.0085.0% of the test subjects were correctly classified by the model.The classification rate in the holdout sample is within 10% of the training sample(87.4% * 0.9).This is sufficient evidence of the utility of the logistic regression model.The 6 significant predictors are the same: sex, chest_pain, trestbps, slope, ca, thal. Results write-up: By splitting the dataset into 75% training set and 25% test set, the accuracy in the holdout sample changes to 85.0%, which is within 10% of the training sample. The significant predictors remain the same as in the model of the entire dataset. These results reinforce the utility of the logistic regression model. Nearest NeighborInput data characteristicsIt assumes that the dependent variable is dichotomic (Boolean). The independent variables (predictors) are either dichotomic or numeric.It is recommended to have at least 20 cases per predictor (independent variable).Steps to run in SPSSSelect Analyze Classify Nearest Neighbor.Select the target variable and the features (or factors).You may change other options, such as the number of neighbors (Neighbors tab), the distance metric, feature selection (Features tab), and the Partitions (Partitions tab). We select here to split the training set (75%) and the test set (25%).Other options on this page are the 10-fold cross validation, which yields a more robust evaluation.Click OK.Results interpretationThe classification accuracy is found in the model viewer by double-clicking the chart obtained and looking at the classification table, which shows an accuracy of 87.2% for K = 5.Classification TablePartitionObservedPredicted'<50''>50_1'Percent CorrectTraining'<50'1001686.2%'>50_1'237877.2%Overall Percent56.7%43.3%82.0%Holdout'<50'43687.8%'>50_1'53286.5%Missing009525016764000Overall Percent55.8%44.2%87.2%4954772283254Classification accuracy.00Classification accuracy.Results write-up: Nearest neighbor classification with k= 3 (3NN) was conducted to assess whether the thirteen predictor variables correctly predicted whether or not a subject has heart disease. When all predictors are considered together, 87.2% of the subjects were correctly classified. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download