Analyses of Cateogical Dependent Variables



Analyses Involving Categorical Dependent VariablesWhen Dependent Variables are CategoricalExamples:Dependent variable is simply Failure vs. SuccessDependent variable is Lived vs. DiedDependent variable is Passed vs FailedYou ignored everything I’ve said about not categorizing and dichotomized a DV.Chi-square analysis is frequently used.Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?Dependent variable is Death: No (0) vs. Yes (1).Independent variable is Helmet – No (0) vs Yes (1).Crosstabs INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML2b4edb.PNG" \* MERGEFORMATINET So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without ments on Chi-square analysesWhat’s good?1. The analysis is appropriate. It hasn’t been supplanted by something else.2. The results are usually easy to communicate, especially to lay audiences.3. A DV with a few more than 2 categories can be easily analyzed.4. An IV with only a few more than 2 categories can be easily analyzed.What’s bad?1. Incorporating more than one independent variable is awkward, requiring multiple tables.2. Certain tests, such as tests of interactions, can’t be performed easily when you have more than one IV.3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs, which goes against recommendations to NOT categorize continuous variables because you lose power.Alternatives to the Chi-square test.We’ll focus on Dichotomous (two-valued) DVs. 1. Linear Regression techniquesa. Multiple Linear Regression. Stick your head in the sand and pretend that your DV is continuous and regress the (dichotomous) DV onto the collection of IVs.b. Discriminant Analysis (equivalent to MR when DV is dichotomous)Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.1. Assumption is that underlying relationship between Y and X is linear.But when Y has only two values, how can that be?2. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.3. Residuals will probably not be normally distributed.4. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction resulting in Y-hats that are impossible values. 2. Logistic Regression3. Probit analysisLogistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.The Logistic Regression EquationWithout restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.Conceptualizing Y-hat. When Y is GPA, for example, actual GPAs and predicted GPAs are just like each other. They can even be identical. However, when Y is a dichotomy, it can take on only 2 values – 0 and 1, although predicted Ys can be any value. So how do we reconcile that discrepancy?When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1.The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression) (B0 + B1*X) 1 e Y-hat = P(Y=1) = --------------------- = ------------------- -(B0 + B1*X) (B0 + B1*X) 1 + e 1 + eThe logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1 as X ranges from -∞ to +∞. P(Y=1) is never negative and never larger than 1.The curve of the equation . . .B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. Although B0 defines the "height" of the curve at a given x, it should be noted that the most vertical part of the curve moves to the right as B0 decreases. For the graphs below, B1=1 and X ranged from -5 to +5.P(Y=1)B0 =0B0 = +1B0 = -1For equations for which B1 is the same, changing B0 only changes the location of the curve over the range of X-axis values.The “slope” of the curve remains the same.B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter.The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1. B0=0 for the graph.B1 =4B1 =1B1 =2P(Y=1)Note that there is a MAJOR difference between the linear regression curves we’re familiar with and logistic regression curves - - -The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.But the linear regression lines extend below 0 on the left and above 1 on the right – the predicted Ys range from -∞ to +∞.If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.ExampleP(Y) = .09090909Odds of Y = .09090909/.909090909 = .1Y is 1/10th as likely to occur as to not occur.P(Y) = .50Odds of Y = .5/.5 = 1Y is as likely to occur as to not occur.P(Y) = .8Odds of Y = .8/.2 = 4Y is 4 times more likely to occur than to not occur.P(Y) = .99Odds of Y = .99/.01 = 99Y is 99 times more likely to occur than to not occur.So logistic regression is logistic in probability but linear in log odds.Why we must fit ogival-shaped curves – the curse of categorizationHere’s a perfectly nice linear relationship between score values, from a recent study.This relationship is of ACT Comp scores to Wonderlic scores. It shows that as intelligence gets higher, ACT scores get larger.[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.savHere’s the relationship when ACT Comp (vertical axis) has been dichotomized at 23, into Low vs. High.When, proportions of High scores are plotted vs. WPT value, we get the followingSo, to fit the above curve relating proportions of persons with High ACT scores to WPT within successive groups, we need a model that is ogival. This is where the logistic regression function comes into play.This means that even if the “underlying” true values are linearly related, proportions based on the dichotomized values within successive groups will not be linearly related to the independent variable. Crosstabs and Logistic RegressionApplied to the same 2x2 situationThe FFROSH data.The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not. Retention is good.The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)I know, I know. This is a violation of the “Do not categorize!” rule. There were technical reasons this time.So the analysis that follows examines the relation of RETAINED to EARLIREG, the relation of retention to the 2nd semester to early registration.The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION. First, univariate analyses . . .GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.Fre var=retained earlireg.sustained FrequencyPercentValid PercentCumulative PercentValid.0055211.611.611.6 1.00420188.488.4100.0 Total4753100.0100.0 earlireg FrequencyPercentValid PercentCumulative PercentValid.00231648.748.748.7 1.00243751.351.3100.0 Total4753100.0100.0 crosstabs tables = retained by earlireg /cells=cou col /sta=chisq.CrosstabsSo, 92.4% of those who registered early sustained, compared to 84.2% of those who registered late.The relation is significant. Students who registered early were more likely to sustain directly into the 2nd semester than those who registered late.The same analysis using Logistic RegressionAnalyze -> Regression -> Binary Logisticlogistic regression retained WITH earlireg.Logistic RegressionThe display to the left is a valuable check to make sure that your “1” is the same as the Logistic Regression procedure’s “1”. Do whatever you can to make Logistic’s 1s be the same cases as your 1s. Trust me.The Logistic Regression procedure applies the logistic regression model to the data. It estimates the parameters of the logistic regression equation. 1That equation is P(Y) = --------------------- -(B0 + B1X) 1 + eThe LOGISTIC REGRESSION procedure performs the estimation in two stages. The first stage estimates only B0. So the model fit to the data in the first stage is simply 1P(Y) = ------------------ -(B0) 1 + eSPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimatedThe second stage estimates both B0 and B1. So, the model fit to the data in the first stage is simply 1P(Y) = ------------------------ -(B0 + B1*X) 1 + eSPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimatedThe first stage . . .Block 0: Beginning Block (estimating only B0)Explanation of the above table: The program estimated B0=2.030. The resulting P(Y=1) = .8839.The program computes Y-hat=.8839 for each case using the logistic regression formula with the estimate of B0. If Y-hat is less than or equal to a predetermined cut value of 0.500, that case is recorded as a predicted 0. If Y-hat is greater than 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s. Predicted Ys are 1 in this particular example. Sometimes this table is more useful than it was in this case. It’s typically most useful when the equation includes continuous predictors. Note that all students are predicted to sustain in this example.The Variables in the Equation Table for Block 0.The “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis. The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030). Only B0 is in the equation. Recall that B1 is not yet in the equation. The test statistic in the “Variables in the Equation” table is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision. Exp(B) is the odds ratio: e2.030 It is the ratio of odds of P(Y=1) when the predictor equals 1 to the odds of P(Y=1) when the predictor equals 0. It’s an indicator of strength of relationship to the predictor. It means nothing here since there is no predictor in the equation.The Variables not in the Equation Table for Block 0.The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .The second stage . . .Block 1: Method = Enter (Adding estimation of B1 to the equation)Whew – three chi-square statistics. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.For Chi-square in this procedure, bigger is better. A significant chi-square means that your ability to predict the dependent variable is significantly different from chance.“Step”: Compared to previous step in a stepwise regression. In this case the previous step is the equation with just B0.“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block in which just B0 was estimated.. “Model”: I believe this is analogous to the ANOVA F in REGRESSION – testing whether the model with all predictors fits better than a model with just B0 – an independence model.All of these tell us that adding estimation of B1 to the equation resulted in a significant addition Analogous to the Model Summary Table in REGRESSION.The value under “-2 Log likelihood” is a measure of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model. The above table is the version of the table including Y-hats based on B0 and B1. Note that since X is a dichotomous variable here, there are only two y-hat values. They are 1 P(Y) = --------------------- = .842 (see below) -(B0 + B1*0) 1 + eAnd 1 P(Y) = --------------------- = .924 (see below) -(B0 + B1*1) 1 + eIn both cases, the y-hat was greater than .5, so predicted Y in the table was 1 for all cases.The prediction equation is Y-hat = P(Y=1) = 1 / (1 + e-(.1.670 + .830*EARLIREG).Since EARLIREG has only two values, those students who registered early will have predicted RETAINED value of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of 1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842. That difference is statistically significant.Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is Odds when X=1 .924/(1-.924)12.158Odds ratio = --------------------- = --------------- = ------------------- = 2.29. Odds when X= 0 .842/(1-.842)5.329So a person who registered early had odds of retaining that were 2.29 times the odds of a person registering late being retained. Odds ratio of 1 means that the DV is not related to the predictor.Graphical representation of what we’ve just found. The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.yhatEarliregThe two points are the predicted points for the two possible values of RETAINED.The bottom line here is that the LOGISTIC REGRESSION results are the same as the CROSSTABS results – those who registered early are significantly more likely to sustain than those who register late.CrossTabs in Rcmdr -Start here on 10/2/18R Rcmdr, then import the ffrosh for P5100 data.Crosstabs in Rcmde requires that the variables to be crossed be factors.First, convert the variables to factorsData Manage variables in active data set Convert numeric variables to factorsNote that I created a new column in the Rcmdr data editor – so that I can use EARLIREG in procedures that analyze regular variables and earliregfact in procedures that require factors.By the way: Rcmdr’s import automatically converts any variable whose values have labels into factors.You can remove this tendency by 1) Removing value labels from the variable in the SPSS file prior to importing, or2) Unchecking the “Convert value labels to factor levels” box in the Import SPSS Data set dialog.Statistics Contingency Tables Two-way table . . .Note that this procedure works only for factors.The output.Frequency table: earliregfactretainedfact 0 1 0 367 185 1 1949 2252Pearson's Chi-squared testdata: .TableX-squared = 78.832, df = 1, p-value < 2.2e-16The chi-square value is the same as the Pearson chi-square in SPSS, p. 9.Logistic Regression Analysis in Rcmdr.Statistics Fit Models Generalized Linear Models . . .Note that the procedure is Generalized Linear models . . .,not Linear model and not Linear regressionlibrary(foreign, pos=14)> ffroshnm <- read.spss("G:/MDBR/FFROSH/Ffroshnm for P595.sav", + use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)> colnames(ffroshnm) <- tolower(colnames(ffroshnm))> library(abind, pos=15)> GLM.1 <- glm(retained ~ earlireg, family=binomial(logit), data=ffroshnm)> summary(GLM.1)Call:glm(formula = retained ~ earlireg, family = binomial(logit), data = ffroshnm)Deviance Residuals: Min 1Q Median 3Q Max -2.2708 0.3974 0.3974 0.5874 0.5874 The z value is the square root of the Wald value printed by SPSS. The p-values are identical.Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.66971 0.05690 29.343 <2e-16 ***earlireg 0.82951 0.09533 8.702 <2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 3414.1 on 4752 degrees of freedomResidual deviance: 3334.2 on 4751 degrees of freedomAIC: 3338.2Number of Fisher Scoring iterations: 5> exp(coef(GLM.1)) # Exponentiated coefficients ("odds ratios")(Intercept) earlireg 5.310627 2.292191Discussion1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.BUT as mentioned above . . .2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous.3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables.4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.Logistic Regression Example 1: One Continuous PredictorThe data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests (blood test, I believe) that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis. INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTMLf1e322.PNG" \* MERGEFORMATINET The objective here is to determine 1) which alone is the better predictor of the condition and2) to determine if both are needed. Note that this analysis could not be done using chi-square.Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts.This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to Amylase only. Note that since Amylase is a continuous independent variable, chi-square analysis would not be appropriate.The name of the dependent variables is PANCGRP. It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise. This forces us to use a technique appropriate for dichotomous dependent variables.Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.Example 1-Part 1. Scatterplots of individual cases – not terribly informative.Relationship of Pancreatitis Diagnosis to log(Amylase)This graph is of individual cases.Y values are 0 or 1.X values are continuous.This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy. It is difficult to see the relationship that may very well be represented by the data. One can see from this graph, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis). The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted. So the line is what we would predict based on linear regression.But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive, shown below.Below are the same data, this time with the line of best fit generated by the logistic regression analysis shown on the scatterplot. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.Example 1-Part 2. Grouping cases to show a relationship when the DV is a dichotomy. This is not a required part of logistic regression analysis but a way of presenting the data to help understand what’s going on.The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.When the DV is a dichotomy, it will be profitable to form groups of cases with similar X values and plot the proportion of 1’s within each group vs. the X value for that group.To illustrate this, groups were formed for every .2 increase in log amylase. That is, the values 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases with log amylase between 1.5 and 1.7 were assigned to the 1.6 group. SPSS Syntax: compute logamygp = rnd(logamy,.2).Then the proportion of 1’s within each group was computed. (When the data are 0s and 1s, the mean of all the scores is equal to the proportion of 1s.) The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.Original Graph of 0s and 1s vs. LOGAMY.Graph of Proportion of 1’s in each group vs. Group Midpoints.Note the ogival relationship.The analyses that follow illustrate the application of both linear and logistic regression to the data.Example 1 – Part 3. Linear Regression analysis of the logamy data, just for old time’s sake.Excerpt from the data fileREGRESSION /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT pancgrp /METHOD=ENTER logamy /SCATTERPLOT=(*ZRESID ,*ZPRED ) /RESIDUALS HIST(ZRESID) NORM(ZRESID) .RegressionThe linear relationship of pancdiag to logamy is strong – R-squared=.570..But as we'll see, the logistic relationship is even stronger.Thus, the predicted linear relationship of probability of Pancreatitis to log amylase isPredicted probability of Pancreatitis = -1.043 + 0.635 * logamy.The following are the usual linear regression diagnostics. Nothing particularly unusual here.Or here.The histogram of residuals is not particularly unusual.Although there is a clear bend from the expected linear line, this is not particularly diagnostic..This is an indicator that there is something amiss.The plot of residuals vs. predicted values is supposed to form a classic 0 correlation scatterplot, with no unusual shape. This is clearly putation of y-hats for the groups.I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group. Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and pute mrgpyhat = -1.043 + .635*logamyvalue.execute.GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc mrgpyhat (PAIR) /MISSING=LISTWISE .GraphObserved proportion of Pancreatitis diagnoses within groups.Predicted proportion of Pancreatitis diagnosis within groups.Note that predictions extend below 0 and above 1.Example 1 – Part 4. SPSS Logistic Regression Analysis of logamy dataRemember we’re here to determine if there is a significant relationship of pancreatitis diagnosis to logamylase.logistic regression pancgrp with logamy.Logistic RegressionSPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks.Recall the Logistic prediction formula is 1 P(Y) = --------------------- -(B0 + B1X) 1 + eIn the first block, labeled Block 0, only B0 is entered into the equation. In this B0 only equation, the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample. Obviously this model doesn’t make sense when your main interest is in whether or not the probability increases as X increases. But SPSS forces us to consider (or delete) the results of a B0 only model. This model does serve as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed 0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is 1/(1+exp(-(-1.466)) = .1875. It happens that this is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model.A case is classified as a Predicted 1 if the y-hat for that case is larger than .5A case is classified as a Predicted 0 if the y-hat for that case is less than or equal to .5Block 0: Beginning BlockSensitivitySpecificityThe test that is printed by SPSS is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0.The predicted probability of 1 here is 1 11P(1) = ------------------------- = --------------------------- = ------------- = 0.1875, the observed proportion of 1’s. 1 + e-(-1.466)1 + 4.332 5.332The “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.Block 1: Method = EnterIn this block, log amylase is added to the equation.Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set.Block: The significance of the increase in fit of the current model vs. the last Block. Model: Compares the current model to a model in which all Bs are zero. We’ll focus on this.The Linear Regression R2 was .570.I don’t believe there is a universally-agree-upon equivalent to the Linear R2.In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with 0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.Sensitivity (power)SpecificitySpecificity: Percentage of all cases without the disease who were predicted to not have it. (Percentage of correct predictions for those people who don’t have the disease.)Sensitivity: Percentage of all cases with the disease who were predicted to have it. (Percentage of correct predictions for those people who did have the disease.)Analogous to “Coefficients” box in RegressionThese are the coefficients for the equation. 1y-hat = ---------------------------------- -(-16.0203+6.8978*LOGAMY 1 + eExample 1 – Part 4. Rcmdr Logistic Regression Analysis of logamy data R Rcmdr; Data Import data from SPSS Data set amylip.savStatisitcs Fit Models Generalized linear model . . .> logamy2 <- read.spss("G:/MDBT/InClassDatasets/amylip.sav", + use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)> colnames(logamy2) <- tolower(colnames(logamy2))> GLM.6 <- glm(pancgrp ~ logamy, family=binomial(logit), data=logamy2)> summary(GLM.6)Call:glm(formula = pancgrp ~ logamy, family = binomial(logit), data = logamy2)Deviance Residuals: Min 1Q Median 3Q Max -1.98232 -0.27148 -0.14026 -0.07655 2.61284 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -16.020 2.227 -7.193 6.32e-13 ***logamy 6.898 1.017 6.780 1.20e-11 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 247.080 on 255 degrees of freedomResidual deviance: 95.436 on 254 degrees of freedomNote that the container with the results can be displayed in pieces. (50 observations deleted due to missingness)AIC: 99.436Number of Fisher Scoring iterations: 7> exp(coef(GLM.6)) # Exponentiated coefficients ("odds ratios") (Intercept) logamy 1.102702e-07 9.901343e+02 = = 990.1343This is what Rcmdr gives by default.Other output can likely be obtained by searching for R commands related to Logistic Regression.Example 1 – Part 6. Post Analysis processing: Computing Predicted proportions for the groups defined on page 3.To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints. Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear pute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))).GRAPH /SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc lrgpyhat (PAIR) /MISSING=LISTWISE .GraphObserved proportion. Only this point is considerably deviant from its predicted value. Could it be that there were coding errors for this group?Predicted proportion, most of which coincide precisely with the observed proportions.LogamyCompare this graph with the one immediately below from the linear regression analysis. Note that the predicted proportions correspond much more closely to the observed proportions here.Note the diverging predictions for all groups with proportions = 0 or 1.The linear regression and the logistic regression analyses yield roughly the same predictions for “interior” points. But they diverge for “extreme” points – points with extremely small or extremely large values of X.Example 1 – Part 7. Discussion: Using residuals to distinguish between logistic and linear regression.I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0. Y-hats are probabilities.First, I computed Y-hats for all cases, using both the linear equation and the logistic equation..compute mryhat = -1.043 + .635*pute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).Now residuals are pute mrresid = pancdiag - pute lrresid = pancdiag - lryhat.frequencies variables = mrresid lrresid /histogram /format=notable.FrequenciesThis is the distribution of residuals for the linear multiple regression.It's like the plot on page 3, except these are actual residuals, not Z's of residuals.Note that there are many large residuals - large negative and large positive.Positive residualNegative residualThe residuals above are simply distances of the observed points from the best fitting line, in this case a straight line.This is the distribution of residuals for the logistic regression. Note that most of them are virtually 0.The residuals above are simply distances of the observed points from the best fitting line, in this case a logistic line.The points which are circled are those with near 0 residuals.What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from 0. So the logistic regression analysis has modeled the Y’s better than the linear regression. Logistic Regression Example 2: One Categorical IV with 3 categoriesThe data here are the FFROSH data – freshmen from 1987-1992. The dependent variable is RETAINED – whether a student went directly to the 2nd semester. The independent variable is NRACE – the ethnic group recorded for the student. It has three values:1: White; 2: African American3: Asian-AmericanRecall that ALL independent variables are called covariates in LOGISTIC REGRESSION. We know that categorical independent variables with 3 or more categories must be represented by group coding variables.LOGISTIC REGRESION allows us to do that internally. INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML36476c.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET INCLUDEPICTURE "C:\\Users\\Michael\\AppData\\Local\\Temp\\SNAGHTML27bd75.PNG" \* MERGEFORMATINET Indicator coding is dummy coding. Here, Category 1 (White) is used as the reference category.Example 2 – 1. Analysis using SPSS LOGISTIC REGRESSION.LOGISTIC REGRESSION retained /METHOD = ENTER nrace/CONTRAST (nrace)=Indicator(1) /CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .This is the syntax generated by the above menus.The red’d syntax shows how to define a categorical variable.Logistic RegressionSPSS’s coding of the independent variable here is important. Note that Whites are the 0,0 group. The first group coding variable compares Blacks with Whites. The 2nd compares Asian-Americans with Whites.Recall that this is what is cone in dummy coding.Be sure the internal and original coding of the dependent variable are identical.Block 0: Beginning BlockSPSS first prints p-value information for the collection of group coding variables representing the categorical factor. Then it prints p-value information for each GCV separately. But the values are not the same as those obtained when NRACE is actually added to the equation. So we’ll ignore this table.Block 1: Method = EnterOWNote that for a categorical variable, SPSS first prints “overall” information on the variable. Then it prints information for each specific group coding variable.So the bottom line is that0) There are significant differences in likelihood of retention to the 2nd semester between the groups (p=.041).Specifically . . .1) Blacks are not significantly more likely to sustain than Whites, although the difference approaches significance. (p=.098).2) Asian-Americans are significantly more likely to sustain than Whites (p=.050). The odds of an Asian-American sustaining are 2.7 times the odds of a White sustaining. Example 2 – 2. Analysis using Rcmdr.R Rcmdr; Data Import data from SPSS file ffroshnm for P5100.savNote that NRACE is imported to Rcmdr as a factor. That’s great. We don’t have to create GCVs.> ffroshnm <- read.spss("G:/MDBR/FFROSH/Ffroshnm for P5100.sav", + use.value.labels=TRUE, max.value.labels=Inf, to.data.frame=TRUE)> colnames(ffroshnm) <- tolower(colnames(ffroshnm))> GLM.7 <- glm(retained ~ nrace, family=binomial(logit), data=ffroshnm)> summary(GLM.7)Call:glm(formula = retained ~ nrace, family = binomial(logit), data = ffroshnm)Deviance Residuals: Min 1Q Median 3Q Max -2.4676 0.4528 0.5065 0.5065 0.5065 Coefficients: Estimate Std. Error z value Pr(>|z|) Note: Wald = z2..(Intercept) 1.98873 0.04867 40.864 <2e-16 ***nrace[T.BLACK] 0.23722 0.14329 1.656 0.0978 . nrace[T.ORIENTAL] 1.00700 0.51461 1.957 0.0504 . Note: SPSS’s value was .050.---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1) Null deviance: 3371.9 on 4696 degrees of freedomResidual deviance: 3364.2 on 4694 degrees of freedom (56 observations deleted due to missingness)AIC: 3370.2Number of Fisher Scoring iterations: 5> exp(coef(GLM.7)) # Exponentiated coefficients ("odds ratios") (Intercept) nrace[T.BLACK] nrace[T.ORIENTAL] 7.306250 1.267722 2.737382 Note that the coding of NRACE was dummy variable with Whites (the first group) as the reference group.Below, I show that using the R anova function gives output that combines the two group coding variables into one “nrace” variable.> anova(GLM.7)Analysis of Deviance TableModel: binomial, link: logitResponse: retainedTerms added sequentially (first to last) Df Deviance Resid. Df Resid. DevNULL 4696 3371.9nrace 2 7.7482 4694 3364.2I’ll admit that I don’t know whether this is of use to us. I don’t know at this time how to get R to compute the Wald value of 6.368 SPSS reported for NRACE.Logistic Regression Example 3: Three Continuous predictors – FFROSH DataThe data used for this are data on freshmen from 1987-1992. Skipped in Fall 2018.The dependent variable is RETAINED – whether student went directly into the 2nd semester or not.Predictors (covariates in logistic regression) are HSGPA, ACT composite, and Overall attempted hours in the first semester, excluding the freshman seminar course.GET FILE='E:\MdbR\FFROSH\ffrosh.sav'.logistic regression retained with hsgpa actcomp oatthrs1.Logistic RegressionBlock 0: Beginning BlockSpecificitySensitivityRecall that the p-values are those that would be obtained if a variable were put BY ITSELF into the equation.Block 1: Method = EnterSpecificitySensitivityNote that while ACTCOMP would have been significant by itself without controlling for HSGPA and OATTHRS1, when controlling for those two variables, it’s not significant.So, the bottom line is that1) Among persons equal on ACTCOMP and OATTHRS1, those with larger HSGPAs were more likely to go directly into the 2nd semester.2) Among persons equal on HSGPA and OATTHRS1, there was no significant relationship of likelihood of sustaining to ACTCOMP. Among persons equal on HSGPA and OATTHRS1 those with higher ACTCOMP were not significantly more likely to sustain than those with lower ACTCOMP. Note that there are other variables that could be controlled for and that this relationship might “become” significant when those variables are controlled. (But it didn’t.)3) Among persons equal on HSGPA and ACTCOMP, those who took more hours in the first semester were more likely to go directly to the 2nd semester. What does this mean???? These were more likely to be full-time students??Logistic Regression Example 4: The FFROSH Full AnalysisFrom the report to the faculty – Output from SPSS for the Macintosh Version 6. Output window was copied and pasted into a Word document, where the comments were added.---------------------- Variables in the Equation -----------------------Variable B S.E. Wald df Sig R Exp(B)AGE -.0950 .0532 3.1935 1 .0739 -.0180 .9094NSEX .2714 .0988 7.5486 1 .0060 .0388 1.3118After adjusting for differences associated with the other variables, Males were more likely to enroll in the second semester .NRACE1 -.4738 .1578 9.0088 1 .0027 -.0436 .6227After adjusting for differences associated with the other variables, Whites were less likely to enroll in the second semester.NRACE2 .1168 .1773 .4342 1 .5099 .0000 1.1239HSGPA .8802 .1222 51.8438 1 .0000 .1162 2.4114After adjusting for differences associated with the other variables, those with higher high school GPA's were more likely to enroll in the second semester.ACTCOMP -.0239 .0161 2.1929 1 .1387 -.0072 .9764OATTHRS1 .1588 .0124 164.4041 1 .0000 .2098 1.1721After adjusting for differences associated with the other variables, those with higher attempted hours were more likely to enroll in the second semester.EARLIREG .2917 .1011 8.3266 1 .0039 .0414 1.3387After adjusting for differences associated with the other variables, those who registered six months or more before the first day of school were more likely to enroll in the second semester.NADMSTAT -.2431 .1226 3.9330 1 .0473 -.0229 .7842POSTSEM -.1092 .0675 2.6206 1 .1055 -.0130 .8965PREYEAR2 -.0461 .0853 .2924 1 .5887 .0000 .9549PREYEAR3 .1918 .0915 4.3952 1 .0360 .0255 1.2114After adjusting for differences associated with the other variables, those who enrolled in 1991 were more likely to enroll in the second semester than others enrolled before 1990. What???POSYEAR2 -.0845 .0977 .7467 1 .3875 .0000 .9190POSYEAR3 -.1397 .0998 1.9585 1 .1617 .0000 .8696HAVEF101 .4828 .1543 9.7876 1 .0018 .0459 1.6206After adjusting for differences associated with the other variables, those who took the freshman seminar were more likely to enroll in second semester than those who did not.Constant -.1075 1.1949 .0081 1 .9283This is from SPSS V15. There are slight differences in the numbers, not due to changes in the program but due to slight differences in the data. I believe some cases were dropped between when the V6 and V15 analyses were performed.NRACE was coded differently in the V15 analysis.The similarity is a tribute to the statisticians who developed logistic regression and the programmers.The full FFROSH Analysis in Version 15 of SPSSlogistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat postsem y1988 y1989 y1991 y1992 havef101 /categorical nrace admstat.Logistic Regression[DataSet1] G:\MdbR\FFROSH\ffrosh.savBlock 0: Beginning BlockSkipped here.Block 0 output skippedBlock 1: Method = EnterThe absence of a relationship to ACTCOMP is very interesting. It could be the foundation for a theory of retention.Logistic Regression Example 5: A 2 x 3 FactorialFrom Pedhazur, p. 762. Problem 4. Messages attributed to either an Economist, a Labor Leader or a Politician are given to participants. The message is about the effects of NAFTA (North American Free Trade Agreement). The participants rated each message as Biased or Unbiased. Half the participants were told that the source was male. The other half were told the source was female. The data areEconomistLabor LeaderPoliticianMale SourceRated Biased 71319Rated Unbiased1812 6Female SourceRated Biased 51720Rated Unbiased20 8 5The data were entered into SPSS as follows . . .The above are summary data, not individual data.For example, line 1: 1 1 1 7, represents 7 individual cases. If the data had been entered as individual data,The first lines of the data editor would have beengendersourcejudgment1111117 cases111111111111111To get SPSS to “expand” the summary data into all of theindividual cases it’s summarizing, I did the followingData -> Weight Cases . . .All analyses after the Weight Cases dialog will involve theExpanded data of 150 cases.If you’re interested, the syntax that will do all of the above is . . .DATASET ACTIVATE DataSet5.Data list free / gender source judgment freq.This syntax reads frequency counts and uses them to “create” individual respondent data for the other variables.Begin data.1 1 1 71 1 0 181 2 1 131 2 0 121 3 1 191 3 0 62 1 1 52 1 0 202 2 1 172 2 0 82 3 1 202 3 0 5end data.weight by freq.value labels gender 1 "Male" 2 "Female" /source 1 "Economist" 2 "Labor Leader" 3 "Politician" / judgment 1 "Biased" 0 "Unbiased".The logistic regression dialogs . . .Analyze -> Regression -> Binary Logistic . . .The syntax to invoke the Logistic Regression command is logistic regression judgment /categorical = gender source /enter source gender /enter source by genderNote we have to tell the LOGISTIC REGRESSION procedure to analyze the interaction between two factors. /save = pred(predict).Click here to add the interaction term.This table tells us about the Group Coding variables for source and gender . . .Source is dummy coded, with Politician as the ref group.Source(1) compares Economists with Politicians.Source(2) compares Labor Leaders with Politicians,.Gender is dummy coded with Female as reference group.Gender(1) compares Males vs. Females.Thanks, Logistic.Logistic Regression outputNote that the N is incorrect. We told SPSS to “expand” the summary data into 150 individual cases. But this part of the Logistic Regression command does not acknowledge that expansion except in the footnote.Case Processing SummaryUnweighted CasesaNPercentSelected CasesIncluded in Analysis12100.0Missing Cases0.0Total12100.0Unselected Cases0.0Total12100.0a. If weight is in effect, see classification table for the total number of cases.As always, make sure the internal code is identical to the original code.Dependent Variable EncodingOriginal ValueInternal Value.00 Unbiased01.00 Biased1Categorical Variables CodingsaFrequencyParameter coding(1)(2)source1.00 Economist41.000.0002.00 Labor Leader4.0001.0003.00 Politician4.000.000gender1.00 Male61.0002.00 Female6.000a. This coding results in indicator coefficients.Note that if you had not learned about group coding variables, the information in the Categorical Variables Codings table would make no sense at all.This is one of the example of SPSS output whose understanding depends on your understanding group coding schemes.Block 0: Beginning Block – (I generally ignore the Block 0 output. Not much of interest here except to logistic regression aficionados.)Classification Tablea,bObservedPredictedjudgmentPercentage Correct.00 Unbiased1.00 BiasedStep 0judgment.00 Unbiased069.01.00 Biased081100.0Overall Percentage54.0a. Constant is included in the model.b. The cut value is .500Variables in the EquationBS.E.WalddfSig.Exp(B)Step 0Constant.160.164.9581.3281.174Variables not in the EquationScoredfSig.Step 0Variablessource30.4352.000source(1)27.1741.000source(2)1.0871.297gender(1).2421.623Overall Statistics30.6763.000Block 1: Method = Enter This is the 1st of two blocks – one for main effects; one for interaction.Omnibus Tests of Model CoefficientsChi-squaredfSig.Step 1Step32.1863.000Block32.1863.000Model32.1863.000Model SummaryStep-2 Log likelihoodCox & Snell R SquareNagelkerke R Square1174.797a.193.258a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.Classification TableaObservedPredictedjudgmentPercentage Correct.00 Unbiased1.00 BiasedStep 1judgment.00 Unbiased383155.11.00 Biased126985.2Overall Percentage71.3a. The cut value is .500Bias=0Bias=1Variables in the EquationBS.E.WalddfSig.Exp(B)Step 1asource26.9442.000source(1)-2.424.47625.8801.000.089source(2)-.862.4483.7091.054.422gender(1)-.202.368.3031.582Econ=1Pol=0.817Constant1.370.39312.1431.0003.934a. Variable(s) entered on step 1: source, gender.Source: The overall differences in probability of rating passage as biased across the 3 source groups.Source(1): Probability of rating passage as biased least when respondents told that message was from Economists vs when attributed to Politicians.Source(2): No officially significant difference between probability of rating passage as biased when attributed to Labor Leaders vs. when attributed to Politicians.Gender(1)No difference in probability of rating passage as biased between males and females.Block 2: Method = EnterThis block adds the interaction of SourceXGender. No change in results.Omnibus Tests of Model CoefficientsChi-squaredfSig.Step 1Step1.5942.451Block1.5942.451Model33.7805.000Model SummaryStep-2 Log likelihoodCox & Snell R SquareNagelkerke R Square1173.203a.202.269a. Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.Classification TableaObservedPredictedjudgmentPercentage Correct.00 Unbiased1.00 BiasedStep 1judgment.00 Unbiased383155.11.00 Biased126985.2Overall Percentage71.3a. The cut value is .500Variables in the EquationBS.E.WalddfSig.Exp(B)Step 1asource17.2142.000source(1)-2.773.70715.3741.000.063source(2)-.633.659.9221.337.531gender(1)-.234.685.1161.733.792source * gender1.5732.455source(1) by gender(1).675.958.4971.4811.965source(2) by gender(1)-.440.902.2381.626.644Constant1.386.5007.6871.0064.000a. Variable(s) entered on step 1: source * gender .Since the interaction was not significant, we don’t have to interpret the results of this block.The main conclusion is that respondents rated passages from politicians as more biased than from economists.Logistic Regression Example 6: Amylase vs LipaseLOGISTIC REGRESSION VAR=pancgrp logistic regression pancgrp with logamy loglip /METHOD=ENTER logamy loglip /CLASSPLOT /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .[DataSet3]?G:\MdbT\InClassDatasets\amylip.sav Logistic RegressionBlock 0: Beginning BlockThe following assumes a model with only the constant, B0 in the equation.Each p-value tells you whether or not the variable would be significant if entered BY ITSELF. That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.Block 1: Method = EnterSpecificity: 204/208Sensitivity: 38/48Specificity is the ability to identify cases who do NOT have the disease.Among those without the disease, .981 were correctly identified.Sensitivity is the ability to identify cases who do have the disease.Among those with the disease, .792 were correctely identified.Note that LOGAMY does not officially increase predictability over that afforded by LOGLIP.Interpretation of the coefficients . . .Bs: Not easily interpretable on a raw probability scale. On a log odds scale: Expected increase in log odds for a one-unit increase in IV. If the p-value is <= .05, we can say that the inclusion of the predictor resulted in a significant change in probability of Y=1, an increase if Bi > 0, a decrease if Bi < 0. We just cannot give a simple quantitative prediction of the amount of change in probability of Y=1.SEs: Standard error of the estimate of Bi.Wald: Test statistic.Sig: p-value associated with test statistic.Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP. Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.So a person one unit higher on LOGLIP will have 20.04 greater odds of having Pancreatitis.The Exp(B) column is mostly useful for dichotomous predictors – 0 = Absent; 1 = present.Classification Plots – a frequency distribution of all cases on across Y-hat values with different symbols representing actual classification ?????????????Step?number:?1?????????????Observed?Groups?and?Predicted?Probabilities??????80?┼????????????????????????????????????????????????????????????????????????????????????????????????????┼?????????│N???????????????????????????????????????????????????????????????????????????????????????????????????│?????????│N???????????????????????????????????????????????????????????????????????????????????????????????????│F????????│N???????????????????????????????????????????????????????????????????????????????????????????????????│R?????60?┼N???????????????????????????????????????????????????????????????????????????????????????????????????┼E????????│N???????????????????????????????????????????????????????????????????????????????????????????????????│Q????????│N???????????????????????????????????????????????????????????????????????????????????????????????????│U????????│NN??????????????????????????????????????????????????????????????????????????????????????????????????│E?????40?┼NN??????????????????????????????????????????????????????????????????????????????????????????????????┼N????????│NN??????????????????????????????????????????????????????????????????????????????????????????????????│C????????│NNN?????????????????????????????????????????????????????????????????????????????????????????????????│Y????????│NNN?????????????????????????????????????????????????????????????????????????????????????????????????│??????20?┼NNN?????????????????????????????????????????????????????????????????????????????????????????????????┼?????????│NNN????????????????????????????????????????????????????????????????????????????????????????????????P│?????????│NNN?NN?????????????????????????????????????????????????????????????????????????????????????????????P│?????????│NNNNNNNNNNNP????????N?????????????????????????????????????????????????????????????????????P????PP?PP│Predicted?─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────??Prob:???0???????.1????????.2????????.3????????.4????????.5????????.6????????.7????????.8????????.9?????????1Y-HAT??Group:??NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP??????????Predicted?Probability?is?of?Membership?for?Pancreatitis??????????The?Cut?Value?is?.50??????????Symbols:?N?-?No?Pancreatitis???????????????????P?-?Pancreatitis??????????Each?Symbol?Represents?5?Cases.One aspect of the above plot is misleading because each symbol represents a group of cases. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were not enough to make 5 cases. Classification Plots using dot plots.Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.No-Pancreatitis casesPancreatitis casesFor the most part, the patients who did not get Pancreatitis had small predicted probabilities while the patients who did get it had high predicted probabilities, as you would expect. There were, however, a few patients who did get Pancreatitis who had small values of Y-hat. Those patients are dragging down the sensitivity of the test. Note that these patients don’t show up on the CASEPLOT produced by the LOGISTIC REGRESSION procedure.Classification Plots using Histograms in EXPLOREHere’s another equivalent representation of what the authors of the program were trying to show. Visualizing the equation with two predictors Skipped to ROC plots in 2016.(Mike – use this as an opportunity to whine about SPSS’s horrible 3-D graphing capability.)With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model. For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced. Graphs -> Interactive -> Scatterplot. . .The graph shows the general ogival relationship of YHAT on the vertical to LOGLIP and LOGAMY. But the relationships really aren’t apparent until the graph is rotated.Don’t ask me to demonstrate rotation. SPSS now does not offer the ability to rotate the graph interactively. It used to offer such a capability, but it’s been removed. Shame on SPSS.The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.Representing Relationships with a Table –the Powerpoint slidescompute logamygp2 = rnd(logamy,.5). <- Rounds logamy to the nearest .5 .Here’s the LOGLIP grouping in which the values were rounded to the nearest .5.LOGAMY and LOGLIP groups were created by rounding values of LOGAMY and LOGLIP to the nearest .5. compute loglipgp2 = rnd(loglip,.5).means pancgrp yhatamylip by logamygp2 by loglipgp2.Here’s the top of a very long two way table of mean Y-hat values for each combination of logamy group and loglip group.Below, this table is “prettified”.The above MEANS output, put into a 2 way table in WordThe entry in each cell is the expected probability (Y-hat) of contracting Pancreatitus at the combination of logamy and loglip represented by the cell.LOGLIP.511.522.533.544.54..LOGAMY3.5.991.001.001.003.97.98.991.002.5.03.09.30.73.92.971.002.01.04.14.47.851.5.00.00.00.01.05.This table shows the joint relationship of predicted Y to LOGAMY and LOGLIP. Move from the lower left of the table to the upper right.It also shows the partial relationships of each.Partial Relationship of YHAT to LOGLIP – Move across any row.So, for example, if your logamylase were 2.5, your chances of having pancreatitus would be only .03 if your loglipase were 1.5. But at the same 2.5 value of logamylase, your chances would be .97 if your loglipase value were 4.0.Partial Relationship of YHAT to LOGAMY – Move up any column.Empty cells show that there are certain combinations of LOGAMY and LOGLIP that are very unlikely.ROC (Receiver Operating Characteristic) Curve analysis – skipped in 2017 Situation: A dichotomous state (e.g., illness, termination, death, success) is to be predicted.You have a continuous predictor, such as in the logamy example above.The relationship between the dichotomous dependent variable and the continuous predictor is significant.The continuous predictor can, in fact, be y-hats from the combination of multiple predictors. It’ll be called y-hat from now on.Predicted 1s and Predicted 0s can be defined from the y-hats as is done in the Logistic Regression Classification Table. These predicted 1s and predicted 0s are created by comparing y-hats with a cutoff or criterion value. That value is 0.500 by default in SPSS’s LOSISTIC REGRESSION procedure.Predicted 1: Every case whose y-hat is above the cutoff.Predicted 0: Every case whose y-hat is <= the cutoff.Once predicted 1s and predicted 0s have been defined, a classification table such as that created by Logistic Regression can be created. In fact, and this is where the “curve” in ROC curve analyses comes in, multiple classification tables can created, one for each possible cutoff value.Some issues: 1) Which cutoff value should be used?2) How can we measure overall predictability from this process. 3) How does predictability relate to the cutoff which is employed.4) What gain in understanding is to be achieved by doing this?ROC CurveThe Receiver Operating Characteristic curve provides an approach to understanding the relationship between a dichotomous dependent variable and a continuous predictor.The ROC curve is a plot of Sensitivity vs. 1-Specificity. Sensitivity: Proportion of Actual 1s predicted correctly. No. of cases that were actual 1 and also predicted to be 1 / No. of cases that were actual 1Specificity: Proportion of Actual 0s predicted correctly.No. of cases that were actual 0 and also predicted to be 0 / No. of cases that were actual 01-Specificity: Proportion of Actual 0s predicted incorrectly to be 1s.In ROC terminology, Sensitivity is called the Hit or true positive rate. 1-Specificity is called the False Alarm or False Positive rate.So the ROC curve is a plot of the proportion of successful identifications vs. proportion of false identifications of some phenomenon.Example: Creating an ROC curve from the the Logamy, LogLip data.Note that I requested that the y-hats be saved in the Data EditorGET FILE='G:\MDBT\InClassDatasets\amylip.sav'.DATASET NAME DataSet1 WINDOW=FRONT.LOGISTIC REGRESSION VARIABLES pancgrp /METHOD=ENTER logamy loglip /SAVE=PRED /CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).Logistic Regression[DataSet1] G:\MDBT\InClassDatasets\amylip.savCase Processing SummaryUnweighted CasesaNPercentSelected CasesIncluded in Analysis25683.7Missing Cases5016.3Total306100.0Unselected Cases0.0Total306100.0a. If weight is in effect, see classification table for the total number of cases.Dependent Variable EncodingOriginal ValueInternal Value.00 No Pancreatitis01.00 Pancreatitis1Block 1: Method = EnterClassification TableaObservedPredictedpancgrp Pancreatitis Diagnosis (DV)Percentage Correct.00 No Pancreatitis1.00 PancreatitisStep 1pancgrp Pancreatitis Diagnosis (DV).00 No Pancreatitis2044Specificity98.11.00 Pancreatitis1038Sensitivity79.2Overall Percentage94.5a. The cut value is .500Creating coordinates of one ROC curve point.For this cutoff value of 0.500 . . .Specificity = proportion of the 208 actual 0s that were predicted to be 0s = 204/208 = .981.Sensitivity= proportion of the 48 actual 1s that were predicted to be 1s: 38/ 48 = .792.False alarm rate = 1 – Specificity = 1-.981 = .019.Hit rate = Sensitivity = .792.Note that these specificity, sensitivity values are for only 1 cutoff = .500.To create an ROC curve, I would have to do the above for multiple cutoff values – 0.001, 0.002, 0.003, 0.004 . . . 0.498, 0.499, 0,500, 0.501, 0.502, . . .. 0.999. I would then create a plot of the Hit rates vs. the False alarm rates, like a scatterplot.Luckily, the ROC procedure in SPSS will do all of that for me. All I have to do is save for each case its Y-hat value and whether it was an actual 1 or an actual 0.In this example, the yhats were saved and renamed LogRegYhatAmyLip.SPSS ROC Curve . . . ProcedureAnalyze ROC Curve . . .The name of a continuout predictor is put in the Test Variable field.Name of continuous predictor.The name of the variable representing Actual 1 vs Actual 0 is put in the State Variable field. I recommend that this variable have the value 1 representing the state that is being predicted – sickness, for example.My recommendations for the appearance of the curve are shown.ROC LogRegYhatAmyLip BY pancgrp (1) /PLOT=CURVE(REFERENCE) /PRINT= COORDINATES /CRITERIA=CUTOFF(INCLUDE) TESTPOS(LARGE) DISTRIBUTION(FREE) CI(95) /MISSING=EXCLUDE. The syntax that SPSS pasted in response to my interaction with the pull-down menu dialog box is above.ROC Curve [DataSet1] G:\MDBT\InClassDatasets\amylip.savCase Processing Summarypancgrp Pancreatitis Diagnosis (DV)Valid N (listwise)Positivea48Negative208Missing50Larger values of the test result variable(s) indicate stronger evidence for a positive actual state.a. The positive actual state is 1.00 Pancreatitis.False Alarm rateI’ve plotted the point whose coordinates I created above. The ROC procedure created multiple such points, internally in order to create the blue jagged curve.The ROC procedure employs as many cutoff values as possible (not just one = .500 as in Logistic Regression).It computes sensitivity and 1-Specificity for each cutoff value.It plots Sensitivity vs. 1-Specificity for each cutoff.That plot is the blue curve.Area Under the CurveTest Result Variable(s): Yhat Predicted probabilityArea.980You can think of the ROC curve as a “running classification table” with sensitivity/1-specificity values for all possible cutoff values. Each point on the curve tells you what combination of Hit Rate and False Alarm rate using a particular cutoff will give.The diagonal green line on the curve, tells you what to expect if your decision process is no better than chance.Added bonus: The area “under” the blue curve gives you a measure of overall predictability of the continuous variable, the Yhats, in this case.Reading the ROC curve . . .Pick a desired sensitivity (Hit rate) value. Use the curve to tell you what your false alarm rate would be expected to be.Example . . .If your decision process gives you a .781 Hit rate (our example value from cutoff=0.500) you’ll have to live with a .019 false alarm rate.If your decision process gives you a .95 Hit rate (see the horizontal line on the curve,) you’ll have to live with a 0.13 false alarm rate (vertical line on the curve).The use of ROC curves emphasizes that for many decision processes, increasing sensitivity (hit rates) is accompanied by increases in false alarms.Realizing this forces decision makers to consider both the good (hit) and the bad (false alarms) when making decisions.ROC Curve of a Perfect PredictorThe blue line is the ROC curve of a perfect predictor.Hit rate is 1.000. False alarm rate is 0.000 for all cases..ROC Curve of a useless predictor False Alarm rate = Hit rate for all cases.For any prediction, the actual case is equally likely to be a 0 as it is a 1.Example 2 – the FFROSH data. Predicting sustaining.Notice that the yhats are obtained by combining multiple predictors.The basic analyses, from the previous lecture . . .(Block 0 stuff omitted.)logistic regression retained with age nsex nrace hsgpa actcomp oatthrs1 earlireg admstat postsem y1988 y1989 y1991 y1992 havef101 /categorical nrace admstat /criteria = cut(.5) /save=pred.Note- y-hats saved.Logistic Regression[DataSet4] G:\MDBR\FFROSH\Ffroshnm.savCase Processing SummaryUnweighted CasesaNPercentSelected CasesIncluded in Analysis469798.8Missing Cases561.2Total4753100.0Unselected Cases0.0Total4753100.0a. If weight is in effect, see classification table for the total number of cases.Block 1: Method = EnterSensitivitySpecificityClassification TableaObservedPredictedretainedPercentage Correct.001.00Step 1retained.00165292.91.007414599.8Overall Percentage88.6a. The cut value is .500You might say, “Wow! Sensitivity if very high!” But notice that Specificity is very low, which means that False Alarm rate will also be pretty high.Variables in the EquationBS.E.WalddfSig.Exp(B)Step 1aage-.113.0534.4671.035.893nsex.268.1026.8911.0091.307nrace16.5132.000nrace(1)-1.033.5263.8541.050.356nrace(2)-.473.543.7581.384.623hsgpa.969.12857.3011.0002.636actcomp-.009.017.2831.595.991oatthrs1.105.01546.2841.0001.111earlireg.351.10511.1441.0011.421admstat(1).229.1273.2551.0711.257postsem-.140.0713.8971.048.870y1988-.081.088.8331.362.922y1989.203.0974.4261.0351.225y1991-.062.100.3851.535.940y1992-.142.1021.9251.165.868havef101.851.15928.7691.0002.341Constant.3641.257.0841.7721.438a. Variable(s) entered on step 1: age, nsex, nrace, hsgpa, actcomp, oatthrs1, earlireg, admstat, postsem, y1988, y1989, y1991, y1992, havef101.For this example, I reran the analysis several times, each time specifying a different cutoff value.Here are the Classification Tables for different cutoff values 0.9961.000.0.9941.000.0.9710.998.0.8040.965.0.2810.598.False Alarm rateNote that relative locations of points on the ROC curve are the inverse of the values of the cutoff.Points with low cutoff values are at the top of the curve.Points with high cutoff values are at the bottom of the paring Predictors using ROC analysisIn which area was prediction best – predicting pancreatitis diagnosis or predicting sustaining?Predicting pancreatitisArea Under the CurveTest Result Variable(s): Yhat Predicted probabilityArea.980Predicting Sustaining Using Area Under the Curve as our measure, we would conclude that we’re doing a better job of predicting Pancreatitis than we are of predicting who will sustain.Advantages of ROC curve analysis1. It Forces you to realize that when using many prediction systems, an increase in sensitivity is invariably accompanied by a concomitant increase in false alarms. For a given selection system, points will only differ in the lower-left to upper-right direction. Moving to the upper right increases sensitivity but it also increases false alarms. Moving to the lower left decreases false alarms but also decreases sensitivity.2. Shows that to make a prediction system better (I-O types, read selection system), you must increase sensitivity while at the same time decreasing false alarms – moving points toward the upper left of the ROC space.Better selection3. Enables you to graphically disentangle issues of bias (the value of the cutoff) from issues of predictability (the area under the curve). ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches