Analyses of Cateogical Dependent Variables



Analyses Involving Categorical Dependent Variables

When Dependent Variables are Categorical

Chi-square analysis is frequently used.

Example Question: Is there a difference in likelihood of death in an ATV accident between persons wearing helmets and those without helmets?

Dependent variable is Death: No (0) vs. Yes (1).

Crosstabs

[DataSet0]

[pic]

So, based on this analysis, there is no significant difference in likelihood of dying between ATV accident victims wearing helmets and those without helmets.

Comments on Chi-square analyses

What’s good?

1. The analysis is appropriate. It hasn’t been supplanted by something else.

2. The results are usually easy to communicate, especially to lay audiences.

3. A DV with a few more than 2 categories can be easily analyzed.

4. An IV with only a few more than 2 categories can be easily analyzed.

What’s bad?

1. Incorporating more than one independent variable is awkward, requiring multiple tables.

2. Certain tests, such as tests of interactions, can’t be performed when you have more than one IV.

3. Chi-square analyses can’t be done when you have continuous IVs unless you categorize the continuous IVs which goes against recommendations to NOT categorize continuous variables because you lose power.

Alternatives to the Chi-square test. We’ll focus on Dichotomous (two-valued) DVs.

1. Techniques based on linear regression

a. Multiple Linear Regression. Regress the dichotomous DV onto the mix of IVs.

b. Discriminant Analysis (equivalent to MR when DV is dichotomous)

Problems with regression-based methods, when the dependent variable is dichotomous and the independent variable is continuous.

1. Assumption is that underlying relationship between Y and X is linear.

But when Y has only two values, how can that be?

2. Y-hats when Y is continuous are typically realizable values of Y. But when Y has only two values, most of the Y-hats will be values that are not either of those two values. In that case, what are they?

3. Linear techniques assume that variability about the regression line is homogenous across possible values of X. But when Y has only two values, residual variability will vary as X varies, a violation of the homogeneity assumption.

4. Residuals will probably not be normally distributed.

5. Regression line will extend beyond the more negative of the two Y values in the negative direction and beyond the more positive value in the positive direction.

2. Logistic Regression

3. Probit analysis

Logistic Regression and Probit analysis are very similar. Almost everyone uses Logistic. We’ll focus on it.

The Logistic Regression Equation

Without restricting the interpretation, assume that the dependent variable, Y, takes on two values, 0 or 1.

When you have a two-valued DV it is convenient to think of Y-hat as the likelihood or probability that one of the values will occur. We’ll use that conceptualization in what follows and view Y-hat as the probability that Y will equal 1.

The equation will be presented as an equation for the probability that Y = 1, written simply as P(Y=1). So we’re conceptualizing Y-hat as the probability that Y is 1.

The equation for simple Logistic Regression (analogous to Predicted Y = B0 + B1*X in linear regression)

(B0 + B1*X)

1 e

P(Y=1) = --------------------- = -----------------

-(B0 + B1*X) (B0 + B1*X)

1 + e e + 1

The logistic regression equation defines an S-shaped (Ogive) curve, that rises from 0 to 1. P(Y=1) is never negative and never larger than 1.

The curve of the equation . . .

B0: B0 is analogous to the linear regression “constant” , i.e., intercept parameter. B0 defines the "height" of the curve. B0 is an elevation parameter. Also called a difficulty parameter in some applications.

[pic]

B1: B1 is analogous to the slope of the linear regression line. B1 defines the “steepness” of the curve. It is sometimes called a discrimination parameter.

The larger the value of B1, the “steeper” the curve, the more quickly it goes from 0 to 1.

[pic]

Note that there is a MAJOR difference between the linear regression and logistic regression curves - - -

The logistic regression lines asymptote at 0 and 1. They’re bounded by 0 and 1.

But the linear regression lines extend below 0 on the left and above 1 on the right.

If we interpret P(Y) as a probability, the linear regression curves cannot literally represent P(Y) except for a limited range of X values.

[pic]

[pic]

Why we must fit ogival-shaped curves – the curse of categorization

Here’s a perfectly nice linear relationship between score values, from a recent study.

This relationship is of ACT Comp scores to Wonderlic scores.

[DataSet3] G:\MdbR\0DataFiles\BalancedScale_110706.sav

Here’s the relationship when ACT Comp has been dichotomized at 23, into Low vs. High.

When, proportions of High scores are plotted vs. WPT value, we get the following

So, to fit the above curve, we need a model that is ogival. This is where the logistic regression function comes into play.

This means that even if the “underlying” true values are linearly related, the dichotomized values that we may have to work with may not be linearly related to the independent variable.

Crosstabs and Logistic Regression

Applied to the same 2x2 situation

The FFROSH data.

The data here are from a study of the effect of the Freshman Seminar course on 1st semester GPA and on retention. It involved students from 1987-1992. The data were gathered to investigate the effectiveness of having the freshman seminar course as a requirement for all students. There were two main criteria, i.e., dependent variables – first semester GPA excluding the seminar course and whether a student continued into the 2nd semester.

The dependent variable in this analysis is whether or not a student moved directly into the 2nd semester in the spring following his/her 1st fall semester. It is called RETAINED and is equal to 1 for students who retained to the immediately following spring semester and 0 for those who did not.

The analysis reported here was a serendipitous finding regarding the time at which students register for school. It has been my experience that those students who wait until the last minute to register for school perform more poorly on the average than do students who register earlier. This analysis looked at whether this informal observation could be extended to the likelihood of retention to the 2nd semester.

After examining the distribution of the times students registered prior to the first day of class we decided to compute a dichotomous variable representing the time prior to the 1st day of class that a student registered for classes. The variable was called EARLIREG – for EARLY REGistration. It had the value 1 for all students who registered 150 or more days prior to the first day of class and the value 0 for students who registered within 150 days of the 1st day. (The 150 day value was chosen after inspection of the 1st semester GPA data.)

So the analysis that follows examines the relationship of RETAINED to EARLIREG, retention to the 2nd semester to early registration.

The analyses will be performed using CROSSTABS and using LOGISTIC REGRESSION.

First, univariate analyses . . .

GET FILE='E:\MdbR\FFROSH\Ffroshnm.sav'.

Fre var=retained earlireg.

sustained

| |Frequency |Percent |Valid Percent |Cumulative |

| | | | |Percent |

|Valid |.00 |552 |11.6 |11.6 |11.6 |

| |1.00 |4201 |88.4 |88.4 |100.0 |

| |Total |4753 |100.0 |100.0 | |

earlireg

| |Frequency |Percent |Valid Percent |Cumulative |

| | | | |Percent |

|Valid |.00 |2316 |48.7 |48.7 |48.7 |

| |1.00 |2437 |51.3 |51.3 |100.0 |

| |Total |4753 |100.0 |100.0 | |

crosstabs retained by earlireg /cells=cou col /sta=chisq.

Crosstabs

[pic]

[pic]

[pic]

The same analysis using Logistic Regression Analyze -> Regression -> Binary Logistic

logistic regression retained WITH earlireg.

Logistic Regression

[pic]

[pic]

The Logistic Regression procedure fits the logistic regression model to the data. It estimates the parameters of the logistic regression equation.

1

That equation is P(Y) = ---------------------

-(B0 + B1X)

1 + e

It performs the estimation in two stages. The first stage estimates only B0. So the model fit to the data in the first stage is simply

1

P(Y) = ------------------

-(B0)

1 + e

SPSS labels the various stages of the estimation procedure “Blocks”. In Block 0, a model with only B0 is estimated

Block 0: Beginning Block (estimating only B0)

[pic]

Explanation of the above table:

The program computes Y-hat for each case using the logistic regression formula with the estimate of B0. If Y-hat is 0.5, the program records that case as a predicted 1. It then creates the above table of number of actual 1’s and 0’s vs. predicted 1’s and 0’s.

The prediction equation for Block 0 is Y-hat = 1/(1 + e –2.030) (The value 2.030 is shown in the “Variables in the Equation” table below.). Recall that B1 is not yet in the equation. This means that Y-hat is a constant, equal to .8839 for each case. (I got this by entering the prediction equation into a calculator.) Since Y-hat for each case is > 0.5, all predictions are 1, which is why the above table has only predicted 1’s. Sometimes this table is more useful than it was in this case.

[pic]

The above “Variables in the Equation” box is the Logistic Regression equivalent of the “Coefficients Box” in regular regression analysis.

The test statistic is not a t statistic, as in regular regression, but the Wald statistic. The Wald statistic is (B/SE)2. So (2.030/.045)2 = 2,035, which would be 2009.624 if the two coefficients were represented with greater precision.

Exp(B) is the odds ratio: e2.030 More on it later.

[pic]

The “Variables not in the Equation” gives information on each independent variable that is not in the equation. Specifically, it tells you whether or not the variable would be “significant” if it were added to the equation. In this case, it’s telling us that EARLIREG would contribute significantly to the equation if it were added to the equation, which is what SPSS does next . . .

Block 1: Method = Enter (Adding B1*X to the equation)

[pic]

Whew – three chi-square statistics.

“Step”: Compared to previous step in a stepwise regression. Ignore for now.

“Block”: Tests the significance of the improvement in fit of the model evaluated in this block vs. the previous block. Note that the chi-square is identical to the Likelihood ratio chi-square printed in the Chi-square Box in the CROSSTABS output.

“Model”: Ignore for now

[pic]

The value under “-2 Log likelihood” is a test of how well the model fit the data in an absolute sense. Values closer to 0 represent better fit. But goodness of fit is complicated by sample size. The R Square values are measures analogous to “percent of variance accounted for”. All three measures tell us that there is a lot of variability in proportions of persons retained that is not accounted for by this one-predictor model.

[pic]

The above table is the revised version of the table presented in Block 0.

Note that since X is a dichotomous variable here, there are only two y-hat values. They are

1

P(Y) = --------------------- = .842 (see below)

-(B0 + B1*0)

1 + e

And

1

P(Y) = --------------------- = .924 (see below)

-(B0 + B1*1)

1 + e

As we’ll see below, in both cases, the y-hat was > .5, so predicted Y in the table was 1 for all cases.

[pic]

The prediction equation is Y-hat = 1 / (1 + e-(.1.670 + .830*EARLIREG).

Since EARLIREG has only two values, those students who registered early will have predicted RETAINED v alue of 1/(1+e-(1.670+.830*1)) = .924. Those who registered late will have predicted RETAINED value of

1/(1+e-(1.670+.830*0) = 1/(1+e-1.670)).= .842. Since both predicted values are above .5, this is why all the cases were predicted to be retained in the table on the previous page.

Exp(B) is called the odds ratio. It is the ratio of the odds of Y=1 when X=1 to the odds of Y=1 when X=0.

Recall that the odds of 1 are P(Y=1)/(1-P(Y=1)). The odds ratio is

Odds when X=1 .924/(1-.924) 12.158

Odds ratio = --------------------- = --------------- = ------------------- = 2.29.

Odds when X= 0 .842/(1-.842) 5.329

So a person who registered early had odds of being retained that were 2.29 times the odds of a person registering late being retained.

Graphical representation of what we’ve just found.

The following is a plot of Y-hat vs. X, that is, the plot of predicted Y vs. X. Since there are only two values of X (0 and 1), the plot has only two points. The curve drawn on the plot is the theoretical relationship of y-hat to other hypothetical values of X over a wide range of X values (ignoring the fact that none of them could occur.) The curve is analogous to the straight line plot in a regular regression analysis.

[pic]

Discussion

1. When there is only one dichotomous predictor, the CROSSTABS and LOGISTIC REGRESSION give the same significance results, although each gives different ancillary information.

BUT as mentioned above . . .

2. CROSSTABS cannot be used to analyze relationships in which the X variable is continuous.

3. CROSSTABS can be used in a rudimentary fashion to analyze relationships between a dichotomous Y and 2 or more categorical X’s, but the analysis IS rudimentary and is laborious. No tests of interactions are possible. The analysis involves inspection and comparison of multiple tables.

4. CROSSTABS, of course, cannot be used when there is a mixture of continuous and categorical IV’s.

5. LOGISTIC REGRESSION can be used to analyze all the situations mentioned in 2-4 above.

6. So CROSSTABS should be considered for the very simplest situations involving one categorical predictor. But LOGISTIC REGRESSION is the analytic technique of choice when there are two or more categorical predictors and when there are one or more continuous predictors.

Logistic Regression with one Continuous Independent Variable

The data analyzed here represent the relationship of Pancreatitis Diagnosis to measures of Amylase and Lipase. Both Amylase and Lipase levels are tests that can predict the occurrence of Pancreatitis. Generally, it is believed that the larger the value of either, the greater the likelihood of Pancreatitis. [pic]

The objective here is to determine which alone is the better predictor of the condition and to determine if both are needed.

Because the distributions of both predictors were positively skewed, logarithms of the actual Amylase and Lipase values were used for this handout and for some of the following handouts.

This handout illustrates the analysis of the relationship of Pancreatitis diagnosis to only Amylase.

The name of the dependent variables is PANCGRP. It is 1 if the person is diagnosed with Pancreatitis. It is 0 otherwise.

Distributions of logamy and loglip – still somewhat positively skewed even though logarithms were taken.

The logamy and loglip scores are highly positively correlated. For that reason, it may be that once either is in the equation, adding the other won’t significantly increase the fit of the model. We’ll test that hypothesis later.

1. Scatterplots with individual cases.

Relationship of Pancreatitis Diagnosis to log(Amylase)

[pic]

This graph represents a primary problem with visualizing results when the dependent variable is a dichotomy. It is difficult to see the relationship that may very well be represented by the data. One can see, however, that when log amylase is low, there are more 0’s (no Pancreatitis) and when log amylase is high there are more 1’s (presence of Pancreatitis).

The line through the scatterplot is the linear line of best fit. It was easy to generate. It represents the relationship of probability of Pancreatitis to log amylase that would be assumed if a linear regression were conducted.

But, the logistic regression analysis assumes that the relationship between probability of Pancreatitis to log amylase is different. The relationship assumed by the logistic regression analysis would be an S-shaped curve, called an ogive.

Below are the same data, this time with the line of best fit generated by the logistic regression analysis through it. While neither line fits the observed individual case points well in the middle, it’s easy to see that the logistic line fits better at small and at large values of log amylase.

[pic]

2. Grouping cases to show a relationship when the DV is a dichotomy.

The plots above were plots of individual cases. Each point represented the DV value of a case (0 or 1) vs. that person’s IV value (log amylase value). The problem was that the plot didn’t really show the relationship because the DV could take on only two values - 0 and 1.

When the DV is a dichotomy, it my be profitable to form groups of cases with similar IV values and plot the proportion of 1’s within each group vs. the IV value for that group.

To illustrate this, groups were formed for every .2 increase in log amylase. That is, the values 1.4, 1.6, 1.8, 2.0, 2.2, 2.4, 2.6, 2.8, 3.0, 3.2, 3.4, 3.6, and 3.8 were used as group mid points. Each case was assigned to a group based on how close that case’s log amylase value was to the group midpoint. So, for example, all cases between 1.5 and 1.7 were assigned to the 1.6 group.

Syntax: compute logamygp = rnd(logamy,.2).

Then the proportion of 1’s within each group was computed. The figure below is a plot of the proportion of 1’s within each group vs. the groups midpoints. Note that the points form a curve, quite a bit like the ogival form from the logistic regression analysis shown on the previous page.

[pic]

[pic]

The plot of proportions above suggests that the S-shaped curve of the logistic regression model may better represent the increase in probability of Pancreatitis than the straight line curve of the linear regression model.

The analyses that follow illustrate the application of both analyses to the data.

3. Linear Regression analysis of the logamy data, just for old time’s sake.

REGRESSION

/MISSING LISTWISE

/STATISTICS COEFF OUTS R ANOVA

/CRITERIA=PIN(.05) POUT(.10)

/NOORIGIN

/DEPENDENT pancgrp

/METHOD=ENTER logamy

/SCATTERPLOT=(*ZRESID ,*ZPRED )

/RESIDUALS HIST(ZRESID) NORM(ZRESID) .

Regression

[pic]

[pic]

[pic]

[pic]

Thus, the predicted linear relationship of probability of Pancreatitis to log amylase is

Predicted probability of Pancreatitis = -1.043 + 0.635 * logamy.

The following are the usual linear regression diagnostics.

[pic]

[pic]

[pic]

[pic]

[pic]

Computation of y-hats for the groups.

I had SPSS compute the Y-hat for each of the group mid-points discussed on page 3. I then plotted both the observed group proportion of 1’s that was shown on the previous page and the Y-hat for each group. Of course, the Y-hats are in a linear relationship with log amylase. Note that the solid points don’t really represent the relationship shown by the open symbols. Note also that the solid points extend above 1 and below 0. But the observed proportions are bound by 1 and 0.

compute mrgpyhat = -1.043 + .635*logamyvalue.

execute.

GRAPH

/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc mrgpyhat (PAIR)

/MISSING=LISTWISE .

Graph

[pic]

4. Logistic Regression Analysis of logamy data

logistic regression pancgrp with logamy.

Logistic Regression

[pic]

[pic]

SPSS’s Logistic regression procedure always performs the analysis in at least two steps, which it calls Blocks.

Recall the Logistic prediction formula is

1

P(Y) = ---------------------

-(B0 + B1X)

1 + e

In the first block, labeled Block 0, only B0 is entered into the equation. In this B0 only equation, it is assumed that the probability of a 1 is a constant, equal to the overall proportion of 1’s for the whole sample.

Obviously this model will generally be incorrect, since typically, we’ll be working with data in which the probability of a 1 increases as the IV increases.

But this model serves as a useful baseline against which to assess subsequent models, all of which do assume that probability of a 1 increase as the IV increases.

For each block the Logistic Regression Procedure automatically prints a 2x2 table of predicted and observed 1’s and 0’s. For all of these tables, a case is classified as a predicted 1 if it’s Y-hat (predicted probability) exceed 0.5. Otherwise it’s classified as a predicted 0. Since only the constant is estimated here, the predicted probability for every case is simply the proportion of 1’s in the sample, which is 48/256 = 0.1875. Since that’s less than 0.5, every case is predicted to be a 0 for this constant only model.

Block 0: Beginning Block

[pic]

[pic]

The test that is recommended is the Wald test. The p-value of .000 says that the value of B0 is significantly different from 0.

The predicted probability of 1 here is

1 1 1

P(1) = ------------------------- = --------------------------- = ------------- = 0.1875, the observed proportion of 1’s.

1 + e-(-1.466) 1 + 4.332 5.332

[pic]

The “Variables not in the Equation” box says that if log amylase were added to the equation, it would be significant.

Block 1: Method = Enter

In this block, log amylase is added to the equation.

[pic]

Step: The procedure can perform stepwise regression from a set of covariates. The Chi-square step tests the significance of the increase in fit of the current set of covariates vs. those in the previous set.

Block: The significance of the increase in fit of the current model vs. the last Block. We’ll focus on this.

Model: Tests the significance of the increase in fit of the current model vs. the “B0 only” model.

[pic]

In the following classification table, for each case, the predicted probability of 1 is evaluated and compared with 0.5. If that probability is > 0.5, the case is a predicted 1, otherwise it’s a predicted 0.

[pic]

Specificity: Proportion of Y=0 cases that test labels as 0. (Percentage of correct predictions of people who don’t have the disease.)

Sensitivity: Proportion of Y=1 cases that test labels as 1. (Percentage of correct predictions of people who did have the disease.)

[pic]

5. Computing Predicted proportions for the groups defined on page 3.

To show that the relationship assumed by the logistic regression analysis is a better representation of the relationship than the linear, I computed probability of 1 for each of the group midpoints from page 3. The figure below is a plot of those probabilities and the observed proportion of 1’s vs. the group midpoints. Compare this figure with that on page 6 to see how much better the logistic regression relationship fits the data than does the linear relationship.

compute lrgpyhat = 1/(1+exp(-(-16.0203 + 6.8978*logamygp))).

GRAPH

/SCATTERPLOT(OVERLAY)=logamygp logamygp WITH probpanc lrgpyhat (PAIR)

/MISSING=LISTWISE .

Graph

[pic]

Logamy

Compare this graph with the one immediately above. Note that the predicted proportions correspond much more closely to the observed proportions here.

6. Using residuals to distinguish between logistic and linear regression.

I computed residuals for all cases. Recall that a residual is Y – Y-hat. For these data, Y’s were either 1 or 0. Y-hats are probabilities.

First, I computed Y-hats for all cases, using both the linear equation and the logistic equation.

.

compute mryhat = -1.043 + .635*logamy.

compute lryhat = 1/(1+exp(-(-16.0203 + 6.8978*logamy))).

Now residuals are computed

.

compute mrresid = pancdiag - mryhat.

compute lrresid = pancdiag - lryhat.

frequencies variables = mrresid lrresid /histogram /format=notable.

Frequencies

[pic]

[pic]

[pic]

[pic]

What these two sets of figures show is that the vast majority of residuals from the logistic regression analysis were virtually 0, while for the linear regression, there were many residuals that were substantially different from 0. So the logistic regression analysis has modeled the Y’s better than the linear regression.

Logistic Regression - Logamy revisited:

Focus on the Logistic Regression Output

logistic regression variables = pancgrp with logamy.

Logistic Regression

[pic]

[pic]

Block 0: Beginning Block

[pic]

Specificity: The ability to identify cases that don’t have the disease.

Sensitivity: The ability to identify cases that do have the disease.

[pic]

The Wald statistics is (B/SE)2. It tests the null hypothesis that the coefficient (B0 in this case) is 0 in the population. That null is rejected here.

[pic]

Block 1: Method = Enter

[pic]

The Step Chi-Square tests the significance of improvement (or decrement) in fit over the immediately previous model. It is applicable when stepwise entry of independent variables within a block has been specified. It will be printed after each variable is entered or removed. Again, larger is better.

The Block Chi-Square tests the significance of improvement (or decrement) in fit over the model specified in the previous block of independent variables, if there was one. It is only applicable when two or more blocks of independent variables have been specified. Again, larger is better. It's analogous to the F-change statistic in linear regression.

The Model Chi-Square statistic tests the significance of the improvement in fit of the current model over a model containing just the constant, B0. For this chi-square, larger is better. It is analogous to the overall F statistic in linear regression output.

[pic]

-2 Log Likelihood is a goodness of fit measure (0 is best) computed using a particular set of assumptions.

The R Square measures are analogous to R2 in regular regression. Each is computed using a different set of assumption, which accounts for the difference in their values.

[pic]

Specificity: Ability to predict cases without the disease.

Sensitivity: Ability to predict cases with the disease.

In this classification table, since every case potentially had a different value of logamy, a unique Y-hat was generated for each case. If Y-hat was .5, a prediction of 1 was recorded. Note the increase in % of correct classifications over the "constant only" model above.

[pic]

[pic]

Logistic Regression: Two Continuous predictors

LOGISTIC REGRESSION VAR=pancgrp

/METHOD=ENTER logamy loglip

/CLASSPLOT

/CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

[DataSet3] G:\MdbT\InClassDatasets\amylip.sav

Logistic Regression

[pic]

[pic]

Block 0: Beginning Block

[pic]

[pic]

Based on the equation with only the constant, B0.

[pic]

Each p-value tells you whether or not the variable would be significant if entered BY ITSELF. That is, each of the above p-values should be interpreted on the assumption that only 1 of the variables would be entered.

Block 1: Method = Enter

[pic]

[pic]

[pic]

Recall: Specificity is the ability to predict cases who do NOT have the disease.

Sensitivity is the ability to predict cases who do have the disease.

[pic]

Interpretation of the coefficients . . .

B: Not easily interpretable on a raw probability scale. Expected increase in log odds for a one-unit increase in IV.

SE: Standard error of the estimate of Bi.

Wald: Test statistic.

Sig: p-value associated with test statistic.

Note that LOGAMY does NOT (officially) add significantly to prediction over and above the prediction afforded by LOGLIP.

Exp(B): Odds ratio for a one-unit increase in IV among persons equal on the other IV.

Person one unit higher on IV will have Exp(B) greater odds of having Pancreatitis.

So a person one unit higher on LOGLIP will have 20.04 greater odds of having

Pancreatitis.

Classification Plots – a frequency distribution of predicted probabilities categorized by actual classification

             Step number: 1

             Observed Groups and Predicted Probabilities

      80 ┼                                                                                                    ┼

         │N                                                                                                   │

         │N                                                                                                   │

F        │N                                                                                                   │

R     60 ┼N                                                                                                   ┼

E        │N                                                                                                   │

Q        │N                                                                                                   │

U        │NN                                                                                                  │

E     40 ┼NN                                                                                                  ┼

N        │NN                                                                                                  │

C        │NNN                                                                                                 │

Y        │NNN                                                                                                 │

      20 ┼NNN                                                                                                 ┼

         │NNN                                                                                                P│

         │NNN NN                                                                                             P│

         │NNNNNNNNNNNP        N                                                                     P    PP PP│

Predicted ─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼──────────

  Prob:   0       .1        .2        .3        .4        .5        .6        .7        .8        .9         1

  Group:  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP

          Predicted Probability is of Membership for Pancreatitis

          The Cut Value is .50

          Symbols: N - No Pancreatitis

                   P - Pancreatitis

          Each Symbol Represents 5 Cases.

One aspect of the above plot is misleading because many cases are not represented in it. Only those cases which happened to be so close to other cases that a group of 5 cases could be formed are represented. So, for example, those relatively few cases whose y-hats were close to .5 are not seen in the above plot, because there were enough to make 5 cases.

Classification Plots using dot plots.

Here’s the same information gotten as dot plots of Y-hats with PANCGRP as a Row Panel Variable.

[pic]

Classification Plots using Histograms in EXPLORE

Here’s another equivalent representation of what the authors of the program were trying to show.

[pic]

[pic]

Visualizing the equation with two predictors

With one predictor, a simple scatterplot of YHATs vs. X will show the relationship between Y and X implied by the model.

For two predictor models, a 3-D scatterplot is required. Here’s how the graph below was produced.

Graphs -> Interactive -> Scatterplot. . .

[pic]

[pic]

The same graph but with Linear Regression Y-hats plotted vs. loglip and logamy.

[pic]

Representing Relationships with a Table – for the Powerpoint slides

compute logamygp2 = rnd(logamy,.5). ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download