Logistic Regression Using SAS



Logistic Regression Using SPSS

We will use the same breast cancer dataset for this handout as we did for the handout on logistic regression using SAS. Shown below is a table listing the variables in “A study of preventive lifestyles and women’s health” conducted by a group of students in School of Public Health, at the University of Michigan during the1997 winter term. There are 370 women in this study aged 40 to 91 years.

|Description of variables: |

| |

|Variable Name Description Column Location |

| |

|IDNUM Identification number 1-4 |

|STOPMENS 1= Yes, 2= NO, 9= Missing 5 |

|AGESTOP1 88=NA (haven't stopped) 99= Missing 6-7 |

|NUMPREG1 88=NA (no births) 99= Missing 8-9 |

|AGEBIRTH 88=NA (no births) 99= Missing 10-11 |

|MAMFREQ4 1= Every 6 months 12 |

|2= Every year |

|3= Every 2 years |

|4= Every 5 years |

|5= Never |

|6= Other |

|9= Missing |

|DOB 01/01/00 to 12/31/57 13-20 |

|99/99/99= Missing |

|EDUC 1= No formal school 21-22 |

|2= Grade school |

|3= Some high school |

|4= High school graduate/ Diploma equivalent |

|5= Some college education/ Associate’s degree |

|6= College graduate |

|7= Some graduate school |

|8= Graduate school or professional degree |

|9= Other |

|99= Missing |

|TOTINCOM 1= Less than $10,000 23 |

|2= $10.000 to 24,999 |

|3= $25,000 to 39,999 |

|4= $40.000 to 54,999 |

|5= More than $55,000 |

|8= Don’t know |

|9= Missing |

| |

|SMOKER 1= Yes, 2= No, 9= Missing 24 |

|WEIGHT1 999= Missing 25-27 |

In SPSS, we use the Set Epoch command to define the 100-year window SPSS will use for a two-digit year. We use SET epoch=1900 so that a date of birth of 12/21/05 will be read as Dec 21, 1905, rather than as Dec 21, 2000.

SET epoch=1900.

The data list command reads in the raw data. We initially read in DOB as a character variable. We then set up the missing value code for DOB to be 09/09/99. We compute a new variable called BIRTHDATE, which is the numeric version of DOB, using the SPSS date and time wizard. We also use the date and time wizard in SPSS to calculate the variables STUDYDATE and AGE.

SET epoch=1900.

data list

file "c:\documents and settings\kwelch\desktop\b510\brca.dat" records=1

/idnum 1-4 stopmens 5 agestop1 6-7 numpreg1 8-9 agebirth 10-11

mamfreq4 12 dob 13-20 (A) educ 21-22

totincom 23 smoker 24 weight1 25-27.

execute.

RECODE

dob ('09/09/99'=' ') .

EXECUTE .

* Date and Time Wizard: birthdate.

COMPUTE birthdate = number(dob, ADATE8).

VARIABLE LABEL birthdate.

VARIABLE LEVEL birthdate (SCALE).

FORMATS birthdate (ADATE10).

VARIABLE WIDTH birthdate(10).

EXECUTE.

missing values stopmens mamfreq4 smoker(9) agestop1 agebirth (88,99)

numpreg1 educ (99) totincom(8,9) weight1 (999).

RECODE

stopmens

(1=1) (2=0) INTO menopause .

EXECUTE .

COMPUTE yearbirth = XDATE.YEAR(birthdate).

FORMATS yearbirth (F8.0).

VARIABLE WIDTH yearbirth(8).

execute.

compute studymonth = 1.

format studymonth (f2.0).

compute studyyear = 1997.

format studyyear (f4.0).

compute studyday = 1.

format studyday (f1.0).

execute.

COMPUTE studydate = DATE.DMY(studyday, studymonth, studyyear).

FORMATS studydate (ADATE10).

EXECUTE.

COMPUTE age = DATEDIF(studydate, birthdate, "years").

FORMATS age (F5.0).

EXECUTE.

RECODE

educ

(MISSING=SYSMIS) (1 thru 4=1) (5 thru 6=2) (7 thru 8=3) INTO

edcat .

formats edcat (f2.0).

EXECUTE .

Recode educ (missing=sysmis) (6 thru 8 = 1) (else=0) into highed.

formats highed (f2.0).

EXECUTE .

RECODE age (missing=sysmis) (lowest thru 49=1) (50 thru 59=2)

(60 thru 69=3) (70 thru highest=4) into agecat.

formats agecat (f2.0).

EXECUTE.

Recode age (missing=sysmis) (50 thru highest = 1) (else=0) into over50.

formats over50 (f2.0).

EXECUTE .

Recode age (missing=sysmis) (50 thru highest = 1) (else=2) into highage.

formats highage(f2.0).

EXECUTE .

SAVE OUTFILE='C:\Documents and Settings\kwelch\Desktop\b510\brca.sav'

/COMPRESSED.

Descriptives and Frequencies

We first get descriptive statistics for all the numerical variables in the dataset. Notice that although there are 370 observations in the dataset, we have only 191 cases that are complete for all variables (Valid n listwise=191).

DESCRIPTIVES

VARIABLES=idnum stopmens agestop1 numpreg1 agebirth mamfreq4 educ totincom

smoker weight1 birthdate yearbirth edcat highed studymonth studyyear

studyday studydate age highage over50 menopause

/STATISTICS=MEAN STDDEV MIN MAX .

|Descriptive Statistics |

| |

| | |Frequency |Percent |Valid Percent |

|stopmens |

| | |Frequency |Percent |Valid Percent |

|menopause |

| | |Frequency |Percent |Valid Percent |

|educ |

| | |Frequency |Percent |Valid Percent |

|edcat |

| | |Frequency |Percent |Valid Percent |

|age |

| | |Frequency |Percent |Valid Percent |

|over50 |

| | |Frequency |Percent |Valid Percent |

|highage |

| | |Frequency |Percent |Valid Percent |

Crosstabulation

Prior to fitting a logistic regression model, we check a crosstabulation to understand the relationship between menopause and high age. In this 2 by 2 table both the predictor variable, HIGHAGE, and the outcome variable, STOPMENS, are coded as 1 and 2. For HIGHAGE, the value 1 represents the high risk group (those whose age is greater than or equal to 50 years), and for STOPMENS, the value 1 represents the outcome of interest (those who are in menopause). Notice also that HIGHAGE is considered to be the risk factor so it is listed first (the row variable) in the tables statement and STOPMENS is the outcome of interest so it is listed second (the column variable). We request the relative risk and the odds ratio, along with the chi-square test of independence in the Statistics window.

CROSSTABS

/TABLES=highage BY stopmens

/FORMAT= AVALUE TABLES

/STATISTIC=CHISQ RISK

/CELLS= COUNT EXPECTED ROW

/COUNT ROUND CELL .

|Case Processing Summary |

| |Cases |

| |Valid |Missing |Total |

| |

| | | |stopmens |Total |

| | | |1 |2 |

| |Expected Count |301.0 |59.0 |360.0 |

| |% within highage |83.6% |16.4% |100.0% |

|Chi-Square Tests |

| |

|b. Computed only for a 2x2 table |

The output below says "For cohort stopmens=1". This is what we want: the risk of menopause for those who are high age (ROW1) divided by the risk of menopause for those who are not high age (ROW2). Notice that the odds ratio (24.6) is not a good estimate of the risk ratio (1.90), because the outcome is not rare in this group of older women.

|Risk Estimate |

| |Value |95% Confidence Interval |

| | |Lower |Upper |

|Odds Ratio for highage (1 / 2) |24.598 |11.680 |51.802 |

|For cohort stopmens = 1 |1.904 |1.564 |2.318 |

|For cohort stopmens = 2 |.077 |.041 |.147 |

|N of Valid Cases |360 | | |

Logistic Regression Model with a dummy variable predictor

We now fit a logistic regression model, but using two different variables: OVER50 (coded as 0, 1) is used as the predictor, and MENOPAUSE (also coded as 0,1) is used as the outcome. Note that SPSS will fit the probability of the Menopause=1 by default, so we do not need to use any special syntax for this, as we did in SAS.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD = ENTER over50

/PRINT = CI(95)

/CRITERIA = PIN(.05) POUT(.10) ITERATE(20) CUT(.5) .

The output below shows us that we have 360 observations in this analysis.

|Case Processing Summary |

|Unweighted Casesa |N |Percent |

|Selected Cases |Included in Analysis |360 |97.3 |

| |Missing Cases |10 |2.7 |

| |Total |370 |100.0 |

|Unselected Cases |0 |.0 |

|Total |370 |100.0 |

|a. If weight is in effect, see classification table for the total number of cases. |

The dependent variable encoding is as we expect, so we proceed to look at the rest of the output.

|Dependent Variable Encoding |

|Original |Internal Value |

|Value | |

|.00 |0 |

|1.00 |1 |

Block 0: Beginning Block

This gives us information about the model before any predictors have been added.

We will not be using the classification table for this analysis. In general, it is helpful if you have a diagnostic test or other method that you wish to use to check the proportion of cases that are correctly classified, if a given cutpoint is used for the probability of an event; the default cutpoint is .5.

|Classification Tablea,b |

| |Observed |Predicted |

| | |menopause |Percentage Correct |

| | |.00 |1.00 | |

|Step 0 |menopause |.00 |0 |59 |

|a. Constant is included in the model. |

|b. The cut value is .500 |

At step 0, the only predictor in the equation is the constant.

|Variables in the Equation |

| |

| | | |Score |df |

Block 1: Method = Enter

Now, we enter the first block of variables, because we are entering only OVER50, it is the only variable shown here.

The Omnibus tests of model coefficients are testing the overall model. Because we have only one predictor in the model, we have a 1 df test, which is significant.

|Omnibus Tests of Model Coefficients |

| | |Chi-square |df |Sig. |

|Step 1 |Step |99.081 |1 |.000 |

| |Block |99.081 |1 |.000 |

| |Model |99.081 |1 |.000 |

The model summary table shows the -2 log likelihood for the model, and the Cox & Snell R-Square (called the pseudo R-square in SAS) and the Nagelkerke R-square (called the maximum rescaled R-square in SAS). We see that this model has explained about 41% of the variation in the outcome.

|Model Summary |

|Step |-2 Log likelihood |Cox & Snell R Square |Nagelkerke R Square |

|1 |222.084a |.241 |.408 |

|a. Estimation terminated at iteration number 6 because parameter estimates |

|changed by less than .001. |

The value of the parameter estimate for OVER50 (3.2) tells us that the log-odds of being in menopause increase (because the estimate is positive) by 3.2 units for those in menopause compared to those women who are not. This result is significant, Wald chi-square (1 df) = 71.036, p< 0.001. The odds ratio (24.6) is easier to interpret. It tells us that the odds of being in menopause are 24.6 times higher for a woman who is over 50 than for someone who is not. We can see that the 95% CI for the odds ratio does not include 1, so we can be pretty confident that there is a strong relationship between being over 50 and being in menopause.

|Variables in the Equation |

| |

Logistic Regression Model with a class variable as predictor

We now fit the same model, but using a class variable, HIGHAGE (coded as 1=Highage and 2=Not Highage), as the predictor. We set up HIGHAGE as a categorical predictor, using the Indicator dummy variable coding. We accept the default setup, so that the last (highest) category of HIGHAGE will be the reference. So, we will be fitting a model in which we are comparing the odds of being in menopause for those women who are over 50 (HIGHAGE=1) to those who are not over 50 (HIGHAGE=2, the reference category).

Note that the results of this model fit are the same as in the previous model, but with some minor modifications in the display.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD=ENTER highage

/CONTRAST (highage)=Indicator

/CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).

|Case Processing Summary |

|Unweighted Casesa |N |Percent |

|Selected Cases |Included in Analysis |360 |97.3 |

| |Missing Cases |10 |2.7 |

| |Total |370 |100.0 |

|Unselected Cases |0 |.0 |

|Total |370 |100.0 |

|a. If weight is in effect, see classification table for the total number of cases. |

|Dependent Variable Encoding |

|Original |Internal Value |

|Value | |

|.00 |0 |

|1.00 |1 |

SPSS provides information on the coding of the categorical predictor (HIGHAGE). We can see that the 1 parameter that will be used for HIGHAGE has a value of 0 for HIGHAGE=2 (the younger group), which means that this will be the reference category.

|Categorical Variables Codings |

| | |Frequency |Parameter coding |

| | | |(1) |

|highage |1 |261 |1.000 |

| |2 |99 |.000 |

The output for the parameter estimate is slightly different than for the previous model. In this case, we see highage(1), to emphasize that this is the first (and only) dummy variable for HIGHAGE. Refer to the table showing the coding of the categorical variables to be sure of the interpretation of this parameter.

|Variables in the Equation |

| |

Logistic Regression Model with a class predictor with more than two categories

We now look at the relationship of education categories to menopause. Again, we begin by checking the cross-tabulation between education and menopause, using the variable EDCAT as the "exposure" and STOPMENS as the "outcome" or event. Because we are interested in the probability of STOPMENS = 1, for each level of EDCAT, we really need only the row percents, so we request the row percents only. We see in the output that the proportion of women in menopause decreases with increasing education level.

CROSSTABS

/TABLES=edcat BY stopmens

/FORMAT=AVALUE TABLES

/STATISTICS=CHISQ

/CELLS=COUNT ROW

/COUNT ROUND CELL.

|edcat * stopmens Crosstabulation |

| | | |stopmens |Total |

| | | |1 |2 |

| |% within edcat |84.0% |16.0% |100.0% |

|Chi-Square Tests |

| |Value |df |Asymp. Sig. (2-sided)|

|Pearson Chi-Square |9.117a |2 |.010 |

|Likelihood Ratio |9.337 |2 |.009 |

|Linear-by-Linear Association |9.071 |1 |.003 |

|N of Valid Cases |363 | | |

|a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 16.78. |

We now fit a logistic regression model, using EDCAT as a predictor, we include EDCAT as a categorical predictor with the reference being EDCAT=1.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD=ENTER edcat

/CONTRAST (edcat)=Indicator(1)

/PRINT=CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

We can look at the categorical variable information below to see that EDCAT=1 is the reference category, because it has a value of 0 for both of the design (dummy) variables. The first dummy variable, EDCAT(1), will be for EDCAT 2 vs. 1, and the second dummy variable, EDCAT(2), will be for EDCAT 3 vs. 1.

|Case Processing Summary |

|Unweighted Casesa |N |Percent |

|Selected Cases |Included in Analysis |363 |98.1 |

| |Missing Cases |7 |1.9 |

| |Total |370 |100.0 |

|Unselected Cases |0 |.0 |

|Total |370 |100.0 |

|a. If weight is in effect, see classification table for the total number of cases. |

|Dependent Variable Encoding |

|Original |Internal Value |

|Value | |

|.00 |0 |

|1.00 |1 |

|Categorical Variables Codings |

| | |Frequency |Parameter coding |

| | | |(1) |(2) |

|edcat |1 |105 |.000 |.000 |

| |2 |148 |1.000 |.000 |

| |3 |110 |.000 |1.000 |

The table for Omnibus Tests of Model Coefficients provides an overall test for all parameters in the model. Thus, we can see that there is a likelihood ratio chi-square test of whether there is any effect of EDCAT, Χ2 (2df) = 9.337, p=.009. In spite of the model being significant, the Nagelkerke R-Square is very small (.043).

|Omnibus Tests of Model Coefficients |

| | |Chi-square |df |Sig. |

|Step 1 |Step |9.337 |2 |.009 |

| |Block |9.337 |2 |.009 |

| |Model |9.337 |2 |.009 |

|Model Summary |

|Step |-2 Log likelihood |Cox & Snell R Square |Nagelkerke R Square |

|1 |309.598a |.025 |.043 |

|a. Estimation terminated at iteration number 5 because parameter estimates |

|changed by less than .001. |

The parmeter estimate for EDCAT(1) shows that the log-odds of menopause for someone with EDCAT=2 are smaller than for someone with EDCAT=1, but this difference is not significant (p=0.105). The parameter estimate for EDCAT(1) is negative, indicating that someone with EDCAT=3 has a lower log-odds of menopause than a person with EDCAT=1, and this difference is significant (p=0.004) . The overall test for EDCAT is a Wald chi-square, which is another test of the overall signficance of EDCAT (Chi-square (2 df) = 8.632, p=.013.

The odds ratio estimate for EDCAT(2) (Edcat 3 vs 1) is .303, indicating that the odds of being in menopause for a person with EDCAT=3 are only 30% of the odds of being in menopause for a person with EDCAT=1.

|Variables in the Equation |

| |

Logistic Regression Model with a continuous predictor

We now look at a logistic regression model, but this time with a single continuous predictor (AGE). The parameter estimate for AGE is positive (0.283) telling us that the log-odds of being in menopause increase by .28 units for a woman who is one year older compared to her counterpart who is one year younger. The odds ratio (1.33) tells us that the odds of being in menopause for a woman who is one year older are 1.33 times greater than for a woman who is one year younger.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD=ENTER age

/PRINT=CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

|Case Processing Summary |

|Unweighted Casesa |N |Percent |

|Selected Cases |Included in Analysis |360 |97.3 |

| |Missing Cases |10 |2.7 |

| |Total |370 |100.0 |

|Unselected Cases |0 |.0 |

|Total |370 |100.0 |

|a. If weight is in effect, see classification table for the total number of cases. |

|Dependent Variable Encoding |

|Original |Internal Value |

|Value | |

|.00 |0 |

|1.00 |1 |

|Omnibus Tests of Model Coefficients |

| | |Chi-square |df |Sig. |

|Step 1 |Step |124.146 |1 |.000 |

| |Block |124.146 |1 |.000 |

| |Model |124.146 |1 |.000 |

|Model Summary |

|Step |-2 Log likelihood |Cox & Snell R Square |Nagelkerke R Square |

|1 |197.019a |.292 |.494 |

|a. Estimation terminated at iteration number 7 because parameter estimates |

|changed by less than .001. |

|Variables in the Equation |

| |

Quasi-Complete Separation in a Logistic Regression Model

One fairly common occurrence in a logistic regression model is that the model fails to converge. This often happens when you have a categorical predictor that is too perfect, that is, there may be a category with no variability in the response (all subjects in one category of the predictor have the same response). This is called quasi-complete separation. When this happens, SPSS will give a warning message in the output. These warnings should be taken seriously, and the model should be refitted, perhaps by combining some categories of the predictor.

Even if there is not quasi-complete separation, separation may be nearly complete, so the standard error for a parameter estimate can become very large. It is good practice to examine the parameter estimates and their standard errors carefully for any logistic regression output.

We now examine a situation where quasi-complete separation occurs, using the variable AGECAT as a predictor in a logistic regression. First we check the crosstabulation between AGECAT and STOPMENS. Notice that in the highest age category, all 71 women are in menopause (not surprisingly).

CROSSTABS

/TABLES=agecat BY stopmens

/FORMAT=AVALUE TABLES

/STATISTICS=CHISQ

/CELLS=COUNT ROW

/COUNT ROUND CELL.

|agecat * stopmens Crosstabulation |

| | | |stopmens |Total |

| | | |1 |2 |

| |% within agecat |83.6% |16.4% |100.0% |

|Chi-Square Tests |

| |Value |df |Asymp. Sig. (2-sided)|

|Pearson Chi-Square |111.660a |3 |.000 |

|Likelihood Ratio |110.175 |3 |.000 |

|Linear-by-Linear Association |78.698 |1 |.000 |

|N of Valid Cases |360 | | |

|a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.64. |

We now fit the corresponding logistic regression model, using AGECAT as a categorical predictor, with AGECAT=1 as the reference category.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD=ENTER agecat

/CONTRAST (agecat)=Indicator(1)

/PRINT=CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

We check out the Categorical Variable Codings to see that the design variables are set up correctly to have AGECAT=1 as the reference category, and notice that the dummy variables for AGECAT will be labeld AGECAT(1), AGECAT(2), and AGECAT(3). These will correspond to AGECAT=2, 3, and 4, respectively.

|Categorical Variables Codings |

| | |Frequency |Parameter coding |

| |

|Step |-2 Log likelihood |Cox & Snell R Square |Nagelkerke R Square |

|1 |210.990a |.264 |.447 |

|a. Estimation terminated at iteration number 20 because maximum iterations |

|has been reached. Final solution cannot be found. |

|Variables in the Equation |

| |

Based on the information that we saw in the crosstabulation, we will create a new variable AGECAT3, with 3 age categories, collapsing category 3 and category 4.

RECODE agecat (MISSING=SYSMIS) (1=1) (2=2)(3 thru 4=3) INTO agecat3.

EXECUTE.

We now fit a new logistic regression, with AGECAT3 as a categorical predictor. Note in the output for this model, we do not have a problem with quasi-complete separation, however, we see a very wide confidence in terval for AGECAT3(2), owing to the fact that there was only one participant in this group with a value of 0 on the dependent variable.

LOGISTIC REGRESSION VARIABLES menopause

/METHOD=ENTER agecat3

/CONTRAST (agecat3)=Indicator(1)

/PRINT=CI(95)

/CRITERIA=PIN(0.05) POUT(0.10) ITERATE(20) CUT(0.5).

|Model Summary |

|Step |-2 Log likelihood |Cox & Snell R Square |Nagelkerke R Square |

|1 |212.329a |.261 |.442 |

|a. Estimation terminated at iteration number 8 because parameter estimates |

|changed by less than .001. |

|Variables in the Equation |

| |

Logistic Regression Model with Several Predictors

We now fit a logistic regression model with several predictors, both continuous and categorical. Note especially the global test for the model, which has 6 degrees of freedom, due to the 6 parameters that are estimated for the predictors in the model. There are two parameters for EDCAT and one each for AGE, SMOKER, TOTINCOM, and NUMPREG1. The only predictor that is significant in this model is AGE (p ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download