LOGISTIC REGRESSION TUTORIAL - Winona



LOGISTIC REGRESSION (Chapter 20)

Example - High Dieldrin Levels in Western Australian Breast Feeding Mothers

Data File: Pestmilk.JMP

These data come from a study of breast feeding mothers in Western Australia in 1979-80. Earlier research discovered surprisingly high levels of pesticide levels in human breast milk.  The research conducted in 1979-80 hoped to show that the levels had decreased as a result of stricter government regulations on the use of pesticides on food crops.   They did find decreases for several types of pesticides.   Levels of the pesticide Dieldrin, however had substantially increased.  These data were collected to hopefully explain why.

For 45 breast milk donors, we have information on the mother's age in years, whether they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the past three years (0 = no, 1 = yes), and whether their breast milk contained above average (> .009 ppm) levels of the pesticide Dieldrin.  By law new houses are treated for termites in Australia.

The variables in the Pestmilk.JMP data file are:

• age - age of mother (yrs.)

• ns - new suburb indicator (1 = yes, 0 = no)

• ht - house treated for termites in the last 3 years (1 = yes, 0 = no)

• hd - high Dieldrin level (1 = yes, 0 = no)

• New Sub (New or Old)

• Treated (HT = house treated or NT= not treated)

• High Dieldrin (High or Low)

Important JMP Note: For interpretation purposes it is best to code the outcome so that the adverse outcome is alphabetically first. The same is true for risk factors, code them so the level that would be associated with increased risk is alphabetically first.

One way to examine the relationship between the response (High Dieldrin) and the predictors (age, New Sub & Treated) we could construct 2 X 2 contingency tables and compute conditional probabilities, relative risks, and odds ratios.  The tables and plots below were obtained in JMP by using Fit Y by X and placing each of the predictors (New Sub & Treated) in the X box and the response (High Dieldrin) in the Y box.  The results are shown on the following page.

The plots and the contingency tables with the conditional probabilities added suggest that both living in a new suburb (New Sub) and living in home treated for termites (HT) lead to increased risk of having high dieldrin levels in breast milk.

Contingency Analysis of High Dieldrin By Treated

[pic]

OR = (13*16)/(3*11) = 6.30 Mothers living in a home treated for termites have 6.30 times higher odds for having high dieldrin levels in their breast milk when compared to mothers living in homes not treated for termites.

Contingency Analysis of High Dieldrin By New Sub

[pic]

OR = (7*22)/(9*5) = 3.42 Mothers living in a new suburb have 3.42 times the odds of having high dieldrin levels in their breast milk when compared to mothers in living in an older suburb.

Logistic Regression Model

In logistic regression we model the log of odds for success as a function of the predictors using a linear model. For example, consider the logistic regression model for the risk factor New Suburb.

[pic]

where,

[pic]

The log odds a breast feeding mother living in a new suburb is given by

[pic]

and for a mother living in an old suburb is given by

[pic]

The difference in the log odds is equivalent to the log of the odds ratio (OR) because of the following property of logarithms.

[pic]

Applying this property here we have

[pic]

This says that the OR associated with living in a new suburb is given by

[pic]

Fitting the New Suburb Logistic Regression Model in JMP

Select Fit Model and place High Dieldrin in the Y box and New Suburb in the Model Effects box.

[pic]

Resulting output…

[pic]

The estimated OR associated with living in a new suburb is then

We can use JMP to compute the OR’s by selecting Nominal Logistic… > Odds Ratio

[pic]

Similarly for House Treated we have the following logistic regression model.

[pic]

Finding Predicted Probabilities

The logistic regression model can be used to estimate the probability of “success” given a set of predictor values as follows:

[pic]for situations where we have a single predictor

and is given by

[pic]for situations where we have p predictors.

For the example above we can estimate the probability of high dieldrin levels for women living in a home treated for termites as follows:

P(High|House Treated) = [pic].5417

P(High|House Not Treated) = [pic].1579

[pic]

We now consider the age effect. Again select Fit Y by X from the Analyze menu and place High Dieldrin in the Y box and age in the X box. The resulting output is given below.

Logistic Fit of High Dieldrin By age

[pic]

Whole Model Test

|Model |-LogLikelihood |DF |ChiSquare |Prob>ChiSq |

|Difference |0.924219 |1 |1.848438 |0.1740 |

|Full |27.458371 | | | |

|Reduced |28.382590 | | | |

|RSquare (U) |0.0326 |

|Observations (or Sum Wgts) |43 |

| | |

Converged by Gradient

Parameter Estimates

|Term | |Estimate |Std Error |ChiSquare |Prob>ChiSq |

|Intercept | |-4.0886156 |2.7245511 |2.25 |0.1334 |

|age | |0.12223765 |0.0922011 |1.76 |0.1849 |

For log odds of High/Low

The logistic model using age a predictor is given by

[pic]= βο + β1Age ’ -4.0886156 + .1222*Age

Note: The response in logistic regression is the natural log of the odds for “success”.

The blue curve added to the plot gives the P(High|Age) = p. For example, for mothers 25 years of age the predicted probability of finding a high dieldrin level in her breast milk is .25. For mothers 35 years of age this probability increases to around .50. The distance from the top of the plot to the curve represents the P(Low|Age). To attach an odds ratio to mother’s age we need to pick an incremental increase of interest, e.g. suppose we wanted to find the odds ratio associated with a 5-year increase in age. The associated odds ratio is found as follows:

OR for 5-year increase in age = e5*.122 = 1.84

Thus for a 5-year increase in age a mothers odds for having high dieldrin are 1.84 times higher or alternatively there is an 84% increase in their odds for having high dieldrin levels in their breast milk.

Predicted Probabilities for Logistic Model Using Age

We can use the logistic regression model to obtain predicted probabilities of high dieldrin levels as a function of age by using.

P(High|Age) = [pic]

For example,

P(High|Age=25) = [pic]

P(High|Age=35) = [pic]

Multiple Logistic Regression Model

Now we consider a logistic regression model. 

[pic]

where,

[pic]

Age = mother’s age in years

Select Fit Model from the Analyze menu and put the high dieldrin indicator in the Y box and Age, HT, and New Sub in the Effects in Model box as shown at the top of the following page.

[pic]

The resulting output is shown below.

[pic]

[pic]

Finding OR’s associated with the predictors

For a dichotomous (two-level) categorical predictor, e.g. new suburb and house treated, in order to find the associated OR we do the following:

[pic], i.e.[pic].

Examples:

For New Suburb we have: For House Treated we have:

To find a crude 95% CI associated with the OR associated with risk factor i we compute

[pic]

which will give an lower and upper confidence limits for the true OR associated with risk factor.

Examples:

For New Suburb we have: For House Treated we have:

[pic] [pic]

These intervals are very wide because the sample size (n = 45) is not very big. Typically these types of studies require a larger sample size to get precise CI’s for OR’s.

We can obtain both the OR’s and their confidence intervals using JMP as follows.

Select both the options

[pic]

The resulting output is shown on the following page.

Multiple Logistic Regression Model

[pic]

The OR’s associated with living in home treated for termites and living in a new suburb are considerably larger than those found examining there effect independently. The differences between those obtained above are due to the fact that the factors themselves are potentially related and as result their estimated effects when placed in a model jointly differ.

The odds ratio reported for age is found by using Max(Age) – Min(Age) as the incremental increase. For these data Max(Age) = 37 and Min(Age) = 21, thus a mother who is 37 has 28.055 times higher odds for having high dieldrin levels in her breast milk when compared to a mother who is 21 years of age. It is better to use an increment like 5 years instead, i.e. OR associated with a 5 year increase in age is calculated as follows: [pic].

As stated previously, the confidence intervals for all of the OR’s are quite broad in this study because the sample size is small (n = 45).

Predicted Probabilities Using All Available Predictors

The predicted probabilities of high dieldrin can be found as follows.

P(High Dieldrin|House Treated, New Suburb, Age) =

[pic]

For example the probability that a 30 year old mother living in a home treated for termites in an old suburb is estimated to be:

P(High|Old Suburb, House Treated, Age = 30) =[pic]= .4690

For a 25 year old mother living in a home treated for termites located in a new suburb the probability of high dieldrin is estimated to be:

P(High|New Suburb, House Treated, Age = 25) =[pic]= .7259

Estimates of the P(High Dieldrin|New Suburb, House Treated, Age) Using Professional Version of JMP (FYI)

Selecting Save Probability Formula from the Nominal Logistic Fit pull down menu places the predicted probabilities of high and low dieldrin levels in the spreadsheet along with the predicted status. The predicted status is determined by whichever probability is larger, low dieldrin level or high dieldrin level, given their demographics.

Here is a portion of this output which will appear back in the original data spreadsheet.

P(Low|X) P(High|X)

[pic]

We can compare the predicted dieldrin status to the actual via a contingency table. Select Fit Y by X from the Analyze menu a place Most Likely High Dieldrin in the X box and High Dieldrin in the Y box. The table and mosaic plot are shown below.

Contingency Analysis of High Dieldrin By MostLikely High Dieldrin

[pic]

From the table we see that 26.7% of mothers classified as having high dieldrin levels actually had low dieldrin levels, similarly 17.9% of those classified as having low dieldrin levels actually had high dieldrin levels. In total 9 out of 43 mothers were misclassified for an estimated overall error rate of 20.9%.

Receiver Operating Characteristic (ROC) Curve

The Receiver Operating Characteristic plots the true positive probability vs. the false positive probability. As the sensitivity increases the false positive rate increases as expected. A good classification rule based on upon a logistic model should have area beneath the ROC curve of .90 or higher. Here we do not quite meet that standard.

Receiver Operating Characteristic

[pic]

Area Under ROC Curve = 0.83449

Example 2: Risk Factors for Low Birth Weight

These data come from a case-control study where risk factors for having a infant with low birth weight (< 2500g) were studied. The following information was recorded for each mother in the study: (Data File: LowBirth)

Low Birth Weight – indicator of birth weight status (Low or Normal)

Prev? – previous history of premature labor (History or None)

Hyper – hypertension during pregnancy (HT or Normal)

Smoke – mother smoked during pregnancy (Cig or No Cig)

Uterine – uterine irritability during pregnancy (Irritation or None)

Minority – minority status of mother (Nonwhite or White)

Age – age of mother

Lwt – mothers weight at last menstrual cycle

Important JMP Note: For interpretation purposes it is best to code the outcome so that the adverse outcome is alphabetically first. The same is true for risk factors, code them so the level that would be associated with increased risk is alphabetically first.

To fit the multiple logistic regression model select Analyze > Fit Model and set up the dialog box as shown below.

[pic]

After using backward elimination to remove non-significant predictors, uterine irritability and mothers age here, we have the following.

[pic]

The only predictor which represents something a mother could control or change is smoking during pregnancy. This is the primary factor of interest in this study and the other factors, while interesting, are there for control purposes only. In summarizing the effect smoking we would see the phrase: “adjusting for age, pre-pregnancy weight, race, hypertension, uterine irritability, and previous history of premature labor we find the OR associated with smoking is OR = 2.66. This says that, after adjusting for these factors, the odds for having a low birth weight infant are 2.66 times larger for mothers who smoked during pregnancy.

-----------------------

The Whole Model Test is testing

[pic]

The p-value = .0013 so here we evidence to suggest that the model is useful for explaining presence of high dieldrin levels in a mothers breast milk.

The Lack of Fit test is testing

[pic]

The p-value = .2220, so there is no evidence of lack of fit.

The Parameter Estimates and Effect Wald Tests both contain the results of tests that are used to test the significance of the predictors in the logistic model. Here we see that both the new suburb and house treated indicators are statistically significant at the .05 level, while mother’s age is significant at the .10 level.

Odds Ratios – calculates the odds ratios for all predictors in the model.

Confidence Intervals – provides CI’s for the Odds Ratio, calculated using a method slightly differently than approach above.

ROC Curve – draws an ROC curve which is shown and discussed later in the handout. !)lmxy¨©ª¶·¸ïÞÍÞï¹¥ (Professional JMP only!)

|Count |High |Low | |

|Row % | | | |

|HT |13 |11 |24 |

| |54.17 |45.83 | |

|NT |3 |16 |19 |

| |15.79 |84.21 | |

| |16 |27 |43 |

|Count |High |Low | |

|Row % | | | |

|New |7 |5 |12 |

| |58.33 |41.67 | |

|Old |9 |22 |31 |

| |29.03 |70.97 | |

| |16 |27 |43 |

Actual Status

|Predicted |High |Low | |

|Status | | | |

|High |11 |4 |15 |

| |73.33 |26.67 | |

|Low |5 |23 |28 |

| |17.86 |82.14 | |

| |16 |27 |43 |

[pic]

How do these estimate probabilities compare to those we obtain by using a 2 X 2 contingency table?

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download