Logistic Regression and Odds Ratio

[Pages:9]Logistic Regression and Odds Ratio

Odds Ratio Review

Risk Factor (Benzene)

Yes No Total

Risk Factor Yes

(Exposed) No

(Unexposed) Total

Outcome

(Brain Tumor)

Yes

No

(Case)

(Control)

50

20

100

130

150

150

Outcome

Yes

No

(Disease) (NoDisease)

a

b

c a + c

d b + d

Total 70 230 300

Total a + b c + d

n

Risk factor can always be arranged as either column or row variable, but the definition of odds ratio is still the same.

Let p1 be the probability of success in row 1 (probability of Brain Tumor in row 1)

1 - p1 is the probability of not success in row 1 (probability of no Brain Tumor in row 1)

Odd of getting disease for the people who were exposed to the risk factor: ( p^1

is an estimate of

a

p1)

O+ =

P[disease | expo sed ] 1 - P[disease | expo sed ]

=

P[disease | expo sed ] P[no disease | expo sed ]

=

1

p1 - p1

p^1 1 - p^1

=

a

+b b

=

a b

=

50 20

=

2.5

a+b

Let p0 be the probability of success in row 2 (probability of Brain Tumor in row 2) 1 - p0 is the probability of not success in row 2 (probability of no Brain Tumor in row 2)

Odd of getting disease for the people who were not exposed to the risk factor: ( p^ 0 is an estimate of p0)

O- =

P[disease | un expo sed ] 1 - P[disease | un expo sed ]

=

P[disease | un expo sed ] P[no disease | un expo sed ]

c

=

1

p0 - p0

1

p^ 0 - p^ 0

=

c+d d

=

c d

= 100 130

= .77

c+d

The Odds Ratio of having brain tumor for people who were exposed to the risk factor versus not exposed:

OR =

=

O+ O-

=

p1 1-

p0 1-

p1 p0

^ =

p^ 1 1-

p^ 0 1-

p^ 1 p^ 0

=

[a /(a + b)] /[b [c /(c + d )] /[d

/(a /(c

+ +

b)] d )]

=

a

b c

d

=

ad bc

=

50 ?130 20 ?100

= 3.25

Interpretation: The odds of having brain tumor are 3.25 times higher for those who exposed to benzene than those who were not exposed to benzene.

If > 1, then the odds of success are higher for column 1(risk factor present) than column 2(risk factor not present). If < 1, then the odds of success are lower for column 1(risk factor present) than column 2(risk factor not present). If = 1, then the odds of success are equal for column 1(risk factor present) and column 2(risk factor not present).

A. Chang

1

Logistic Regression and Odds Ratio

Confidence interval for odds ratio:

For large sample, the log of odds ratio, ln(^), follows asymptotically a normal distribution. The (1 ? )100% confidence interval estimate for the Log Odds Ratio is ln(^) ? z / 2 s *

(e , e ) The (1 ? )100% confidence interval estimate for the Odds Ratio is ln(^)- z / 2 s* ln(^)+ z / 2 s*

where

^

=

ad bc

,

standard error of ln(^) is

s*

1 a

+

1 b

+

1 c

+

1 d

,

and

a,

b,

c

and

d

should

not

be

zero.

Example: (Brain tumor) The 95% confidence interval estimate for the odds ratio is:

z / 2 = z.025 = 1.96,

^

=

ad bc

=

50 ?130 20 ?100

=

3.25,

s* 1 + 1 + 1 + 1 = 0.296 50 100 20 130

ln(^) = ln(3.25) = 1.179, z / 2 s* = 1.96 0.296 = 0.58

(e ln(^)- z / 2s* , e ln(^)+ z / 2s* ) (e1.179-0.58 , e1.179+0.58 ) (1.819, 5.807 )

This confidence interval doesn't cover 1. In fact, it covers a range that is greater than 1. This means that exposed to benzene is more likely to have brain tumor than those who were not exposed to benzene.

(e , e ) Continuity Correction:

The (1 ? )100% confidence interval estimate for the Odds Ratio is

ln(^ )- z / 2 s*

ln(^ )+ z / 2 s*

where

^

=

(a + (b +

0.5)(d 0.5)(c

+ 0.5) + 0.5)

,

standard error of ln(^) is

s*

a

1 + 0.5

+

b

1 + 0.5

+

c

1 + 0.5

+

. d

1 + 0.5

A. Chang

2

Logistic Regression and Odds Ratio

Example: Risk Factors for Dengue Epidemics Data: Dantes, Koopman and Addy et al.,"Dengue Epidemics on the Pacific Coast of Mexico," International Journal

of Epidemiology 17 (1988), p. 178-186.

Variable Name Identification

Description 1-196

Age

Age of person (in years)

Socioeconomic Status

1=upper, 2=middle, 3=lower

Sector

Disease Status Saving Account Status

Sector in city: 1 = sector 1 2 = sector 2

1= with disease, 0= no disease 1= has acct., 0 = has no acct.

With 100% of the data:

Disease Status

Yes, Y=1

No, Y=0

Sector

X=1, sector 1 X=0, sector 2

22 35

95

117

44

79

57

139

196

Sector 1: p^1 = 22/117, Sector 2: p^ 0 = 35/79,

1 - p^1 = 1 ? 22/117 = 95/117 1 - p^ 0 = 1 ? 35/79 = 44/79

Odds ratio of having disease for sector 1 v.s. sector 2: (22/95)/(35/44) = .291 (The odds of having disease living in sector 1 is around 30% of the odds for living in sector 2.) (It is less likely to have disease living in sector 1 than living in sector 2.) Odds ratio of having disease for sector 2 v.s. sector 1: ( 35/44)/(22/95) = 3.45 (The odds of having disease living in sector 2 is more than 3 times of the odds for living in sector 1.)

Estimate and compare proportions of persons who contracted the disease by sector, using first 50% of the data:

Disease Status

Yes, Y=1

No, Y=0

Sector

X=1, sector 1 X=0, sector 2

10 21

49

59

18

39

31

67

98

Sector 1: p^1 = 10/59, Sector 2: p^ 0 = 21/39,

1 - p^1 = 49/59 1 - p^ 0 = 18/39

Odds ratio(sector 1 v.s. sector 2) = (10/49)/(21/18)= .175 (The odds of having disease living in sector 1 is around 18% of the odds for living in sector 2.) (It is less likely to have disease living in sector 1 than living in sector 2.)

A. Chang

3

Logistic Regression and Odds Ratio Use of SPSS for Odds Ratio and Confidence Intervals

Layout of data sheet in SPSS data editor for the 50% data example above, if data is pre-organized.

Figure (a). SPSS Data Editor

Figure (b). Weight Cases

Step 1: (Go to Step 2 if data is raw data and not organized frequencies as in figure (a).) First, create the data in SPSS Data Editor as in (a), and then weight the cases entered in the Data Editor by click Data and select Weight Cases as in (b). In the Weight Cases dialog box select freq variable for weighting the cases as in figure (c). Weight Cases will allow users to be able to make a contingency table with the joint frequency distribution entered in (a) and each associate with a joint class. For example, 10 is the frequency for "Yes" and "Sector 1". The disease variable has internal values 1 and 2 (1 is labeled as Yes and 2 is labeled as No). The sector variable has internal values 1 and 2 (1 is labeled as Sector 1 and 2 is labeled as Sector 2).

Figure (c). Weight Cases

SECTOR * DISEASE Crosstabulation

Count SECTOR Total

Sector 1 (X=1) Sector 2 (X=2)

DISEASE

Yes (Y=1) No (Y=2)

22

95

35

44

57

139

Figure (d). Contingency Table

Total 117 79 196

Step 2: Following the procedure below (Analyze/Descriptive Statistics/Crosstabs) to make the contingency table. In SPSS, the row variable is risk factor and column variable is outcome variable. In the Crosstabs dialog box, click Statistics and check the Risk box in the Crosstabs: Statistics dialog window to obtain risk measurement such as odds ratio and relative risk for obtaining the following Risk Estimate table. Chi-square test is one of the options too.

Risk Estimate

Odds Ratio for SECTOR (Sector 1 (X=1) / Sector 2 (X=2))

For cohort DISEASE = Yes (Y=1)

For cohort DISEASE = No (Y=2)

N of Valid Cases

Value .291

.424 1.458

196

95% Confidence Interval

Lower

Upper

.153

.553

.270 1.176

.666 1.808

Odds ratio of having disease for sector 1 v.s. sector 2

Confidence interval for odds ratio

A. Chang

4

Logistic Regression and Odds Ratio

Logistic Regression Logistic regression is a regression method that can model binary response variable using both quantitative and categorical explanatory variables. This method can also be used to predict the probability of a binary outcome. Notation:

Y is the response variable, it takes on 1 if disease present and takes on 0 if disease absent. p denote the probability of success, i.e., probability of y = 1 (disease present) or P(Y=1 | X)

References

? Hosmer D.W. and Lemeshow S., Applied Logistic Regression, John Wiley& Sons, Inc. 1989. ? Neter, Kutner, Nachtsheim and Wasserman, Applied Linear Regression Models, 3rd ed., Irwin Pub., 1996.

Regression Models with Binary Outcome Variable

Since the outcome is either 1 or 0, it can be modeled by Bernoulli distribution.

Y = + X + , Y ~ Bernoulli

Y Probability 1 P(Y = 1) = p or P(Y = 1 | X) = p

0 P(Y = 0) = 1 - p or P(Y = 0 | X) = 1 - p

The mean of Bernoulli distribution is the probability of success: E[Y ] = ? y|x = p = + X ???? (The expected value of the model is the probability of getting 1.)

1.2

1.0

Response Variable

.8

.6

.4

.2

0.0 -.2

.5

Can we fit a line for p and x?

1.0

1.5

2.0

2.5

3.0

Explanatory Variable

Special Problems:

1. Nonnormal error terms: = Y - + X 2. Nonconstant error variance: 2{ } = ( + X )[ 1 - ( + X)] 3. Constraints of response function: 0 ? y|x = p 1 (If fit a line, p may exceeds 1 or negative which is not good.)

A. Chang

5

Logistic Regression and Odds Ratio

Response Variable Response Variable

Fitting a Linear Function

Fitting a Logistic Function (Sigmoidal Curve)

1.2

1.2

1.0

1.0

.8

.8

.6

.6

.4

.4

.2

.2

0.0

0.0

-.2 .5

1.0

1.5

2.0

2.5

3.0

-.2 .5

1.0

1.5

2.0

2.5

3.0

Explanatory SViamriabplele Logistic Response Function (DetermEinxpilsatniactorpyaVrartia)ble

Logistic function

The Sigmoidal model (Logistic Function): The probability of success (probability of Y = 1 given X):

p

e + x = 1+ e + x

=

1

+

e

1

-

-

x

Properties ? Sigmoidal (S-shaped) ? Monotonic (Increasing or decreasing) ? Linearizable (Logit transformation)

Other forms: (There are different forms of the logistic function.)

Logit Transformation (Logit link) The natural logarithm of the odds in favor of success at x: (The log of odds is linearly related to x.)

ln1-p

p

=

+

x

Odds The odds in favor of success (the odds in favor of Y=1) at x:

p 1- p

=

e + x

Useful Statistics:

Use the maximum likelihood estimates ^, ^

1. Predicted Probability

e( + x)

p = 1 + e( + x)

=> predicted probability of binary outcome:

Predicted probability of success for x = 1:

p^ 1

=

e(^ + ^ 1) 1 + e(^ + ^ 1)

Predicted probability of success for x = 0:

p^ 0

=

e(^ + ^ 0) 1 + e(^ + ^ 0)

p^

=

e(^ + ^ x) 1 + e(^ + ^ x)

A. Chang

6

Logistic Regression and Odds Ratio

Example: Studying disease and sectors

Categorical Variables Codings

Sector within city

Sector 1 Sector 2

Frequency 117 79

Paramete (1d)i 1.000

.000

Model:

ln1-p

p

=

+

x

Model Summary

Step 1

-2 Log likelihood

221.596

Cox & Snell R Square

.072

Nagelkerke R Square

.103

Variables in the Equation

B

S.E.

Satep SECTOR(1)

-1.234

.328

1

Constant

-.229

.226

a. Variable(s) entered on step 1: SECTOR.

Wald 14.193

1.021

df 1 1

Sig. .000 .312

Exp(B) .291 .795

95.0% C.I.for EXP(B)

Lower .153

Upper .553

Maximum likelihood estimates: ^ = -.229, ^ = -1.234 (Obtained from SPSS)

The predicted probability of having disease can be calculated with the prediction

equation:

p^

e ( -.299 -1.234x ) = 1 + e(-.299-1.234x)

Probability of getting disease for people from sector 1: p^1 = e-0.299 -1.234/(1+ e-0.299 -1.234) = .18803431 Probability of getting disease for people from sector 2: p^ 0 = e-0.299 /(1+ e-0.299 ) = 1 - .18803431 = .81196569

2. Estimated Odds

The odds in favor of success (the odds in favor of Y=1) at x is:

p^ 1- p^

=

e^ +^x

p^

ln(estimated odds) = ? 0.229 ? 1.234 ? x, estimated odds = 1- p^ = e ? 0.229 ? 1.234 ? x

The odds of people in sector 1 to have the disease = e ? 0.229 ? 1.234 ? 1 = .2315 The odds of people in sector 2 to have the disease = e ? 0.229 = .7953

3. Estimated Parameter and Odds Ratio (for binary predictor variable) x = 1

x=0

=> change of ln of odds for increasing one unit of X, that is

ln

1

p1 - p1

-

ln

1

p0 - p0

=

ln

p1 1-

p0 1-

p1 p0

e If X is a binary variable, then e is the odds ratio and ^ is the estimated odds ratio for x = 1 v.s x = 0

Exp(B) in SPSS

Odds ratio of getting disease for people from sector 1 v.s. sector 2: Odds ratio = e^ = e -1.234 = .291

The p-value of the test for = 0 (OR = 1) is .000 which is less than 0.05. So sector variable is statistically significant. It suggests that there is some association between disease and sector. The 95% confidence interval of the odds ratio is (.153, .553). Odds ratio of having disease for sector 2 v.s. sector 1 would be the reciprocal of .291 and is 3.45, with 95% C.I. from 1.81 to 6.54.

A. Chang

7

Logistic Regression and Odds Ratio

Studying disease and sectors, social econ., age, savings

Categorical Variables Codings

Socioeconomic Status

Sector within city

Upper Middle Lower Sector 1 Sector 2

Frequency 77 49 70

117 79

Parameter coding

(1) 1.000

(2) .000

.000

1.000

.000

.000

1.000

.000

Odds ratio of contracting disease for Upper v.s. Lower Socio. Status is .757

Odds ratio of contracting disease for Middle v.s. Lower Socio. Status is .803

Variables in the Equation

B

S.E.

Wald

df

Satep SECTOR(1)

-1.234

.357

11.970

1

1

SCIOSTAT

.439

2

SCIOSTAT(1)

-.278

.434

.409

1

SCIOSTAT(2)

-.219

.459

.227

1

SAVINGS

.061

.386

.025

1

AGE

.027

.009

8.646

1

Constant

-.814

.452

3.246

1

a. Variable(s) entered on step 1: SECTOR, SCIOSTAT, SAVINGS, AGE.

Sig. .001 .803 .522 .634 .874 .003 .072

Exp(B) .291

95.0% C.I.for EXP(B)

Lower .145

Upper .586

.757 .803 1.063 1.027 .443

.323 .327 .499 1.009

1.775 1.976 2.264 1.045

(* Stepwise regression can be used for variable reduction.)

Variables in the Equation

Satep 1

AGE SECTOR(1)

B .027

-1.181

S.E. .009 .337

Wald 9.606

12.295

Constant

-.978

.336

8.458

a. Variable(s) entered on step 1: AGE, SECTOR.

df 1 1 1

Sig. .002 .000 .004

Exp(B) 1.027 .307 .376

95.0% C.I.for EXP(B)

Lower

Upper

1.010

1.045

.159

.594

Odds ratio of getting disease for sector 1 versus sector 2 after adjusted for age: 0.307 Odds ratio of getting disease for sector 2 versus sector 1 after adjusted for age: 3.2 (= 1/0.307) The p-value is .000 for testing the significance of sector variable by controlling the age variable. Since p-value is less than .05, there is statistically significant association between sector and disease.

Age is also a significant variable by controlling the sector variable, with a p-value of .002. "Exp(B)=1.027" means

the odds of getting disease increases as the age increases. For a 20 year old, controlling the sector variable, the odds are e -.978-1.181?Sector + 0.027?20 for getting disease. For a 25 year old, controlling the sector variable, the odds are e -.978-1.181?Sector + 0.027?25 for getting disease. The odds ratio is e -.978-1.181?Sector + 0.027?25 / e -.978-1.181?Sector + 0.027?20 = e 0.027?25 / e 0.027?20 = 1.964/1.716 = 1.145. This

means that the odds of getting disease for 25 year-old is 1.145 times higher than the 20 year-old.

Hosmer-Lemeshow Test can be used for testing goodness of fit. (p-value = .210 > .05, good fit.)

Hosmer and Lemeshow Test

Step Chi-square

1

10.853

df 8

Sig. .210

(*One can also test for interactions between all the variables.)

A. Chang

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download