Logistic Regression and Odds Ratio
[Pages:9]Logistic Regression and Odds Ratio
Odds Ratio Review
Risk Factor (Benzene)
Yes No Total
Risk Factor Yes
(Exposed) No
(Unexposed) Total
Outcome
(Brain Tumor)
Yes
No
(Case)
(Control)
50
20
100
130
150
150
Outcome
Yes
No
(Disease) (NoDisease)
a
b
c a + c
d b + d
Total 70 230 300
Total a + b c + d
n
Risk factor can always be arranged as either column or row variable, but the definition of odds ratio is still the same.
Let p1 be the probability of success in row 1 (probability of Brain Tumor in row 1)
1 - p1 is the probability of not success in row 1 (probability of no Brain Tumor in row 1)
Odd of getting disease for the people who were exposed to the risk factor: ( p^1
is an estimate of
a
p1)
O+ =
P[disease | expo sed ] 1 - P[disease | expo sed ]
=
P[disease | expo sed ] P[no disease | expo sed ]
=
1
p1 - p1
p^1 1 - p^1
=
a
+b b
=
a b
=
50 20
=
2.5
a+b
Let p0 be the probability of success in row 2 (probability of Brain Tumor in row 2) 1 - p0 is the probability of not success in row 2 (probability of no Brain Tumor in row 2)
Odd of getting disease for the people who were not exposed to the risk factor: ( p^ 0 is an estimate of p0)
O- =
P[disease | un expo sed ] 1 - P[disease | un expo sed ]
=
P[disease | un expo sed ] P[no disease | un expo sed ]
c
=
1
p0 - p0
1
p^ 0 - p^ 0
=
c+d d
=
c d
= 100 130
= .77
c+d
The Odds Ratio of having brain tumor for people who were exposed to the risk factor versus not exposed:
OR =
=
O+ O-
=
p1 1-
p0 1-
p1 p0
^ =
p^ 1 1-
p^ 0 1-
p^ 1 p^ 0
=
[a /(a + b)] /[b [c /(c + d )] /[d
/(a /(c
+ +
b)] d )]
=
a
b c
d
=
ad bc
=
50 ?130 20 ?100
= 3.25
Interpretation: The odds of having brain tumor are 3.25 times higher for those who exposed to benzene than those who were not exposed to benzene.
If > 1, then the odds of success are higher for column 1(risk factor present) than column 2(risk factor not present). If < 1, then the odds of success are lower for column 1(risk factor present) than column 2(risk factor not present). If = 1, then the odds of success are equal for column 1(risk factor present) and column 2(risk factor not present).
A. Chang
1
Logistic Regression and Odds Ratio
Confidence interval for odds ratio:
For large sample, the log of odds ratio, ln(^), follows asymptotically a normal distribution. The (1 ? )100% confidence interval estimate for the Log Odds Ratio is ln(^) ? z / 2 s *
(e , e ) The (1 ? )100% confidence interval estimate for the Odds Ratio is ln(^)- z / 2 s* ln(^)+ z / 2 s*
where
^
=
ad bc
,
standard error of ln(^) is
s*
1 a
+
1 b
+
1 c
+
1 d
,
and
a,
b,
c
and
d
should
not
be
zero.
Example: (Brain tumor) The 95% confidence interval estimate for the odds ratio is:
z / 2 = z.025 = 1.96,
^
=
ad bc
=
50 ?130 20 ?100
=
3.25,
s* 1 + 1 + 1 + 1 = 0.296 50 100 20 130
ln(^) = ln(3.25) = 1.179, z / 2 s* = 1.96 0.296 = 0.58
(e ln(^)- z / 2s* , e ln(^)+ z / 2s* ) (e1.179-0.58 , e1.179+0.58 ) (1.819, 5.807 )
This confidence interval doesn't cover 1. In fact, it covers a range that is greater than 1. This means that exposed to benzene is more likely to have brain tumor than those who were not exposed to benzene.
(e , e ) Continuity Correction:
The (1 ? )100% confidence interval estimate for the Odds Ratio is
ln(^ )- z / 2 s*
ln(^ )+ z / 2 s*
where
^
=
(a + (b +
0.5)(d 0.5)(c
+ 0.5) + 0.5)
,
standard error of ln(^) is
s*
a
1 + 0.5
+
b
1 + 0.5
+
c
1 + 0.5
+
. d
1 + 0.5
A. Chang
2
Logistic Regression and Odds Ratio
Example: Risk Factors for Dengue Epidemics Data: Dantes, Koopman and Addy et al.,"Dengue Epidemics on the Pacific Coast of Mexico," International Journal
of Epidemiology 17 (1988), p. 178-186.
Variable Name Identification
Description 1-196
Age
Age of person (in years)
Socioeconomic Status
1=upper, 2=middle, 3=lower
Sector
Disease Status Saving Account Status
Sector in city: 1 = sector 1 2 = sector 2
1= with disease, 0= no disease 1= has acct., 0 = has no acct.
With 100% of the data:
Disease Status
Yes, Y=1
No, Y=0
Sector
X=1, sector 1 X=0, sector 2
22 35
95
117
44
79
57
139
196
Sector 1: p^1 = 22/117, Sector 2: p^ 0 = 35/79,
1 - p^1 = 1 ? 22/117 = 95/117 1 - p^ 0 = 1 ? 35/79 = 44/79
Odds ratio of having disease for sector 1 v.s. sector 2: (22/95)/(35/44) = .291 (The odds of having disease living in sector 1 is around 30% of the odds for living in sector 2.) (It is less likely to have disease living in sector 1 than living in sector 2.) Odds ratio of having disease for sector 2 v.s. sector 1: ( 35/44)/(22/95) = 3.45 (The odds of having disease living in sector 2 is more than 3 times of the odds for living in sector 1.)
Estimate and compare proportions of persons who contracted the disease by sector, using first 50% of the data:
Disease Status
Yes, Y=1
No, Y=0
Sector
X=1, sector 1 X=0, sector 2
10 21
49
59
18
39
31
67
98
Sector 1: p^1 = 10/59, Sector 2: p^ 0 = 21/39,
1 - p^1 = 49/59 1 - p^ 0 = 18/39
Odds ratio(sector 1 v.s. sector 2) = (10/49)/(21/18)= .175 (The odds of having disease living in sector 1 is around 18% of the odds for living in sector 2.) (It is less likely to have disease living in sector 1 than living in sector 2.)
A. Chang
3
Logistic Regression and Odds Ratio Use of SPSS for Odds Ratio and Confidence Intervals
Layout of data sheet in SPSS data editor for the 50% data example above, if data is pre-organized.
Figure (a). SPSS Data Editor
Figure (b). Weight Cases
Step 1: (Go to Step 2 if data is raw data and not organized frequencies as in figure (a).) First, create the data in SPSS Data Editor as in (a), and then weight the cases entered in the Data Editor by click Data and select Weight Cases as in (b). In the Weight Cases dialog box select freq variable for weighting the cases as in figure (c). Weight Cases will allow users to be able to make a contingency table with the joint frequency distribution entered in (a) and each associate with a joint class. For example, 10 is the frequency for "Yes" and "Sector 1". The disease variable has internal values 1 and 2 (1 is labeled as Yes and 2 is labeled as No). The sector variable has internal values 1 and 2 (1 is labeled as Sector 1 and 2 is labeled as Sector 2).
Figure (c). Weight Cases
SECTOR * DISEASE Crosstabulation
Count SECTOR Total
Sector 1 (X=1) Sector 2 (X=2)
DISEASE
Yes (Y=1) No (Y=2)
22
95
35
44
57
139
Figure (d). Contingency Table
Total 117 79 196
Step 2: Following the procedure below (Analyze/Descriptive Statistics/Crosstabs) to make the contingency table. In SPSS, the row variable is risk factor and column variable is outcome variable. In the Crosstabs dialog box, click Statistics and check the Risk box in the Crosstabs: Statistics dialog window to obtain risk measurement such as odds ratio and relative risk for obtaining the following Risk Estimate table. Chi-square test is one of the options too.
Risk Estimate
Odds Ratio for SECTOR (Sector 1 (X=1) / Sector 2 (X=2))
For cohort DISEASE = Yes (Y=1)
For cohort DISEASE = No (Y=2)
N of Valid Cases
Value .291
.424 1.458
196
95% Confidence Interval
Lower
Upper
.153
.553
.270 1.176
.666 1.808
Odds ratio of having disease for sector 1 v.s. sector 2
Confidence interval for odds ratio
A. Chang
4
Logistic Regression and Odds Ratio
Logistic Regression Logistic regression is a regression method that can model binary response variable using both quantitative and categorical explanatory variables. This method can also be used to predict the probability of a binary outcome. Notation:
Y is the response variable, it takes on 1 if disease present and takes on 0 if disease absent. p denote the probability of success, i.e., probability of y = 1 (disease present) or P(Y=1 | X)
References
? Hosmer D.W. and Lemeshow S., Applied Logistic Regression, John Wiley& Sons, Inc. 1989. ? Neter, Kutner, Nachtsheim and Wasserman, Applied Linear Regression Models, 3rd ed., Irwin Pub., 1996.
Regression Models with Binary Outcome Variable
Since the outcome is either 1 or 0, it can be modeled by Bernoulli distribution.
Y = + X + , Y ~ Bernoulli
Y Probability 1 P(Y = 1) = p or P(Y = 1 | X) = p
0 P(Y = 0) = 1 - p or P(Y = 0 | X) = 1 - p
The mean of Bernoulli distribution is the probability of success: E[Y ] = ? y|x = p = + X ???? (The expected value of the model is the probability of getting 1.)
1.2
1.0
Response Variable
.8
.6
.4
.2
0.0 -.2
.5
Can we fit a line for p and x?
1.0
1.5
2.0
2.5
3.0
Explanatory Variable
Special Problems:
1. Nonnormal error terms: = Y - + X 2. Nonconstant error variance: 2{ } = ( + X )[ 1 - ( + X)] 3. Constraints of response function: 0 ? y|x = p 1 (If fit a line, p may exceeds 1 or negative which is not good.)
A. Chang
5
Logistic Regression and Odds Ratio
Response Variable Response Variable
Fitting a Linear Function
Fitting a Logistic Function (Sigmoidal Curve)
1.2
1.2
1.0
1.0
.8
.8
.6
.6
.4
.4
.2
.2
0.0
0.0
-.2 .5
1.0
1.5
2.0
2.5
3.0
-.2 .5
1.0
1.5
2.0
2.5
3.0
Explanatory SViamriabplele Logistic Response Function (DetermEinxpilsatniactorpyaVrartia)ble
Logistic function
The Sigmoidal model (Logistic Function): The probability of success (probability of Y = 1 given X):
p
e + x = 1+ e + x
=
1
+
e
1
-
-
x
Properties ? Sigmoidal (S-shaped) ? Monotonic (Increasing or decreasing) ? Linearizable (Logit transformation)
Other forms: (There are different forms of the logistic function.)
Logit Transformation (Logit link) The natural logarithm of the odds in favor of success at x: (The log of odds is linearly related to x.)
ln1-p
p
=
+
x
Odds The odds in favor of success (the odds in favor of Y=1) at x:
p 1- p
=
e + x
Useful Statistics:
Use the maximum likelihood estimates ^, ^
1. Predicted Probability
e( + x)
p = 1 + e( + x)
=> predicted probability of binary outcome:
Predicted probability of success for x = 1:
p^ 1
=
e(^ + ^ 1) 1 + e(^ + ^ 1)
Predicted probability of success for x = 0:
p^ 0
=
e(^ + ^ 0) 1 + e(^ + ^ 0)
p^
=
e(^ + ^ x) 1 + e(^ + ^ x)
A. Chang
6
Logistic Regression and Odds Ratio
Example: Studying disease and sectors
Categorical Variables Codings
Sector within city
Sector 1 Sector 2
Frequency 117 79
Paramete (1d)i 1.000
.000
Model:
ln1-p
p
=
+
x
Model Summary
Step 1
-2 Log likelihood
221.596
Cox & Snell R Square
.072
Nagelkerke R Square
.103
Variables in the Equation
B
S.E.
Satep SECTOR(1)
-1.234
.328
1
Constant
-.229
.226
a. Variable(s) entered on step 1: SECTOR.
Wald 14.193
1.021
df 1 1
Sig. .000 .312
Exp(B) .291 .795
95.0% C.I.for EXP(B)
Lower .153
Upper .553
Maximum likelihood estimates: ^ = -.229, ^ = -1.234 (Obtained from SPSS)
The predicted probability of having disease can be calculated with the prediction
equation:
p^
e ( -.299 -1.234x ) = 1 + e(-.299-1.234x)
Probability of getting disease for people from sector 1: p^1 = e-0.299 -1.234/(1+ e-0.299 -1.234) = .18803431 Probability of getting disease for people from sector 2: p^ 0 = e-0.299 /(1+ e-0.299 ) = 1 - .18803431 = .81196569
2. Estimated Odds
The odds in favor of success (the odds in favor of Y=1) at x is:
p^ 1- p^
=
e^ +^x
p^
ln(estimated odds) = ? 0.229 ? 1.234 ? x, estimated odds = 1- p^ = e ? 0.229 ? 1.234 ? x
The odds of people in sector 1 to have the disease = e ? 0.229 ? 1.234 ? 1 = .2315 The odds of people in sector 2 to have the disease = e ? 0.229 = .7953
3. Estimated Parameter and Odds Ratio (for binary predictor variable) x = 1
x=0
=> change of ln of odds for increasing one unit of X, that is
ln
1
p1 - p1
-
ln
1
p0 - p0
=
ln
p1 1-
p0 1-
p1 p0
e If X is a binary variable, then e is the odds ratio and ^ is the estimated odds ratio for x = 1 v.s x = 0
Exp(B) in SPSS
Odds ratio of getting disease for people from sector 1 v.s. sector 2: Odds ratio = e^ = e -1.234 = .291
The p-value of the test for = 0 (OR = 1) is .000 which is less than 0.05. So sector variable is statistically significant. It suggests that there is some association between disease and sector. The 95% confidence interval of the odds ratio is (.153, .553). Odds ratio of having disease for sector 2 v.s. sector 1 would be the reciprocal of .291 and is 3.45, with 95% C.I. from 1.81 to 6.54.
A. Chang
7
Logistic Regression and Odds Ratio
Studying disease and sectors, social econ., age, savings
Categorical Variables Codings
Socioeconomic Status
Sector within city
Upper Middle Lower Sector 1 Sector 2
Frequency 77 49 70
117 79
Parameter coding
(1) 1.000
(2) .000
.000
1.000
.000
.000
1.000
.000
Odds ratio of contracting disease for Upper v.s. Lower Socio. Status is .757
Odds ratio of contracting disease for Middle v.s. Lower Socio. Status is .803
Variables in the Equation
B
S.E.
Wald
df
Satep SECTOR(1)
-1.234
.357
11.970
1
1
SCIOSTAT
.439
2
SCIOSTAT(1)
-.278
.434
.409
1
SCIOSTAT(2)
-.219
.459
.227
1
SAVINGS
.061
.386
.025
1
AGE
.027
.009
8.646
1
Constant
-.814
.452
3.246
1
a. Variable(s) entered on step 1: SECTOR, SCIOSTAT, SAVINGS, AGE.
Sig. .001 .803 .522 .634 .874 .003 .072
Exp(B) .291
95.0% C.I.for EXP(B)
Lower .145
Upper .586
.757 .803 1.063 1.027 .443
.323 .327 .499 1.009
1.775 1.976 2.264 1.045
(* Stepwise regression can be used for variable reduction.)
Variables in the Equation
Satep 1
AGE SECTOR(1)
B .027
-1.181
S.E. .009 .337
Wald 9.606
12.295
Constant
-.978
.336
8.458
a. Variable(s) entered on step 1: AGE, SECTOR.
df 1 1 1
Sig. .002 .000 .004
Exp(B) 1.027 .307 .376
95.0% C.I.for EXP(B)
Lower
Upper
1.010
1.045
.159
.594
Odds ratio of getting disease for sector 1 versus sector 2 after adjusted for age: 0.307 Odds ratio of getting disease for sector 2 versus sector 1 after adjusted for age: 3.2 (= 1/0.307) The p-value is .000 for testing the significance of sector variable by controlling the age variable. Since p-value is less than .05, there is statistically significant association between sector and disease.
Age is also a significant variable by controlling the sector variable, with a p-value of .002. "Exp(B)=1.027" means
the odds of getting disease increases as the age increases. For a 20 year old, controlling the sector variable, the odds are e -.978-1.181?Sector + 0.027?20 for getting disease. For a 25 year old, controlling the sector variable, the odds are e -.978-1.181?Sector + 0.027?25 for getting disease. The odds ratio is e -.978-1.181?Sector + 0.027?25 / e -.978-1.181?Sector + 0.027?20 = e 0.027?25 / e 0.027?20 = 1.964/1.716 = 1.145. This
means that the odds of getting disease for 25 year-old is 1.145 times higher than the 20 year-old.
Hosmer-Lemeshow Test can be used for testing goodness of fit. (p-value = .210 > .05, good fit.)
Hosmer and Lemeshow Test
Step Chi-square
1
10.853
df 8
Sig. .210
(*One can also test for interactions between all the variables.)
A. Chang
8
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- logistic regression and odds ratio
- logistic regression use interpretation
- use and interpretation of logistic regression in habitat
- 11 logistic regression interpreting parameters
- stata interpreting logistic regression
- final exam practice problems logistic regression practice
- an introduction to logistic and probit regression models
- logistic and probit regression idre stats
- lecture 2 marginal and conditional odds ratios
- lecture 4 special cases of logistic regression
Related searches
- logistic regression for longitudinal data
- multivariable logistic regression analysis
- univariable logistic regression model
- multivariable logistic regression model
- binary logistic regression analysis
- binary logistic regression equation
- binary logistic regression formula
- binary logistic regression 101
- binary logistic regression pdf
- multinomial logistic regression assumptions
- multinomial logistic regression stata
- multinomial logistic regression in sas