Project



Stat 562 Term Project

Conditional logistic regression for binary matched pairs response

By Xiufang Ye

Content

I: Motivation and theoretical background

II: Basic Theory

III: Data analysis

Iv: Main Reference

V:Acknowledgements

Keyword: Marginal model , logit link , conditional logistic regression, sufficient statistics, ML analysis, matched case-control study

1.Motivation and Theoritical Background

1.1 What is binary matched pairs?

For comparing categorical responses for two samples, when each observation in one sample pairs with an observation in the other, the data is called matched–pairs data.

Thus, the responses in the two samples are statistically dependent. “Binary” here means that the responses are binary.

1.2 Why “logistic regression”?

Firstly, logistic regression is the most important model for categorical response data, especially for binary data. It has been widely used in biomedical studies, social science research marketing ,business and even genetics.

1.3 Why “conditional logistic regression”?

In all of the examples so far, the observations have been independent. But what if the observations were matched? You might think that it would possible to include dummy coded variables to indicate the matching. For example, if you had 56 matched pairs you could include 55 dummy variables to account for non-independence along with whatever covariates you wanted to have in the model. Logistic regression has problems when the number of degrees of freedom is close to the total degrees of freedom available. In a situation, such as this, the conditional logistic model is recommended. In matched case-control studies, conditional logistic regression can be used to investigate the relationship between an outcome and a set of prognostic factors.

1.4 Exact Inference for Logistic Regression

Maximum likelihood estimators of model parameters work best when the sample size is large compared to the number of parameters in the model. When the sample size is small, or when there are many parameters relative to the sample size, improved inference results using the method of conditional maximum likelihood. The conditional maximum likelihood method bases inference for the primary parameters of interest on a conditional likelihood function that eliminates other parameters. The technique uses a conditional probability distribution defined over data sets in which the values of certain “sufficient statistics" for the other parameters are fixed. This distribution is defined for potential samples that provide the same information about the other parameters that occurs in the observed sample. The distribution and the related conditional likelihood function depend only on the parameters of interest.

For binary data, conditional likelihood methods are especially useful when a logistic regression model contains a large number of “nuisance” parameters. They are also useful for small samples. One can perform exact inference for a parameter by using the conditional likelihood function that eliminates all the other parameters. Since that conditional likelihood does not involve unknown parameters, one can calculate probabilities such as p-values exactly rather than use crude approximations.

2. Basic theory

2.1 Marginal versus conditional models for binary matched pairs

2.1.1 Two marginal models

Let ([pic], [pic]) denote the pair of observations of a randomly selected subject, where a “1” outcome denotes category 1(success) and “0” outcome denotes category 2. We can fit the model

[pic] (1)

Where [pic]=0, [pic]=1.

Since P([pic]=1)=[pic]=[pic] and P( [pic]=1)=[pic]=[pic], [pic]= P([pic]=1)[pic]P( [pic]=1).

Interpretaion of the parameter: It is the difference between marginal probabilities.

Alternatively, the logit model can be written as

logit(p([pic]=1))=[pic] (2)

Then,

[pic]

[pic]

Interpretation of the parameter: log odds ratio with the marginal distributions.

The two models focus on the marginal distributions of responses for the two observations. For instance, in terms of the population-averaged table, the ML estimate of [pic] in (2) is the log odds ratio of marginal proportions,

2.1.2 One conditional model

By contrast, the subject –specific table having strata implicitly allows probabilities to vary by subject. Let [pic] denote the ith pair of observations, i=1,2, …,n. The model has the form

[pic]

This is called a conditional model, since the effect [pic] is defined conditional on the subject.

When compared with marginal model, its estimate describes conditional association for the three-way table stratified by subject. The effect is subject-specific. But for marginal models (1) and (2), the effects are population-averaged, since they refer to averaging over the entire population rather than to individual specific. In fact, they are identical for the identity link, but differ for non-linear links.

For example: For logit link,

[pic] (3)

Take the average of this for the population, we can not obtain the same form.

2.2 A Logit Model with subject –specific Probabilities

By permitting subjects to have their own probability distributions, the conditional model (3) for [pic], observation t for subject i, is

[pic]

then,

[pic]

Where [pic]. Here, we assume a common effect [pic]. For subject i,

[pic], [pic]

Then, [pic] and [pic].

Interpretation of the parameter :The parameter [pic] compares the response distribution. For each subject, the odds of success for observation 2 are exp([pic]) times the odds for observation 1.

The dependence in matched pairs can be accounted for in the conditional logistic regression model. Given the parameters, with (3), we normally assumes independent of responses for different subjects and for the two observations on the same subject. However, averaging over all subjects, the responses are nonnegatively associated.

Suppose|[pic]| is small compared to |[pic]|, P( [pic]=1) and P() are increasing functions of |[pic]|, a subject with a larger positive [pic] has high P([pic]=1) for each t and is likely to have a success each time, with a larger negative [pic] has lower P([pic]=1)for each t and is likely to have a failure. For any [pic], the greater the variability in [pic], the greater the overall positive association between responses, success(failure) for observation 1 tending to occur with success (failure) for observation 2. The positive association reflects the shared value of [pic] for each observation in a pair. Specially, when [pic] is identical, no association occurs.

Question:When there are a large number of parameters[pic], this conditional model (3) causes some difficulty with the fitting process and the properties of ordinary ML estimators.

Unconditional ML estimator is inconsistent. This result was shown firstly by Anderson in 1973.

Outline of proof:

Step 1:Assuming independent of responses for different subjects and different observations by the same subject, we can find the log likelihood equations are [pic]and [pic].

Step 2:Substituting [pic]+[pic] in the second likelihood equation, we can prove that [pic] for the [pic] subjects wit[pic], [pic] for the[pic] subjects with[pic], and [pic] for the [pic] subjects with [pic] .

Step 3: By breaking [pic] into components for the sets of subjects having

[pic], [pic]and[pic] , we can find that the first likelihood equation is , for t=1,[pic].Then, [pic]=[pic], And solve the first equation , we can show that [pic]. Hence, [pic].

There is a remedy which is called conditional ML. It treats [pic] as nuisance parameters and

maximizes the likelihood function for a conditional distribution that eliminates them.

2.3 Conditional ML inference for binary matched pairs

2.3.1 Estimate of parameters

For model (3), we assume the independence as mentioned before, the joint mass function for [pic]is

[pic][pic]

[pic]

So, it is proportional to[pic].

To eliminate {[pic]}, we condition on their sufficient statistics, the pairwise success totals

[pic]. Then, we have

[pic], given [pic]1; [pic]given[pic]2.

[pic]

=[pic] if [pic]

=[pic]if [pic]

Let {[pic]} denote the counts for the four possible sequences, for subjects having [pic], [pic],the number of subjects having success for observation 1 and failure for observation 2. Silmilarly, for those subjects, [pic] and [pic]. Since [pic] is the sum of [pic] independent , identical Bernoulli variates, its conditional distribution is binomial with parameter[pic].

Hence, to make inference about β, or testing marginal homogeneity (β=0), we only need to know the information about the pairs in which either[pic]or [pic].

Alternatively, we can obtain this result through maximum likelihood method.

Conditional on [pic], the joint distribution of matched pairs is

[pic] =[pic]

where the product refers to all pairs having [pic] .

[pic]=[pic]

Differentiating the log of this conditional likelihood and equating to 0.And solving yields the conditional ML estimators.

[pic]

[pic]

By the delta method which is similarly applied in 2x2 contingincy tables, we can obtain that [pic].

2.3.2 The Consistent property of estimate β

Referring to problem 10.23 in the textbook, we can easily prove that [pic]=[pic]([pic])[pic].

Outline of the proof: For a random sample of n pairs ,we can easily prove that [pic]by the definition of [pic] and independence of responses for different observations by the same subject. Similarly,[pic]. Then we apply the law of large numbers(WLLN) and obtain that

[pic]

and [pic].

Therefore , [pic].

2.4 Random effects in Binary matched-pairs Model

There is an alternative remedy to handling the huge number of nuisance parameters in logit model(3). One can treat [pic]as random effects and regard [pic]as an unobserved random sample from a probability distribution, usually assumed to be [pic]with unknown [pic] and[pic]. This will eliminates [pic]by averaging with respect to their distribution, yielding a marginal distribution. For matched pairs with non negative sample log odds ratio, this approach also yields the estimate [pic].

2.5 Conditional ML for Matched Pairs with Multiple redictors

Generally, We can extend the model (3) to the general model with multiple predictors as follows.

[pic] (4)

where [pic]denotes the value of predictor h for observation t in pair i, t=1,2.Typically , one predictor is an explanatory variable of interest, when the others are covariates being controlled, in addition to those already controlled by virtue of using them to form the matched pairs. We can also apply conditional ML to eliminate [pic]and get estimate [pic].

Let [pic], and [pic], them the conditional distribution are

[pic]=[pic]

[pic]

By some mathematical technique, it shows that the first equation has the form of logistic regression with no intercept and with predictor values[pic] (difference between the two levels of some predictor variable).In fact, one can obtain conditional ML estimates for models (4) by fitting a logistics regression model those pairs alone, using artificial response [pic]=1 when ([pic]),[pic]0 ,when ( [pic]), no intercept, and predictor values [pic].This addresses the same likelihood as the conditional likelihood.(see Breslow et al, 1978 ; Chamberlain) 1980)

Let us illustrate it with the Table 10.3 in textbook.

Let [pic]and[pic].

Let t=1 refers to the control and t=2 to the case, then [pic]always. Since [pic] represents “yes” for diabetes and [pic] represents “no”,([pic]) for 16 observations, ([pic]) for 9+82=91 observations , and( [pic]) for 37 observations. The logit model that forces [pic]has[pic].With a single binary predictor, the estimate is identical to [pic].

2.6 Extensions:

The discussion of marginal model in section 10.1 and conditional model in this section can be generalized to multinomial response and to matched set clusters. For matched-set cluster, the conditional ML approach is restricted to estimating [pic] that are within cluster effects, such as occur in case control and crossover studies. But an advantage of using the random effects approach instead of conditional ML with the conditional model is that it has no this kind of restriction.

3. Data analysis:

3.1 Conditional Logistic Regression for Matched Pairs Data

In matched case-control studies, conditional logistic regression is used to investigate the relationship between an outcome of being a case or a control and a set of prognostic factors. When each matched set consists of a single case and a single control, the conditional likelihood is given by [pic].

where xi1 and xi0 are vectors representing the prognostic factors for the case and control, respectively, of the ith matched set. This likelihood is identical to the likelihood of fitting a logistic regression model to a set of data with constant response, where the model contains no intercept term and has explanatory variables given by di = xi1 - xi0 (Breslow 1982).

The table 10.3 in the textbook illustrate a case-control study of acute myocardial Infarction (MI) among Navajo Indians, which matched 144 victims of MI according to age and gender with 144 people free of heart disease. Subjects were asked whether they had ever been diagnosed as having diabetes (x=0,no;x=1,yes).For subject t in matched pairs I, we consider the model (3).

The case and corresponding control have the same ID. The prognostic factor is diabetes (an indicator variable for whether having diagnosed diabetes). The goal of the case-control analysis is to determine the relative risk for diabetes.

Before PROC LOGISTIC is used for the logistic regression analysis, each matched pair is transformed into a single observation, where the variable diabetes contains the differences between the corresponding values for the case and the control (case - control). The variable Outcome, which will be used as the response variable in the logistic regression model, is given a constant value of 1. Note that there are 144 observations in the data set, one for each matched pair. The variable Outcome has a constant value of 0.

In the following SAS statements, PROC LOGISTIC is invoked with the NOINT option to obtain the conditional logistic model estimates. Two models are fitted. The first model contains diabetes as the only predictor variable. Because the option CLODDS=PL is specified, PROC LOGISTIC computes a 95% profile likelihood confidence interval for the odds ratio for the predictor variable.

SAS code

data Data;

input diabetes outcome @@ ;

output;

datalines;

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 1

–1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1 –1 1

–1 1 –1 1 –1 1 –1 1 –1 1 –1 1;

proc logistic data=Data;

model outcome=diabetes / noint CLODDS=PL;

run;

[pic]Results from the conditional logistic analyses are shown as follows. Note that there is only one response level listed in the "Response Profile" tables and there is no intercept term in the "Analysis of Maximum Likelihood Estimates" tables.

Output of Anlysis:

|The SAS System |

|The LOGISTIC Procedure |

|Model Information |

|Data Set |WORK.DATA |

|Response Variable |outcome |

|Number of Response Levels |1 |

|Number of Observations |144 |

|Model |binary logit |

|Optimization Technique |Fisher's scoring |

 

|Response Profile |

|Ordered |outcome |Total |

|Value | |Frequency |

|1 |1 |144 |

|Probability modeled is outcome=1. |

 

|Model Convergence Status |

|Convergence criterion (GCONV=1E-8) satisfied. |

 

|Model Fit Statistics |

|Criterion |Without |With |

| |Covariates |Covariates |

|AIC |199.626 |193.073 |

|SC |199.626 |196.043 |

|-2 Log L |199.626 |191.073 |

 

|Testing Global Null Hypothesis: BETA=0 |

|Test |Chi-Square |DF |Pr > ChiSq |

|Likelihood Ratio |8.5534 |1 |0.0034 |

|Score |8.3208 |1 |0.0039 |

|Wald |7.8501 |1 |0.0051 |

 

|Analysis of Maximum Likelihood Estimates |

|Parameter |DF |Estimate |Standard |Wald |Pr > ChiSq |

| | | |Error |Chi-Square | |

|diabetes |1 |0.8383 |0.2992 |7.8501 |0.0051 |

 

|Odds Ratio Estimates |

|Effect |Point Estimate |95% Wald |

| | |Confidence Limits |

|diabetes |2.312 |1.286 |4.157 |

|NOTE: |

|Since there is only one response level, measures of association between the observed and predicted values were not calculated. |

| |

 

|Profile Likelihood Confidence Interval for Adjusted |

|Odds Ratios |

|Effect |Unit |Estimate |95% Confidence Limits |

|diabetes |1.0000 |2.312 |1.310 |4.272 |

In this model, where diabetes is the predictor variable. The odds ratio estimate for diabetes is 2.312, which is an estimate of the relative risk for diabetes. since The 95% profile likelihood confidence interval for the odds ratio for diabetes is(1.310, 4.272), which does not contain unity, the prognostic factor diabetes is statistically significant.

Conditional Logistic Regression for m:n Matching

Conditional logistic regression is used to investigate the relationship between an outcome and a set of prognostic factors in matched case-control studies. The outcome is whether the subject is a case or a control. If there is only one case and one control, the matching is 1:1. The m:n matching refers to the situation in which there is a varying number of cases and controls in the matched sets. You can perform conditional logistic regression with the PHREG procedure by using the discrete logistic model and forming a stratum for each matched set. In addition, you need to create dummy survival times so that all the cases in a matched set have the same event time value, and the corresponding controls are censored at later times.

Consider the following set of low infant birth-weight data extracted from Appendix 1 of Hosmer and Lemeshow (1989). These data represent 189 women, of whom 59 had low birth-weight babies and 130 had normal weight babies. Under investigation are the following risk factors: weight in pounds at the last menstrual period (LWT), presence of hypertension (HT), smoking status during pregnancy (Smoke), and presence of uterine irritability (UI). For HT, Smoke, and UI, a value of 1 indicates a "yes" and a value of 0 indicates a "no." The woman's age (Age) is used as the matching variable. The SAS data set LBW contains a subset of the data corresponding to women between the ages of 16 and 32.

data LBW;

input id Age Low LWT Smoke HT UI @@;

Time=2-Low;

datalines;

25 16 1 ……

175 32 0 170 0 0 0 207 32 0 186 0 0 0

;

The variable Low is used to determine whether the subject is a case (Low=1, low birth-weight baby) or a control (Low=0, normal weight baby). The dummy time variable Time takes the value 1 for cases and 2 for controls.

The following SAS statements produce a conditional logistic regression analysis of the data. The variable Time is the response, and Low is the censoring variable. Note that the data set is created so that all the cases have the same event time, and the controls have later censored times. The matching variable Age is used in the STRATA statement so each unique age value defines a stratum. The variables LWT, Smoke, HT, and UI are specified as explanatory variables. The TIES=DISCRETE option requests the discrete logistic model.

proc phreg data=LBW;

model Time*Low(0)= LWT Smoke HT UI / ties=discrete;

strata Age;

run;

The procedure displays a summary of the number of event and censored observations for each stratum. These are the number of cases and controls for each matched set shown in Output1. Results of the conditional logistic regression analysis are shown in Output 2. Based on the Wald test for individual variables, the variables LWT, Smoke, and HT are statistically significant while UI is marginal.

The hazards ratios, computed by exponentiating the parameter estimates, are useful in interpreting the results of the analysis. If the hazards ratio of a prognostic factor is larger than 1, an increment in the factor increases the hazard rate. If the hazards ratio is less than 1, an increment in the factor decreases the hazard rate. Results indicate that women were more likely to have low birth-weight babies if they were underweight in the last menstrual cycle, were hypertensive, smoked during pregnancy, or suffered uterine irritability. For matched case-control studies with one case per matched set (1:n matching), the likelihood function for the conditional logistic regression reduces to that of the Cox model for the continuous time scale. For this situation, you can use the default TIES=BRESLOW.

Output 2: Summary of Number of Case and Controls

|The PHREG Procedure |

| |

| |

|Model Information |

| |

|Data Set |

|WORK.LBW |

| |

|Dependent Variable |

|Time |

| |

|Censoring Variable |

|Low |

| |

|Censoring Value(s) |

|0 |

| |

|Ties Handling |

|DISCRETE |

| |

| |

|Summary of the Number of Event and Censored Values |

| |

|Stratum |

|Age |

|Total |

|Event |

|Censored |

|Percent |

|Censored |

| |

|1 |

|16 |

|7 |

|1 |

|6 |

|85.71 |

| |

|2 |

|17 |

|12 |

|5 |

|7 |

|58.33 |

| |

|3 |

|18 |

|10 |

|2 |

|8 |

|80.00 |

| |

|4 |

|19 |

|16 |

|3 |

|13 |

|81.25 |

| |

|5 |

|20 |

|18 |

|8 |

|10 |

|55.56 |

| |

|6 |

|21 |

|12 |

|5 |

|7 |

|58.33 |

| |

|7 |

|22 |

|13 |

|2 |

|11 |

|84.62 |

| |

|8 |

|23 |

|13 |

|5 |

|8 |

|61.54 |

| |

|9 |

|24 |

|13 |

|5 |

|8 |

|61.54 |

| |

|10 |

|25 |

|15 |

|6 |

|9 |

|60.00 |

| |

|11 |

|26 |

|8 |

|4 |

|4 |

|50.00 |

| |

|12 |

|27 |

|3 |

|2 |

|1 |

|33.33 |

| |

|13 |

|28 |

|9 |

|2 |

|7 |

|77.78 |

| |

|14 |

|29 |

|7 |

|1 |

|6 |

|85.71 |

| |

|15 |

|30 |

|7 |

|1 |

|6 |

|85.71 |

| |

|16 |

|31 |

|5 |

|1 |

|4 |

|80.00 |

| |

|17 |

|32 |

|6 |

|1 |

|5 |

|83.33 |

| |

|Total |

|  |

|174 |

|54 |

|120 |

|68.97 |

| |

Output 3 Conditional Logistic Regression Analysis for the Low Birth-Weight Study

|The PHREG Procedure |

| |

| |

|Convergence Status |

| |

|Convergence criterion (GCONV=1E-8) satisfied. |

| |

| |

|Model Fit Statistics |

| |

|Criterion |

|Without |

|Covariates |

|With |

|Covariates |

| |

|-2 LOG L |

|159.069 |

|141.108 |

| |

|AIC |

|159.069 |

|149.108 |

| |

|SBC |

|159.069 |

|157.064 |

| |

| |

|Testing Global Null Hypothesis: BETA=0 |

| |

|Test |

|Chi-Square |

|DF |

|Pr > ChiSq |

| |

|Likelihood Ratio |

|17.9613 |

|4 |

|0.0013 |

| |

|Score |

|17.3152 |

|4 |

|0.0017 |

| |

|Wald |

|15.5577 |

|4 |

|0.0037 |

| |

| |

|Analysis of Maximum Likelihood Estimates |

| |

|Variable |

|DF |

|Parameter |

|Estimate |

|Standard |

|Error |

|Chi-Square |

|Pr > ChiSq |

|Hazard |

|Ratio |

| |

|LWT |

|1 |

|-0.01498 |

|0.00706 |

|4.5001 |

|0.0339 |

|0.985 |

| |

|Smoke |

|1 |

|0.80805 |

|0.36797 |

|4.8221 |

|0.0281 |

|2.244 |

| |

|HT |

|1 |

|1.75143 |

|0.73932 |

|5.6120 |

|0.0178 |

|5.763 |

| |

|UI |

|1 |

|0.88341 |

|0.48032 |

|3.3827 |

|0.0659 |

|2.419 |

| |

4 .Main Reference

Categorical data analysis, second edition , by Alan Agresti.

[pic]Thank you for listening.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download