Regression modelling with a categorical outcome: logistic ...



Regression modelling with a categorical outcome: logistic regression

Logistic regression is similar to linear regression, the right hand side of the regression model below works in the same way. It is the outcome on the left hand side that is different. The standard multiple regression model of:

Y = a + b1x1 + b2x2 + … +e

Where Y is a continuous outcome requires a linear relationship between the outcome and explanatory variables. In the case where Y is a binary variable (can take one of two values, e.g. dead/alive, yes/no; also known as dichotomous) this relationship is not possible and so some form of mathematical function, known as the ‘link’ function needs to be applied to the binary outcome. This is a transformation of Y to create a linear relationship with the explanatory variables. These models are known as generalized linear models (GLM). Generalized linear models can also be used to model other types of outcome:

Binary outcome from a RCT or case-control study: logistic regression

Event rate or count: Poisson regression

Binary outcome from a matched case-control study: conditional logistic regression

Categorical outcome with >2 categories: multinomial regression

Time to event data: exponential or Weibull models

Model fitting

Given that p= the probability of a person having the event of interest, then the function that is modelled as the outcome in a logistic regression is:

Logit (p) = ln p

1-p

.

Where p/1-p= probability of event occurring = the odds ratio

probability of event not occurring

The model is therefore

Logit (p) = a + b1x1 + b2x2 + … +e

Where a is the estimated constant and b’s are the estimated regression coefficients, and e is the residual term.

The parameters of this model are estimated using different methods from linear regression which uses least squares. Logistic regression uses maximum likelihood estimation. This is an iterative procedure which gives the regression coefficients that maximise the likelihood of our results assuming an underlying binomial distribution for the data. For a good explanation of maximum likelihood and regression modelling in general see ‘Statistics at square two’ by Michael Campbell (2nd edition, BMJ books, Blackwell).

The significance of each variable in the model is assessed using the Wald statistic which is the estimate of the regression coefficient b divided by its standard error. The null hypothesis being tested is that b is zero.

The estimated regression coefficients are used to calculate the odds ratio, which is the result most commonly reported from a logistic regression model and used to interpret the results.

eb (the exponential of the coefficient) gives us an estimate of the odds ratio.

This is the estimate of the odds of the outcome for a particular variable in the model, treating all other variables as fixed. For a categorical variable it gives the odds in one group relative to the other (or a baseline group if there are more than 2 levels of the variable). For a continuous variable it gives the increase in odds associated with a one unit increase in that variable. For example the estimated regression equation is:

Logit (p) = -21.18 +0.075 age – 0.77 diabetes

For age e0.075=1.078, indicating that an increase in age of 1 year increases the estimated odds of the event by 7.8%.

For diabetes e—0.77=0.463, indicating that the odds of the event for people with diabetes are just under half the odds of those without. Alternatively the odds of the event for people without diabetes are 1/0.463 = 2.16 times those of people with diabetes.

An odds ratio of 1 is the null hypothesis, that there is no difference between two groups.

An odds ratio>1 indicates increased odds of having the event

An odds ratio ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download