CSSS 508: Intro to R - Carnegie Mellon University



CSSS 508: Intro to R

3/03/06

Logistic Regression

Last week, we looked at linear regression, using independent variables to predict a continuous dependent response variable.

Very often we want to predict a binary outcome: Yes/No (Failure/Success)

For example, we may want to predict whether or not someone will go to college or whether or not they will be obese or whether or not they will develop a hereditary condition.

We use logistic regression to model this scenario. Our response variable is usually 0 or 1.

The formula is:

P(Y = 1) = 1 / (1 + exp[-(B0 + B1x1 + B2X2 + B3X3 + …+ BpXp)])

More simply: f(z) = 1 / (1 + exp(-z)) where z is the regular linear regression we’ve seen.

The behavior of f(z) (or P(response = 1)) looks like:

[pic]

Note that when z = 0, P(response = 1) = 0.5.

We have no information from the covariates, and so it’s essentially a coin flip.

High z, high chance of a response of 1. Low z, low chance of a response of 1.

In R, we model logistic regression using generalized linear models (glm).

This function allows for several different types of models, each with their own “family”.

For us, the family just means that we specify the type of response variable we have and what kind of model we would like to use.

help(glm)

The arguments we’ll look at today are: formula, family, and data.

Formula and data are the same as used before in linear regression. If you are working with a data frame, you can type in the formula y~ x1 + x2 + …+xp and then data = the name of your data frame. If you have a variable defined for each term in your formula, you just need to include the formula argument.

For logistic regression, family = binomial.

Recall that a binomial distribution models the probability of trials being successes or failures (like our response variable).

Let’s try it on the low infant birth weight data set.

library(MASS)

help(birthwt)

Our variables:

low: 0, 1 (Indicator of birth weight less than 2.5 kg)

age: mother’s age in years

lwt: mother’s weight in lbs

race: mother’s race (1 = white, 2 = black, 3 = other)

smoke: smoking status during pregnancy

ptl: no. of previous premature labors

ht: history of hypertension

ui: presence of uterine irritability

ftv: no. of physician visits during first trimester

bwt: birth weight in grams

First attaching the data set to R so we can access the individual variables:

attach(birthwt)

All variables have a natural ordering except race. Race has been coded as integers. If we leave it as integers, the model will return a number that is associated with a change from white to black or a change from black to other. We want to remove this order from the race categories.

race2 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download