Hi Tim, - Jeroen Smits



Estimating the Heckman two-step procedure to control for selection bias with SPSS

Jeroen Smits

()

September 2003

This paper shortly discusses two main forms of the selection bias problem and a method which in a number of cases can be used to control for this kind of bias: the Heckman two-step procedure. After this, detailed instructions are given about how the Heckman procedure can be applied using the statistical package SPSS.

1. Introduction

Many statistical software packages like SAS, STATA, or LIMDEP offer the possibility to use the Heckman two-step procedure to control for selection bias (although the possibilities in these packages sometimes are rather limited). However in SPSS, the statistical package which is widely used by social researchers, no procedure for applying this method is available. That does not mean that it is completely impossible to apply this method with SPSS. With some additional computations, the SPSS procedures PROBIT or LOGISTIC REGRESSION can be used to construct a Heckman selection bias control factor. This control factor, then, can be added to an OLS regression analysis in which selection bias is a problem, to produce unbiased parameter estimates. To get also correct standard errors for these parameters, another step can be taken in which a WLS regression analysis is performed using weights constructed on the basis of the outcomes of the earlier steps. This paper gives detailed instructions on how this can be done.

1.1 Selection bias

There are basically two versions of the selection bias problem. In the standard case of selection bias, information on the dependent variable for part of the respondents is missing. For example, if we want to estimate the effect of education of women on their income, we meet the problem that many women are not engaged in paid work and hence have no income. If a substantial part of these nonemployed women has no job because their returns to education were relatively low, running a regression with income as dependent variable and education as one of the predictors may lead to a biased estimates of the effect of education on income.

In the other version of the selection bias problem, information on the dependent variable is available for all respondents, but the distribution of respondents over categories of the independent variable we are interested in has taken place in a selective way. For example, we may want to study the effect of migration on income, using a random sample of the population for which we know the income and whether or not they migrated to another place in the past. If we simply run a regression with income as dependent variable and a dummy indicating whether or not the respondent migrated in the past as one of the independent variables, we may get a biased estimate of the migration effect because the distribution of respondents over the categories of migrants and nonmigrants was not random. People who choose to migrate may differ in many (measured and unmeasured) characteristics from people who don't. If these characteristics are related to income, the coefficient of the migration dummy may catch up these effects and be biased because of this. Controlling for these differences would solve the problem. However, this is generally not possible, because in any data set the number of control factors is limited, whereas the number of possible differences among individuals is infinite. One can never be sure that all relevant differences are taken into account. This second form of selection bias is sometimes called heterogeneity bias.

Common to both forms of selection bias is that there is a selection process by which individuals are divided over two (or more) groups (employed/nonemployed; migrants/nonmigrants) and that nonrandomness in this process disturbs the estimation of other relationships which are of substantial interest. With other words, there are two processes (which can be described with two equations, called "selection equation" and "substantial equation") and these processes are related to each other. This relationship will be reflected in a non-zero correlation between the error terms of the equations. If such a correlation is present, we cannot estimate the substantial equation without taking the selection process into account.

Most statistical packages which offer the possibility to estimate Heckman models restrict themselves to the standard version of the problem. However, the Heckman two-step procedure can also be used to address the other form of selection bias. In this paper I give the SPSS instructions for both.

As an example of the use of the method, I will show how it can be applied to correct two simple income equations, one in which the income of working women is explained on the basis of their age and educational level and one in which the income of respondents is explained on the basis of their age, educational level and a dummy indicator of whether or not they migrated in the past.

Readers who want to know more about selection bias or the Heckman procedure might read Breen (1996), Winship and Mare (1992) or one of the classical papers of Heckman (1976, 1979). Part of the information on how to estimate the Heckman procedure with SPSS was derived from Ploeg (1993).

1.2 The Heckman procedure

In the first step of the Heckman procedure, the selection process which is responsible for selection bias problems is studied with the so-called selection model. The bias is caused by the existence of differences between employed and unemployed women (or between migrated and nonmigrated persons) which are related to their income. So it is necessary to compare these groups (employed and nonemployed women; migrants and nonmigrants) to find out what the differences are. For this purpose, generally a probit model is estimated (because the error term of this model is normally distributed, one of the assumptions underlying the Heckman model). However, with some "tricks" also other techniques like logit analysis can be used.

In our examples, the dependent variable in the probit analysis is a dummy variable indicating whether or not the female is employed or the respondent has migrated. Independent variables in the model are the (relevant) characteristics of the respondents available in the data set; in the examples education, age, and the number of children. In the probit analysis, we estimate the effects of these variables on the employment/migration decision. However, these effects themselves are not really of interest, because these variables are available in the data set and hence we can control for them in the income analysis. What we really want to know is the effect of the unmeasured characteristics of the respondents on the employment/migration decision. Of course, information on the effect of these unmeasured characteristics is not available in the coefficients of the explanatory variables. However, in the residuals of the probit analysis it is. After all, the variation which remains in the dependent variable after removing the effect of the known factors can only be caused by the influence of unknown factors.

In the Heckman procedure, the residuals of the selection equation are used to construct a selection bias control factor, which is called Lambda and which is equivalent to the Inverse Mill's Ratio. This factor is a summarizing measure which reflects the effects of all unmeasured characteristics which are related to employment/migration. The value of this lambda for each of the respondents is saved and added to the data file as an additional variable.

In the second step of the Heckman procedure, the analysis is performed in which we are interested in the first place, in this case an OLS regression analysis of the effects of education/migration on income. In this substantial analysis we use the selection bias control factor Lambda as an additional independent variable. Because this factor reflects the effect of all the unmeasured characteristics which are related to the employment/migration decision, the coefficient of this factor in the substantial analysis catches the part of the effect of these characteristics which is related to income. Because we now have a control factor in the analysis for the effect of the income related unmeasured characteristics which are also related to the employment/migration decision, the other predictors in the equation are freed from this effect and the regression analysis produces unbiased coefficients for them.

1.3 Limitations

Before going to the description of the practical estimation of the Heckman model, a word of caution is in its place. Although theoretically the procedure sounds rather well, applying it in practice is not so straightforward. An important condition for its use is that the selection equation contains at least one variable which is not related to the dependent variable in the substantial equation. If such a variable is not present (and sometimes even if such a variable is present), there may arise severe problems of multicollinearity and addition of the correction factor to the substantial equation may lead to estimation difficulties and unreliable coefficients.

2. Estimation of the standard version with SPSS

2.1 The selection model

2.1.1 Computation of LAMBDA with SPSS PROBIT

To compute the Heckman correction factor Lambda with a PROBIT selection model, the SPSS procedures PROBIT can be used. This procedure is a bit laborious, because, after estimation of the model, the parameter estimates must be typed by hand in a formula to compute the predicted values of the model (which we need for computing Lambda). As an alternative, a logit model can be estimated with the procedure LOGISTIC REGRESSION, which offers the possibility to save the predicted values automatically. However, in that case a kind of "trick" must be used to translate the predicted values of the logit model into "quasi-probit" scores. This alternative is discussed later.

In our example on the effect of education of women on their income, the selection model contains the variables age (AGEW) and education (EDUW) of the woman and number of children (CHILD). The dependent variable PARTW is an indicator variable with value 1 for women participating in the labor force and a value 0 for other women. With PROBIT the procedure goes as follows:

compute SUBJ=1.

PROBIT PARTW of SUBJ with AGEW EDUW CHILD

/log=none /print=none.

In the output of this analysis, we find the estimates of the parameters. On the basis of these parameters, for each respondent the predicted probit score can be computed, by typing the parameter values in the following formula (using an SPSS compute statement):

compute IPS = 0.35020-0.04691*AGEW+0.47745*EDUW+0.46660*CHILD.

With this COMPUTE statement, the individual probit scores (IPS) are computed and added to the temporary data file. These probit scores are used to compute the Heckman control factor LAMBDA.

compute LAMBDA = ((1/sqrt(2*3.141592654))*(exp(-IPS*IPS*0.5)))/cdfnorm(IPS).

For applying the two-step procedure it is important that all respondents with missing values on variables which are used in the substantial analyses are removed from the active file. This makes that all following computations are done on the basis of the same group of respondents. For example:

select if (INCW>0 and EDUW ne -9 and ....).

Now, the help and control factor DELTA is computed:

compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.

The value of DELTA should be between -1 and 0. This offers the possibility to check whether LAMBDA is computed well.

DESCR DELTA /statistics = min max.

2.1.2 Computation of LAMBDA with SPSS LOGISTIC REGRESSION

A disadvantage of the procedure PROBIT is that this procedure cannot compute predicted values. The procedure LOGISTIC REGRESSION can do this. Because Lee (1983) has developed a method to estimate the selection model with logit analysis, LOGISTIC REGRESSION offers a less laborious alternative for computing LAMBDA. Estimating the selection model with LOGISTIC REGRESSION goes as follows:

LOGISTIC REGRESSION PARTW with AGEW EDUW CHILD

/save pred (IKL).

With the instruction " /save pred (IKL) " a new variable is made and saved under the name IKL, which contains the individual probabilities predicted by the model. Using the inverse cumulative distribution function of the normal distribution, these individual probabilities are translated into the form they would have had when they would have been computed on the basis of a probit model:

compute IPS = probit(IKL).

The variable IPS now contains the quasi probit scores and can be used to compute LAMBDA in the same way as when using a probit selection model:

compute LAMBDA = ((1/sqrt(2*3.141592654))*(exp(-IPS*IPS*0.5)))/cdfnorm(IPS).

Again, cases with missing values on variables involved in the substantial analysis should be removed from the active file:

select if (INCW>0 and EDUW ne -9 and ....).

Computation of the help and control factor DELTA and testing whether the value of DELTA is between -1 and 0:

compute DELTA = -LAMBDA*IPS-LAMBDA*LAMBDA.

DESCR DELTA /statistics = min max.

2.2 The substantial analysis

Now LAMBDA is known, we can use it as a correction factor to control for selection bias in the substantial analysis, which is an OLS regression analysis estimated with the procedure REGRESSION:

REGRESSION /dep=INCW

/method=enter AGEW EDUW LAMBDA

/save resid (RES).

This analysis produces unbiased parameter estimates for the independent variables. However, the standard estimates of these parameters are biased because of heteroschedasticity. The variance of the error term is not the same for each respondent. To get better standard errors, several additional steps have to be taken.

2.2.1 Correcting the error terms

First, a command was added to the substantial regression analysis to save the residuals of the regression model in a new variable (which is called RES). This variable must be squared:

compute RES2 = RES*RES.

Besides RES, two help variables must be computed. The first one is the regression coefficient of LAMBDA in the OLS analysis, which is called LAMB. The second one is the number of cases used in the OLS regression, called N.

compute LAMB=0.002648.

compute N=9024.

The variable RES2 and also DELTA, which was computed in the first part of the analysis, have to be summed over all cases. In SPSS this can be done automatically by first saving the aggregated totals in a separate file and then reading them in again:

compute HELP = 1.

AGGREGATE /outfile=A /break=HELP

/RESS=sum(RES2)

/DELTAS=sum(DELTA).

MATCH FILES /table=A /file=* /by HELP.

Now the corrected value of the variance (VARC) and the standard error (SEC) of the error term of the substantial equation can be estimated:

compute VARC = RESS/N-LAMB*LAMB*DELTAS/N.

compute SEC = sqrt(VARC).

Computation of RHO, the correlation between the error terms of the selection and substantial equations:

compute RHO = sqrt(LAMB*LAMB/VARC).

If (lamb ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download