Missing data



Missing data

1. What is the problem?

2. Missing at random, etc.

3. Intent to treat—Peto

4. Listwise and casewise

a. Problems and advantages

5. Covariance

6. Coding for missing cases

7. Reweighting

8. Single imputation

9. Maximum likelihood estimation

10. Multiple imputation

11. Software

12. References

[pic]

Treatment of Missing Data

David C. Howell

[pic]

The problem of missing data runs through most of the research that is done in the social, behavioral, and medical sciences. In some situations the researchers had designed a study that was to have complete and balanced data sets, with equal numbers of observations in each group. However the best of intentions do not always lead to the desired result, and observations are often missing either because participants drop out of a study or because there is human or mechanical error in data collection. In other situations, especially when we are interesting in studying the relationships among many variables, participants may supply data on some variables but not on others. They may miss seeing a question, they may decide that the question does not apply to them, they may think that the answer to the question is none of our business, or them may simply drop out of the study after answering only some of our questions. Whatever the reason, missing data causes considerable problems in data analysis because most of our data analytic techniques were not designed with missing data in mind.

As indicated in the preceding paragraph, there are a number of causes of missing data, and the nature of those causes plays an important role in deciding how to handle an analysis. Before looking at this issue from a more technical perspective, it may be helpful to see the kinds of problems that result by looking at two typical situations.

Imagine an experiment in which the experimenter wishes to compare the mean level of optimism across three religious groups. (This example Is derived from an actual study by Sethi and Seligman, 1993.) The experimenter originally intended to sample 150 participants from each religious group, but he was unable to find enough religious moderates, and some of the data from the religious conservatives were lost. In the end he had data on 150 liberals, 125 moderates, and 140 conservative participants. For each of these 415 participants he has a score on a scale of optimism.

In this example there probably is no particular problem with missing data. There is no reason to think that “missingness” had anything to do with optimism or religious affiliation, and so the means of the three groups are probably a reasonable and unbiased reflection of the means for all liberals, moderates, and conservatives. Our standard statistical techniques (e.g. descriptive statistics, multiple regression, and the analysis of variance) can handle these data quite well, and we can go ahead and analyses the data without too many qualms.

Let us modify the experiment slightly to compare levels of depression among three different groups—150 people selected at random off the street, 140 people in a job-training program, and 125 people obtained from the local mental health center. Here I think we have a problem. I would guess that the participants from the mental health center are far more likely to be depressed, and, because of that depression, many of them may be unwilling to participate. The same may be true to a lesser extent for the people from the job-training center. If the more severe of the mental health patients refuse to participate because of their depression, then our estimate of the mean level of depression in that group is likely to be underestimated. Similarly, but less so, for the job-training participants. In the situation that I have described here we will have biased results due to the fact that our missing data are caused, at least in part, by depression. In fact in this particular case there may be no way to overcome the bias resulting from missingness because it is, in part, a function of depression.

We have just examined two situations that look similar on the surface. In both we had three groups of participants with unequal numbers in each group. In both we wished to compare group means and perhaps other descriptive statistics. But in one of these situations missing data create a serious problem, and in the other they do not. What general feature is there that distinguishes these two situations?

Before I answer that question, we need to consider another underlying issue in the treatment of missing data. The examples that we have seen so far were based on only one dependent variable (either optimism or depression), and our underlying analysis was a comparison of group means. But social scientists also collect multivariate data, where several or many variables are obtained from each participant, and our goal is to predict one of those variables from knowledge of the others. For example, we might wish to predict job satisfaction from data on salary, length of service, amount of responsibility, gender, and flexibility in hours of work. It is quite likely that with many participants, some will be missing data on one or more of these variables. In fact there may be so much missing data that our ability to draw conclusions is put at risk. How may we handle such a situation?

So far we have made two kinds of distinctions. The first involves why the data are missing, and the second involves the number of variables on which we have missing data. In the univariate situation, if someone is missing they are simply missing—we have no data on them. In the multivariate situation most participants will have data on at least some variables, but may not have it on all variables. In what follows I will elaborate mostly on the first distinction (why are the data missing), but will also speak to the distinction between univariate and multivariate data. Most of what follows will focus on the multivariate case.

1.1 The nature of missing data

Missing completely at random

Rubin (1976) made several important distinctions that help us to sort out the nature of missing data. He characterized data as “missing completely at random,” “missing at random,” or “not missing at random.”

There are several reasons why the data may be missing that has nothing to do with the values those data would have. They may be missing because equipment malfunctioned, the weather was terrible, people got sick, or the data were not entered correctly. Here the data could be missing completely at random (MCAR). When we say that data are missing completely at random, we mean that the probability that an observation (Xi) is missing is unrelated to the value of Xi or to the value of any other variable. Suppose that we were collecting data on depression, marital status, family income, and diet. Data on depression would be considered missing completely at random if the fact that an observation was missing had nothing to do with the person’s potential scores on depression, marital status, family income, or diet. If people who eat an unbalanced diet were more likely to have missing data on depression, the data could not be considered missing completely at random.

Notice that it is the value of the observation, and not its "missingness," that is important. If people who refused to report personal income were also likely to refuse to report their level of depression, the data could still be considered MCAR, so long as neither of these had any relation to the income score, the depression score, or to any other variable. This is an important consideration, because when a data set consists of responses to several survey instruments, someone who did not complete the Beck Depression Inventory, for example, would be missing all BDI subscores, but that would not affect whether the data can be classed as MCAR. Notice that for a variable to be classed as missing completely at random, missingness can not be related either to the level of that variable or to the level of any other variable in the data set. It can, however, be related to the missingness of other variables.

If we are going to have missing data, we prefer that they be missing completely at random. This makes the situation far easier, though not necessarily easy, to deal with.

Missing at random

Often data are not missing completely at random, but they may be classifiable as missing at random (MAR). For data to be missing completely at random, the probability that Xi is missing is unrelated to the value of Xi or to the value of any other variable. But the data can be considered as missing at random if the data meet the requirement that missingness does not depend on the value of Xi even if it does depend on the value of some other variable. Thus if the probability that an individual is missing a depression score is related to that person’s level of family status,income, or diet, the data can still be considered missing at random. However the data are not missing at random if the probability that a depression score is missing depends on the value of that depression score. If depression scores are more likely to be missing for people with poor diets, the data can still be considered missing at random (though not missing completely at random). But if depression scores are more likely to be missing for depressed individuals, the data are not missing at random.

For completeness I should point out that data could be missing at random even if there is some relationship between variable Y and missingness caused by the fact that both Y and missingness are correlated with X, so long as the relationship between Y and missingness disappears after we control for X. For example, let’s use family income as our variable of interest. Family income is certainly related to person income. If personal income is correlated with missingness, then the correlation between family income and missingness is almost certainly not 0. But if that correlation drops to 0.00 after we control for personal income, then the data on family income is missing at random. This is probably more than many readers want to know, but I needed to say it to be completely accurate.

Missing not at random

If the data are not missing at random or completely at random, they are classed as missing not at random (MNAR). We are going to find that this is the most difficult situation for us to deal with. In fact, there are times when we are unable to deal with it at all realistically.

To summarize is a nontechnical way, if data are missing for reasons that have nothing to do with the values of any of our variables, the data are missing completely at random. If data are missing for reasons that may have to do with the other variables, but not with this variable, then the variable is missing at random. Finally, if data are missing for reasons that have to do with the value of this particular variable, then the data are missing not at random. Notice that in all of this we are taking one variable at a time. We speak about whether depression scores are missing at random, and then we move on to speaking about whether family income is missing at random. We may well have a data set where some variables are missing at random and other variables are missing not at random.

Ignorability

If data are MCAR or MAR, we say that missingness is ignorable. By that we mean that we don't have to model the missingness property. In other words, we don’t have to worry about a statistical model that explains missingness. If, on the other hand, data are not missing at random, a complete treatment of missing data would have to include a model that accounts for missing data. This chapter does not deal with data where missingness is not ignorable. However, Schafer & Graham (2003) have pointed out “good performance is often achievable through likelihood or Bayesian methods without specifically modeling the probabilities of missingness, because in many psychological research settings the departures from MAR are probably not serious.” (p. 154)

Intent to treat

For a moment let us step outside the issue of randomness of missing cases and look at a controversial approach to the treatment of missing data—the “intention to treat” methodology. This methodology is particular important in situations where participants drop out of a study or move between treatments for reasons having to do with the effectiveness of the treatment. It was originally proposed by Richard Peto at Oxford University as a way to deal with data in which participants were originally randomly assigned to treatments but, for a variety of reasons, ended up receiving a different treatment.

are a part of almost all research, and we all have to deal with it from time to time. There are a number of alternative ways of approach missing data, and this document is an attempt to outline those approaches. For historical reasons, a large section of the document deals with an approach involving dummy variables for identifying missing observations. This approach was popularized by Cohen and Cohen (1983), and has been well received by psychologists. However, it does not produce unbiased parameter estimates (Jones, 1996), and alternative approaches are also discussed. For a very thorough book-length treatment of the issue of missing data, I recommend Little and Rubin (1987) .A shorter treatment can be found in Allison (2002) .

 

1.2 The simplest approach--listwise deletion.

By far the most common approach is to simply omit those cases with missing data and to run our analyses on what remains. Thus if 5 subjects in Group One don't show up to be tested, that group is 5 observations short.  Or if 5 individuals have missing scores on one or more variables, we simply omit those individuals from the analysis. This approach is usually called listwise deletion, but it is also known as complete case analysis. 

Although listwise deletion often results in a substantial decrease in the sample size available for the analysis, it does have important advantages. In particular, under the assumption that data are "missing at random, it leads to unbiased parameter estimates. The alternative approaches discussed below should be considered in relation to listwise deletion. However in some cases we are better off to "bite the bullet" and fall back on listwise deletion.

 

1.3  Traditional treatments for missing data

Regression Models versus ANOVA models

I am about to make the distinction between regression and ANOVA models. This may not be the distinction that others might make, but it makes sense for me.  I am really trying to distinguish between those cases for which group membership is unknown and cases in which the dependent variables are unknown.

  

Missing Identification of Group Membership

I will begin with a discussion of an approach that probably won't seem very unusual. In experimental research we usually know to which group a subject belongs because we specifically assigned them to that group. Unless we somehow bungle our data, group membership is not a problem. But in much applied research we don't always know group assignments. For example, suppose that we wanted to study differences in optimism among different religious denominations. We could do as Sethi and Seligman (1993) did and hand out an optimism scale in churches and synagogues, in which case we have our subjects pretty well classified because we know where we recruited them. However we could also simply hand out the optimism scale to many people on the street and ask them to check off their religious affiliation. Some people might check "None," which is a perfectly appropriate response. But others might think that their religious affiliation is not our business, and refuse to check anything, leaving us completely in the dark. I would be hard pressed to defend the idea that this is a random event across all religious categories, but perhaps it is. Certainly "no response" is not the same as a response of "none," and we wouldn't want to treat it as if it were.

The most obvious thing to do in this situation would be to drop all of those non-responders from the analysis, and to try to convince ourselves that these are data missing at random. But a better approach is to make use of the fact that non-response is itself a bit of data, and to put those subjects into a group of their own. We would then have a specific test on the null hypothesis that nonresponders are no different from other subjects in terms of their optimism score. And once we establish the fact that this null hypothesis is reasonable (if we should) we can then go ahead and compare the rest of the groups with somewhat more confidence. On the other hand, if we find that the non-responders differ systematically from the others on optimism, then we need to take that into account in interpreting differences among the remaining groups.

Example

I will take the data from the study by Sethi and Seligman (1993) on optimism and religious fundamentalism as an example, although I will assume that data collection involved asking respondents to supply religious affiliation. These are data that I created and analyzed elsewhere to match the results that Sethi and Seligman obtained, although for purposes of this example I will alter those data so as to remove "Religious Affiliation" from 30 cases. I won't tell you whether I did this randomly or systematically, because the answer to that will be part of our analysis. The data for this example are contained in a file named FundMiss.dat, which is available for downloading, although it is much too long to show here. (The variables are, in order, ID, Group (string variable), Optimism, Group Number (a numerical coding of Group), Religious Influence, Religious Involvement, Religious Hope, and Miss (to be explained later).) You will note that when respondents are missing any data, the data are missing on Group membership and on all three religiosity variables. (Missing values are designated here with a period (.) 1 . This is the kind of result you might find if the religiosity variables all come off the same measurement instrument and that instrument also has a place to record religious affiliation. We see cases like this all the time. The dependent variable for these analyses is the respondent's score on the Optimism scale, and the resulting sample sizes, means, and standard deviations are shown in Table 1, as produced by SPSS.

[pic]

- - Description of Subpopulations - -

Summaries of OPTIMISM

By levels of GROUPNUM Group Membership

Variable Value Label Mean Std Dev Cases

For Entire Population 2.1633 3.2053 600

GROUPNUM 1 Fundamentalist 3.0944 2.8573 180

GROUPNUM 2 Moderate 1.9418 3.1629 275

GROUPNUM 3 Liberal .8783 3.2985 115

GROUPNUM 4 Missing 3.5333 3.1919 30

Total Cases = 600

Table 1 Descriptive Statistics for Optimism as a Function of Group Membership

[pic]

From this table we see that there are substantial differences among the three groups for whom Religious Affiliation is known. We also see that the mean for the Missing subjects is much closer to the mean of Fundamentalist than to the other means, which might suggest that Fundamentalists were more likely to refuse to provide a religious affiliation than were members of the other groups.

The results of an analysis of variance on Optimism scores of all four groups is presented in Table 2. Here I have asked SPSS to use what are called "Simple Contrasts" with the last (missing) group as the reference group. This will cause SPSS to print out a comparison of each of the first three groups with the Missing group. I chose to use simple contrasts because I wanted to see how Missing subjects compared to each of the three non-missing groups, and that option happens to do precisely what I wanted.

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

Table 2 Analysis of Variance with All Four Groups -- Simple Contrasts

[pic]

At the top of Table 2 you see the means of the four groups, as well as the unweighted mean of all groups (labeled Grand Mean). Next in the table is an Analysis of Variance summary table, showing that there are significant differences between groups; (F3,599 = 14.395; p = .000).

A moment's calculation will show you that the difference between the mean of Fundamentalists and the mean of the Missing group is 3.094 - 3.533 = -0.439. Similarly the Moderate group mean differs from the mean of the Missing group by 1.942 - 3.533 = -1.591, and the Liberal and Missing means differ by 0.878 - 3.533 = -2.655. Thus participants who do not give their religious affiliation have Optimism scores that are much closer to those of Fundamentalists than those of the other affiliations.

In the section of the table labeled "Parameter Estimates" we see the coefficients of -.439, 1.592, and -2.655. You should note that these coefficients are equal to the difference between each group's mean and the mean of the last (Missing) group. Moreover, the t values in that section of the table represent a significance test on the deviations from the mean of the Missing group, and we can see that Missing deviates significantly from Moderates and Liberals, but not from Fundamentalists. This suggests to me that there is a systematic pattern of non-response which we must keep in mind when we evaluate our data. Subjects are not missing at random. (Notice that the coefficient for missing is set at 0 and labeled "redundant." It is redundant because if someone is not in the Fundamentalist, Moderate, or Liberal group, we know that they are missing. "Missing," in this case, adds no new information.)

Orthogonal Contrasts

You might be inclined to suggest that the previous analysis doesn't give us exactly what we want because it does not tell us about relationships among the three groups having non-missing membership. In part, that's the point, because we wanted to include all of the data in a way that told us something about those people who failed to respond, as well as those who did supply the necessary information.

However, for those who want to focus on those subjects who provided Religious Affiliation while not totally ignoring those who did not, an alternative analysis would involve the use of orthogonal contrasts not only to compare the non-responders with all responders, but also to make specific comparisons among the three known groups.

You can use SPSS (OneWay) or any other program to perform the contrasts in question. (Or you can easily do it by hand). Suppose that I am particularly interested in knowing how the non-responders differ from the average of all responders, but that I am also interested in comparing the Moderates with the other two identified groups, and then the Fundamentalists with the Liberals. I can run these contrasts by providing SPSS with the following coefficients. 

Missing vs Non-Missing 1 1 1 -3

(Fundamental & Moderate) vs Liberals 1 1 -2 0

Fundamental vs Liberal 1 -1 0 0

The first contrast deals with those missing responses that have caused us a problem, and the second and third contrasts deal with differences among the identified groups. The results of this analysis are presented below. (I have run this using SPSS syntax because it produces more useful printout.)

[pic]

ONEWAY

optimism BY groupnum(1 4)

/CONTRAST= 1 1 1 -3 /CONTRAST= 1 1 -2 /CONTRAST= 1 -1

/HARMONIC NONE

/FORMAT NOLABELS

/MISSING ANALYSIS .

[pic]

 

[pic]

[pic]

 

    Table 3 OneWay Analysis of Variance on Optimism with Orthogonal Contrasts

[pic]

Notice in Table 3 that the contrasts are computed with and without pooling of error terms. In our particular case the variances are sufficiently equal to allow us to pool error, but, in fact, for these data it would not make any important difference to the outcome which analysis we used. In Table 3 you will see that all of the contrasts are significant. This means that non-responders are significantly different from (and more optimistic than) responders, that Fundamentalists and Moderates combined are more optimistic than Liberals, and that Fundamentalists are in turn more optimistic than Moderates.

I have presented this last analysis to make the point that you have not lost a thing by including the missing subjects in your analysis. The second and third contrasts are exactly the same as you would have run if you had only used the three identified groups. However, this analysis includes the variability of Optimism scores from the Missing group in determining the error term, giving you somewhat more degrees of freedom. In a sense, you can have your cake and eat it too.

Cohen and Cohen (1983, Chapter 7) provide additional comments on the treatment of missing group membership, and you might look there for additional ideas. In particular, you might look at their treatment of the case where there is missing information on more than one independent variable.

This situation, where data on group membership is missing, is optimally handled by the analysis above. Notice that it is not dependent on the nature of the mechanism behind missingness, which is in fact addressed by the analysis. This will not necessarily be the case in the following analysis, where the nature of missingness is important.

Missing Information on a Single Continuous Independent Variable

The case of missing group membership was actually the easier problem to deal with, because the resulting analysis is straightforward. You simply treat the respondents with missing group membership as a group unto themselves, and proceed from there. But suppose that you have a continuous independent variable and want to use it to predict some dependent variable. For example, suppose that you want to predict Optimism on the bases of Religious Influence, but that some of the respondents are missing data on Religious Influence. We could throw out those subjects, but then we are not only losing power (which we can afford to lose with this many observations), but we risk running our analyses on a partial set of data that are not representative of the population of respondents. The solution to this problem is quite straightforward, but it will involve some steps that may strike you as strange, if not downright dishonest. I promise you, however, that they are neither strange nor dishonest. What I am talking about here is the approach advocated by Cohen and Cohen (1983). Alternative approaches will come later.

I will start with a simple example involving the data on optimism and religiosity that we have been discussing. I'll then move to a more complex example involving different data. In the first example we will try to predict Optimism on the basis of the Religious Influence score (RelInflu), which is a self-reported measure of the influence of religion in the respondent's daily life.

An important principle in our analysis is that information about the presence or absence of data is important. We are going to retain all of our subjects, but we will keep track of which subjects do, and do not, have missing data on RelInflu by means of an indicator variable, which I will label Miss. This variable will be coded 1 if the data are missing, and 0 if not. We could have coded the variable in reverse, but this method is in keeping with naming a variable on the basis of its upper end. This variable is included in the data set that you can retrieve as FundMiss.dat We cannot, however, simply predict Optimism from RelInflu as those variables stand at present, because any statistical package will automatically delete the 30 cases in which RelInflu is missing. We need to fill that missing value with something. It turns out that we can replace the missing values with any constant we want, just so long as all missing values are replaced by the same value. It really doesn't matter what value you chose. It could be 0, 9, 99, 28.376, or whatever. I am going to chose to replace missing with the mean RelInflu score ( = 4.50), because that will turn out to be a bit more convenient later. But I could use any value I wish, even values that are legitimate values when data are present. Just be sure that you fill in every missing value with the same number.

Rather than alter the existing variable, I will create a new variable and operate on that. Any statistical package will create values for you, and I used SPSS to define

NewInflu = RelInflu (if RelInflu = non-missing)

NewInflu = 4.50 (if RelInflu = missing).

Now I can't just predict Optimism from NewInfl, because the program would have no way to know which of the NewInflu values are real and which are arbitrary. And certainly the result would differ depending on which arbitrary values I used to replace missing values. But NewInflu and Miss together provide all the information we had in RelInflu.

We are going to regress Optimism on both NewInflu and Miss, but let's start out by looking at those separately. We will do our analyses in a hierarchical fashion, first predicting Optimism from Miss, and then adding NewInflu as a predictor. Remember that Miss is a dichotomous variable, and the regression using Miss will ask the question about whether Optimism scores vary with Miss. In other words, that regression will tell us whether Optimism scores are higher (or lower) on average for those people will missing data than for people with complete data.

The results of using just Miss as the predictor for Optimism are presented in Table 4.

[pic]

[pic]

[pic]

[pic]

Table 4. Optimism predicted from Miss

[pic]

From Table 4 you can see that the regression of Optimism on Miss is significant (F1, 598 = 5.82, p = .0162) The slope is positive (1.442), indicating that Optimism scores are higher for those who did not have a score for Religious Influence than for those who did. The intercept (2.09) represents the difference between the means of people who had scores of 0 and 1 on Miss. In other words, on average those with missing data on religious influence scored 2.09 units higher on Optimism than those who had a score for religious influence. This is useful information in its own right. (Note: this interpretation of the intercept was possible only because the codes for Miss (0 and 1) differed by only one unit.)

Note also the R-squared for this relationship shown in Table 4. Here R2 is approximately .01, indicating that 1% of the variability in Optimism scores is associated with the presence or absence of data on RelInflu. While this is not a very large percentage, it is significant, partly because N = 600.

So we have learned one thing already, and that is that the existence of missing data is not random; it is associated with people with higher Optimism scores. The next thing to do is to ask what religious influence (as coded in NewInflu) has to add over and above the effect of missing data. This is a hierarchical regression, because we are looking at what is added as we increase the predictors. The result of the regression predicting Optimism from Miss and NewInflu is shown in Table 5.

[pic]

[pic]

[pic]

[pic]

Table 5 Regression of Optimism on Miss and NewInflu

[pic]

The first thing to notice in Table 5 is that R2 has increased from .0096 to .082, which is an increase of .072. This is the squared semi-partial correlation between NewInflu and Optimism, controlling for the presence of missing data. We could compute an F test on this increment, but it would be equivalent to the t test on the regression coefficient for NewInflu, which is 6.838. This increment is statistically significant, leading us to conclude that there is a true relationship, after controlling for Miss, between Optimism and the degree to which an individual says that religion is influential in his/her life. And because the slope for NewInflu is positive, greater religious influence is associated with greater optimism.

But what are we to make of the regression coefficients themselves? This is most easily seen if we step back and regress Optimism on the original RelInflu, which has data on just the 570 cases where participants responded. An abbreviated form of this result is shown in Table 6.

[pic]

[pic]

[pic]

[pic]

 

Table 6 Regression of Optimism on RelInflu

[pic]

Notice in Table 6 that the slope and intercept for RelInflu are exactly the same as they were in Table 5, where we predicted Optimism from Miss and a revised NewInflu which had an arbitrary constant substituted for missing data. (The standard errors for the slopes look exactly the same, but carried to 5 decimal places they are .09796 and .09777, respectively. I point this out because the standard errors are slightly larger when we use Miss as a predictor and replacing missing data with a constant. Notice also that the standardized regression coefficients are substantially different.) This means that Table 5 contains the same information as Table 6 in terms of the contribution of religious influence, while at the same time making use of all 600 cases. In this particular case we have so many degrees of freedom that the addition of 30 cases to our analysis is not important, but there are many situations where we would dearly love to involve more cases in our analysis, partly for the added power. In addition, by adding these extra 30 cases that we would otherwise have missed, we are in a position to say something about the meaning of missing data. We know that people who decline to provide data are more likely to have high optimism scores, which is something that we would not otherwise have known had we simply deleted those cases.

Table 7 below shows the results from the complete data set of 600 cases before I systematically converted some observations to missing. Notice that the coefficient for Relinf is significant, though it is different from the estimates above. This represents both the influence of missing observations and the fact that those observations were deliberately not missing at random.

[pic]

Table 7  Regression for Original Data with no Missing Cases

Jones (1996), has argued that the approach suggested here can lead to biased estimates of regression coefficients. However, when you have only one independent variable, the regression coefficient, after controlling for missing data, will be the same as the regression coefficient on the original X with casewise deletion. This makes sense, because when we control for missing values, we essentially calculate the relationship between X and Y when Miss = 1, and when Miss = 0. But when Miss = 1, X is a constant (the mean of X that we substituted for the missing scores), and the slope is undefined. When Miss = 0, we have the same data that we have for casewise deletion, and the slope will equal the slope with casewise deletion. In the final result, the undefined slope is ignored, and we have only the coefficient for complete data.

However, when we use two or more predictor variables, with data missing on only one variable, the situation is different. For the case where Missing = 0, we have the same coefficient we would have with casewise deletion. However, with Missing = 1, X will be a constant, and will be dropped from the model, leaving Y = f(z). Thus the final result will be a combination of y=f(x,z) for the complete data, and y=f(z) for the missing data. Thus the result will not be the same as for casewise deletion.

 

Missing Information on Multiple Continuous Independent Variables

We have just used an example where we have a single predictor variable and are asking how it predicts Optimism. Furthermore, I have indicated what would happen if we used multiple independent variables, but had missing data on only one of them. But what do we do with those situations in which we have multiple predictor variables, each of which is missing some observations? Well, that depends on how those observations are missing.

I have chosen a deliberately extreme, but very realistic, example where several independent variables had exactly the same pattern of missing data. This kind of situation often occurs when our variables are collected with the same instrument, such as a questionnaire, and a respondent who fails to turn in the instrument is certain to have missing data on all of the scales that are based on that instrument. In this case, the solution is simple. If we wanted to predict Optimism from both Religious Influence and Religious Involvement, we would create a variable (Miss) coding who was and who was not missing data, substitute the mean of the relevant variable in place of missing data, and run the multiple regression just as we did in the simple case. Here the influence of those two variables combined would be the hierarchical increment in R2 between the regression with just Miss as the independent variable and the regression with all three independent variables. If you have any question about this, just run the regressions as described, and again with missing observations omitted, and note the parallels.

A second situation that is one in which we have two (or more) predictor variables and only missing data on one of those variables. This was discussed above.

A third situation gets more cumbersome. Suppose that we have two (or more) predictor variables, each of which has its own pattern of missing data. Here we may have a problem. If the pattern of missing data is not systematic (i.e., if the presence or absence of data on one predictor is independent of the presence or absence of data on the other predictor), then we can proceed as above. We substitute a constant for each of the missing observations (it could be a different constant for each variable) and We create multiple variables (e.g.Miss1, Miss2, and so) carrying information about which participant is missing on which variable, and then run our analyses. We would first predict the dependent variable from the two or more Miss variables (Miss1, Miss2, etc.) and then add in our modified independent variables. This arrangement will work nicely, just so long as "missingness" is random. I assume, though I have not worked it out, that we would have some of the same problems we had with one missing and one complete variable. I would suggest that you solve the problem both with dummy variables and with casewise deletion, and compare the result.

Unfortunately, "missingness" is rarely random. People who, for one reason or another, don't respond on one variable, tend not to respond on others. Thus the same people pop up over and over again in our list of missing cases. This was fine above when exactly the same people were missing in each case, because Miss could unambiguously code them. We didn't need a Miss1 and a Miss2, because they would be exactly the same variable. But an equally likely case is one in which respondents who are missing one variable generally, but not always, are missing on another variable. In this situation we need to have Miss1, Miss2, etc to code the different patterns of missing. But, unfortunately, Miss1 and Miss2 are going to be highly intercorrelated because they point to almost the same people. This high intercorrelation is going to create serious problems with our regression solution, making the results unstable. I don't have a solution for this situation. Perhaps you can throw out a few sets of data and end up with a large group missing exactly the same observation, which takes us back to the first paragraph in this section. Baring that, I'm afraid you are on your own.

 

1.4  Alternative Approaches--Maximum Likelihood and Multiple Imputation

I am going to go fairly lightly on what follows, because the solutions are highly technical. As implied by the title of this section, the solutions fall into two categories--those which rely on maximum likelihood solutions, and those which involve multiple imputation.

 

Maximum Likelihood

The principle of maximum likelihood is fairly simple, but the actual solution is computationally complex. I will take an example of estimating a population mean, because that illustrates the principle without complicating the solution.

Suppose that we had the sample data 1, 4, 7, 9 and wanted to estimate the population mean. You probably already know that our best estimate of the population mean is the sample mean, but forget that bit of knowledge for the moment.

Suppose that we were willing to assume that the population was normally distributed, simply because this makes the argument easier. Let the population mean be represented by the symbol μ, although in most discussions of maximum likelihood we use a more generic symbol, θ, because it could stand for any parameter we wish to estimate.

We could calculate the probability of obtaining a 1, 4, 7, and 9 for a specific value of μ. This would be the product p(1)*p(4)*p(7)*p(9). You would probably guess that this probability would be very very small if the true value of μ = 10, but would be considerably higher if the true value of μ were 4 or 5. (In fact, it would be at its maximum for μ = 5.25.) For each different value of μ we could calculate p(1), etc. and thus the product. For some value of μ this product will be larger than for any other value of μ. We call this the maximum likelihood estimate of μ. It turns out that the maximum likelihood estimator of the population mean is the sample mean, because we are more likely to obtain a 1, 4, 7, and 9 if μ = the sample mean than if it equals any other value.

The same principle applies in regression, although it is considerably more complicated. If we assume a multivariate normal distribution, we can calculate maximum likelihood estimators for the means, variances, and covariance given the sample data. These are the values of those parameters that would make the data we obtained maximally likely. Once we have these estimates, we can use them to derive the optimal regression equation.

 

The EM Algorithm

There are a number of ways to obtain maximum likelihood estimators, and one of the most common is called the Expectation-Maximization algorithm, abbreviated as the EM algorithm. The basic idea is simple enough, but the calculation is more work than you would want to do. 

If we wanted to solve the EM algorithm by hand, we would first take estimates of the variances, covariances and means, perhaps from listwise deletion. We would then use those estimates to solve for the regression coefficients, and then estimate missing data based on those regression coefficients. (For example, we would you whatever data we have to estimate Y=bX + a, and then use X to estimate Y wherever it is missing.) This is the "estimation step of the algorithm. Having filled in missing data with these estimates, we would then use the complete data (including estimated values) to recalculate the regression coefficients. (The new estimates would be adjusted to model sampling error, but that is a technical issue.) This is the "maximization" step. Having new regression coefficients, we would re-estimate the missing data, calculate new coefficients, etc. We would continue this process until the estimates no longer changed noticeably. At that point we have maximum likelihood estimates of the parameters, and we can use those to make the maximum likelihood estimates of the regression coefficients.

The solution from the EM algorithm is better than we can do with coding for missing data, but it will still underestimate the standard errors of the coefficients.

There are alternative maximum likelihood estimators that will be better than the ones obtained by the EM algorithm, but they assume that we have an underlying model (usually the multivariate normal distribution) for the distribution of variables with missing data.

 

Multiple Imputation

An alternative the maximum likelihood his called Multiple Imputation. Each of the solutions that we have discussed involves estimating what the missing values would be, and using those "imputed" values in the solution. With dummy variable coding we substituted a constant (often the variable mean) for the missing data. For the EM algorithm we substituted a predicted value on the basis of the variables that were available. In multiple imputation we will substitute random data.

In multiple imputation we generate imputed values on the basis of existing data, just as we did with the EM algorithm. But suppose that we are estimating Y on the basis of X. For every situation with X = 5, for example, we will impute the same value of Y. This leads to an underestimate of the standard error of our regression coefficients, because we have less variability in our imputed data than we would have had if those values had not been missing. One solution was the one used in the EM algorithm, where we altered the calculational formulae by adding in error in the calculation. With multiple imputation we are going to take our predicted values of Y and then add, or subtract, an error component drawn randomly from the residual distribution of Y-Yhat. This is know as "random imputation."

This solution will still underestimate the standard errors. We solve this problem by repeating the imputation problem several times, generating multiple sets of new data whose coefficients varying from set to set. We then capture this variability and add it back into our estimates. This is a very messy process, and the reader is referred to Allison (2002) or Little and Rubin (1987) for the technical details.

 

Software

For anything other than the dummy variable coding approach to missing data, the calculations are more than any reasonable person would care to undertake. However there is software available for this purpose, and some of it is free. The two pieces of software highlighted below were chosen because they are free. Additional software is discussed in Allison (2002).

Expectation-Maximization Algorithm

I don't know of free software for this, but many people have access to the SAS package. The SAS-MI procedure implements the EM algorithm.

 

Direct Maximum Likelihood

Mx is a free program available for download at . 

 

Multiple Imputation

The Direct Augmentation approach to multiple imputation, known as NORM, is available at . 

This can also be implemented through Proc MI as part of the SAS package.

 

Footnotes

1.  If your software doesn't like periods as missing data, you can take any editor and change periods to asterisks (*) or blanks, or 999s, or whatever it likes.

return

 

References

Cohen, J. & Cohen, P. (1983) Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.).Hillsdale, NJ: Erlbaum. Return

Little, R.J.A. & Rubin, D.B. (1987) Statistical analysis with missing data. New York, Wiley. Return

Sethi, S. & Seligman, M.E.P. (1993). Optimism and fundamentalism. Psychological Science, 4, 256-259. Return

Allison, P.D. (2002). Missing data. Thousand Oaks, CA: Sage. Return

Jones, M.P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression Journal of the American Statistical Association, 91,222-230. Return

Last revised: December 23, 2002

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download