MAIN TITLE



When is a Program Ready for Rigorous Impact Evaluation?

The Role of a Falsifiable Logic Model

Diana Epstein

Senior Education Policy Analyst

Center for American Progress

Jacob Alex Klerman

Principal Associate

Abt Associates Inc.

Abstract:

Without meaning to dispute the crucial role of random assignment in establishing program impact, this paper argues that too rapid movement to rigorous impact evaluation is a partial cause of the low success rate of evaluated programs. This paper proposes a constructive response: process evaluations that compare program intermediate outcomes—in the treatment group, during the operation of the program—against a more falsifiable extension of the conventional logic model. We argue by example that such process evaluations would have allowed funders to deem programs unlikely to show impacts and therefore not ready for random assignment evaluation—without the high cost and long timelines of a rigorous impact evaluation. The paper then discusses the implications of this approach for broader evaluation strategy.

[G]overnment should be seeking out creative, results-oriented programs like the ones here today and helping them replicate their efforts across America.

President Barack Obama (2009)

Suppose that, like the quote from President Obama, one’s goal was to identify new social programs to be rolled out nationwide to address pressing national problems. What process would you design for identifying those programs?

In response to conventional practice that sometimes goes directly from social problem and plausible program ideas to national implementation, recent initiatives have urged inserting a random assignment “tollgate.” Only programs that pass that “tollgate” proceed from pilot to national rollout. Proponents of this new tollgate, sometimes called “randomistas,”[1] argue that the only way to know if a program “works” is through rigorous impact evaluation, In contrast, the more conventional evaluation community argues that there is much to be learned from observing the program as implemented—without random assignment or even a control group.

We are strongly sympathetic to both communities, but it is our argument that the very success of the randomistas is contributing to the high failure rate of rigorous impact evaluations. The rush to rigorous impact evaluation leads us to evaluate programs that are not “ready” for rigorous impact evaluation. Given that we are evaluating too early, it should not be surprising that the evaluations find no impact. If we would evaluate only those programs that have been deemed ready for rigorous impact evaluation, we might both lower evaluation costs and also increase the fraction of programs that “pass” rigorous impact evaluation. In sum, the rush to rigorous impact evaluation leads to the waste of evaluation resources, and the resulting null or negative results may lead to the premature abandonment of promising program models.

The challenge, of course, is knowing when a program is ready for rigorous impact evaluation. Determining if a program works is the very purpose of rigorous impact evaluation. The balance of this paper argues that what we call a “falsifiable logic model” combined with process evaluation methods—carefully applied to the problem at hand—can help to determine whether a program is ready for rigorous impact evaluation. Applying these process evaluation methods would lead to a determination that some programs are simply not ready for rigorous impact evaluation. Furthermore, some of those programs not ready for rigorous evaluation would be ready, if only they had a little more time to be developed and refined. Thus, because some programs would never be subject to rigorous impact evaluation and because other programs would only be subject to rigorous evaluation after program refinement, the success rate of programs that were subjected to rigorous impact evaluation would be higher.

1. The Case for Rigorous Impact Evaluation and Our Critique of that Case

Inside and outside government,[2] there is a movement to use rigorous impact evaluation as a (or multiple) toll gate(s) through which programs must pass before proceeding to broad-scale program rollout (see Figure 1). As articulated by proponents of this approach (Society for Prevention Research, 2004; Ioannides et al., 2001),[3] the first toll gate should be an efficacy evaluation to see if the program works—in a single site, under favorable conditions. Then, the second toll gate should be an effectiveness evaluation to see if the program works in multiple sites under conditions that may deviate from the ideal (McDonald, 2009, 2010).

[pic]

The case for such toll gates seems overwhelming. When subject to rigorous evaluation, many—perhaps most—plausible and promising programs are found to have at best small impacts not commensurate with their cost, and often no impacts at all.[4] Even programs that initial rigorous impact evaluations show to be effective often fail a second test with an expanded population or at multiple sites (Glasgow et al., 2004; Hallfors et al., 2006). Sometimes the initial finding was a statistical fluke (or the result of creative data mining), sometimes a program was successful with the initial implementing team but is not implemented with fidelity by a new team, and sometimes a program that was effective in its initial site is not effective at a new location (Summerville and Raley, 2009).

Thus, even in the face of serious social problems which the nation has the will to address, it appears that most programs don’t work. Perhaps programs are poorly designed; perhaps it is fundamentally very difficult to change human behavior. In either case, rolling out programs that have not been rigorously evaluated wastes the nation’s resources by funding programs that are unlikely to actually improve people’s lives. Despite our desire to “do something,” this low success rate for rigorously evaluated programs suggests that evaluating prior to broad program rollout is likely to be a more prudent course.

Granting the need for requiring rigorous impact evaluation before proceeding to broad-scale program rollout, we are concerned that the growing emphasis on rigorous impact evaluation will push programs to rigorous impact evaluation before they are ready and, in so doing, further depress the already low fraction of programs that are shown to be effective. Pushing programs prematurely to rigorous impact evaluation has two related costs. First, evaluations are expensive. If we are reasonably sure that a program will not “pass” a rigorous evaluation, why waste limited evaluation resources by subjecting the program to a rigorous impact evaluation? Second, program models can be improved. Sometimes a promising program model needs several iterations before a workable approach can be found. If we evaluate too early, we will deem a failure a program model that after another round (or rounds) of development might be deemed a success.

2. A Logic Model Approach to Evaluability

The hard question is: how can we identify which programs are ready for and worthy of rigorous impact evaluation? After all, the whole premise of the argument for rigorous impact evaluation is that nothing short of rigorous evaluation yields compelling evidence of impact.

The next two sections of this paper assert that a partial solution to this difficult question exists—what we call the “Logic Model Approach to Evaluability.” This approach builds on the work of Joseph Wholey and what he termed “evaluability assessment.” In Wholey’s words, evaluability assessment “is a process for clarifying program designs, exploring program reality, and—if necessary—helping redesign programs to ensure that … program goals, objectives, important side effects, and priority information needs are well defined, program goals and objectives are plausible, relevant performance data can be obtained, and the intended users of the evaluation results have agreed on how they will use the information” (Wholey, 1994). Specifically, our approach expands the role of the logic model from a tool for thinking through causal pathways to a tool for evaluating whether a program is satisfying its own professed approach (Conrad et al., 1999; McLaughlin and Jordan, 1999; Rogers, 2005; Valley of the Sun United Way, 2008; W. W. Kellogg Foundation, 2004). In conventional logic models, program developers specify three key components. First, the program model: how will the program operate? Second, intermediate outcomes that must be realized by members of the treatment group in order for the program to succeed. Third, the impacts that the program will have, i.e., what outcomes does the program consider likely to be achieved in the target population, relative to what the outcomes would have been for program participants if they had not been offered the chance to participate?

Our innovation is a requirement that the expanded logic model include detailed—and falsifiable—goals for intermediate outcomes. These intermediate outcome goals could be both quantitative and qualitative.[5] Consider the case of a training program. Such a program’s logic model might specify intermediate benchmarks such as: (i) space is secured and a curriculum developed; (ii) instructors are trained; (iii) classes of a given size are recruited; (iv) instructors teach with fidelity to the curriculum; (v) students attend all or most of the classes (with a quantitative standard of what fraction of students will attend what fraction of classes); (vi) students master material, as measured by gain of X percent on a pre/post test; and end impacts such as (vii) students find jobs that use acquired competencies; and (viii) students are retained in jobs for at least X months.

Asking program developers to specify their logic models in detail is valuable for two reasons. The first reason is the obvious one: developing an explicit and detailed logic model helps program developers to refine their program vision. Program models are more than “ideas” or “insights.” In developing programs it is crucial to think realistically. Programs are rarely experienced in their ideal form. Not everyone will attend all of the sessions; not everyone will pass the final exam. What are realistic expectations? As important, what intermediate outcomes do program developers think is needed in order for the program to have an impact? Can the program be cost-effective if classes are not full? Can the program have the projected impact if only two-thirds (half?) of the students attend all (at least three-quarters?) of the sessions? The process of thinking through the details of program operation should identify issues, lead to improved program design, and thereby result in better program outcomes. This is the conventional argument for urging new programs to develop explicit logic models.

The second reason for asking program developers to specify their logic model in detail is more subtle. When trying to “sell” the program, program proponents have a strong incentive to over-promise. A program that purports to yield large impacts for many people is inherently more attractive than one for which only small impacts are expected.[6] In contrast, at the evaluation stage, program proponents have an incentive to lowball their estimates. If they overestimate or even give their best guess, it is possible that the program will not meet the stated performance goal at evaluation and the program will perhaps be cancelled.[7] These conflicting pressures give program proponents a stronger incentive to give realistic estimates. The explicit goals for intermediate outcomes in our proposedfalsifiable logic model should therefore also help in selecting which programs to fund.

We acknowledge that in current practice, logic models often lack the specificity required by our approach.[8] For example, the IES evaluation report from the Even Start literacy program states that the children had very different amounts of exposure to the program. This is not (necessarily) a result of failed implementation, rather it was the case that “Even Start guidelines do not specify an expected level of exposure for children or parents, and the hours of instruction offered by local projects vary widely. Instead, it is our premise that falsifiable goals are a reasonable degree of specificity for a program seeking substantial funds. Furthermore, as will become clear in the balance of this paper, such falsifiable goals are crucial to our proposed approach for determining when a program is ready for rigorous impact evaluation.

We also acknowledge that satisfying the benchmarks for program operation and outputs and outcomes in the treatment group specified in such a falsifiable logic model will often not be sufficient to achieve the desired end impacts. Even if there is progress in the treatment group, there may not be any impact. Control groups often also show improvement—sometimes due to regression to the mean, sometimes due to similar programs in the community. Thus, even a program that satisfies its own logic model for outputs and outcomes for the treatment group (e.g., pre/post progress on a standardized test) may fail a rigorous impact evaluation for impact on long term outcomes (i.e., outcomes for the treatment group relative to outcomes for the control group).[9]

Nevertheless, a program that cannot achieve the clearly intermediate goals specified by its own logic model will not, according to its own logic model, have the desired (usually longer term) impacts and therefore should not be rigorously evaluated. The caveat “its own logic model” is crucial. Sometimes programs have long term impacts without satisfying their own logic model. In that case, the program must have had some impact through some unanticipated pathway.

Crucially, note that this determination of whether a program achieves the intermediate outcomes specified by its own logic model can often be made using conventional process evaluation methods; i.e., careful observation of program operation, without random assignment and without a control or comparison group. In addition, in most cases and by design, this determination should be possible. Our approach is attractive exactly because it is inexpensive—using program operating records (e.g., enrollment and attendance) and end-of-program tests. We want falsifiable intermediate outcomes that can be measured without a follow-up survey and its expensive tracking of former program participants and without a need to locate and survey nonparticipants (a control group or comparison group) .

3. Some Examples

If most evaluated programs could pass this additional toll gate—i.e., their own logic models—then this “Logic Model Approach to Evaluability” would yield little operational (for the practice of evaluation) insight. Unfortunately the opposite is true. Many evaluated programs fall short of (the currently implicit) expectations on these intermediate outcome tests, thus failing their own logic models. Here, we provide some examples of common, but distinct, ways in which programs fail their own logic models and how this could have been detected by a process evaluation. For each of these ways in which programs fail, we call out the required details in order to make the logic model falsifiable; i.e., to a lead funder to conclude that according to its own logic model, a program is unlikely to have an impact and therefore should not progress to rigorous impact evaluation.The first way in which programs fail their own logic model concerns the inputs into the program. Program models often implicitly posit the ability to establish inter-organization partnerships and to recruit and retain certain types of staff; sometimes those partnerships never materialize or the staff cannot be recruited or retained. For example, the Employment Retention and Advancement (ERA) program in Salem, Oregon struggled with both high turnover among case manager staff and a difference in philosophies between staff recruited from welfare agencies and those from community colleges. These implementation challenges affected service delivery and hence the benefits that participants were able to obtain from the program (Molina, Cheng, and Hendra, 2008). Thus, a falsifiable logic model should specify the partnerships to be established, the qualifications of the staff to be hired, and the projected retention rate of those staff.

The second way in which programs fail is that they do not attract the target number of clients/participants/trainees. For random assignment to be feasible, a program needs to have a surplus of clients (usually double). The ideal situation for rigorous impact evaluation is therefore an existing program with a long waiting list. When such a long waiting list exists, random assignment can often be viewed as the most ethical approach to deciding who will be served.

When there is not currently a waiting list, sometimes a program can attract additional clients through advertising and recruiting. For a program that is already achieving its target enrollment, resources expended on recruiting and advertising are arguably wasted; when called upon to increase applications, recruiting and advertising can plausibly satisfy the larger required number of applicants. However, a program that cannot even attract the target size of the treatment group (or is expending considerable resources to do so) is not ready to recruit double (or more) the target number of treatment group members in order to implement random assignment.

Many evaluations have trouble recruiting applicants at the target rate. Sometimes the evaluation is cancelled, as was the case in the Portland Career Builders ERA program which recruited a random assignment sample only a third of the target size (Azurdia and Barnes, 2008). Sometimes, the evaluation limps forward on less than the target number of applicants. In either case, an earlier process study could have detected recruitment challenges and would have indicated that the program was not ready for a rigorous impact evaluation.

The implications of a waiting list for a falsifiable logic model are slightly more subtle. Most programs begin with a premise that they are trying to solve some important and (in some relatively sense) common problem—and that the proposed program will be viewed as attractive by the target population. In as much as we are only testing a pilot, we should expect that demand should massively exceed current program capacity; if not, why expand the program—here or elsewhere? Thus, falsifiable program logic models should include some standard against which to judge that the claimed demand for the program truly exists. Easily filling the current pilot program—with a waiting list—will often be a plausible standard.

Of course, a successful program can usually expect even more demand. An existing program develops a referral network. Claimed impacts and individual success stories help to build demand. Nevertheless, we usually have in mind a fully rolled-out program considerably larger than the pilot program—even in this site. In that case, at least moderate excess demand (i.e., a waiting list) is a reasonable requirement before proceeding to—and a requirement for successful implementation of—a random assignment evaluation.

The third way that programs fail is that sometimes participants initially enroll in the program, but do not complete the expected treatment. We know this is a failure of the logic model because, ex post, reports of rigorous impact studies point to failure to complete the treatment as the reason for null results. For example, the report on the Rural Welfare to Work Strategies Evaluation attributes the lack of substantial impacts to the fact that only about two-fifths of the target clients received substantial Future Steps services (Meckstroth et al., 2006). In the South Carolina Moving Up ERA program, only half of the program group was actually engaged in the program’s services during the year after they entered the study. Of those who were engaged, the intensity of engagement varied such that some were only minimally involved in the program (Scrivener, Azurdia, and Page, 2005). Another example is the Cleveland Achieve ERA program, where participation varied widely such that overall the intensity was less than the program designers had envisioned (Miller, Martin, and Hamilton, 2008).

Similarly, Building Stronger Families’ (Wood et al., 2010) logic model asserted that multiple group sessions on relationship skills (combined with other support services to unmarried couples around the time of their child’s birth) could improve relationship skills and thereby increase marriage rates and decrease divorce rates. However, in all but one of the sites, only 9 percent of the couples received at least 80 percent of the curriculum. Only in the one site where 45 percent of the couples received at least 80 percent of the curriculum was there an impact on measures of relationship quality.

Universal attendance at every program session is not realistic. Nevertheless, below some threshold, impacts seem extremely unlikely. Thus, a falsifiable logic model should specify what define “enrollees” and what fraction of those enrollees need to attend what number of sessions in order for there to be a measurable (perhaps cost-effective) impact. Then, actual attendance can and should be measured against the specific logic model. It is our reading of the literature that there is often a presumption that most (perhaps nearly all) of the sessions must be attended in order for the program to have its full effect. This is clearly true in a program that gives a certificate or a degree. In other programs such a presumption is signaled by the fact that final sessions often include some special capstone or wrap-up activity. A presumption that most sessions need to be completed is sometimes implied by the fact that we fund all of the sessions, i.e., if we thought the later sessions were not needed, we would not have funded them.

The fourth way in which programs fail is that sometimes clients attend, but the program as implemented falls short of what was envisioned in the logic model. Thus for example, across each of four supplemental reading comprehension programs, a Mathematica random assignment evaluation found no evidence of consistent impact. However, the study also found evidence of far from complete implementation of the curricula themselves. Only about three-quarters of the targeted practices were actually implemented (James-Burdumy et al., 2010).

Similarly, Abt’s random assignment evaluation of a national student mentoring program found no consistent pattern of impact (Bernstein et al., 2009). In explaining that null result, the report noted that many treatment members received no mentoring at all and the average amount of mentoring received was about an hour a week—less than in model community-based mentoring programs.

Thus, a falsifiable logic model should specify what constitutes (sufficient) fidelity of implementation. That specification of implementation with fidelity needs to be specific enough to be falsifiable. What defines how the program would be implemented ideally? What amount of deviation from that ideal implementation would constitute failure?

This step has important implications for program development as well. Once program developers have defined implementation with fidelity, they should go back to their plan for training and supervising staff. Do those training materials make clear how it is expected that the program will be implemented? Have the instructors been tested to assure that they have mastered the techniques and learned the expectations. Does the program model include sufficient supervisory resources and a supervision scheme which can reasonably be expected to lead to implementation of the program with fidelity? This part of development of the logic model will lead to additional falsifiable outcomes at the first way that programs fail (i.e., recruiting—and training—staff) and additional falsifiable outcomes at this fourth way that programs fail; e.g., was the supervisory plan implemented (perhaps by checking supervisor reports.)?

The fifth way in which programs fail is that sometimes clients show minimal (sometimes no) progress on pre/post measures of the intermediate outcome the program was intended to affect. Thus, in an evaluation of National Evaluation of Welfare to Work Strategies, clients randomized to the Human Capital Development (HCD) component showed no progress on objective achievement tests (Bos et al., 2002). In as much as the logic model for these HCD programs implied that earnings would rise because clients learned basic skills in reading, writing, and mathematics, the program was a failure.

The program did increase the number of people who received a GED, absolutely and relative to the control group. Here and in general, the specifics of the logic model matter. If the program’s logic model had posited that earnings will rise because clients get a GED, even if they do not learn anything, then it might be reasonable to proceed with rigorous impact evaluation. However, the HCD program’s logic model had specified actual learning. In this sense, we propose to hold programs to their own logic models and, conversely, not to let programs ex post (i.e., after they see the results) define achieved outcomes as success (in this case, GEDs but no improvement in test scores).

Thus, a falsifiable logic model should specify pre/post changes in outcomes for the treatment group; e.g., skill attainment, progress on standardized tests, graduation rates, receipt of certificates. To be a useful screening device for programs that should be subjected to rigorous impact evaluation, these measurements must be inexpensive. Not all enrollees will still be around at the end of the program. Thus, we would not want a standard in terms of true outcomes for all enrollees; we cannot easily observe outcomes for initial enrollees who do not complete the program (for whatever reason). We also do not want measures conditional on completion (the people we observe). Rather we want standards in terms of the incoming class; e.g., half of the entering class will get a certificate through the program. People who complete but do not get a certificate and people who leave the program before completion would count towards the denominator (i.e., program enrollees), but not towards the numerator (i.e., those receiving a certificate). The program might like to claim credit for those who get a certificate outside the program, but those certificates are not easily measured (and probably not due to the program), so it is probably better to define the standard not including such certificates received by those who initially enrolled in the program, but who did not complete the program and get their certificate through the program.

As in all performance management systems, in such falsifiable logic models, it will be crucial to carefully define “enrollees.” The Department of Labor’s Workforce Investment Act training programs are notorious for having programs delay official “enrollment” until the training programs are relatively sure that the trainee will complete the program. For most purposes a more appropriate definition will be people who receive any program services.

Finally, note also that such performance standards give program operators strong incentives for “cream skimming”; i.e., only enrolling trainees who are likely to meet the standard. Program funders need to be aware of that incentive. If those most likely to meet the standard are not the target population, then there is another standard for the second way in which programs fail (i.e., enrollment). Enrollment standards need to specify not only the number of enrollees, but also enough about their characteristics to assure that the target population is being enrolled.

Conversations with researchers and research sponsors suggest that each of these logic model failures is common. These failures are also embarrassing and therefore rarely end up in the formal literature. Sometimes as a result of these logic model failures the study is cancelled and therefore no report is produced. More frequently, the program limps through the rigorous impact evaluation: a different partnership is attempted, staff standards are relaxed, the intake period is held open much longer than expected. These changes from the original design are usually mentioned only in passing, if at all. Finally, sometimes the problems—poor attendance at sessions or minimal progress even in the treatment group—are not noticed until the project’s final report, where they are not mentioned at all, relegated to a footnote, or used as a reason why the program did not find impacts.

Crucially for our argument, note the common pattern of these examples. Each of these programs was subjected to expensive and time-consuming rigorous impact evaluation. Those evaluations found no impact—overall or at most program sites. The random assignment evaluations were, however, unnecessary. Through site visits and analysis of program records a process evaluation could have collected information on partnerships and staffing, initial enrollment, attendance at sessions, and pre/post progress on the target intermediate outcomes. That information could have been compared against a more detailed version of the conventional logic model. Those comparisons would have clearly demonstrated that, according to the program’s own logic model, the program was unlikely to have impacts and therefore was not ready for rigorous impact evaluation.

4. Fitting the Logic Model Step into an Evaluation Process

The previous section has argued that through a process evaluation it is possible to screen out many programs as not ready for rigorous impact evaluation. In this section, we point out three other direct implications of this negative screen.

First, programs that fail their own logic model do not necessarily need to simply be discarded without further consideration. The premise of technical assistance (management consulting in the for-profit world) is that, through observing a program as it currently operates, an experienced outsider can suggest ways to improve the program’s operation. Perhaps with those improvements, the program will meet its own logic model. The possibility of revising programs such that they will meet their own logic models suggests building such a “formative evaluation” or “technical assistance” step into every evaluation—either once the program fails its own logic model, or even before proceeding to the “process evaluation.”[10]

However, a caveat is in order. Among the activities of formative evaluations is to help a program to refine its logic model. However, the falsifiable logic model should be set before the program is implemented. Once the program is implemented, formative evaluations and technical assistance should focus on changing details of program implementation to satisfy the program’s falsifiable logic model as stated initially. After seeing the result, simply lowering the quantitative goals—how many sessions a client needs to attend, how much progress the client must make on an achievement test—is too easy. Instead, a program with a significantly different logic model, or even substantially different quantitative goals, should often be viewed as a totally new program—and be required to recomplete for funding with other new programs.

Second, sometimes we can refine a program such that, with the refinements, it will meet its own logic model, but should we try? How many times? Some program models that fall short of expectations in the first (or second, or nth) iteration will fulfill their own logic models with another cycle through formative evaluation and then process evaluation. Other program models should simply be terminated. These programs were initially promising, they have been tried, and they have been deemed ineffective (i.e., they failed to achieve their own logic models’ intermediate outcomes). Limited resources should be transferred from this failed program to one of the many other promising program models.

The challenge, of course, is figuring out which programs to iterate and which programs to simply terminate. Unfortunately, we have no clear guidance on when to continue investing and when to move on. Foundation program officers, social venture capitalists, and government employees currently make these decisions. As of now, our only guidance is to carefully consider the tradeoff (in time and resources) between continuing to invest in a program that can be saved versus scrapping it entirely and instead beginning to explore some other program that offers a different—and perhaps better—approach. Perhaps making the tradeoffs of each choice explicit will improve decision-making. Additional insights into this choice would clearly be useful.

Figure 2 summarizes our proposed revised sequence of evaluation steps. Specifically, it augments the impact evaluation of Figure 1 with an iterative sequence of formative evaluation and process evaluation. In this augmented model, the process evaluation serves as an additional toll gate: only a program that passes its own logic model as measured by the process evaluation proceeds to the impact evaluation stage.

Figure 2: Logic Model Evaluability

[pic]

Figure 2 suggests the third implication of our negative screen. Current evaluation strategy often includes some formative evaluation/technical assistance and even some process evaluation. However, those steps are usually funded simultaneously with, and as part of the same contract as, the impact evaluation. Formative evaluation/technical assistance is provided immediately before and as part of proceeding to random assignment. The advantage of this approach is that a single contract shrinks the necessarily long interval from program idea to broad-scale rollout.

The disadvantage of this approach is that a single contract approach implicitly assumes that most programs will go from program idea through formative evaluation and process evaluation to impact evaluation. However, the thrust of our argument is that many (perhaps most) programs will fail at the process evaluation stage and therefore should not proceed to impact evaluation. Thus, the assumption implicit in current contracting practice appears to be problematic.[11]

In as much as many (perhaps most) programs should not proceed immediately from process evaluation to impact evaluation, then it might be better to contract differently. Rather than a single contract, instead issue one contract for “program development,” i.e., technical assistance and formative evaluation. If the program and the formative evaluator decide the program is ready, then proceed to competition for a second contract for a process evaluation that would compare intermediate outcomes to the program’s own falsifiable logic model. If the program and the formative evaluator instead decide that the program is not ready for the process evaluation, then they should apply—competitively—for another round of technical assistance and formative evaluation.

For programs that proceed to the process evaluation phase, at the end of that phase the evaluator would prepare a report and the funder would choose between three options: (i) proceed to a third competition for a new contract to conduct the impact evaluation; (ii) proceed to a competition for another round of program development and process evaluation; or (iii) terminate funding for the program.[12]

We grant that the current interval from initial program development to verified program impact is already measured in years (often a decade or more), and that issuing multiple contracts would further extend that interval. If most programs passed their own logic models and rigorous impact evaluation screens, these long timelines would be a problem. Given that many programs fail both screens, the tradeoff of longer evaluation cycles for more successful evaluations seems worth careful consideration. Some things just cannot be rushed.

From this perspective, it appears that the Education Department’s Investing in Innovation Fund (i3), for example, takes the right approach. Rather than funding only program pilots or formative evaluations, i3 awarded the largest amounts of money to promising existing programs and gave more points to programs that are supported by stronger evaluation evidence.

5. Some Broader Implications for Evaluation Strategy

The logic model approach to evaluability and program development described in this paper embodies a strong implicit assumption about program motivation. Our approach implicitly assumes that a program’s ultimate goal is to grow from program idea to broad program rollout. Our approach further assumes that the only way to do so is by passing the various evaluation toll gates.

In practice, neither of those assumptions is universally correct. First, with respect to the desire to move promptly through evaluation to broad program rollout, this might be plausible under two conditions: (i) the primary goal of program developers is to get their programs rolled out nationally as quickly as possible, and (ii) program developers have complete faith in the evaluation process.

Neither of those conditions is likely to be satisfied. Some program operators would be content to run small local programs. In some cases, this is because their vision truly is local. In other cases, there is a fear of rigorous impact evaluation (Campbell, 1969). Such program operators believe that they are achieving their desired results; they try to avoid rigorous impact evaluation because they fear the program might—in their view, incorrectly—be deemed ineffective. Other program operators do not believe that their program can be meaningfully evaluated with a random assignment approach, perhaps because important benefits are not (easily) measurable.[13] Still others feel that limited resources should be used to serve clients and should not be diverted toward evaluation. For each of these groups of program operators, prolonged “program development” will often be the ideal outcome.

Second, with respect to the necessity of program evaluation in order to proceed to broad-scale rollout, some recent initiatives are consistent with this perspective. The evidence-based Nurse-Family Partnership ($1.5 billion over five years) and the evidence-based Teen Pregnancy Prevention Program ($110 million in FY2010) each provided substantial funding for the broad rollout of programs that have passed rigorous impact evaluation at both the efficacy and effectiveness level (Orszag, 2009a). As Orszag explained: “[This approach] will also create the right incentives for the future. Organizations will know that to be considered for funding, they must provide credible evaluation results that show promise, and be ready to subject their models to analysis.”

However, this approach is the exception rather than the rule. As Orszag (2009a) and others acknowledge, many programs without rigorous impact evaluation evidence and even some programs with negative rigorous impact evaluation evidence continue to be funded, often at high levels. Given this reality, avoiding rigorous evaluation may also be a viable strategy for program developers to pursue.

Together, these two factors imply that evaluators will often need to induce programs to participate in random assignment evaluations. Once evaluators need to induce programs to participate in evaluations, it is not clear that they can insist that programs develop falsifiable logic models and participate in the long timeline and onerous sequence of evaluation steps described here.

Our guidance is simple: this sequence of evaluation steps should be a requirement of funding for pilot programs and for proceeding to broad-scale program rollout. We understand that the reality diverges from that simple guidance. That divergence will make it more difficult to implement the sequence of evaluation steps described in this paper. More consideration of these issues is needed.

6. Closing Thoughts and Next Steps

This paper has argued that some programs are being evaluated too early and that more resources should be devoted to determining whether a program is ready for rigorous impact evaluation. A more detailed and falsifiable logic model combined with a careful process evaluation would frequently detect programs that failed to: (i) complete partnerships or hire the desired staff; (ii) recruit sufficient qualified program participants; (iii) induce program participants to engage with the complete program; (iv) enable program participants to improve their skills relative to the program’s own goals. Having failed to satisfy the quantitative intermediate outcomes of their program’s own logic model, these programs are unlikely to have positive long-term impacts and therefore should not proceed to rigorous impact evaluation.

From this insight—that some programs can be rejected based on the results of benchmarks for the treatment group during (or shortly after) the end of the program—emerges an answer to the question with which we began this paper: What process would you design for identifying programs worthy of national rollout? Our ideal process is depicted in Figure 2: (i) Fund a pilot with a corresponding formative evaluation; (ii) If repetition of the formative evaluation step is needed, decide whether to fund it, or alternatively to abandon the program model; (iii) Proceed to a process evaluation which verifies the satisfaction of the program’s own “falsifiable logic model”; (iv) If repetition of the process evaluation step is indicated, decide whether to fund it, or alternatively to abandon the program model; (v) Proceed to a random assignment efficacy trial; (vi) If the program passes the efficacy trail, process to replication; (vii) If the program does not pass the efficacy trial, (perhaps) repeat the formative evaluation and the process evaluations steps (i.e., i-iv); (viii) Process to a random assignment effectiveness trial; (ix) If the program passes the effectiveness trial, proceed to broad program rollout.

This paper’s advocacy of the use of formative evaluation and process evaluation should not be used as an excuse to delay rigorous impact evaluation. We would urge a bias toward either proceeding to the process evaluation and then rigorous impact evaluation, or terminating the program. Partially to encourage programs to move on to rigorous impact evaluation when appropriate, another round of formative evaluation and process evaluation should be far from automatic. There are other programs that could be funded instead. Forcing such programs to reapply for funding for formative evaluation—alongside other programs that have not had even one round of formative evaluation—is one possible strategy.

We conclude by acknowledging that the approach described here is unlikely to be adopted completely. Our approach would convert a process that already often takes nearly a decade into a process that will often take considerably more than a decade. We have argued that this is consistent with good science, and we acknowledge that it is inconsistent with the political cycle. Policymakers and politicians face strong pressures to be seen as “doing something”—and soon! We also acknowledge that our suggested approach is inconsistent with the difficulty of sustaining research attention—and therefore funding—on any single topic over a long period of time. Nonetheless, we hope that this discussion will help highlight the importance of careful, early-stage investments in program development and testing in order to save money later and also to find more programs that work.

References

Anderson, D. 2010. Proven programs are the exception, not the rule [Blog post]. Retrieved from .

Azurdia, G., and Barnes, Z. 2008. The Employment Retention and Advancement Projects: Impacts for Portland’s Career Builders Program. New York: Manpower Demonstration Research Corporation.

Bernstein, L., Dun Rappaport, C., Olsho, L., Hunt, D., and Levin, M. 2009. Impact evaluation of the U.S. Department of Education’s Student Mentoring Program (NCEE 2009-4047). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

Bos, J. M., Scrivener, S., Snipes, J., and Hamilton, G. 2002. Improving basic skills: The effects of adult education in welfare-to-work programs. New York: Manpower Demonstration Research Corporation.

Campbell, D. T. 1969. Reforms as experiments. American Psychologist 24(4): 409–429.

Coalition for Evidence Based Policy. 2009. The Congressionally-established Top Tier evidence standard is based on a well-established concept in the scientific community, and strong evidence regarding the importance of random assignment. Retrieved from .

Conrad, K. J., Randolph, F., Kirby, M., Jr., and Bebout, R. R. 1999. Creating and using logic models—Four perspectives. Alcoholism Treatment Quarterly 17(1): 17–31.

GiveWell. n.d.. Social programs that just don’t work. Retrieved from .

Glasgow, R. E., Klesges, L. M., Dzewaltowski, D. A., Bull, S.S., and Estabrooks, P. 2004. The future of health behavior change research: What is needed to improve translation of research into health promotion practice? Annals of Behavioral Medicine 27: 3–12.

Hallfors, D., Cho, H., Sanchez, V., Khatapoush, S., Kim, H. M., and Bauer, D. 2006. Efficacy vs effectiveness trial results of an indicated “model” substance abuse program: Implications for public health. American Journal of Public Health 96(12): 2254–2259.

Imbens, G. 2009. Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009) (NBER Working Paper #14896). Washington, DC: National Bureau of Economic Research. Retrieved from .

Ioannidis, J. P. A., et al. 2001. Comparison of evidence of treatment effects in randomized and nonrandomized studies. Journal of the American Medical Association 286(7): 821–830.

James-Burdumy, S., et al. 2010. Effectiveness of Selected Supplemental Reading Comprehension Interventions: Findings From Two Student Cohorts (NCEE 2010-4015). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.

Klerman, J. A. 2010. Contracting for independent evaluation: Approaches to an inherent tension. Evaluation Review 34(4): 299–333.

McDonald, S.-K. 2009. Scale-up as a framework for intervention, program, and policy evaluation research. In G. Sykes, B. Schneider, and D. N. Plank (Eds.), Handbook of education policy research (pp. 191–228). New York: Routledge.

McDonald, S.-K. 2010. Developmental stages for evaluating scale. The Evaluation Exchange 15(1). Retrieved from .

McLaughlin, J. A., and Jordan, G. B. 1999. Logic models: A tool for telling your program’s performance story. Evaluation and Program Planning 22(1): 65–72.

Meckstroth, A., et al. 2006. Paths to work in rural places: Key findings and lessons from the impact evaluation of the Future Steps Rural Welfare-to-Work Program. Final report (No. 8762-192, 202). Princeton, NJ: Mathematica Policy Research.

Miller C., Martin, V., and Hamilton, G. 2008. The Employment Retention and Advancement Projects: Findings for the Cleveland Achieve Model: Implementation and early impacts of an employer-based approach to encourage employment retention among low-wage workers. New York: Manpower Demonstration Research Corporation.

Molina, F., Cheng, W.-L., and Hendra, R. 2008. The employment retention and advancement project: Results from the Valuing Individual Success and Increasing Opportunities Now (VISION) Program in Salem, Oregon. New York: Manpower Demonstration Research Corporation.

National Academies of Sciences. 2010. Standards of evidence: Strategic planning initiative. Retrieved from

%20Description.html.

Obama, Barak. 2009. Remarks of President Obama on Community Solutions Agenda. Retrieved from ?

tbl_pr_id=1828.

Office of Management and Budget. 2004. What constitutes strong evidence of a program’s effectiveness? Retrieved from

2004_program_eval.pdf.

Orszag, P. R. 2009a. Building rigorous evidence to drive policy [Blog post]. Retrieved from

BuildingRigorousEvidencetoDrivePolicy.

Orszag, P. R. 2009b. Increased emphasis on program evaluation [Memorandum]. Retrieved from .

Ravallion, M. 2008. Should the randomistas rule? The Economists' Voice 6(2). doi: 10.2202/1553-3832.1368.

Rogers, P. J. 2005. Logic models. In S. Mathison (Ed.), Encyclopedia of evaluation (p. 232). Beverly Hills, CA: Sage Publications.

Rossi, P., Lipsey, M., and Freeman, H. 2004. Evaluation: A systematic approach. Thousand Oaks, CA: Sage Publications, 2004.

Sawhill, I. V., and Baron, J. 2010. We need a new start for Head Start. Education Week 29(23): 22–23.

Scrivener, S., Azurdia, G., and Page, J. 2005. The Employment Retention and Advancement Projects: Results from the South Carolina ERA site. New York: Manpower Demonstration Research Corporation

Shadish, W. R., Cook, T. D., and Campbell, D. T. 2001. Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin.

Society for Prevention Research. 2004. Standards of evidence: Criteria for efficacy, effectiveness and dissemination. Retrieved from

sofetext.php.

Summerville, G., and Raley, B. 2009. Laying a solid foundation: Strategies for effective program replication. Philadelphia: Public/Private Ventures.

U.S. Department of Education, Institute for Education Sciences. 2008. A study of classroom literacy interventions and outcomes in Even Start. Washington, DC.

U.S. Department of Health and Human Services, Administration for Children and Families. 2010. Head Start impact study. Final report. Washington, DC.: Author.

USAID. 2011. USAID Evaluation Policy. Bureau for Policy, Planning, and Learning. Retrieved from evaluation/USAID_EVALUATION_POLICY.pdf.

Valley of the Sun United Way. 2008. Logic model handbook 2008. Retrieved from .

W. W. Kellogg Foundation. 2004. Using logic models to bring together planning, evaluation and action: Logic model development guide. Battle Creek, MI: Author. Retrieved from

What Works Clearinghouse. 2008. What Works Clearinghouse evidence standards for reviewing studies, version 1.0. Retrieved from

Wholey, J. 1994. Assessing the feasibility and likely usefulness of evaluation.” In H. P. Hatry, J. S. Wholey, and K. E. Newcomer (Eds.), Handbook of practical program evaluation. San Francisco: Jossey-Bass.

Wood, R. G., McConnell, S., Moore, Q., Clarkwest, A., and Hseuh, J. 2010. The Building Strong Families Project: Strengthening unmarried parents’ relationships: The early impacts of building strong families: Executive summary (No. 08935.155. Princeton, NJ: Mathematica Policy Research, Inc.

-----------------------

[1] On the term “randomistas,” see Ravallion (2008) in the developing country evaluation context. Our critique is very different from that of Ravllion. For a direct reply to Ravallion, see Imbens (2009).

[2] For inside of government see Office of Management and Budget (OMB) (2004) and Orszag (2009b). For outside of government see National Academies of Science (2010) and What Works Clearinghouse (2008).

[3] See also Anderson (2010), GiveWell (n.d.), and Sawhill and Baron (2010).

[4] See for example the discussion of specific examples and overall patterns at Coalition for Evidence Based Policy (2009).

[5] Quantitative goals for long-term impacts would themselves be useful at the rigorous impact stage. They would help to power (i.e., choose sample size) such studies and they would help in interpreting the results of those studies.

[6] The importance of the social problem and the degree of difficulty of the solution are also important considerations.

[7] On this, see Shadish, Cook, and Campbell, (2001, p. 52-53) “However, specifying such an effect size is a political act, because a reference point is then created against which an innovation can be evaluated.  Thus, even if an innovation has a partial effect, it may not be given credit for this if the promised effect size has not been achieved.  Hence, managers of education programs learn to assert: ‘We want to increase achievement’, rather than stating, ‘We want to increase achievement by two years for every year of teaching.’”

[8] See the quote from Shadish, Cook, and Campbell, 2001, in the previous footnote.

We note that some programs and government agencies are already following our suggested approach. For example, the recently released USAID Evaluation Policy notes that “Compared to evaluations of projects with weak or vague causal maps and articulation of aims, we can expect to learn much more from evaluations of projects that are designed from the outset with clear development hypotheses, realistic expectations of the value and scale of results, and clear understanding of implementation risks.

[9] We have implicitly ruled out the possibility of “sleeper effects,” programs that fail to show short-term effects nevertheless showing long-term effects.

[10] In addition to helping determine whether or not a program is ready for rigorous impact evaluation, it is also worth noting that formative and process evaluations can produce additional benefits. These methods can yield important information about a program that can be leveraged later when designing the impact evaluation. Namely, these early steps can highlight intermediate outcomes that should be measured and can yield insights about moderator and mediator variables that should be included in the analytic framework.

[11] Relative to the ideal process evaluation, funding a single contract probably raises the likelihood of proceeding through to the impact evaluation. The likelihood of proceeding through is high because relations between the program and the evaluator create a form of capture. Successful technical assistance benefits from a strong rapport between the program and the technical assistance provider. Once that strong rapport and the attendant personal relationships are formed, it becomes much harder for the formative evaluation/technical assistance team—now in the process evaluator role—to state that the program has failed to meet its own logic model.

It is not just such friendly relations that induce a bias towards proceeding to rigorous impact evaluation. Once a single contract is issued for both steps, the evaluator has a strong financial interest in making the program (appear to) work. If the program does not work, then there is no second phase and hence no impact evaluation. Contractors already face more than enough pressure to please the client (Klerman, 2010); more pressure is not needed.

[12] Considerations of conflict of interest on the part of the evaluator suggest that barring contractors at one phase from bidding on the next phase would be an even stronger procedural protection against a bias to proceed

[13] For example, Comprehensive Community Initiatives.

-----------------------

Figure 1: Rigorous Impact Evaluation before Broad Rollout

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download