Research Design Issues for Evaluating Complex Multicomponent Interventions in Neighborhoods and Communities
Living in high-poverty neighborhoods is a major risk factor for several mental, emotional, and behavioral disorders, as well as other developmental challenges and physical health problems. Despite increasing evidence that significant improvements in the life trajectories of at-risk young people are now possible, actual wellbeing lags far behind, especially in neighborhoods of concentrated poverty. Major advances in public health will not occur unless we translate existing knowledge into effective multicomponent interventions, implement these in high-poverty neighborhoods, and develop rigorous evaluation methods to ensure continual improvement. In this paper, we discuss challenges and offer approaches to evaluation that are likely to result in adoption and maintenance of effective and replicable multicomponent interventions in high-poverty neighborhoods. The major challenges we discuss concern 1) multiple levels of evaluation/research, 2) complexity of packages of interventions, and 3) classes of evaluation/research questions. We suggest multiple alternative research designs that maintain rigor but accommodate these challenges. We conclude that a standardized measurement system is also fundamental to the evaluation of complex multicomponent interventions. To be useful in local neighborhood and communities, such a system would include assessments of the 1) implementation of each intervention component, 2) reach of each component, 3) effects of each component on immediate outcomes, and 4) effects of the comprehensive intervention package on outcomes.
Recent research suggests that significant improvements in the life trajectories of at-risk young people are now possible. [1-3] Despite this evidence, actual wellbeing lags far behind, especially in neighborhoods of concentrated poverty, where the risk of young people developing multiple behavioral and health problems is much higher than other neighborhoods. [4-5] Major advances in public health will not occur unless we translate existing knowledge into effective multicomponent interventions, implement them in high-poverty neighborhoods, and develop rigorous evaluation methods to ensure continual improvement. In this paper, we present approaches to evaluation that are likely to result in adoption and maintenance of effective and replicable multicomponent interventions in high-poverty neighborhoods.
The Federal Government is funding multiple efforts in the next few years, particularly in high-poverty neighborhoods and communities. For example:
• The U.S. Department of Education Promise Neighborhoods initiative, modeled after the Harlem Children’s Zone 
• The Prevention Prepared Communities initiative, a collaboration of the Substance Abuse & Mental Health Services Administration, the U.S. Departments of Education and Justice, and the National Institute on Drug Abuse
Given the wealth of evidence-based interventions, it is timely for neighborhoods and communities to begin to implement and evaluate complex multicomponent interventions.
The Promise Neighborhoods Research Consortium (PNRC) is an NIH-funded network of social, behavioral, and health scientists created to provide scientific support to these efforts. The PNRC identified major cognitive, behavioral, social, and health outcomes at each phase of development, from the prenatal period through adolescence, to ensure young people’s success. We then specified the major proximal and distal influences on all of these outcomes. Based on extant scientific evidence, we identified sets of policies, programs, and practices that, implemented together in neighborhoods, would best help achieve the desired social, health and behavioral outcomes. Furthermore, to effectively implement, guide, and evaluate neighborhood change efforts, the PNRC specified a set of standard operational measures of these outcomes and influences.
The Scientific Foundation for Comprehensive Community Interventions
Human wellbeing is supported by the creation of environments that nurture development by (a) minimizing biologically and socially toxic stressors; (b) teaching, modeling, and reinforcing prosocial and healthy behavior; (c) limiting opportunities for unhealthy or antisocial behavior; and (d) promoting pragmatic, values-driven action. [7-8] We recently summarized this wide-ranging set of concepts;  Figure 1 from that paper (also available on the PNRC website, the PNRC Model, details each developmental phase, measures appropriate to assess each construct at each phase, and descriptions of evidence-based interventions. The PNRC website supports neighborhood-serving organizations and residents working for change in their neighborhoods by providing them with the best current evidence from scientific research.
Ours is certainly not the only framework that could be used for improving wellbeing in high-poverty neighborhoods; but it illustrates the complexity of the problem, the need for comprehensive interventions, the fact that many evidence-based interventions are available across multiple levels and domains, and the scientific challenges that remain for getting from current knowledge to effective comprehensive interventions in neighborhoods and communities. Here we articulate the major methodological tools available to move from current knowledge to effectively nurturing outcomes for young people in the many high-poverty neighborhoods that exist in the United States.
Challenges for Optimal Evaluation Research
The PNRC model of important child outcomes and the proximal (immediate) and distal (background) influences on them provides one example of a causal model that might inform complex neighborhood interventions. It also provides a helpful structure for discussing evaluation/research designs. In this section, we focus on three major challenges to optimal evaluation: (1) multiple levels of evaluation/research, (2) complexity of packages of interventions, and (3) classes of evaluation/research questions. We then discuss alternative rigorous research designs able to accommodate these challenges.
The evaluation of complex neighborhood-based, multicomponent interventions will need to take place within each neighborhood and across neighborhoods. Each neighborhood will have to collect evaluation data to monitor the adoption and implementation of each intervention component, and track effects of the specific interventions they deliver. Across neighborhoods, evaluation research will be necessary to answer questions about effects of the neighborhood/community initiatives en bloc.
To achieve sizable results, neighborhoods or communities funded to conduct complex community interventions must provide a multicomponent package of interventions across many settings and age groups. No one intervention alone will solve all the problems in the community. Even the best evidence-based interventions produce relatively small effects, meaning they will affect only a portion of the population, or the effects will be small for any one person or institution but nevertheless important and large when considering the population as a whole. Different interventions also target particular outcomes or sub-populations (e.g., developmental stages, age groups, risk status). Therefore, multiple evidence-based interventions will need to take place concurrently to produce larger effects at the neighborhood level. Because the effects of multiple interventions could be due to the additive effects across interventions or to synergistic or mutually reinforcing effects, a viable approach to evaluation must consider:
- Effects of specific interventions separately, and
- Effects of the package of interventions in each community.
Evaluation research questions and viable research designs are comparable for these two levels, although the details might differ substantially.
Complex Multicomponent Interventions
Patton  best sums up the difference between evaluating stand-alone programs and complex multicomponent interventions. See Table 1. Simple programs are relatively easy to replicate (with expected similar effects across replications) and relatively easy to evaluate. Complicated programs are more difficult to reproduce with fidelity and to replicate effects, and, hence, more difficult to evaluate. Complex multicomponent interventions are even more difficult to reproduce and evaluate.
An evidence-based practice  is most like a recipe; an individual program combines practices and can range from simple to complicated; a policy or structural change, including the process of getting a policy passed, implemented, and enforced, is complicated. A package of multiple interventions clearly is complex, with potential synergies and interactions. In some ways, the parts are indivisible from the whole. A successful package of interventions to help a neighborhood achieve large results is likely to contain multiple practices, programs, and policies. We use the term “component interventions” to refer to any one of these practices, programs, or policies, and the term “multicomponent interventions” to refer to a set of multiple interventions implemented together and that may include one or more practices, programs, and/or policies.
The complexity of multicomponent intervention packages requires evaluation designs that are equally complex; that acknowledge differences between neighborhoods, intervention packages, the degree of and how well interventions are implemented, and how the strategies are adapted over time as neighborhoods learn what works and what does not.  Traditionally, development and evaluation of individual programs have been “top-down”; that is, researchers or developers determine what is needed and how to achieve it, develop a program, evaluate its efficacy, then its effectiveness, and then offer it to the world. [13-14] In contrast, development and evaluation of a complex package of interventions is more likely to be “bottom-up” : neighborhoods will determine what they need, adopt/adapt particular combinations of interventions, evaluate for effectiveness (both individually and as a package) in a specific real-world setting, rather than for efficacy in researcher-controlled settings (i.e., pragmatic rather than explanatory evaluation).  They are likely to focus as much on implementation issues as on outcomes—because they cannot achieve outcomes if they fail to deliver the interventions well. This approach is similar to Patton’s  concept of “developmental evaluation,” which involves changing the intervention, adapting it to changed circumstances, and altering tactics based on emergent conditions. This can be particularly useful for programs that evolve over time as they address emerging issues in changing environments.
Critical Evaluation/Research Questions
Evaluating the effectiveness of interventions in real-world settings requires focus on assessing the multiple phases of the program effects pathway. [12, 17] See Figure 1. If an evidence-based intervention cannot produce effects, it is likely because it was not adopted in the first place, not implemented with integrity, did not reach and engage the target audience, or did not produce the immediate expected effects. [18-20] In most cases, it should be possible to specify the expected immediate effects of these interventions that would instill confidence that they will result in longer-term improvement in youth outcomes. For example, a parenting intervention should achieve some immediate improvement in parenting practices, followed by long-term outcomes for youth, such as skills or health status.
Understanding the mechanism or expected pathway of effects leads to an assessment of whether a package of interventions is performing in the way it is intended along the full range of its implementation, rather than simply an evaluation of its ultimate impact (cf. “theory of change” models of evaluation).  Evaluating the expected mechanism of effect helps us understand which elements of the package are functioning well and which might be improved to achieve larger effects. For neighborhoods that receive support for complex, community-wide, multicomponent interventions, a wide range of factors will affect implementation and outcomes, including cultural, societal, geographical, and political factors, as well as the presence of existing investments and activities addressing some or all of the same outcomes. Given the multiplicity of factors that influence outcomes, the goal of evaluation is to assess a multicomponent intervention’s contribution to changes in outcomes. Research designs incorporating repeated measures, both before and after intervention implementation, are necessary to follow the trends in its implementation, reach, and impact.
Thus, basic questions to determine which efficient research designs will help to evaluate complex multicomponent interventions include: 1. Implementation: Is each intervention implemented with fidelity, so that it is likely that the effects it had elsewhere will be replicated? Evaluating effectiveness of evidence-based interventions in real-world settings requires much more focus on assessing adoption patterns, adaptations, and implementation level/intensity and integrity.  Key questions include, How many implementers or settings adopt the intervention? How often and how well is it delivered? 2. Reach: Is each intervention reaching the members of its intended target audience or group? An intervention cannot have expected effects on the target audience if it does not reach them. There must be methods for testing strategies to expand reach to a high proportion of the target population. We need to know: What is the level of engagement in and satisfaction with the intervention by the recipients? Is the reach sufficient to create a tipping point in population outcomes on population-level outcomes? 3. Immediate effects: Does an intervention have the expected immediate effects on the mediating or proximal behaviors or processes? Without reliable immediate effects, chances of longer-term change on ultimate outcomes remain slim. Here the key questions include Does delivery of the intervention produce the expected immediate effects, especially in terms of changes in behaviors or organizational practices increasing or decreasing rather than just simple changes in knowledge? Are results the same for everyone? How well are these changes maintained? 4. Outcomes: Are there changes in longer-term outcomes? If so, is it possible to discern which intervention components have contributed to each effect? Key questions include Does continuous delivery of the intervention lead to the expected outcomes? How long do they take to occur (e.g., 3-5 years)? How well are they maintained? Do outcomes differ for different subgroups? Can we attribute effects on these outcomes to specific component interventions? These questions are not only important for research, but also vital for effective policymaking and service delivery, maintenance of service delivery quality, continuous improvement in interventions, and public support needed to maintain interventions. In short, we see a human service and prevention system evolving, where distinctions between research and practice diminish as careful measurement and experimental evaluation of intervention processes and their effects become fundamental to program operation and service provision.
We expect the resulting evaluation designs described here to contribute to the understanding of large-scale programmatic strategies to meet the needs of high-poverty neighborhoods and their residents. Such research will provide rigorous, non-partisan, multidisciplinary, and independent assessment of complex multicomponent interventions to inform policymakers, the scientific community, program implementers, neighborhood members, and stakeholders in education, social work, and public health.
Evaluation Research Designs
Designs for Intervention Implementation and Reach Questions
Intervention implementation. We can answer these questions by tracking and continuous monitoring of who (or what settings/places) and how many adopt a particular intervention, and how often and how well they deliver that intervention. Collecting data on the proportion of people/places adopting and implementing the intervention(s) helps decision-makers and implementers ensure that they adopt the component interventions and the package of interventions as planned and implement them as planned.
Management information system records, observations, standardized reporting by implementers, or survey measures of implementation quality, together with comparison against established norms or benchmarks, can help determine the quality of implementation of an intervention. This information can help providers improve the level and quality of implementation. Understanding these aspects of implementation is critical to the interpretation of evaluation results and the enhancement of future or ongoing adoption or adaptation.
Tracking and ongoing monitoring can help to determine intervention reach questions. It will be able to determine how many of the target audience each intervention reaches and can track the responses (e.g., satisfaction, ongoing engagement). These data help ensure that the interventions reach their target audiences and that they engage with the appropriate interventions at the appropriate times, places, and levels. Such data would be helpful with experimental studies by examining better ways to reach target audiences.
Designs for Intervention Effectiveness and Outcome Questions
Immediate effects are the direct short-term effects expected of component interventions (whether practices, programs, or policies) related directly to the immediate or proximal influences identified in the PNRC conceptual model. Outcomes are more long-term. Because of the multiple outcomes expected at the neighborhood level (see PNRC model), attributing changes in them to particular interventions will often not be as easy as it is for immediate effects. The main choices of research design are the same for assessing immediate or long-term intervention effectiveness. In this section, we discuss possible designs in order of scientific rigor for establishing a causal relationship. Because funding agencies favor designs that are more rigorous, communities might consider them in this same order. After explication of stand-alone designs, we describe how a hierarchy of nested designs may be useful.
Randomized controlled trials. Most of the substantial progress in prevention and education in recent decades is due to the increasing use of randomized controlled trials (RCTs). The number of RCTs of preventive interventions rose from fewer than 10 in 1992 to more than 40 in 2007.  As we move to implement interventions in whole neighborhoods, it is vital that we continue to evaluate them rigorously. We cannot assume that existing evidence, often derived from implementations under optimal conditions, will guarantee the success of these interventions when implemented in new and more challenging settings. [13, 22] Ongoing rigorous evaluations become necessary due to the need for accountability to those who fund the efforts and for improving intervention effectiveness. Within neighborhoods, randomized trials would provide the strongest evidence of effectiveness, when circumstances permit randomization. For example, within a neighborhood, a lottery can select children for a charter school, as they did in the Harlem Children’s Zone.  Where multiple interventions are delivered in one kind of setting (e.g., families, schools, clinics), and there are insufficient resources to offer them to all instances of the setting, then settings might be randomly assigned to receive or not receive the package of interventions. This approach is likely to be politically unpopular because community members likely prefer that everyone to benefits. However, randomization generates the strongest evidence of effectiveness as long as a sufficient number of settings/units are randomly assigned to the conditions (for example, 60 or more families, 20 or more schools or clinics).
Research involving multiple neighborhoods would also benefit from randomized trials if a sufficient number of neighborhoods adopt a similar set of interventions. Indeed, funders could rank proposals in order of reviewers’ scores and then randomly assign half of the best proposals to get funding for the intervention and the other half to be controls. This approach is difficult to achieve and there might still be substantial variation between neighborhoods in choice of interventions and how well they are implemented.
Limitations of randomized trials. RCTs are not always the optimal method of advancing prevention in neighborhoods. As the size of the unit receiving an intervention increases (e.g., from individuals to whole neighborhoods) and the complexity of the intervention increases (e.g., from one clearly defined program to multiple policies, programs and practices), it becomes more difficult to conduct randomized controlled trials. The most obvious reason for this is that it becomes difficult to include a sufficient number of units (communities) to estimate intervention effects reliably. As the size of the unit receiving intervention increases, there are fewer of them to compare and contrast in an RCT. Second, randomized trials require standardization of the intervention across all units receiving a “treatment.” This is very difficult to achieve in a neighborhood or community intervention. A good example of how a RCT of a community intervention can fail is the COMMIT trial. [2, 24-25] It is important to note, however, than many school-based RCTs, some involving whole communities [26-27], and a few community-based RCTs [28-29] have achieved success.
An even more challenging assumption underlying randomized trials in neighborhoods concerns the likely efficacy of the intervention. One fundamental rationale of an RCT is replication of effects across cases. Thus, it would be a mistake to submit an intervention to testing in an RCT before having confidence of the likelihood of its efficacy. Yet, at this time, we do not have much evidence that complex interventions with many interacting components will have replicable effects across neighborhoods or communities.
The most significant problem with employing randomized trials to evaluate complex multicomponent neighborhood or community interventions at this stage of our knowledge is that they preclude further improvements in the intervention. We know relatively little about the best ways to engage entire neighborhoods in intervention activities or about how to increase social cohesion and collective efficacy ; such processes may be foundational for successful introduction of evidence-based interventions. What is necessary in the further development of neighborhood interventions is a method for systematically evaluating the functional effects of intervention components. To accumulate enough knowledge to justify the next round of randomized trials, we need methods to replicate reliably the relationships between intervention components and measured processes in individual neighborhoods. We offer some options below.
Regression-discontinuity designs. Many believe the regression-discontinuity design to be one of the most powerful “quasi-experimental” designs that, when implemented properly, can produce conclusions as clear as those obtained from randomized trials. [30-32] In regression-discontinuity designs, participants receive assignment to conditions based on a continuous quantitative variable, often a measure of need, merit, or risk (e.g., children assigned to receive school lunch programs if their household income falls below a specified threshold). The functional relationship between the known quantitative assignment variable (e.g., household income) and the outcome variable (e.g., health, school achievement), estimated separately for the treated group that falls below the threshold and the control group that falls above the threshold, provides the basis for causal inference. Because treatment assignment is determined fully by the assignment variable, inference of a treatment effect of the school lunch program is warranted if there is a discontinuity at the threshold where the treatment is introduced. Strong causal inferences are possible as long as the rigorous standards outlined by the What Works Clearinghouse are met. 
Vaughn et al.  present an example of the results of a regression-discontinuity design. They provided remedial reading to children whose Oral Reading Fluency scores fell below 27 words a minute, but not to those with higher scores. The regression line between initial oral reading fluency and word identification skills after intervention was more positive than it was for those who did not get the intervention indicating that those who got the intervention were performing higher than would have been expected.
Regression-discontinuity designs are statistically less efficient than randomized trials, and need about three times as many cases to reach the same statistical power as a randomized trial.  As a result, they are rarely appropriate when the assignment unit is a larger aggregate (e.g., city or state), but highly effective when individuals are assigned to treatments based on the cut-off variable. The regression-discontinuity design is of best use in situations with a large number of cases available, random assignment is not feasible or desirable, and there is a cut-off continuous variable to use for assignment to conditions. Such situations might be rare in neighborhoods, but they will exist. For example, instead of using a lottery to assign children to a new form of school, one might use scores on a prior test or an admissions exam. This approach would be politically acceptable because it would allow neighborhoods to provide the new services to those most in need of them.
Designs involving repeated measures. Identifying relationships between an intervention and outcomes requires observation of the phenomenon of interest to see if some manipulation of an independent variable affects it. This typically requires repeated or continuous measuring of the phenomenon. Also, at this stage in our knowledge, developing effective multicomponent interventions would require considerable trial and error. The social importance of these interventions means that we cannot simply test them and wait to discover whether they have or have not worked.
Repeated measures can strengthen randomize trials  but a variety of non-randomized designs involve the use of repeated measures of a process or outcomes of interest. The fundamental feature of these designs is the observation of a change in a series of data points after introduction of an intervention or independent variable. A change in either the slope or intercept of the repeatedly measured process or outcome is evidence of the effect of the intervention. [31, 35] These designs allow for adjustments (in light of most recent results) in the program or its implementation to improve effectiveness.
This basic technique has a venerable history in diverse areas of science, engineering, and business management, although each area uses its own language. In engineering, process control theory specifies using a sensor to monitor a desired system output, providing feedback to system inputs in a closed-loop system.  In manufacturing, statistical process control methods assess levels and variability of key measures, with constant monitoring for “out-of-control” signals that suggest corrective action.  In business management, “total quality management,” “total quality improvement,” “six sigma” and similar methods  recommend clear goals, identification of reliable quality and performance indicators that are continuously collected, and understanding and building the systems of interacting components required to achieve and maintain desired outcomes, such as happy customers. The observation of change in a repeatedly measured process also led to seminal contributions to physiology  and behavior modification. [39-40] The movement to assess students’ response to interventions in order to identify what works for individual students  is an example of this technique. The growing success of the Harlem Children’s Zone has been guided by a similar focus on repeated assessment of students’ progress and modification of procedures in light of results.  Repeated measurement designs have also been used to evaluate the effects of public policies around alcohol [42-43] and the effects of interventions in communities. 
The fundamental issue in these designs is whether we can have confidence that any observed change in a repeatedly measured process is, in fact, due to the intervention or independent variable. The two most important issues in this regard are the statistical reliability of any observed change in the intercept or slope of the time series following introduction of the independent variable (e.g., policy change) and replication of the effect across cases.
Interrupted time-series designs. Rigorous interrupted time-series designs require many data points, typically a minimum of 30 and often a hundred or more, both before and after an intervention. [14, 31] This limits the application of this design to variables on which data can be collected frequently (e.g., by the minute, hour, day, week) or to variables on which data have been collected routinely and consistently on a monthly or annual basis for 15-20 times before the neighborhood implements new interventions, and will continue to be collected in the same consistent and routine way after the new interventions start. Statistical methods are in place for analyzing effects on the time series, and a wide range of common theoretically expected effect patterns (gradual effects; decaying effects, or “s-curve”—slow start, fast middle, late asymptotic) are easily modeled and tested. Time-series data are auto-correlated and often have cyclical components, so statistical modeling methods such as Auto-Regressive Integrated Moving Average (ARIMA) modeling are necessary.  These techniques allow for the estimation of intervention-induced changes in patterns beyond a change in level, or a simple change in slope, such as changes in variance, cycles, etc. Many interventions have S-shaped gradual effects, or sudden but exponentially decaying effects. All such functional forms of intervention effects can be statistically evaluated/modeled—a very important strength of time-series designs, especially when appropriate higher-time resolution data are used. This is in contrast to designs not incorporating time series, which usually measure simple before/after changes, ignoring the growing or decaying effects of most interventions.
Time resolution is a key issue in interrupted time-series designs. Are the observations once/hour, day, week, month or year? Are they a sampled time point (e.g., temperature once/hour) or a sum over a set time (number of incidents or deaths per year)? Higher time resolution is usually better, but complexities can arise. Higher resolution series may have cycles in them (e.g., hourly, daily, weekly, monthly, seasonal) that create additional variance; but this is variance that time-series analyses can control.
Where sufficient pre-intervention data exist on a regular basis (e.g., daily, weekly, monthly or annual assessments), repeating the same assessments post-intervention can create an interrupted time series. Time-series designs can be used for either individual interventions or packages of interventions, and for single neighborhoods or multiple neighborhoods. Having a sufficiently large number of repeated data points provides greater confidence in the reliability of the claimed relationship between the independent variable and the dependent variable. However, replicating that relationship across cases (e.g., persons, neighborhoods, schools, or cities) is also very important. Confidence in the evidence from those replications improves as a function of (a) whether the data came from multiple cases at the same time (as opposed to doing a series of replications over time); (b) whether the implementation of the intervention was staggered across cases; (c) whether the receipt of the intervention across cases was determined at random; (d) how large the effect is on intercept or slope; (e) how reliable these effects are (as a function of the number of data points and the size of the effects); (f) whether the relationship between the independent and dependent variables can be shown to reverse when or if the intervention is subsequently withdrawn. 
Repeated-measures, non-equivalent control group designs.Where multiple interventions are delivered in one kind of setting (e.g., families, schools, clinics), and when there are insufficient resources to offer them to all of those settings, but randomization is not possible or acceptable, groups obtaining the package of interventions might be compared with non-equivalent groups not receiving it. This approach is strongest if there is a known assignment variable (e.g., test scores for schools). It is likely to be politically more popular than random assignment; but randomization should still be encouraged in such situations where a “lottery” among all the fully qualified sites is the fairest way to select an initial subset for implementation. Non-equivalent comparison group designs require a large number of settings (e.g., 60 or more families, 20 or more schools or clinics), and require very careful assessment of the assignment variable, covariates, and alternative explanations of any observed effects. 
At the component-intervention level nested within neighborhoods, non-equivalent control group designs are appropriate when not all people in the target group can participate in the intervention and randomization is not possible. All of the cautions about such designs related to lack of randomization apply but, with the inclusion and appropriate analyses of covariates , they can be useful in neighborhood contexts, as well as across communities, for determining if an intervention is having its expected effects. The strongest such designs involve matched pairs and repeated measures (but far fewer than time-series designs) before and after starting the intervention.
Research involving multiple neighborhoods could also use the repeated-measures, matched-pair, control-group design. Meta-analyses and secondary analytical studies have found that well-conducted matched-pair, control-group designs can produce estimates of effects similar or equivalent to those from randomized trials.  Others have demonstrated that careful analysis of repeated-measures data, using multi-level data analytic techniques, such as multilevel trajectory analysis strategies, provides information that standard statistical models (e.g., inclusion of covariates, econometric models) are designed to produce, but with greater precision (i.e., smaller standard errors) and provide better information about the nature of program effects. 
Multiple-baseline designs. The label multiple-baseline commonly refers to multiple pre-intervention measures or pretests. However, sometimes it refers to multiple units/cases on which the multiple repeated measures are taken.  In the behavior analysis literature since the 1960s, the term has been used to refer to multiple settings, units/case, or outcomes.  We use the term multiple-baseline to refer to both multiple pretests and multiple cases, where there is staggered introduction of the intervention across cases, where cases can be groups or places (e.g., neighborhoods or subsets of neighborhoods). In some respects, this type of design might be the workhorse of neighborhood intervention evaluation and research.  As interventionists start work with one case or one group of cases (e.g., a preschool, a family, a classroom, several blocks of a neighborhood, a neighborhood), they can adapt and improve the intervention before working with the next case(s). Project 16 provides an example. 
As noted earlier, when many data points are available, even more powerful time-series analytical techniques can be applied to model complexity in the data, including cycles like monthly or seasonal variations, and specific functional forms of hypothesized intervention effects over time. Naturally, random assignment of timing of intervention implementation by unit (e.g., school) will further strengthen this design.
Single-case studies. An interrupted time-series analysis can be used with a single neighborhood—this is a form of a single-case study, where the single case is assessed on multiple occasions prior to and after the introduction of an intervention, and then another case is (later) provided with the same or a similar intervention.  The assessments before the intervention help to estimate the normal level and variation around that level. The effects of the introduction of an intervention appear as a change in the level or slope of the indicator as assessed on multiple occasions after the introduction of the intervention. Causal statements are possible from single-case studies as long as the standards of the What Works Clearinghouse (WWC) are met [52-53], especially that the process be repeated with multiple cases.
Response to Intervention (RtI), as practiced in education today , is a variation of the single-case study approach. Students are under observation repeatedly, so, if they do not learn as expected, they receive additional or different instruction. When they do learn as expected, they move to the next level. It is sometimes very much a trial and error approach, where the educator applies different interventions until finding one that works. This approach can be valuable for getting rapid feedback about an intervention’s effects, and tailoring an intervention to a specific case. It is not recommended for rigorous evaluation, because its focus on making progress with a single case does not necessarily lead to improved understanding of the conditions under which particular intervention components are effective.
Strengthening Research Designs – A Hierarchical Approach
Adding more waves of measurement, both before and after interventions start, can improve all designs, including randomized trials. Including multiple settings also can improve them; indeed, including multiple neighborhoods can improve all of the above designs. At the neighborhood level, any design will (and should) include multiple dependent variables, and it is expected that the package of interventions will produce effects on all of the targeted outcomes. The inclusion of comparison dependent variables that are not expected to change due to the intervention is an additional way to improve the design. This design option, which builds on the assumption that a particular intervention will produce effects on targeted outcomes and not others, is sometimes called the non-equivalent dependent variable design. 
Adding randomization and multiple comparisons can also improve all of the designs. Time-series or repeated-measures designs with multiple comparisons are rigorous designs that have been utilized effectively in the evaluation of policy effects on public health. A time-series or repeated-measures design across multiple experimental units enables the careful evaluation of an intervention in a succession of cases in a way that can provide confidence that the intervention worked if its impact was replicated in all the cases of the series. [42-44, 54]
Adding randomization to a time-series or repeated-measures design could also be used to evaluate initiatives across multiple neighborhoods, treating each neighborhood as an experimental unit. One could, for example, implement the package of interventions in one neighborhood randomly assigned from a matched set of three neighborhoods, obtain repeated measures to enable a statistical analysis of the outcomes and processes of interest in that and the two comparison neighborhoods, then replicate the intervention in a second and, later, a third neighborhood, once evidence of effect of the intervention is shown in the first (and then second) cases.
Adding multiple comparisons to time-series or multiple-baseline designs further enhances the strength of causal inference. For example, for the overall evaluation of intervention packages, two comparisons could be utilized. First, when examining the effectiveness of interventions by developmental phase in the first experimental neighborhood, the delayed-intervention neighborhoods could serve as comparisons. Second, for each archival outcome measure, the aggregate of all the other low-income neighborhoods within the same city or county could be used as a comparison covariate when examining effects in the experimental neighborhoods. This approach would efficiently control for broad trends as well as any specific change or perturbation in outcomes due to a host of factors that act in common across neighborhoods within a city or county, including such factors that are not (or cannot be) measured.
It is unlikely that the first attempt to implement a package of interventions will be totally successful; and adjustments will be necessary. Therefore, an adaptive intervention design (as distinctive from adaptive research design ; is desirable. A time-series or multiple-baseline design with monthly assessments of implementation and intermediate outcomes could simultaneously address the needs for a rigorous evaluation and for a practical way of implementing and refining the intervention. Careful monitoring of implementation fidelity, uptake, and effects of each of the components of the intervention can provide feedback to guide refinement of the intervention. Thus, the intervention implementation procedures and their effects are likely to improve over time. This is an important learning process natural to the refinement of any intervention (and in engineering as well).  In this way, continuous evaluation and refinement of the intervention (including implementation strategies and intervention components) can occur within one neighborhood and the lessons learned applied to implementation in subsequent neighborhoods.
The Fundamental Importance of a Measurement System
Traditionally, organizations have delivered educational and human service interventions without accompanying efforts to monitor the quality or effects of the intervention. It is increasingly clear, however, that effective education and human services need ongoing systems for monitoring the fidelity of implementation of evidence-based interventions, the proportion of the target population reached, their immediate effects on targeted processes, and their ultimate effects on longer-term outcomes. [9, 57-59]
Thus, a standardized measurement system is fundamental to the evaluation of complex multicomponent interventions. Such a system, to be usable in local neighborhoods and communities across the country, would include assessments of (1) implementation of each intervention component, (2) reach of each component, (3) effects of each component on immediate outcomes, and (4) effects of the comprehensive intervention package on outcomes; and would include the following nine dimensions: 1. Be standardized and usable for evaluations within and between communities 2. Include a comprehensive set of reliable and valid measures 3. Be easy to use by a variety of individuals and organizations within communities 4. Allow local evaluators to select the measures most relevant to their community 5. Include only key measures of process, intermediate and primary outcomes 6. Be readily accessible to all communities across the country (e.g., economically, technologically, and culturally) 7. Rapidly provide results to relevant individuals and organizations to inform continuous quality improvement 8. Be secure and protect confidentiality 9. Support the use of unique identifiers in order to aggregate data at multiple levels (e.g., individuals within families, within schools/organizations, within neighborhoods/ communities, etc.)
Measuring process, implementation, and reach. Systematic methods to monitor community change processes have also been developed as part of complex community-wide intervention research. [60-64] Others have designed standardized web-based systems to track complex community participation and action, and systems to monitor school- and family-based implementation fidelity. [62, 65-66] Such systems could grow into a standardized, yet highly flexible, web-based system that becomes a routine function within community-based organizations. A user-friendly measurement system to document and monitor adoption, implementation, and reach of each intervention component is not only vital to an overall evaluation effort; it is part of a continuous quality improvement system to improve the effectiveness of interventions over time.
Monitoring outcomes. NIH and CDC support several initiatives to monitor health and wellbeing systematically at the state and national levels. However, these systems do not allow for monitoring of outcomes at the local level, where many preventive initiatives exist. Many of these systems have a limited focus on a single health issue. Therefore, we propose the development of a standardized and comprehensive system that local organizations can implement feasibly and routinely at the local level. The PNRC is creating such a system for all constructs in the PNRC Model, including measures of distal and proximal influences (measureable intermediate outcomes of individual community strategies) and comprehensive primary outcomes of child health and wellbeing, including cognitive, social and emotional, behavioral, and physical health domains.
We recommend complex multicomponent designs to evaluate complex multicomponent interventions, such as the Promise Neighborhoods, Prevention Prepared Communities, and Choice Neighborhoods initiatives. Careful attention to rigorous evaluation strategies for each individual intervention component can help design an overall evaluation of the effectiveness of complex multicomponent initiatives. Attention to designing rigorous evaluations and continuous quality improvement efforts for each intervention component will provide the building blocks for a more successful overall effort. The elements and principles to strengthen causal inference should be applied to the evaluation of each intervention component, as well as to the overall effort. A simple causal model should be prepared for each intervention component, specifying the intervention component’s inputs and adoption, intervention delivery, and immediate and ultimate outcomes (see Figure 1). Critical for a rigorous evaluation of each component, as well as the overall effort, is implementation of a standardized, efficient system to measure adoption, intervention delivery and reach, and both immediate and ultimate outcomes for each component. To guide decisions for optimum use of multiple design elements for constructing experiments and quasi-experiments, we suggest a hierarchical decision-making approach as follows:
Randomization whenever feasible and acceptable at neighborhood, family, organization, and/or individual levels.
When randomization is not feasible, use matched comparisons (sites and/or outcomes), in combination with (a) time series or multilevel design with (at least) monthly assessments of implementation, intermediate and ultimate outcomes, in addition to (b) elements of adaptive intervention designs, feedback loops, and continuous quality improvement.
A monitoring and measurement system is a necessary component of the kind of evaluation/research we envision for complex multicomponent interventions. Bringing about permanent improvements in the prevalence of successfully developing children requires such a system, in the same way that having good measures of economic performance is vital to management of our economy. Substantial advances have occurred in such monitoring systems , including the development of increasingly precise measures of students’ academic performance. In summary, the interventions that neighborhoods and communities will implement should lead to permanent changes in practices—as long as the data show their value. Thus, we envision a system of continuous quality improvement that starts with the implementation of practices that existing evidence has found valuable. Then the evidence of their effects in specific neighborhoods and communities will further shape the system. The maintenance of effective interventions will require both ongoing evidence of their continued value and ongoing quality implementation.
A grant from the National Institute on Drug Abuse (DA028946) supports the PNRC and the work on this manuscript. We thank Christine Cody for her excellent editorial assistance.