DRAFT: Please do not cite or circulate without permission.
Research on prevention, treatment, and education has brought us to the point where very significant improvements in the life trajectories of at-risk young people are possible. [1, 2] Major advances in public health are especially likely if we can translate existing knowledge into effective, multicomponent packages of interventions in high-poverty neighborhoods where psychological, social, and behavioral problems are concentrated. However, new evaluation and research strategies will be needed if the promise of existing evidence is to be translated into effective practice in multiple communities. The present paper examines the challenges that must be addressed and describes the infrastructure of measurement systems and experimental practices that are most likely to help the nation use existing knowledge to help many communities become places that nurture child and adolescent development and realize greater human potential.
The Potential of Recent Behavioral Science Research
Ambitious federal efforts are underway to transform child and adolescent development in neighborhoods, communities, and even states. Inspired by the success of the Harlem Children’s Zone, the Obama administration introduced the Promise Neighborhood initiative, which supports high-poverty neighborhoods around the nation in trying to replicate what Geoffrey Canada and his team are accomplishing in Harlem. The Department of Education has other initiatives that include statewide data systems and interventions for Safe and Supportive Schools. The Department of Housing and Urban Development has developed the Choice Neighborhoods initiative to help neighborhoods with federally supported housing to improve wellbeing and prevent social and academic failure of young people. The Office of National Drug Control Strategy includes a plan to fund 30 “Prevention-Prepared” communities to implement comprehensive interventions that would prevent drug abuse and related problems.
These efforts are amply justified by the substantial accumulation of rigorously evaluated prevention, treatment, and education interventions over the past 20 years. The NRC and IOM report on prevention  documents this progress. It underscores the fact that multiple mental, emotional, and behavioral disorders are inter-related and stem from substantially the same risk factors. It documents numerous preventive interventions from the prenatal period through adolescence that not only prevent problems but also enhance general wellbeing. Similar progress has been reported in the treatment of drug abuse4 as well as interventions to reduce criminal behavior. [1, 5] Raudenbush  has summarized the evidence, indicating that substantial reductions in the achievement gap between at-risk youth and other youth could be achieved by fully implementing the evidence-based teaching practices identified over the last 30 years.
Despite all of the evidence about what works, actual wellbeing lags far behind what could be achieved, especially in neighborhoods with concentrated poverty. High-poverty neighborhoods are more prevalent in the U.S. than in Europe.  These neighborhoods have substantially higher levels of drug abuse, crime and violence, depression, obesity, CVD, and all-cause mortality.[8-11] The higher rates of poverty may account for why the U.S. has higher rates of mental, emotional, and behavioral disorders. [12-17] Substantial evidence suggests that the impact of concentrated poverty is mediated at the social environment level by its effect on social relations  and at the biological level by stress.  Thus, high-poverty neighborhoods must be the foci of prevention research and practice due to the size of their contribution to drug abuse, psychological and behavioral disorders, physical illness, and health disparities.
The Need for a New Research Infrastructure
The existing research infrastructure is not adequate to the task of transforming high-poverty neighborhoods. It is not organized to create the measurement and evaluation systems needed to guide us to increasingly effective practices and to determine whether comprehensive interventions produce their expected effects. Indeed, the science relevant to improving human wellbeing has never been well-coordinated with the major societal efforts to improve social, psychological, and behavioral wellbeing. For example, the war on poverty, Head Start, the war on drugs, and multiple educational reforms have been only loosely based on existing behavioral science evidence; sometimes practices have been contrary to available evidence. Their implementation was not guided by careful measures of their delivery, reach, or proximal effects and none of them were evaluated with the kind of scientific rigor that would definitively indicate whether they had worked. Scientific activities did not contribute to the initial design and evaluation of program elements, nor the continuous improvement of these elements, alone or in combination, as programs were adopted and implemented. To an extent, this was because the measures and scientific techniques that were needed were not yet widely understood and available, but that is no longer the case. Such feedback mechanisms are fundamental to both immediate and long-term benefits in changing human environments. [19-26]
If we do not define and develop the data and feedback infrastructure needed for these initiatives to succeed, we run the risk of never having adequate evidence of their effectiveness. Moreover, we will not even be able to ensure that the components of comprehensive interventions are implemented with fidelity and that they affect the proximal targets essential to their ultimate success. Simply implementing evidence-based programs is insufficient, since there is no guarantee that the effects will be replicated. For example, Hallfors and colleagues, using a well-designed experimental study, found that they could not replicate the effects of an intervention that had positive effects in an efficacy trial. 
Thus, the problem is not simply a matter of conducting evaluations of the overall package of interventions that each community adopts. The essential first step must be to create a system that ensures that each intervention component works. We need to build a comprehensive infrastructure that leads from careful descriptive analysis to planned, controlled, treatment-building interventions, careful attention to implementation science, and finally to experimental evaluation of program effectiveness. In this way, we can construct a research infrastructure that drives, and is driven by, increasingly effective, sustainable interventions in high-need communities. The bigger threat is that there will not be sufficient research support to ensure that the many components of comprehensive interventions (a) are implemented with fidelity and (b) affect the proximal targets essential to ultimate success. The problem goes beyond the overall evaluation of the comprehensive package of interventions and necessitates the creation of a system for ensuring that each component intervention is working.
Evolving a Public Health System to Nurture Human Wellbeing
We envision a system evolving to foster development of the family, school, workplace, and community environments needed to nurture child, adolescent, and adult wellbeing. Such a system would (a) monitor wellbeing and the conditions that affect it in defined populations; (b) suggest the implementation of programs, policies, and practices that evidence indicates are likely to affect wellbeing; (c) monitor how well these interventions are implemented; (d) establish the extent to which the interventions produce the effects expected of them; and (e) provide timely feedback on progress. It would not be a static effort; rather, it would continuously monitor the implementation and impact of interventions and would adjust them in light of the data.
Fixsen and Blase  noted that the major challenge to the widespread and effective implementation of evidence-based practices is the development of a system of organizations that adopt, maintain, and refine these interventions based on their impact. [29,30] In what follows, we describe the key kinds of practices that we think such a system needs. We of course recognize that defining a system is a far cry from achieving it.
Evidence-Based Practices Focused On Key Outcomes
Others [1, 2, 31, 32] have documented the nature of wellbeing and the conditions and interventions necessary to achieve it. Indeed, so much evidence has accumulated that it is necessary to be judicious in finding an efficient and effective mix of interventions that are likely to succeed. This is one reason the National Institute on Drug Abuse funded the Promise Neighborhoods Research Consortium (PNRC). The PNRC is a network of scientists who, among other things, are identifying evidence-based programs, policies, and practices that research shows have the potential to significantly improve wellbeing in high-poverty neighborhoods, and that might have especially large effects among children and adolescents when adopted as a package.
The PNRC has identified the major cognitive, behavioral, social, and health outcomes as standards for each phase of development (from the prenatal period through adolescence), and the major proximal (immediate) and distal (background) influences on these outcomes. It has also identified measures of each construct. This developmental perspective is vital, as it is clear that modifiable patterns of behavior at every age can and do influence both immediate and long-term outcomes of public health and safety significance. [1, 2] For example, modifying the pattern of early disruptive behavior in first grade prevents ADHD and Oppositional Defiant Disorder (ODD) in third grade , conduct disorders and tobacco use in sixth grade [34-36], and the use of illegal drugs like cocaine in eighth grade.  It increases high-school graduation and college entry rates and reduces lifetime psychiatric disorders  as well as criminal arrests and violence  and attempted suicides in early adulthood. Such interventions can be thought of as behavioral vaccines , wherein early “inoculations” can have a lifetime of protection just like medical inoculations such as the polio vaccine. Understanding the developmental trajectory of outcomes is vital for monitoring progress.
Following this life span approach, the PNRC identified strategies that might modify children’s and adolescents’ developmental trajectory. Teams of scientists identified judicious sets of policies, programs, and practices that could be implemented together in neighborhoods to achieve desired outcomes with an emphasis on reaching the whole neighborhood. A brief synthesis of all of this evidence is captured by the concept of nurturing environments.  Human wellbeing is supported by the creation of environments that nurture development by (a) minimizing biologically and socially toxic stressors; (b) teaching, modeling, and reinforcing prosocial and healthy behavior; (c) limiting opportunities for antisocial behavior; and (d) promoting pragmatic, values-driven action. 
Komro et al.  summarizes all of this information. The PNRC website also provides the information in a format that allows neighborhood-serving organizations and neighborhood residents to access the information and use it to advance change in their neighborhoods.
A Measurement System
The measurement of the incidence and prevalence of disease is the foundation of the public health system. The practice began with the tracking of infectious disease, which was vital in learning how to control the spread of infection. Now it is essential to the ongoing management of epidemics. The practice grew to include the monitoring of other diseases, such as cancer and cardiovascular disease, as their contribution to mortality increased. Eventually, as the contribution of behavioral and psychological problems to health became clear, systems for monitoring these problems were also devised. For example, the system for monitoring cigarette smoking that the Centers for Disease Control and Prevention promulgated has been an essential component in reducing the rate of smoking.  Although the impact of stressful, non-nurturing environments on wellbeing is increasingly understood [31, 45], a system for monitoring the prevalence of nurturing environments is in its infancy. Indeed, we do not yet have a system for monitoring mental, emotional, and behavioral disorders at the community or neighborhood level.
Monitoring outcomes. Any effort to evaluate the impact of comprehensive interventions will require the development of a system for accurately monitoring wellbeing and the environmental conditions that research shows must be modified in order to affect wellbeing. Unfortunately no agency of the federal government is charged with developing such a system. Thus, the first step for research that evaluates any of these initiatives will be development of a measurement system. Given the research traditions of NIH, a high-quality system could well be created for the purpose of this research, but it is much less likely that provision will be made to maintain it in perpetuity.
In our view, this would be a significant mistake. An effective set of practices to prevent multiple problems requires continuous measurement of the reach, fidelity of implementation, and immediate impact of each practice. These initiatives ostensibly aspire to creating permanent changes in practices of neighborhoods and communities. It is inconceivable that a system for ensuring wellbeing could continue in the absence of continuing evidence of its impact.
Consider some successfully managed aspects of our society. Each has a measurement system that guides adjustment in practices needed to maintain desired outcomes. The quality of manufactured goods has improved enormously in the last half century as companies have begun to use more effective methods of monitoring product quality and to adjust their practices based on the results of those methods. Several continuously monitored economic performance indicators drive the worldwide management of the economy. Although the deregulation of the late 1990s contributed to a nearly catastrophic downturn, it was precisely due to our system of economic indicators that we were able to sustain greater growth and minimize volatility between the late 1940s and the recent crisis.  Furthermore, the indicator system has been essential to knowing whether or not efforts are successful to rebuild stability and growth. As evidence mounts that human wellbeing is less a function of economic indicators and the quality of our material goods than it is the quality of our social relations , why not devote as many resources to monitoring wellbeing and the conditions that affect it as we do to monitoring the economy and quality of our material goods?
Measuring reach, implementation, and proximal impact. An effective system for managing the wellbeing of populations must have an ongoing system for monitoring the reach, fidelity of implementation, and immediate impact of practices designed to nurture wellbeing. Without this, practices will inevitably drift, their impact will deteriorate, and there will be no way to know if they need adjustment. From this perspective, the implementation of an evidence-based practice needs a system for monitoring implementation quality—exactly as is the case for our manufacturing systems.
Perhaps the most impressive example of this approach is provided by the work of Forgatch and her colleagues. [48-51] Beginning with the initiation of their work in Norway in 1999, they developed an implementation system providing full program transfer from purveyor to adopting communities using the Oregon Model of Parent Management Training (PMTO).  The transfer method involves extensive training for the first generation of community practitioners with subsequent support as they train future generations of practitioners. Following the establishment of a certified group, the purveyors provide coaching and support with a system-wide infrastructure that incorporates training, coaching, practitioner certification, and regular fidelity checks to prevent drift from the original model.  Fidelity is maintained by continued monitoring, coaching, and recertification based on direct observation of intervention sessions.  Norway now has more than 900 practitioners. The Norwegians conducted an RCT to test whether the intervention sustained effects reported for Oregon studies.  The multiple method findings were replicated. Iceland now has 290 practitioners, the Netherlands 84, and Michigan has 181. 
Ultimately, achieving population-wide effects requires robust systems with high reach, effectiveness, adoption, implementation, and maintenance (RE-AIM).  For example, the Triple P system of parenting supports [56-58] has multiple levels of support, each of which has been shown to affect parenting: a media component including a TV show [59-61], internet , self-help materials , brief supports in doctors’ offices and other community settings , and workplace seminars for parents.  This systems approach now has had two population-level studies showing prevention effects in whole counties or large communities. [65, 66]
Systematic methods to monitor community change processes have also been developed as part of complex community-wide intervention research. [67-72] Standardized web-based systems have been designed to track complex community participation and action [67-70], as well as systems to monitor school- and family-based implementation fidelity. [70, 73, 74] Such systems could be developed into a standardized, yet highly flexible, web-based system that becomes a routine function within community-based organizations. Yet, to create such permanent systems, changes will be needed in the way we conceive of and fund interventions and research.
By the standards of traditional human services dissemination practices, systems approaches such as these may seem excessive; however, there is ample evidence that evidence-based interventions tested in efficacy conditions are not necessarily translatable to effectiveness trials  nor automatically implemented correctly in “real-world” conditions. They may even have iatrogenic effects. Thus, the application of prevention technologies requires the kind of care and precision that is commonplace in the dissemination of manufacturing processes. Such practices should not be seen as a matter of research but, rather, as the standard system to be used in disseminating practices and maintaining their effectiveness. There is increasing evidence that systems for monitoring the reliability and robustness of prevention or intervention strategies contribute to better outcomes [26, 75, 76] and as the field becomes more facile at monitoring and promoting intervention fidelity, it is reasonable to gravitate towards more streamlined and efficient ways of achieving quality implementation.
A System for Experimental Evaluation
As knowledge of specific evidence-based interventions accumulates, and measures of wellbeing and environmental influences on wellbeing become more widely available, we can evolve a system that steadily increases the prevalence of wellbeing in defined populations. However, a third component of this system will be needed—the experimental evaluation of the component practices, programs, and the entire multicomponent intervention package itself. In a separate document, we have detailed the methodological considerations regarding the validity of various methods of experimental evaluation. Here, we will more briefly summarize them, highlighting the ways in which they can improve the effectiveness of interventions over time.
Experimental evaluations of components. It would be naïve to think we can implement comprehensive interventions in multiple neighborhoods or communities and expect that each component will reach all intended targets, will be delivered with fidelity, and will have its intended impact. The measurement system that we described above will provide information about the level and trend of each of these targets; however, by itself, a measurement system will not deliver sufficient information to tell if a specific practice is making a difference or if modifying that practice would achieve the intended improvement in impact. Experimental methods are necessary in order to accumulate increasingly effective components.
The Society for Prevention Research Standards of Evidence Committee describes three types of experimental designs that can be used for this purpose. [78, 79] The randomized controlled trial is the best known and most widely used method of experimentation. It provides rigorous information about the degree to which an intervention has an effect across multiple cases. It would be ideal if this method could be used whenever an evidence-based intervention is implemented in a new place, as Forgatch did in Norway. However, significant obstacles prevent this in most instances. There may be an insufficient number of cases and often services must be provided to everyone. Additionally, RCTs are used for evaluating effectiveness  and thus are best suited to end-stage evaluations of well-developed programs. While powerful (and perhaps singularly appropriate) for this purpose, RCTs offer few resources and significant challenges for developmental design, testing, and refinement of interventions—and sometimes are poorly matched to the logistical and political demands of intervention research in high-need communities.
A second design is the regression discontinuity design. In this design, the impact of an intervention is gauged by whether those receiving the intervention have lower scores on an outcome variable than would be predicted from the known regression between the outcome variable and the predictor variable that was used to select those to get the intervention. Unfortunately, these designs require more cases than RCTs do and also are not helpful in refining interventions.
While these designs are ideal for assessing whether an intervention is having its intended effect on proximal and distal outcomes, they are less useful for ongoing monitoring of the effects of interventions or for the fine-grained refinement of component interventions. For these purposes, designs employing repeated-measures or time-series [A] designs appear more useful. In these designs, repeated measures of the variable of interest (i.e., implementation or proximal and/or distal outcomes) are obtained and the effect of introducing an independent variable (intervention component) on the level or slope of the measure is observed.
A: We use the terms repeated-measures designs and time-series designs interchangeably. In our review of the literature we note variation in the use of the term “interrupted time-series designs.” Some use the term to refer to any design in which repeated measures are obtained and the effect of the independent variable on the time series is observed [81, 82], while others reserve the term for instances in which sufficient data are collected to employ sophisticated analytic techniques such as ARIMA modeling.  We suggest that the former use is helpful in pointing to the fundamental logic of such designs, namely the analysis of the impact of an independent variable on the level or slope of the time series. Nonetheless, having sufficient time points to analyze such effects statistically had distinct benefits.
As Flay et al. [77, 78] note, the strongest such designs involve large numbers of time points so that sophisticated statistical analyses, such as ARIMA (Autoregressive Integrated Moving Average) models, can be used to estimate reliable changes in intercept or slope following introduction of the intervention. However, useful information can be obtained from situations in which many fewer data points are obtained. Indeed, a sizable behavior-analytic literature exists on the impact of antecedent and consequent events on behavior in which brief introductions and withdrawals of independent variables have pinpointed effective interventions. An additional strategy to increase the strength of causal inference is to incorporate comparisons whenever feasible, if not with cases, then with alternative outcome measures that are not hypothesized to be affected by the intervention.
Systematic use of the logic of repeated-measures or interrupted time-series designs can be useful in every facet of the implementation of multicomponent interventions. For example, a vital issue is the reach of each intervention component. Weekly data on how many people have been reached can be obtained and various strategies can be implemented to see how each strategy affects reach. Similarly, repeated-measures designs can contribute to developing interventions with greater impact. For example, Biglan and colleagues [84, 85] attempted to reduce illegal sales of tobacco in the context of an RCT of a comprehensive tobacco prevention intervention in 16 communities. Initial efforts to educate merchants in the first community about not selling to youth showed that the intervention had no effect on repeated measures of the number of stores willing to sell to young people. They therefore introduced an intervention in which young people rewarded clerks who refused to sell tobacco to them. This produced a dramatic decline in illegal sales in that community, which was subsequently replicated when the intervention was introduced six weeks later in a second community. Biglan et al. [84, 85] repeated the design in an additional three pairs of communities, introducing it in one community while delaying the intervention in the second community until there was evidence of an effect in the first one. The results of these interventions prompted Embry to replicate the effects of the intervention across two states.  He used the same staggered intervention design and analysis of effects using repeated measures of the percent of stores willing to sell in each state.
The logic of repeated-measures designs is also fundamental to the day-to-day management of any complex intervention. These designs encourage people to keep track of the progress of interventions and modify their practices in light of trends in the process and outcome data. Over time, more effective practices are selected in light of the data. There is no question that the use of this logic in situations where there are few data points could lead to spurious conclusions. For example, a change in recruitment procedure could be associated with an improvement in the number of people recruited that subsequent experience shows cannot be replicated. However, continued monitoring of the outcome and failures to replicate the effect would guide decision making.
Experimental evaluation of comprehensive interventions. In evaluating the long-term impact of complex, multicomponent interventions on multiple outcomes, RCTs are more feasible than many people may realize. There are now numerous examples in the literature of RCTs of community-wide interventions [65, 70, 71, 87-90] although many focus on only a limited range of problems and all had a more limited range of intervention components than the above-described federal initiatives will require. It is likely that a greater number of qualified applicants for funding will be identified than can be funded, so randomly choosing those to be funded could enable an important randomized trial.
Randomized trials are not the only way to evaluate these interventions and are probably not the most useful method for comprehensive packages of interventions that have never before been implemented or evaluated. The example of COMMIT [91-93] is instructive in this regard. It was a randomized trial of a comprehensive smoking reduction intervention in 22 communities. Despite the involvement of the best tobacco control experts, the intervention produced no impact on smoking. The study cost approximately $100 million. Might it not have been better to implement the intervention in one or two communities, using the adaptive design procedures we have been describing, and continually monitor smoking prevalence and all of the proximal influences on smoking, such as the proportion of physicians giving advice and the number of smokers reached with such advice? In this way, weakness in intervention components could have been detected and the interventions refined until they achieved high levels of effect. For example, the trial struggled to get physicians to give advice. Repeated measures of advice giving and experimental evaluation of efforts to get physicians to give advice could have resulted in an effective component. Only when a decrement in smoking prevalence was detected in the first one (or two) communities would the intervention have been implemented in a subsequent community. Over communities the effectiveness of the intervention would have steadily improved and the cost of getting to a reliably effective intervention would have been considerably less.
An Infrastructure of Scientist-Practitioners
In our view, the system of methods we describe should be promoted in every community so that it is in the hands of those who are charged with ensuring the success of component interventions. Comprehensive interventions will succeed or fail in the day-to-day work in neighborhoods and communities. Creating an infrastructure of measures and experimental methods to guide the selection of progressively more effective practices will establish the procedures and the network of knowledgeable and experienced personnel who can extend successful efforts to a growing number of neighborhoods and communities. Thus, the roles of scientist and practitioner will be increasingly merged. 
Implications for NIH and the Federal Research Funding Enterprise
Evolving the system that we describe will require significant changes in the funding practices of the National Institutes of Health, the Centers for Disease Control and Prevention, the Substance Abuse and Mental Health Services Administration, and the Departments of Education and Justice. None of these agencies has a clear mandate to develop the measurement system that is needed. The CDC supports a system of assessments at the state level (the Behavioral Risk Factor Survey System and the Youth Risk Behavior Survey), but neither is designed to be used at the neighborhood or community level; and they would need considerable modification to provide all the information necessary to guide interventions.
The prevention research funded by NIH supports the testing of increasingly complex interventions [88, 90], but we are not aware of any instances in which NIH has funded the development of a permanent assessment system or the development of an infrastructure for ongoing experimental evaluations. Indeed, we worry that it will be difficult to get interrupted time-series designs accepted by NIH scientific review panels. Prompted in part by the IOM report on prevention , federal agencies are emphasizing collaboration to develop and implement interventions that target multiple, inter-related problems and the contextual conditions that contribute to these problems.  But even if inter-agency collaboration puts a set of evidence-based preventive interventions in the field, progress will be limited if we do not adequately fund the research that (a) evaluates the ultimate impact of these interventions, (b) provides an infrastructure of measurement and adaptive experimentation that fosters the evolution of increasingly effective interventions, and (c) provides the infrastructure of personnel to disseminate effective interventions with fidelity to a growing number of neighborhoods and communities.
DRAFT: Please do not cite or circulate without permission.