Without Replication, Should Program Evaluation Findings Be Suspect as Research Findings Currently Are?


Replication of research -- the reproducibility of findings -- is a methodological safeguard and hallmark of research universally lauded by scientists to justify their craft.  As we are continuing to learn with more certainty, it is theory not much put into practice. Claims about research finding may be more likely to be false than true. Scientific studies are tainted by poor study design, sloppy and often self-serving data analysis, and miscalculation – problems that replication of the studies and duplication of the results would largely correct. Again, the problem is that it’s not done.

The continuing work of John Ioannidis at Stanford University, Brian Nosek at the University of Virginia, and others shows that much research is not and cannot be replicated. Almost a decade ago in these pages (Courts Have No Business Doing Research Studies, Made2Measure, October 15, 2007), I highlighted a 2005 paper by Ioannidis titled “Why Most Published Research Findings Are False” that caused a stir in the scientific community and prompted many scientists and consumers of research  to begin questioning whether we can trust evidence produced by research studies.  Today, with more than $80 million of funding of a “research integrity” initiative by the Laura and John Arnold Foundation, the science critics and reformers like Ioannidis and Nosek have been given a solid platform to question the culture of science that produces studies that can’t be reproduced.

Can program evaluation be questioned as well? Program evaluations are assessments of changes in the well-being (status or condition) of individuals, households, communities or firms that can be attributed to a project, program or process, along with the systematic determination of their quality, value or merit. Rooted in the tradition of behavioral and social research, does program evaluation -- especially impact evaluation that relies on randomized controlled trials – exist in a similar culture of research described by Ioannidis, Nosek, and others that does not support replication and reproducibility of results? In my experience, program evaluations in the area of justice and the rule of law are one-off affairs funded by donors who are seldom, if ever, prompted to support replication of the results.

For several years, I have called for a bigger space for performance measurement and management (PMM) in the toolkit of international development of justice and the rule of law relative to program evaluation and global indicators. I argue that justice institutions and justice systems that take responsibility for measuring and managing their own performance in delivering justice using PMM, rather than relying on external assessments done by third parties such as typically is done in program evaluation and global indicators, are likely to have more success and gain more legitimacy, trust and confidence in the eyes of those they serve.

Replication or reproducibility of results highlights a critical design difference between PMM and program evaluation or evaluation research. Basically, replication means repeating the performance measurement or evaluation research to corroborate the results and to safeguard against overgeneralizations and other false claims. In contrast with program evaluation, repeated measurements – i.e., replication of results on a regular and continuous basis, ideally in real time or near-real time -- are part of the required methodology of PMM.

I’d like to believe that my suspicion of a lack of replicability and reproducibility of program evaluation findings strengthens my argument for more space in the toolkit of international development for PMM. Of course, my suspicions are just that until “program evaluation integrity” studies like the research integrity studies funded by the Arnold Foundation confirms those suspicions.

© Copyright CourtMetrics 2017. All rights reserved.

Popular posts from this blog

Top 10 Reasons for Performance Measurement

Q & A: Outcome vs. Measure vs. Target vs. Standard

Taming “Wild Problems”: Measure Everything That Matters