Category Archives: Evaluation methods

Visual displays of evidence

I have become a fan of Edward Tufte ~ he suggests that newspapers, like the NY Times and the Wall Street Journal, are light years ahead of social science in their ability to communicate large amounts of data in straightforward, comprehensible, and aesthetically pleasing ways. Here is an example of a simple kind of data (number of households mail is being delivered to) that over time illustrate the resurrection of New Orleans after Katrina. One could have used any uniform service to accomplish the same thing, but a mailing address exists whether any mail is actually delivered and does not differentiate people by class or race (like telephone or power service might).

There are any number of applications to evaluation for this particular data display, but it illustrates more generally ways to show the adoption or spread of something in geographic space.

Most Significant Change Technique ~ an example

Stories_of_Significance_Redefining_Change.gifMost Significant Change (MSC) is a participatory monitoring technique based on stories of important or significant changes – they give a rich picture of the impact of an intervention. MSC can be better understood through a metaphor – of a newspaper, which picks out the most interesting or significant story from the wide range of events and background details that it could draw on. This technique was developed by Rick Davies and is described in a handbook on MSC.

An illustration of how MSC is used is this report, Stories of Significance: Redefining Change, which is an evaluation of community based interventions for Indian women and HIV/AIDS.

Image based research and evaluation

Images are all around us; we are all image-makers and image readers. Images are a rich source of data for understanding the social world and for representing our knowledge of that social world. Image-based research has a long history in cultural anthropology and sociology as well as the natural sciences, but is nonetheless still relatively uncommon.

This chapter, Seeing is Believing, describes imaged based research and evaluation and focuses especially on issues of credibility of images and image based inquiry strategies.

There are a few examples in this chapter from my research on the impact of high stakes testing. Data collection focusing on kids’ experiences of testing involved drawing and writing. You can see more of these data on my website, as well as view a presentation I did on this topic for the Claremont Graduate School 2006 summer institute on the credibility of evidence.

Tracy%2C%20Savannah%2C%20and%20Faith%20urban4_1.jpg

A bit of history on accreditation

The Flexner Report of 1910 was an evaluation of medical education in all programs in Canada and the US–Flexner found medical education wanting and his report lead to significant changes in medical education. But for evaluators, the importance of this report is that it was the genesis of accreditation, a model of evaluation based on expert, professional judgement. The New England Journal of Medicine has published a 100 year retrospective look at medical education and still finds it wanting.

Accreditation has changed over the years, moving away from looking at resources and processes, to a more substantial examination of outcomes. But still, the approach holds too fast to simplistic criteria, and still often looks at things like the ration of support staff to faculty, and the number of hours of clinical work, and the office space allocated. And it is common knowledge that accreditation is as much about feting the site visitors as it is about doing a good job of professional education.

There have been few advances in accreditation as evaluation, perhaps the one exception is TEAC or the Teacher Education Accreditation Council. Unlike other accrediting agencies, TEAC holds programs accountable for what they say they intend to do based on reasonable evidence identified by the program being reviewed, rather than an abstract, general notion of what is considered appropriate professional education and preconceived sorts of evidence.

Now, if only someone would do a 100 year retrospective analysis of accreditation qua evaluation…

Canadian University presidents refuse to participate in flawed rankings game

Canada has so few universities compared to countries like the USA but that doesn’t stop Macleans Magazine from sponsoring a horse race among them. Because the number of institutions is small it is relatively easy to analyze which one best suits your needs by looking at data available from each. And, Macleans expects universities to foot the bill by expending resources and time to answer their questions, which simplify and average out important differences any consumer would want to know about.

UBC’s new president, along with the other university presidents, is to be commended for taking a stand and refusing to participate in a simplistic evaluation system. The letter, sent to Macleans, was signed by eleven university presidents–most of the biggies, but not McGill.

What is the difference between an 89 and a 90? Or rate that wine for profit!

Here is an excerpt from a NYT story that critiques the 100 point rating system for wines. The ratings have become more useful for selling wine than actually judging them, and thus the difference between and 89 and a 90 is absurd in an evaluative sense, but worth lots of $s as a marketing tool.

A rating system that draws a distinction between a cabernet scoring 90 and one receiving an 89 implies a precision of the senses that even many wine critics agree that human beings do not possess. Ratings are quick judgments that a single individual renders early in the life of a bottle of wine that, once expressed numerically, magically transform the nebulous and subjective into the authoritative and objective.

When pressed, critics allow that numerical ratings mean little if they are unaccompanied by corresponding tasting notes (“hints of blackberry,” “a good nose”). Yet in the hands of the marketers who have transformed wine into a multibillion-dollar industry, The Number is often all that counts. It is one of the wheels that keep the glamorous, lucrative machinery of the wine business turning, but it has become so overused and ubiquitous that it may well be meaningless — other than as an index of how a once mystical, high-end product for the elite has become embroidered with the same marketing high jinks as other products peddled to the masses.

Read the whole story.

The Role of RCTs in Evaluation

The controversy over the best evaluation design continues to rage. During the late 1970s and early 80s, many debates were held at professional evaluation meetings about the value of experimental design in evaluation. Many evaluators thought the issue had been laid to rest, i.e., there was an acknowledgement of the usefulness and appropriateness of experimental designs–sometimes, but a consensus that such an approach does not constitute a gold standard, and not even the most frequently used design.

For many evaluators, especially in education but also in other fields, this prior understanding has been turned on its head by the US Government endorsement of randomized clinical trials as the sine qua non approach in evaluation (followed by quasi experimental and regression discontinuity designs). There has been considerable reaction to this methodological/ideological turn, with strong objections by AERA, AEA, and NEA.

AEA’s position, which follows, created much controversy within the organization–and even resulted in a “Not AEA Statement” on the use of RCTs. The issue was debated by Michael Scriven and Mark Lipsey at a conference at Claremont Graduate School, and this debate is summarized in Determining Causality in Program Evaluation and Applied Research: Should Experimental Evidence be the Gold Standard? which also includes the text of the “Not” statement.

Another useful presentation about the issue of RCTs is a one given by Michael Q Patton to NIH in September 2004. This talk The Debate about Randomized Controls as the Gold Standard in Evaluation focuses especially on the confusion of design and purpose.

You can read more about the issue in a previous post of the invited APA address I gave in August on educational evaluation as a public good.___________________________________________________________________
American Evaluation Association Response
To
U. S. Department of Education
Notice of proposed priority, Federal Register RIN 1890-ZA00, November 4, 2003

“Scientifically Based Evaluation Methods”

The American Evaluation Association applauds the effort to promote high quality in the U.S. Secretary of Education’s proposed priority for evaluating educational programs using scientifically based methods. We, too, have worked to encourage competent practice through our Guiding Principles for Evaluators (1994), Standards for Program Evaluation (1994), professional training, and annual conferences. However, we believe the proposed priority manifests fundamental misunderstandings about (1) the types of studies capable of determining causality, (2) the methods capable of achieving scientific rigor, and (3) the types of studies that support policy and program decisions. We would like to help avoid the political, ethical, and financial disaster that could well attend implementation of the proposed priority.

(1) Studies capable of determining causality. Randomized control group trials (RCTs) are not the only studies capable of generating understandings of causality. In medicine, causality has been conclusively shown in some instances without RCTs, for example, in linking smoking to lung cancer and infested rats to bubonic plague. The secretary’s proposal would elevate experimental over quasi-experimental, observational, single-subject, and other designs which are sometimes more feasible and equally valid.

RCTs are not always best for determining causality and can be misleading. RCTs examine a limited number of isolated factors that are neither limited nor isolated in natural settings. The complex nature of causality and the multitude of actual influences on outcomes render RCTs less capable of discovering causality than designs sensitive to local culture and conditions and open to unanticipated causal factors.

RCTs should sometimes be ruled out for reasons of ethics. For example, assigning experimental subjects to educationally inferior or medically unproven treatments, or denying control group subjects access to important instructional opportunities or critical medical intervention, is not ethically acceptable even when RCT results might be enlightening. Such studies would not be approved by Institutional Review Boards overseeing the protection of human subjects in accordance with federal statute.

In some cases, data sources are insufficient for RCTs. Pilot, experimental, and exploratory education, health, and social programs are often small enough in scale to preclude use of RCTs as an evaluation methodology, however important it may be to examine causality prior to wider implementation.

(2) Methods capable of demonstrating scientific rigor. For at least a decade, evaluators publicly debated whether newer inquiry methods were sufficiently rigorous. This issue was settled long ago. Actual practice and many published examples demonstrate that alternative and mixed methods are rigorous and scientific. To discourage a repertoire of methods would force evaluators backward. We strongly disagree that the methodological “benefits of the proposed priority justify the costs.”

(3) Studies capable of supporting appropriate policy and program decisions. We also strongly disagree that “this regulatory action does not unduly interfere with State, local, and tribal governments in the exercise of their governmental functions.” As provision and support of programs are governmental functions so, too, is determining program effectiveness. Sound policy decisions benefit from data illustrating not only causality but also conditionality. Fettering evaluators with unnecessary and unreasonable constraints would deny information needed by policy-makers.

While we agree with the intent of ensuring that federally sponsored programs be “evaluated using scientifically based research . . . to determine the effectiveness of a project intervention,” we do not agree that “evaluation methods using an experimental design are best for determining project effectiveness.” We believe that the constraints in the proposed priority would deny use of other needed, proven, and scientifically credible evaluation methods, resulting in fruitless expenditures on some large contracts while leaving other public programs unevaluated entirely.