“outcome harvesting”… forensics-informed evaluation approach

Outcome Harvesting is an evaluation approach developed by Ricardo Wilson-Grau. Using a forensics approach, outcome harvesting has the evaluator or ‘harvester’ retrospectively glean information from reports, personal interviews, and other sources to document how a given program, project, organization or initiative has contributed to outcomes. Unlike so many evaluation approaches that begin with stated outcomes or objectives, this approach looks for evidence of outcomes, and explanations for those outcomes, in what has already happened… a process the creators call ‘sleuthing.’

This approach blends together, and maybe eliminates the distinctions, among intended and unintended outcomes. Evaluators are enjoined to look beyond what programs say they will do to what they actually do, but in an objectives driven world this requires evaluators to convince clients that this is important or necessary, and justifying the expenditure of evaluation resources on a broader concept of outcomes than is often defined.

Wilson-Grau has written a clear explanation of the process, which can be downloaded here. In the downloadable pdf, the six steps of outcome harvesting are summarized:

1. Design the Outcome Harvest: Harvest users and harvesters identify useable questions to guide the harvest. Both users and harvesters agree on what information is to be collected and included in the outcome description as well as on the changes in the social actors and how the change agent influenced them.
2. Gather data and draft outcome descriptions: Harvesters glean information about changes that have occurred in social actors and how the change agent contributed to these changes. Information about outcomes may be found in documents or collected through interviews, surveys, and other sources. The harvesters write preliminary outcome descriptions with questions for review and clarification by the change agent.
3. Engage change agents in formulating outcome descriptions: Harvesters engage directly with change agents to review the draft outcome descriptions, identify and formulate additional outcomes, and classify all outcomes. Change agents often consult with well- informed individuals (inside or outside their organization) who can provide information about outcomes.
4. Substantiate: Harvesters obtain the views of independent individuals knowledgeable about the outcome(s) and how they were achieved; this validates and enhances the credibility of the findings.
5. Analyze and interpret: Harvesters organize outcome descriptions through a database in order to make sense of them, analyze and interpret the data, and provide evidence-based answers to the useable harvesting questions.
6. Support use of findings: Drawing on the evidence-based, actionable answers to the useable questions, harvesters propose points for discussion to harvest users, including how the users might make use of findings. The harvesters also wrap up their contribution by accompanying or facilitating the discussion amongst harvest users.

Other evaluation approaches (like the Most Significant Change technique or the Success Case Method) also look retrospectively at what happened and seek to analyze who, why and how change occurred, but this is a good addition to the evaluation literature. An example of outcome harvesting is described on the BetterEvaluation Blog. A short video introduces the example. watch?v=lNhIzzpGakE&feature=youtu.be

Reflections of a Journal Editor

When my term as Editor-in-Chief of New Directions for Evaluation ended I was asked to write a short piece for the AEA newsletter, as I did each year whilst I was EIC. I submitted a short reflection on knowledge and publishing rather than a summary of what was in and what would be in NDE. I have been told by Gwen Newman of AEA that the short piece I wrote would be published in the AEA Newsletter, but three months have passed and it hasn’t appeared. I have no insight about why.

Below is the short reflective commentary I wrote.

As of December 2012 my term as Editor-in-Chief of New Directions for Evaluation ended, and Paul Brandon’s term began. AEA has made a fine choice in appointing Paul, and I wish him good luck in his new role.

Closing the book on six years working on NDE leads me to reflect on being an editor and the role of scholarly journals. I have enjoyed being the editor of NDE, I hope I have made a positive contribution to AEA, and I have tried to respect the diversity of viewpoints and varying degrees of cultural competence in the journal publishing game. I have enjoyed working with the newer generation of evaluators and those whose voices might not otherwise have been heard, but regret that this did not make up more of my time as NDE editor. I also have mixed feelings, even if, on balance, the good outweighs the bad.

Journal editors are gatekeepers, mediators, maybe even definers of the field, who are expected to oversee and insure the fairness of an adjudication process that results in the stamp of approval and dissemination of knowledge that is most worthy and relevant to the field. But in fulfilling this role, journal editors participate in a larger ‘game’ of knowledge production. Of course, others participate in the game as well, including authors, the reward systems in higher education, professional associations, publishing companies, and indeed journal readers. Pierre Bourdieu’s notion of “illusio” captures the ‘game’ of publishing in scholarly journals, a game where everyone must play, and even be taken in by the game, in order for the game to continue.

And so I have played a key role in this game, a game that is mostly seen as necessary, benign, civil and collegial. I am, however, a bit disquieted by my complicity in the game, where knowledge about evaluation theory and practice is commodified, packaged and embargoed. A game that sometimes defines too narrowly what ought to be published, in what form, by whom, and limits access to knowledge. The illusio of the game leads us to believe that without stalwart gatekeepers and limited (often corporately owned) venues for sharing knowledge there will be excessive scholarly writing, and that it will be of dubious quality. There is little evidence to support this fear, and a growing number of highly regarded open access journals, blogs, and websites that do not forsake quality and suggest the possibility of a new game.

In a vision of the future where knowledge is a public commodity and freely shared, I imagine journal editors might play a different role in the game. A role that focuses less on gatekeeping and more on opening the gate to welcome the sharing of evaluation knowledge for free, with unfettered access, and without the need for authors to give away copyright to their works. While it may be the case that knowledge in some disciplines has a small, select audience, evaluation knowledge crosses all domains of human experience with an attendant desire to foster improvement. The audience for our work is vast, and I wish for thoughtful inclusive sharing of evaluation knowledge.

If your job involves doing evaluation (and whose doesn’t), you might be sued

For many professionals doing evaluation is part of the job. Lawyers make evaluative judgements about the quality of evidence; teachers judge the quality of students’ learning; builders judge the quality of materials. All work entails judgements of quality, and the quality of work is dependent on doing good evaluations.

But what happens when the evaluation done as part of professional work is contested? You might just find yourself being sued. Such is the case with Dale Askey, librarian at McMaster University. Askey’s job requires him to make judgements about the quality of published works and in turn publishers to make wise procurement decisions for his employer, decisions that have become ever more difficult with shrinking resources. The case can be easily summarized:

Librarian questions quality of a publishing house.

Librarian publicly criticizes said press on his personal blog.

Two years later, librarian and current employer get sued for libel and damages in excess of $4 million.

Read more: http://www.insidehighered.com/news/2013/02/08/academic-press-sues-librarian-raising-issues-academic-freedom#ixzz2MDEYx2An
Inside Higher Ed

There is no reason to believe that Askey rendered his judgement about the quality of scholarship offered by Mellen Press in a capricious or incompetent manner. Making judgements for procurement decisions is surely one of the tasks that Askey’s employer expects him to do, especially in a time of diminishing resources.

There has been considerable support for Askey, some a bit misguided by defending his write to express his opinion on his blog, but most in defense of Askey’s responsibility to do his job.

Screen Shot 2013-02-28 at 9.20.31 AMThere is every reason to expect that the Mellen Press lawsuit will be dismissed as was the similar lawsuit brought by Mellen Press against Lingua Franca.

So what is the relevance for evaluation? It is clear that evaluation is integral to all and applied in virtually all other intellectual and practical domains… it is as Michael Scriven claims, a trans-discipline. As such, there is a need to pay more attention in preparing people to do publicly defensible evaluations in the context of their work. Perhaps more than program evaluation, this sort of evaluative thinking might be the raison d’etre for the discipline of evaluation.

Holding accountability to account

One of the hallmarks of any quality evaluation is that it ought to be subject itself to evaluation. Many evaluation schemes in education, such as the test driven accountability scheme, are not evaluated. illusion_of_success_EN The Action Canada Task Force on Standardized Testing has released a report analyzing the place of standardized testing as an accountability measure in Canadian K-12 education systems, using Ontario as a case study focus. “A review of standardized testing in this province and others is not only timely – it’s urgently needed,” says Sébastien Després, a 2012-2013 Action Canada Fellow and co-author of the report.

The Task Force offers four recommendations that could be the heart of an evaluation of accountability schemes in K-12 education across Canada.

Recommendations
We recommend that the Ontario government establish a suitable panel with a balanced and diverse set of experts to conduct a follow-up review of its standardized testing program. In particular:

A. Structure of the tests relative to objectives
i. The panel should review whether the scope of the current testing system continues to facilitate achievement of education system objectives.
ii. The panel should review whether the scale and frequency of testing remains consistent with the Ministry of Education’s objectives for EQAO testing.

B. Impact of testing within the classroom
i. The panel should review the impact on learning that results from classroom time devoted to test preparation and administration.
ii. The panel should review the impact of testing methods and instruments on broader skills and knowledge acquisition.
iii. The panel should review the appropriateness and impact of the pressure exerted by standardized testing on teachers and students.

C. Validity of test results
i. The panel should review whether or not standardized testing provides an assurance that students are performing according to the standards set for them.
ii. The panel should review the impact of measuring progress by taking a limited number of samples throughout a student’s career.

D. Public reporting and use of test results
i. The panel should review the impact of the potential misinterpretation and misuse of testing results data, and methods for ensuring they are used as intended.
ii. The panel should review supplemental or alternative methods of achieving public accountability of the educational system.

Kelly Conference @ UOttawa, April 12, 2013

What is the Kelly Conference?

The Edward F. Kelly Evaluation Conference is a graduate student organized regional evaluation conference, whose goal is to provide graduate students in the field of evaluation an opportunity to present original research and network with professionals in the field. This year the conference consists of two main components: A professional development workshop and presentations of student research. The Edward F. Kelly Conference originated at the University at Albany in 1987 in commemoration of beloved former faculty member Ed Kelly. Dr. Kelly founded the Evaluation Consortium on campus in collaboration with the School of Education in order to foster an authentic evaluation setting in which graduate students could work and learn alongside seasoned faculty members. The conference continues his commitment to providing authentic learning and research experiences to graduate students today.

The Edward F. Kelly Evaluation Conference is a graduate student organized regional evaluation conference, whose goal is to provide graduate students in the field of evaluation an opportunity to present original research and network with professionals in the field. This year the conference consists of two main components: A professional development workshop and presentations of student research.

The Edward F. Kelly Conference originated at the University at Albany in 1987 in commemoration of beloved former faculty member Ed Kelly. Dr. Kelly founded the Evaluation Consortium on campus in collaboration with the School of Education in order to foster an authentic evaluation setting in which graduate students could work and learn alongside seasoned faculty members. The conference continues his commitment to providing authentic learning and research experiences to graduate students today.

Evaluators cannot be useful if their only skill is data generation

The New York Times columnist David Brooks nicely captures the problem that Carol Weiss identified several decades ago… data doesn’t speak authoritatively, nor should it. In evaluation and in decision making we take into account loads of data, available to us in greater amounts and more sophisticated ways, but it still is never enough on its own.

Brooks highlights the limitations of what he calls “big data.”

Data struggles with the social. Your brain is pretty bad at math (quick, what’s the square root of 437), but it’s excellent at social cognition. People are really good at mirroring each other’s emotional states, at detecting uncooperative behavior and at assigning value to things through emotion.

Computer-driven data analysis, on the other hand, excels at measuring the quantity of social interactions but not the quality. Network scientists can map your interactions with the six co-workers you see during 76 percent of your days, but they can’t capture your devotion to the childhood friends you see twice a year, let alone Dante’s love for Beatrice, whom he met twice.

Therefore, when making decisions about social relationships, it’s foolish to swap the amazing machine in your skull for the crude machine on your desk.

Data struggles with context. Human decisions are not discrete events. They are embedded in sequences and contexts. The human brain has evolved to account for this reality. People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel.

Data creates bigger haystacks. This is a point Nassim Taleb, the author of “Antifragile,” has made. As we acquire more data, we have the ability to find many, many more statistically significant correlations. Most of these correlations are spurious and deceive us when we’re trying to understand a situation. Falsity grows exponentially the more data we collect. The haystack gets bigger, but the needle we are looking for is still buried deep inside.

One of the features of the era of big data is the number of “significant” findings that don’t replicate the expansion, as Nate Silver would say, of noise to signal.

Big data has trouble with big problems. If you are trying to figure out which e-mail produces the most campaign contributions, you can do a randomized control experiment. But let’s say you are trying to stimulate an economy in a recession. You don’t have an alternate society to use as a control group. For example, we’ve had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides.

Data favors memes over masterpieces. Data analysis can detect when large numbers of people take an instant liking to some cultural product. But many important (and profitable) products are hated initially because they are unfamiliar.

Data obscures values. I recently saw an academic book with the excellent title, “ ‘Raw Data’ Is an Oxymoron.” One of the points was that data is never raw; it’s always structured according to somebody’s predispositions and values. The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation.

Dogs and evaluation

One of the things I look forward to in February is the Westminster Dog Show, a mega spectacle of dog owners, trainers, handlers and, of course, dogs. I like dogs, I have two of them. They are the very best dogs I know.

It is instructive to look beyond the crazy haired, hairless, majestic and cuter than ever faces at how after many rounds of judging one dog is named best in show. Investigating evaluation systems that are distinct from our own contexts of application provides a moment for reflection and perhaps learning.

Here’s what the WKC says about judging:

Each breed’s parent club creates a STANDARD, a written description of the ideal specimen of that breed. Generally relating form to function, i.e., the original function that the dog was bred to perform, most standards describe general appearance, movement, temperament, and specific physical traits such as height and weight, coat, colors, eye color and shape, ear shape and placement, feet, tail, and more. Some standards can be very specific, some can be rather general and leave much room for individual interpretation by judges. This results in the sport’s subjective basis: one judge, applying his or her interpretation of the standard, giving his or her opinion of the best dog on that particular day. Standards are written, maintained and owned by the parent clubs of each breed.

So come February there are successive rounds of dog judging leading to that final moment, naming the best dog. First, there are competitions within breeds to determine the best lab or poodle (standard, miniature and toy) and so on. To make the next round of judging manageable, breeds are then grouped… there are seven groups: sporting, non-sporting, hound, working, terriers, toy, and herding. This grouping is really a matter of convenience; the groups make sense but they are not mutually exclusive. For example, terriers could be working dogs if hunting down vermin were considered work or terriers could be sporting dogs if hunting vermins were considered sport and clearly at least some of the terriers are small enough to be considered part of the toy group.

The grouping makes sense, but it isn’t a key feature of the evaluation process because in this round of judging dogs are not compared to one another, but to their own breed standard. For example, in the non-sporting group the Bichon Frise is not compared to the Dalmatian even though they are in the same group. The judge is making a judgement about each dog in relation to the breed standard and declaring that one (actually four dogs at this round) are the best examples of their breed. So if the Bichon wins in the group it means that the Bichon is an excellent Bichon and the Dalmatian is a less of an excellent Dalmatian.

The last round of judging to find the best dog looks at the best in each of the groups. Within the groups and at this culminating stage what continues to be notable is the variation, the dogs don’t appear to be ones that belong together. The judge again is comparing each dog to its breed standard, choosing the one that best meets the standards for its own breed.

So, dog judging is a criterial based evaluation system. This example reminds me of the protest thrown in the way of evaluation: “That’s comparing apples and oranges!” with the implication that doing so is unfair or even impossible. The WKC show illustrates that this is a false protest and that we can judge given an apple and an orange which is better, but that to do so requires a clear articulation of what makes an apple a good apple and what makes an orange a good orange. I tell my students that we make such evaluative judgements all the time, even when we are deciding whether to buy apples or oranges.

Grading each dog is a separate evaluation and then the grades of each dog are compared to one another–the dogs are not compared, the grades are. Now this grading and comparison isn’t made explicit and happens in the mind of the judge, presumably experienced evaluators. But, conceivably this stage of the evaluation could be made explicit. That it isn’t is a part of the evaluation procedure that assigns confidence in the evaluators’ expert knowledge–we don’t expect it to be explicit because we don’t need it to be in order to trust the judge/evaluator. It is a criterion based evaluation system that includes some considerable exercise of expert judgement (criterion based evaluation meets connoisseurship, perhaps).

And the winner is…