Category Archives: Evaluation theory

Does Evaluation Contribute to the Public Good?

In September I was honoured to give the initial keynote address at the 2017 Australasian Evaluation Society meeting in Canberra. I am thankful for the opportunity and for the warm response my keynote received.

I express my pessimism, maybe even cynicism, about the extent to which evaluation has contributed to the public good, by which I mean the well-being of all people, globally, manifested in things such as food security, healthcare, education, clean water, adequate housing. I offered some hopeful suggestions about how evaluation as a practice might do better in its contribution to the public good.

This talk has been translated to French and has been published in La Vigie de l’évaluation and can be accessed here. It will soon be published in English and I will post a link here soon.

I also appreciate the media coverage this talk received in the Mandarin, an independent online newspaper devoted to government policy and practice in Australia. Click here for a link to that story, “Whoever Heard of an Independent Evaluation Keynote Tell It Like It Is?”


evaluation and independence… it matters

In the run up to the Academy Awards I was catching up on nominated movies. This past weekend I saw several including The Big Short.MV5BMjM2MTQ2MzcxOF5BMl5BanBnXkFtZTgwNzE4NTUyNzE@._V1_UX182_CR0,0,182,268_AL_ A Salon review summarizes the all to familiar story of the movie:

In the late 1990s, banks and private mortgage lenders began pushing subprime mortgages, many with “adjustable” rates that jumped sharply after a few years. These risky loans comprised 8.6 percent of all mortgages in 2001, soaring to 20.1 percent by 2006. That year alone, 10 lenders accounted for 56 percent of all subprime loans, totaling $362 billion. As the film explains, these loans were a ticking time bomb, waiting to explode.

While there is really nothing new revealed in the movie, there is a great scene in which Mark Baum (Steve Carrell) confronts the Standard and Poor’s staffer who admits to giving high ratings to mortgage security bonds because the banks pay for the ratings. If S&P doesn’t deliver the high ratings, the banks will take their business elsewhere, perhaps to Moody’s. The profit incentive to be uncritical, to not evaluate, is overwhelming. Without admitting any wrong doing it has taken until 2015 for S&P (whose parent company is MacGraw Hill Financials) to make reparations in a $1.4M settlement with the US Justice Department.

cashThis is a particular and poignant message for evaluation and evaluators. Like so much else about the financial crisis, shortsightedness and greed resulted in false evaluations, ones with very serious consequences. S&P lied: they claimed to be making independent evaluations of the value of mortgage backed securities, and the lie meant making a larger than usual profit and facilitating banks’ bogus instruments. Moody did the same thing. While the ratings agencies have made some minor changes in their evaluation procedures the key features, lack of independence and the interconnection of their profit margin with that of their customers, have not. The consensus seems to be there is nothing that would preclude the evaluators from playing precisely the same role in the future.

In addition, while the ratings companies profits took a serious hit the big three agencies—Moody’s, S&P and Fitch— their revenues surpassed pre-crisis levels, and Moody’s and S&P now look more attractive as businesses than most other financial firms do. Something worth pondering another day.

p56a191acde5f1Individual evaluators may say, “Well, I wouldn’t do that” and that may be to some extent true, but the same underlying relationships are repeated in all contracted evaluation work. If you are hiring me to do evaluation for you and I want you to consider hiring me again in the future then I am in the same relationship as the ratings agencies are to financial institutions. This is a structural deficiency, and a serious one. In a soon to be published book chapter (in Evaluation for an Equitable Society), I analyze how capitalism has overwhelmed pretty much everything. We are unable to see a role for evaluation theory and practice outside the fee-for-service framework dictated in the current neoliberal frames of social engagement.

In that chapter I offer suggestions about what evaluation can do, alongside being more responsible within a fee for service framework. First, evaluation needs to evaluate its own systems and instruments. Meta-analysis of evaluations (like that done by S&P, and pharmaceutical companies, by grant funding agencies, in education, and so on) are necessary. Using our skills to insure that what is being done in the name of evaluation is indeed evaluative and not merely profiteering is critically important. Second, professional evaluation associations need to promote structures for truly independent evaluations, evaluations solicited and paid for by third parties that have no profit to make although, of course, an interest (government agencies, funding agencies, and so on) in competently done, valid evaluation studies.



the difference between external and independent evaluation

The terms external and independent evaluation are often conflated, largely because external evaluations are (wrongly) assumed to be more independent than internal evaluations. A good example is the evaluation of the LAUSD iPad initiative conducted by the American Institutes for Research, which is described in an EdWeek story like this:

An independent evaluation of the Los Angeles Unified School District’s ambitious—and much-maligned—effort to provide digital devices to all students found that the new, multi-million dollar digital curriculum purchased as part of the initiative was seldom used last year because it had gaping holes, was seen by some teachers to lack rigor, and was plagued by technical glitches.

To be fair, AIR calls their evaluation external, NOT independent. And the evaluation IS external because the evaluators (AIR staff) are not members of the organization (LAUSD) in which the evaluand exists. They are external also to the iPad initiative, the program itself.

Screen Shot 2014-09-19 at 10.15.22 AMWhile a bit pedestrian, it is worth asking what is meant by independent so it is clearer how external and independent are not synonyms.

Using the LAUSD iPad example, is AIR’s evaluation independent? The first sense of independence would suggest the evaluation is free from control by any one outside of AIR and the AIR evaluation team ~ that the evaluation is not influenced by any one, including the LAUSD, Pearson or Apple. It is clear from the report that indeed the evaluation is influenced by the LAUSD by asking questions that are relevant and desirable to them, although there is no obvious influence from Pearson or Apple, the two corporations providing the hardware, software, and professional development. This is absolutely typical in evaluation ~ those who commission the evaluation influence the focus of the evaluation, and often how the evaluation is done (although whether that was the case in this evaluation is not explicit in the report).

A key to the influence the LAUSD has on the evaluation is illustrated in the description of the program milestones, the first of which is characterized as awarding the contract to Apple in June 2013. But it is clear this is not the first milestone as a LAUSD Board report released in August 2014 points to Superintendent Deasy’s manipulation of the bidding process so it would be a foregone conclusion the successful vendor would be the Apple/Pearson combo. AIR evaluators would have known about this. There is also no mention of the LAUSD’s refusal, when the project was rolled out, to reveal how much money had been paid to Pearson, a subcontractor to Apple on the $30 million first phase of the project. 

Evaluators might argue that these matters are not the focus of the evaluation as framed by the evaluation questions, and that is likely true. The problem is that the evaluation questions are usually (and no reason to believe this wasn’t the case with the AIR evaluation of the iPad initiative) mutually agreed upon by the external evaluator and the organization contracting for the evaluation. That an organization would not want to include issues of malfeasance, transparency and accountability is understandable in many cases. A truly independent evaluation would necessarily include these issues, as well as other unanticipated circumstances and outcomes. The lack of independence is structural (in who commissions evaluations) privileging the perspectives of decision-makers, funders and CEOs.

The second sense of independence points to a failure for every external evaluation ~ external evaluators are in an immediate sense dependent on whomever commissions the evaluation for their subsistence and in the longer term sense if they wish to do evaluations for this organization again, or even other organizations who may monitor how the first sense of independence is treated in past evaluations. External evaluations lack financial independence.

And, external evaluations fail on the third sense of independence because the evaluators and the organizations commissioning evaluations of themselves or their programs are connected to one another, certainly financially but also often in an ongoing relationship with one another.

Whose interests are served and how?

Screen Shot 2014-09-19 at 11.53.22 AMBecause of the lack of structural and financial independence, external evaluations (as much as internal evaluations) emphasize some interests and serve some ends, while ignoring or bracketing others. In the LAUSD iPad initiative, the interests of both the LAUSD as a whole, the Board, and John Deasy are served both by what is included and excluded. The AIR evaluation provides a good descriptive account of the roll out of a major technology initiative, including issues with levels and types of use, quality of curriculum, and what worked well (the use of apps, for example). The evaluation could not be construed as positive on the Pearson curriculum content.

But by avoiding the inclusion of issues around the initial bidding process, so too are specific interests of Deasy, Apple and Pearson served. What does it mean that both Deasy and Apple were involved in manipulating the bidding for the contract? Put in the context of Apple’s aggressive marketing of iPads to schools, this becomes potentially an example of profit-making over learning. Apple’s last quarterly earnings claims more than 13 million iPads have been sold globally for education; 2 and a half iPads are sold for every Mac in K-12 education. The secretive partnering with Pearson, a company recognized more for making profit than making educational gains, should be an additional piece of an independent evaluation. Corporations whose primary interest is profit making and who mastermind programs and products deserve scrutiny for how their interests intersect with other interests (like teaching and learning).

Although there are few mechanisms for truly independent evaluations, professional evaluation associations and professional evaluators should be pondering how their work as either internal or external evaluators might be more independent, as well as developing strategies for conducting truly independent evaluations that are simply not compromised by the structural and financial relationships that characterize virtually all evaluations.

Logic Models

Logic models (similar to program theory) are popular in evaluation. The presumption is that programs or interventions can be depicted in a linear input output schema, simplistically depicted as:

This simple example can be illustrated by using this model to evaluate how an information fair on reproductive health contributes to the prevention of unwanted pregnancies.

The inputs are the money, labour, and facilities needed to produce the information fair.
The activity is organizing and presenting the information fair.
The output is that some people attend the info fair.
The outcome is that some of those who attend the info act on the information provided.
The impact is that unwanted pregnancies are reduced.The idea is that each step in this causal chain can be evaluated.Did the inputs (money etc.) really produce the intervention?

And did the activities produce the output (an informed audience)?
Did the output produce the outcome (how many attendees acted on the information)?
To measure the impacts, public health statistics could be used.

A quick overview of logic models is provided on the Audience Dialogue website. One of the best online resources for developing and using logic models is Kellogg Foundation’s Logic Model Development Guide, and loads of visual images of logic models are available, and aboriginal logic models have also been developed.

See also Usable Knowledge’s short tutorial on creating a logic model.

And readIan David Moss’ In Defense of Logic Models, which is probably the most reasoned response to many of the criticisms… take a look at the comments to his blog post as they extend the discussion nicely.

Precision measurement ~ sometimes it matters, like in Luge, but not most of the time

In some Olympic sports thousandsth of a second matter. In the men’s doubles luge run the difference between the gold and silver medals was about 1/2 a second (.522 of a second to be exact). Lugers compete against a timer and luge is probably one of the most precisely timed sports in the world. Just to be clear, luge specifies a base weight (90 kg for individuals, 180 kg for doubles) and lugers may add weights to their sleds so that each run is precisely the same weight, and skill in maneuvering the track is what accounts for differences in time. Luge is a sport that is judged entirely on the outcome ~ the shortest time. How you get there doesn’t matter, other than that it is understood that following the “perfect line” is likely to get you to the finish line in the least amount of time. However, in luge nuance is critical. But often that nuance escapes even the knowledgable commentators who attempt to give spectators a sense of what is happening during a luge run. Mostly it comes down to a better run is one where the luger moves very little and doesn’t hit a wall!

For those of us doing program evaluation in the world of social, educational, health, policy interventions we might envy such precise measurements, but the work we do is different in a number of ways. Precision of measurement must be judged within the context of evaluation. First, we have no singular and unambiguous outcomes to measure. Our outcomes are constructs, ones that depend for their definition on values and ideologies. For example, poverty reduction might be an agreed upon outcome, but how that is conceptualized is quite elastic. And poverty reduction is likely conflated with other constructs like food security or affordable housing. Second, measures used in evaluation are not like time. We have no analogous high precision outcome measure to time in luge competitions, in large part because of the ambiguity of our outcomes. And last, we seldom want to give up investigating process and focus solely on outcomes. In the social world, how we attempt to ameliorate problems is an essential component of the quality of those efforts… outcomes matter to be sure, but getting to outcomes matters as much, and sometimes more.

an organic, evolving definition of evaluation

Perhaps a step closer to being a discipline, the American Evaluation Association project to define evaluation might signal that we are getting down to the fundamental ideas in our field. A committee has developed a definition that its chair, Michael Q. Patton describes as “a living document, ever being updated and revised, but never becoming dogma or an official, endorsed position statement.” Bravo to all for this initiative!

The open-access, participatory strategy is an interesting and forwarding thinking one, and I will be curious to see if and how that statement changes over time. My prediction is that it won’t change much. The statement as it is pretty much captures what anyone would say an introductory evaluation course, but we shall see.

I think, however, there are a couple of key details missing from this definition… details that might bring clarity about the foundations of evaluation. As the definition now stands, it focuses primarily on evaluation practice and less so on the discipline of evaluation. The initial definition is what we all say when we explain what evaluation is:

Evaluation is a systematic process to determine merit, worth, value or significance.

The string of descriptors about what evaluation is a determination of are important, and they are not the same. The definition provides no guidance about what the differences are and why we provide this string in our definition. What is the difference between merit and worth, and how are those different from value or significance? This is not a trivial matter and lack of understanding about these distinctions sometimes gives evaluation a bad name. For example, when an evaluation focuses on determining the worth of an evaluand and is found wanting there is often a hue and cry when that same evaluand is simultaneously meritorious.

The second detail that is missing is the logic of how we get to those judgements of merit, worth, value and significance. The definition says that evaluation is a “systematic process” but provides no hint of what makes evaluation systematic. Perhaps this is one of those contentious areas that Patton describes when he introduced the statement, “There was lots of feedback, much of it contradictory.” But, from the statement, we cannot know whether the committee talked about including details about what makes evaluation systematic and couldn’t come to agreement, or if this was never discussed in the first place. Perhaps being systematic has two meanings that get entangled… we use models/approaches in evaluating that provide guidance about how to do evaluation (UFE, RCT, participatory, and so on) AND there is a logic to thinking evaluatively that is embedded in all models/approaches to evaluation. There is no need to include the former in a definition of evaluation, but there is a need to include the latter.

Michael Scriven has provided the grounding for articulating the logic of evaluation, Deborah Fournier has done considerable work on articulating what that logic looks like in practice (that is, how it is manifest in various evaluation approaches/models), and both Michael Scriven and Ernie House have tackled the specific issue of synthesis in evaluation. This logic is at the heart of what makes evaluation systematic and I’d like to see this in this definition. (For a quick introduction to these ideas, check out the entries in the Encyclopedia of Evaluation by these authors.)

As an organic, evolving definition of evaluation, perhaps these are components that will still be developed and included.

new book ~ Feminist Evaluation & Research: Theory & Practice

Available in April, a new edited book (Guilford Press) that explores the ‘whats,’ ‘whys,’ and ‘hows’ of integrating feminist theory and methods into applied research and evaluation practice.


I. Feminist Theory, Research and Evaluation

1. Feminist Theory: Its Domain and Applications, Sharon Brisolara
2. Research and Evaluation: Intersections and Divergence, Sandra Mathison
3. Researcher/Evaluator Roles and Social Justice, Elizabeth Whitmore
4. A Transformative Feminist Stance: Inclusion of Multiple Dimensions of Diversity with Gender, Donna M. Mertens
5. Feminist Evaluation for Nonfeminists, Donna Podems

II. Feminist Evaluation in Practice

6. An Explication of Evaluator Values: Framing Matters, Kathryn Sielbeck-Mathes and Rebecca Selove
7. Fostering Democracy in Angola: A Feminist-Ecological Model for Evaluation, Tristi Nichols
8. Feminist Evaluation in South Asia: Building Bridges of Theory and Practice, Katherine Hay
9. Feminist Evaluation in Latin American Contexts, Silvia Salinas Mulder and Fabiola Amariles

III. Feminist Research in Practice

10. Feminist Research and School-Based Health Care: A Three-Country Comparison, Denise Seigart
11. Feminist Research Approaches to Empowerment in Syria, Alessandra Galié
12. Feminist Research Approaches to Studying Sub-Saharan Traditional Midwives, Elaine Dietsch
Final Reflection. Feminist Social Inquiry: Relevance, Relationships, and Responsibility, Jennifer C. Greene

“outcome harvesting”… forensics-informed evaluation approach

Outcome Harvesting is an evaluation approach developed by Ricardo Wilson-Grau. Using a forensics approach, outcome harvesting has the evaluator or ‘harvester’ retrospectively glean information from reports, personal interviews, and other sources to document how a given program, project, organization or initiative has contributed to outcomes. Unlike so many evaluation approaches that begin with stated outcomes or objectives, this approach looks for evidence of outcomes, and explanations for those outcomes, in what has already happened… a process the creators call ‘sleuthing.’

This approach blends together, and maybe eliminates the distinctions, among intended and unintended outcomes. Evaluators are enjoined to look beyond what programs say they will do to what they actually do, but in an objectives driven world this requires evaluators to convince clients that this is important or necessary, and justifying the expenditure of evaluation resources on a broader concept of outcomes than is often defined.

Wilson-Grau has written a clear explanation of the process, which can be downloaded here. In the downloadable pdf, the six steps of outcome harvesting are summarized:

1. Design the Outcome Harvest: Harvest users and harvesters identify useable questions to guide the harvest. Both users and harvesters agree on what information is to be collected and included in the outcome description as well as on the changes in the social actors and how the change agent influenced them.
2. Gather data and draft outcome descriptions: Harvesters glean information about changes that have occurred in social actors and how the change agent contributed to these changes. Information about outcomes may be found in documents or collected through interviews, surveys, and other sources. The harvesters write preliminary outcome descriptions with questions for review and clarification by the change agent.
3. Engage change agents in formulating outcome descriptions: Harvesters engage directly with change agents to review the draft outcome descriptions, identify and formulate additional outcomes, and classify all outcomes. Change agents often consult with well- informed individuals (inside or outside their organization) who can provide information about outcomes.
4. Substantiate: Harvesters obtain the views of independent individuals knowledgeable about the outcome(s) and how they were achieved; this validates and enhances the credibility of the findings.
5. Analyze and interpret: Harvesters organize outcome descriptions through a database in order to make sense of them, analyze and interpret the data, and provide evidence-based answers to the useable harvesting questions.
6. Support use of findings: Drawing on the evidence-based, actionable answers to the useable questions, harvesters propose points for discussion to harvest users, including how the users might make use of findings. The harvesters also wrap up their contribution by accompanying or facilitating the discussion amongst harvest users.

Other evaluation approaches (like the Most Significant Change technique or the Success Case Method) also look retrospectively at what happened and seek to analyze who, why and how change occurred, but this is a good addition to the evaluation literature. An example of outcome harvesting is described on the BetterEvaluation Blog. A short video introduces the example. watch?v=lNhIzzpGakE&

If your job involves doing evaluation (and whose doesn’t), you might be sued

For many professionals doing evaluation is part of the job. Lawyers make evaluative judgements about the quality of evidence; teachers judge the quality of students’ learning; builders judge the quality of materials. All work entails judgements of quality, and the quality of work is dependent on doing good evaluations.

But what happens when the evaluation done as part of professional work is contested? You might just find yourself being sued. Such is the case with Dale Askey, librarian at McMaster University. Askey’s job requires him to make judgements about the quality of published works and in turn publishers to make wise procurement decisions for his employer, decisions that have become ever more difficult with shrinking resources. The case can be easily summarized:

Librarian questions quality of a publishing house.

Librarian publicly criticizes said press on his personal blog.

Two years later, librarian and current employer get sued for libel and damages in excess of $4 million.

Read more:
Inside Higher Ed

There is no reason to believe that Askey rendered his judgement about the quality of scholarship offered by Mellen Press in a capricious or incompetent manner. Making judgements for procurement decisions is surely one of the tasks that Askey’s employer expects him to do, especially in a time of diminishing resources.

There has been considerable support for Askey, some a bit misguided by defending his write to express his opinion on his blog, but most in defense of Askey’s responsibility to do his job.

Screen Shot 2013-02-28 at 9.20.31 AMThere is every reason to expect that the Mellen Press lawsuit will be dismissed as was the similar lawsuit brought by Mellen Press against Lingua Franca.

So what is the relevance for evaluation? It is clear that evaluation is integral to all and applied in virtually all other intellectual and practical domains… it is as Michael Scriven claims, a trans-discipline. As such, there is a need to pay more attention in preparing people to do publicly defensible evaluations in the context of their work. Perhaps more than program evaluation, this sort of evaluative thinking might be the raison d’etre for the discipline of evaluation.