learning to be an evaluator ~ many ways, many contexts

FFA-competition-JB-3-218x300For many of us we naturally think about learning to do evaluation within the context of degree programs, professional development workshops, and sometimes on the job training. In so doing education in evaluation is seen as more limited than is practically the case. Because evaluation is perhaps one of the most common forms of thinking (whether it is done well or not) there are a dizzying array of contexts in which people learn to make judgements about what good is.

Yesterday, hundreds of young people gathered in rural North Carolina to demonstrate their evaluation skills… in dairy cow judging.

participants are scored based on how well they apply dairy cattle evaluation skills learned in the classroom. Each team evaluated six classes of dairy cattle and defend reasoning for evaluation to a panel of judges

While future farmers of America may do cow judging in preparation for careers as future dairy farmers, historically the evaluation skills demonstrated were key to selecting the best, most productive and healthy herd upon which the farmer’s livelihood depended.

the difference between external and independent evaluation

The terms external and independent evaluation are often conflated, largely because external evaluations are (wrongly) assumed to be more independent than internal evaluations. A good example is the evaluation of the LAUSD iPad initiative conducted by the American Institutes for Research, which is described in an EdWeek story like this:

An independent evaluation of the Los Angeles Unified School District’s ambitious—and much-maligned—effort to provide digital devices to all students found that the new, multi-million dollar digital curriculum purchased as part of the initiative was seldom used last year because it had gaping holes, was seen by some teachers to lack rigor, and was plagued by technical glitches.

To be fair, AIR calls their evaluation external, NOT independent. And the evaluation IS external because the evaluators (AIR staff) are not members of the organization (LAUSD) in which the evaluand exists. They are external also to the iPad initiative, the program itself.

Screen Shot 2014-09-19 at 10.15.22 AMWhile a bit pedestrian, it is worth asking what is meant by independent so it is clearer how external and independent are not synonyms.

Using the LAUSD iPad example, is AIR’s evaluation independent? The first sense of independence would suggest the evaluation is free from control by any one outside of AIR and the AIR evaluation team ~ that the evaluation is not influenced by any one, including the LAUSD, Pearson or Apple. It is clear from the report that indeed the evaluation is influenced by the LAUSD by asking questions that are relevant and desirable to them, although there is no obvious influence from Pearson or Apple, the two corporations providing the hardware, software, and professional development. This is absolutely typical in evaluation ~ those who commission the evaluation influence the focus of the evaluation, and often how the evaluation is done (although whether that was the case in this evaluation is not explicit in the report).

A key to the influence the LAUSD has on the evaluation is illustrated in the description of the program milestones, the first of which is characterized as awarding the contract to Apple in June 2013. But it is clear this is not the first milestone as a LAUSD Board report released in August 2014 points to Superintendent Deasy’s manipulation of the bidding process so it would be a foregone conclusion the successful vendor would be the Apple/Pearson combo. AIR evaluators would have known about this. There is also no mention of the LAUSD’s refusal, when the project was rolled out, to reveal how much money had been paid to Pearson, a subcontractor to Apple on the $30 million first phase of the project. 

Evaluators might argue that these matters are not the focus of the evaluation as framed by the evaluation questions, and that is likely true. The problem is that the evaluation questions are usually (and no reason to believe this wasn’t the case with the AIR evaluation of the iPad initiative) mutually agreed upon by the external evaluator and the organization contracting for the evaluation. That an organization would not want to include issues of malfeasance, transparency and accountability is understandable in many cases. A truly independent evaluation would necessarily include these issues, as well as other unanticipated circumstances and outcomes. The lack of independence is structural (in who commissions evaluations) privileging the perspectives of decision-makers, funders and CEOs.

The second sense of independence points to a failure for every external evaluation ~ external evaluators are in an immediate sense dependent on whomever commissions the evaluation for their subsistence and in the longer term sense if they wish to do evaluations for this organization again, or even other organizations who may monitor how the first sense of independence is treated in past evaluations. External evaluations lack financial independence.

And, external evaluations fail on the third sense of independence because the evaluators and the organizations commissioning evaluations of themselves or their programs are connected to one another, certainly financially but also often in an ongoing relationship with one another.

Whose interests are served and how?

Screen Shot 2014-09-19 at 11.53.22 AMBecause of the lack of structural and financial independence, external evaluations (as much as internal evaluations) emphasize some interests and serve some ends, while ignoring or bracketing others. In the LAUSD iPad initiative, the interests of both the LAUSD as a whole, the Board, and John Deasy are served both by what is included and excluded. The AIR evaluation provides a good descriptive account of the roll out of a major technology initiative, including issues with levels and types of use, quality of curriculum, and what worked well (the use of apps, for example). The evaluation could not be construed as positive on the Pearson curriculum content.

But by avoiding the inclusion of issues around the initial bidding process, so too are specific interests of Deasy, Apple and Pearson served. What does it mean that both Deasy and Apple were involved in manipulating the bidding for the contract? Put in the context of Apple’s aggressive marketing of iPads to schools, this becomes potentially an example of profit-making over learning. Apple’s last quarterly earnings claims more than 13 million iPads have been sold globally for education; 2 and a half iPads are sold for every Mac in K-12 education. The secretive partnering with Pearson, a company recognized more for making profit than making educational gains, should be an additional piece of an independent evaluation. Corporations whose primary interest is profit making and who mastermind programs and products deserve scrutiny for how their interests intersect with other interests (like teaching and learning).

Although there are few mechanisms for truly independent evaluations, professional evaluation associations and professional evaluators should be pondering how their work as either internal or external evaluators might be more independent, as well as developing strategies for conducting truly independent evaluations that are simply not compromised by the structural and financial relationships that characterize virtually all evaluations.

Logic Models

Logic models (similar to program theory) are popular in evaluation. The presumption is that programs or interventions can be depicted in a linear input output schema, simplistically depicted as:

This simple example can be illustrated by using this model to evaluate how an information fair on reproductive health contributes to the prevention of unwanted pregnancies.

The inputs are the money, labour, and facilities needed to produce the information fair.
The activity is organizing and presenting the information fair.
The output is that some people attend the info fair.
The outcome is that some of those who attend the info act on the information provided.
The impact is that unwanted pregnancies are reduced.The idea is that each step in this causal chain can be evaluated.Did the inputs (money etc.) really produce the intervention?

And did the activities produce the output (an informed audience)?
Did the output produce the outcome (how many attendees acted on the information)?
To measure the impacts, public health statistics could be used.

A quick overview of logic models is provided on the Audience Dialogue website. One of the best online resources for developing and using logic models is Kellogg Foundation’s Logic Model Development Guide, and loads of visual images of logic models are available, and aboriginal logic models have also been developed.

See also Usable Knowledge’s short tutorial on creating a logic model.

And readIan David Moss’ In Defense of Logic Models, which is probably the most reasoned response to many of the criticisms… take a look at the comments to his blog post as they extend the discussion nicely.

Google Glass & GoPro… gimicky or useful in evaluation?

Doctors, forensic scientists, and police officers have been early adopters of Google Glass as a way of collecting data, of recording events that matter to their professional practice. This recording device is double edged: on the one hand they make the transmission of surveillance data incredible easy (maybe too easy), but on the other hand the data might be used to evaluate the performance of the surgeon, physician or police officer wearing them. So personnel evaluation might evolve with a data record of actual performance ~ you can see the quality of the surgery performed or the propriety of an arrest.

Some evaluators spend considerable time in programmatic contexts collecting observational data. One wonders if recording devices that just come along with us and record what is going on might be useful for evaluators. For example, the GoPro, strapped to your head or chest, is now standard equipment for sports enthusiasts to capture their accomplishments or nature enthusiasts their surroundings. It might well be the means to record that program activity or meeting, but it might also be a bit intrusive.

Google Glass is definitely more stylish, less obtrusive, and provides interactive capabilities. It’s in the beta stage, what Google calls the Explorer Program and if a space is available you could be an early adopter for the cost $1500, that is if you live in the USA. In short you tell it what to do, take a picture or video, which you can share, send a message, look up information. The example below shows some of its capabilities. Imagine an evaluation context that would allow you to record what you see, do and to share and connect with program stakeholders.

Google Glass has been controversial when people wear them as a matter of course in their daily lives creating exaggerated tensions in an already surveillance rich society (smart phones being the obvious device). But used in an evaluation context, where people have accepted that events, interactions, and talk will be recorded, these controversies might be obviated.

Creating Educative Personal Experiences ~ learning evaluation from the Olympics and other things that happen in your life

In the early 1990s, Wayne Ross and I wrote an article with this title. (The full article is available here.) While we were talking about the role of personal experiences in learning to teach, rereading this article suggests a broader scope to the value of personal experiences in learning just about anything, including evaluation or research. Because evaluation is absolutely everywhere the opportunities to hone or knowledge and skills is limitless. I’ve had fun with the Olympics and evaluation project, revisiting some basic ideas in evaluation and sharing them with you.

Athletes (and evaluators) learn from mistakes.

And they learn from successes.

Ranking ~ who’s the best now that the Olympics are over?

Wherever in the world you were watching the Olympics from, there would have been a nationalistic bias in what you saw and a constant counting and recounting of medals to assert the superiority of your country over all others (you hope) or at least over some other countries. That Russia, the host country, earned the most medals, and especially the most gold and silver medals declares Russia simply as #1, best in the world, and highly accomplished in amateur sports. Russia is followed by the USA, Norway, Canada, and the Netherlands in terms of national prowess in winter sports.

This ranking is based on the number of medals received regardless of the level of medal. Naturally, it is the media that creates these rankings (not the IOC) and this rather simple strategy might distort who is the best (if this notion of the best has any construct validity, but that’s another discussion). It seems fairly obvious that getting the gold is better than getting the silver and that both trump getting a bronze medal. If we weighted the medal count (3 pts for gold, 2 for silver, and 1 for bronze) would the rankings of countries change? They do, a bit, and there are two noticeable changes. First is that Russia is WAY better than even the other top five ranking countries with a score of 70, compared to the next highest scoring country, Canada (which has moved from fourth to second place) with a score of 55. Perhaps less profound, but still an interesting difference is that although overall the USA had two more medals than Norway their weighted scores are identical at 53.

But wait. The Olympics are held every four years and while one might expect relative stability in the rankings. The table to left is the top six ranked countries in 2010, when the Olympics were held in beautiful Vancouver, BC (no bias, on my part here). Russia squeaks into the top six ranked countries.

So two things to note: 1) using the weighted scoring suggested above the order doesn’t change and we get a similar magnitude of performance [USA score = 70; Germany = 63; Canada = 61; Norway = 49; Austria = 30; Russia = 26], and 2) something miraculous happened in Russia in the last four years! Russia’s weighted score went from 26 in 2010 to 70 in 2014.

Looking across 2006, 2010, and 2014 you get a different picture with the countries that appear in the top six countries changing and the stability of the weighted ranking fluctuating notably. There are a couple of take away messages for evaluators. The simply one is to be cautious when using ranking. There are quite specific instances when evaluators might use ranking (textbook selection; admissions decisions; research proposal evaluation are examples) and a quick examination of how that ranking is done illustrates the need for thoughtfulness in creating algorithms. Michael Scriven and Jane Davidson offer an alternative, a qualitative wt & sum technique, to a numeric wt & sum strategy I have used here, and it is often a great improvement. When we rank things we can too easily confuse the rankings with grades, in other words, the thing that is ranked most highly is defined as good. In fact, it may or may not be good… it’s all relative. The most highly ranked thing isn’t necessarily a good thing.

Fidelity is over-rated… or understanding “hurry, hurry hard”

I couldn’t get through this project of learning about evaluation from the Olympics without a mention of curling. Born on the Canadian prairies, I curl! We curled during phys ed class and as a young adult it was an important context for socializing. Curling is a polite game, winning is important but good sportsmanship is more important ~ players are on their honour and there are no judges or referees. And what other sport has a tradition of all the competitors getting together after the match for rounds of drinks, what is called “broomstacking.” Maybe it’s an easy game to make fun of, but try it and you’ll discover there’s more to it than it seems.

Curling is a sport that has many skills that can be isolated, practice and mastered. Like drawing to the button, or peeling off a guard, or a take out with a roll behind a guard, or throwing hack weight. And there’s learning to know when to sweep and yell at the top of your lungs, “hurry, hurry hard!” Countries relatively new to the sport focus on these skills and demonstrate extraordinary abilities of execution, which is important to winning. But winning the game also requires something more elusive. These teams often confuse fidelity with quality, an all too common mistake in program evaluation. Being able to execute shots with precision is necessary, but not sufficient to win, in either curling or programs.

Strategy is also key in curling and is not so easily mastered through repetitious practice of isolated skills. Curling has been called “chess on ice.” There are aggressive and conservative strategies. Strategy depends in large part on the context ~ factors such as the ice, skill levels, whether you have the hammer (the last rock thrown), and so on. Strategy in program delivery, especially on the ground interpretations and practice, also depends on the context and practitioners use their strategic knowledge to adjust interventions to achieve maximum success. This strategic adjustment must often trade away fidelity to the intervention plan or map, and too frequently this is seen as a failure. Program evaluations sensitive to both programmatic intentions and local variation are more comprehensive and meaningful for understanding how and why programs work, or don’t.

Precision measurement ~ sometimes it matters, like in Luge, but not most of the time

In some Olympic sports thousandsth of a second matter. In the men’s doubles luge run the difference between the gold and silver medals was about 1/2 a second (.522 of a second to be exact). Lugers compete against a timer and luge is probably one of the most precisely timed sports in the world. Just to be clear, luge specifies a base weight (90 kg for individuals, 180 kg for doubles) and lugers may add weights to their sleds so that each run is precisely the same weight, and skill in maneuvering the track is what accounts for differences in time. Luge is a sport that is judged entirely on the outcome ~ the shortest time. How you get there doesn’t matter, other than that it is understood that following the “perfect line” is likely to get you to the finish line in the least amount of time. However, in luge nuance is critical. But often that nuance escapes even the knowledgable commentators who attempt to give spectators a sense of what is happening during a luge run. Mostly it comes down to a better run is one where the luger moves very little and doesn’t hit a wall!

For those of us doing program evaluation in the world of social, educational, health, policy interventions we might envy such precise measurements, but the work we do is different in a number of ways. Precision of measurement must be judged within the context of evaluation. First, we have no singular and unambiguous outcomes to measure. Our outcomes are constructs, ones that depend for their definition on values and ideologies. For example, poverty reduction might be an agreed upon outcome, but how that is conceptualized is quite elastic. And poverty reduction is likely conflated with other constructs like food security or affordable housing. Second, measures used in evaluation are not like time. We have no analogous high precision outcome measure to time in luge competitions, in large part because of the ambiguity of our outcomes. And last, we seldom want to give up investigating process and focus solely on outcomes. In the social world, how we attempt to ameliorate problems is an essential component of the quality of those efforts… outcomes matter to be sure, but getting to outcomes matters as much, and sometimes more.