Fidelity is over-rated… or understanding “hurry, hurry hard”

I couldn’t get through this project of learning about evaluation from the Olympics without a mention of curling. Born on the Canadian prairies, I curl! We curled during phys ed class and as a young adult it was an important context for socializing. Curling is a polite game, winning is important but good sportsmanship is more important ~ players are on their honour and there are no judges or referees. And what other sport has a tradition of all the competitors getting together after the match for rounds of drinks, what is called “broomstacking.” Maybe it’s an easy game to make fun of, but try it and you’ll discover there’s more to it than it seems.

Curling is a sport that has many skills that can be isolated, practice and mastered. Like drawing to the button, or peeling off a guard, or a take out with a roll behind a guard, or throwing hack weight. And there’s learning to know when to sweep and yell at the top of your lungs, “hurry, hurry hard!” Countries relatively new to the sport focus on these skills and demonstrate extraordinary abilities of execution, which is important to winning. But winning the game also requires something more elusive. These teams often confuse fidelity with quality, an all too common mistake in program evaluation. Being able to execute shots with precision is necessary, but not sufficient to win, in either curling or programs.

Strategy is also key in curling and is not so easily mastered through repetitious practice of isolated skills. Curling has been called “chess on ice.” There are aggressive and conservative strategies. Strategy depends in large part on the context ~ factors such as the ice, skill levels, whether you have the hammer (the last rock thrown), and so on. Strategy in program delivery, especially on the ground interpretations and practice, also depends on the context and practitioners use their strategic knowledge to adjust interventions to achieve maximum success. This strategic adjustment must often trade away fidelity to the intervention plan or map, and too frequently this is seen as a failure. Program evaluations sensitive to both programmatic intentions and local variation are more comprehensive and meaningful for understanding how and why programs work, or don’t.

Precision measurement ~ sometimes it matters, like in Luge, but not most of the time

In some Olympic sports thousandsth of a second matter. In the men’s doubles luge run the difference between the gold and silver medals was about 1/2 a second (.522 of a second to be exact). Lugers compete against a timer and luge is probably one of the most precisely timed sports in the world. Just to be clear, luge specifies a base weight (90 kg for individuals, 180 kg for doubles) and lugers may add weights to their sleds so that each run is precisely the same weight, and skill in maneuvering the track is what accounts for differences in time. Luge is a sport that is judged entirely on the outcome ~ the shortest time. How you get there doesn’t matter, other than that it is understood that following the “perfect line” is likely to get you to the finish line in the least amount of time. However, in luge nuance is critical. But often that nuance escapes even the knowledgable commentators who attempt to give spectators a sense of what is happening during a luge run. Mostly it comes down to a better run is one where the luger moves very little and doesn’t hit a wall!

For those of us doing program evaluation in the world of social, educational, health, policy interventions we might envy such precise measurements, but the work we do is different in a number of ways. Precision of measurement must be judged within the context of evaluation. First, we have no singular and unambiguous outcomes to measure. Our outcomes are constructs, ones that depend for their definition on values and ideologies. For example, poverty reduction might be an agreed upon outcome, but how that is conceptualized is quite elastic. And poverty reduction is likely conflated with other constructs like food security or affordable housing. Second, measures used in evaluation are not like time. We have no analogous high precision outcome measure to time in luge competitions, in large part because of the ambiguity of our outcomes. And last, we seldom want to give up investigating process and focus solely on outcomes. In the social world, how we attempt to ameliorate problems is an essential component of the quality of those efforts… outcomes matter to be sure, but getting to outcomes matters as much, and sometimes more.

Evaluators (and figure skating judges) should be impartial

Although figure skating is still one of the most popular Olympic sports it has lost some of its romance and charm with Tonya Harding’s henchmen whacking on Nancy Kerrigan’s knee and the ongoing real and alleged buying and selling of the judging.

We were all familiar with the 6 point grading scale used in figure skating, scrapped after the cheating scandals at the 2002 Olympics in Salt Lake City. The old 6 pt grading scale required each judge to publicly give a grade to a skating performance, and the synthesis of the judges scores has been done in a number of different ways over the years.

The new evaluation system, the ISU of International Judging System) took effect in 2005. The ISU breaks the performance into elements (determined by a technical judge) and uses a computerized tabulation as a primary function of which is to make the judges grading anonymous. Low and high scores are discarded and the remaining scores averaged. It’s a complicated evaluation system… many criteria, use of video playback to analyze the technical elements, checks for extreme errors in judging, anonymous judging, and so on. It isn’t clear that this new system is better.

At the heart of the judging issues in figure skating is an important evaluation issue: impartiality. Even though judges scores are anonymous, which many agree has compromised accountability and transparency, judges are selected by nations and so nationalist favoritism may still be at play. Eric Zitzewitz, a Dartmouth economist analyzed judging data and found the chance that judges give higher marks to skaters from their own country is now about 20 percent greater than in the 6.0 system.

How can impartiality in evaluation be fostered? First, those doing the evaluations ought to be accountable for the justification of their judgements. That means they are known and there needs to be transparency in the evaluation process ~ what is the evidence and how has it been synthesized into an evaluative claim? This is a feature of meta-evaluation and isn’t much more than expecting that evaluations should be auditable. But impartiality requires more than transparency, it also requires fairness as well as integrity & honesty (one of AEA’s guiding principles). What we mean by impartiality is quite complex and the matter won’t be resolved here, but figure skating judging sure reminds us of the importance of minding this matter in our practice.

Olympic judges shouldn’t cheat, neither should evaluators

This is a pretty easy take away message, and figure skating is not surprisingly the sport to deliver this one. Figure skating might be one of the roughest non-contact sports there is. Cheating by judges and skaters attacking other skaters off the ice are legendary. Cheating in judging scandals have resulted in a revised evaluation system that most would suggest isn’t much of an improvement (more about that in another post). To say that judging in figure skating has credibility problems is an understatement.

So, it’s not surprising (even if it isn’t true) that as the competition begins there are rumors that the Russian and US judges are colluding to squeeze Canada out of any medals. As reported in the Globe and Mail, “The allegation implies the pact would see the U.S. judge dish out favourable marks to Russia in the team event, where the U.S. is not a contender for the podium, in exchange for the Russian judge boosting the scores for Americans Meryl Davis and Charlie White in the ice dance.” This sort of collusion harkens back to the 2002 Salt Lake City Olympics where the Canadian pairs team lost the gold to Russia, and the French judge Marie-Reine Le Gougne eventually revealed she was pressured by the French with the influence of a Russian mobster to award the Russians high marks, in exchange for similar treatment for France’s ice dance team. (For a quick summary, click here.) So yeah, rumour or truth, the fact that it’s happened before lends just a little weight to the “current” collusion accusations.

Most evaluators aren’t in the position to collude in quite the same way as these Machiavellian figure skating judges, but the advice ~ do not cheat still holds. The cheating might take on a different form… like designing an evaluation you know will make the evaluand look to be a failure. The best (meaning most egregious and obvious) example of this that comes to mind is Charles Murray’s evaluation of PUSHExcel in the 1980s. Designing an evaluation that some have contended was inappropriate and doomed the program before the evaluation began, is cheating. Rigging the evaluation through a priori manipulation of the means for judging, whether in figure skating or program evaluation just isn’t what we should do!

Olympics controversy: slopestyle boarding ~ it’s all about the holistic scoring!

I introduced this series of posts by highlighting the evaluation of snowboarding… but there are multiple events within snowboarding and they do not all use the same evaluation strategy. While many of the specific events use a component evaluation strategy (separate judges looking at different parts of the athletes’ performance), the slope style event uses a holistic evaluation strategy, that is, each of six judges gives a grade from 1 – 100 considering a range of features of the run (including things like creativity, difficulty, execution of tricks, landings) but it is the overall impression that is the primary focus.

Yesterday’s first round of slopestyle boarding introduces us to a number of evaluation issues, but let’s focus on one: holistic scoring isn’t transparent and justifying the evaluative claim can be dodgy.

When top ranked Canadian slopestyler Mark McMorris received a score of 89.25 (which put him 7th) his response was: “It’s a judged sport; what can you do?” Canadian team coach Leo Addington repeated this perspective: “It’s a judged sport, and they saw what they saw and they put down what they thought.” He went on: “It’s always hard to tell without putting each run side by side, and the judging has many criteria – execution, amplitude, use of force, variety, progression. All those things are included in their thoughts … and they’re judging the entire run, completely, all those little things that sometimes we miss or don’t miss. It’s really hard to tell unless you study each one in slow motion all the way through.”

Holistic scoring is common in many evaluation contexts (assessments of student writing, lots of program evaluation) and expert judges (like teachers) assert they know a good performance when they see one without necessarily dissecting the performance. But it is more difficult to justify a holistic score and more difficult to challenge its veracity.

Coach Addington’s suggestion that judging each run in slow motion rather than as it is actually occurring is an interesting, although misguided, suggestion. Of course we see things differently in slow mo (that’s what the sports replay is all about) but that isn’t what is being judged in the case of most sports… what is being judged is the actual, authentic performance and even when replays show an error in judgement (let say about a penalty in hockey or a ball/strike in baseball) that judgement is mostly not overturned. So, the justification for a holistic score can’t be that you change the evaluand in order to make it clearer how you arrive at the judgement.

So how can holistic scoring be improved and justified? Slopestyle is a very recent “sport” and so the accumulation of collective expertise about what counts as a quality performance isn’t well formulated and one imagines that over time the quality of the judging will improve… there will be higher levels of agreement among judges and among judges, coaches and athletes. In fact, in the slopestyle instance, the coaches and judges do have relationships that provide for learning from each other. Again, quoting Coach Addington: “We [coaches and judges] can sit down and discuss and say, ‘What did you see, what did we see?’ Maybe we missed something, and we learn from it and that’s how it’s evolving. They learn our perspective and we learn their perspective.” While the media has mistakenly conjured up an image of fraternization between judges and coaches, they misunderstand that evaluations that are fair, transparent and justifiable are necessarily dependent on just such conversations. Holistic scoring approaches can only be better with the development of greater expertise through experience in knowing what that ideal type, the strong overall impression looks like.

NOTE: For the rest of the snowboarding world and media the real story in snowboarding has been US snowboarder Shaun White’s withdrawal from the slopestyle competition.

Bring on the Olympics and Learn about evaluation

The Olympics are a rich learning opportunity. We can learn about global politics, “celebration capitalism”, gender identity, gender politics, fascism, to name just a few analytic frames for the Olympic Games. We can also learn a great deal about evaluation from the Olympics. While in graduate school I took an evaluation class with Terry Denny and we tried to understand educational evaluation by investigating how evaluation was done in other contexts. It was fun and instructive to consider how wine tasting, diamond grading, dog trials & shows, and yes judging sports might help us to think more carefully and creatively about evaluating educational programs.

So the Olympics give us a peek at how evaluation is done within many specific sports. And they aren’t all the same!

For example, judging snowboarding involves five judges grading each snowboarder’s run on a scale of .1 – 10, with deductions for specific errors. The judges have specific components of the snowboarder’s run to judge: one judge scores the standardized moves, another scores the height of maneuvers, one scores quality of rotations, and two score overall impression. There are bonus points for really high maneuvers… an additional point is given for every additional 30 centimeters the competitor reaches above the lip of the pipe.

Falls and other mistakes lead to deductions. The format for point deduction in halfpipe is as follows:
0.1–0.4 for an unstable body, flat landing, or missed airs
0.5–0.9 for using hand for stability
1.0–1.5 for minor falls or body contact with the snow
1.6–1.9 for complete falls
2.0 for a complete stop

Note that this is not the same system used in judging all snowboarding… the World Snowboard Tour uses a different system.

So over the next couple of weeks I’ll be posting about what evaluation practitioners and theorists might possibly learn from the judging fest in Sochi.

an organic, evolving definition of evaluation

Perhaps a step closer to being a discipline, the American Evaluation Association project to define evaluation might signal that we are getting down to the fundamental ideas in our field. A committee has developed a definition that its chair, Michael Q. Patton describes as “a living document, ever being updated and revised, but never becoming dogma or an official, endorsed position statement.” Bravo to all for this initiative!

The open-access, participatory strategy is an interesting and forwarding thinking one, and I will be curious to see if and how that statement changes over time. My prediction is that it won’t change much. The statement as it is pretty much captures what anyone would say an introductory evaluation course, but we shall see.

I think, however, there are a couple of key details missing from this definition… details that might bring clarity about the foundations of evaluation. As the definition now stands, it focuses primarily on evaluation practice and less so on the discipline of evaluation. The initial definition is what we all say when we explain what evaluation is:

Evaluation is a systematic process to determine merit, worth, value or significance.

The string of descriptors about what evaluation is a determination of are important, and they are not the same. The definition provides no guidance about what the differences are and why we provide this string in our definition. What is the difference between merit and worth, and how are those different from value or significance? This is not a trivial matter and lack of understanding about these distinctions sometimes gives evaluation a bad name. For example, when an evaluation focuses on determining the worth of an evaluand and is found wanting there is often a hue and cry when that same evaluand is simultaneously meritorious.

The second detail that is missing is the logic of how we get to those judgements of merit, worth, value and significance. The definition says that evaluation is a “systematic process” but provides no hint of what makes evaluation systematic. Perhaps this is one of those contentious areas that Patton describes when he introduced the statement, “There was lots of feedback, much of it contradictory.” But, from the statement, we cannot know whether the committee talked about including details about what makes evaluation systematic and couldn’t come to agreement, or if this was never discussed in the first place. Perhaps being systematic has two meanings that get entangled… we use models/approaches in evaluating that provide guidance about how to do evaluation (UFE, RCT, participatory, and so on) AND there is a logic to thinking evaluatively that is embedded in all models/approaches to evaluation. There is no need to include the former in a definition of evaluation, but there is a need to include the latter.

Michael Scriven has provided the grounding for articulating the logic of evaluation, Deborah Fournier has done considerable work on articulating what that logic looks like in practice (that is, how it is manifest in various evaluation approaches/models), and both Michael Scriven and Ernie House have tackled the specific issue of synthesis in evaluation. This logic is at the heart of what makes evaluation systematic and I’d like to see this in this definition. (For a quick introduction to these ideas, check out the entries in the Encyclopedia of Evaluation by these authors.)

As an organic, evolving definition of evaluation, perhaps these are components that will still be developed and included.

The “evaluate that” campaign

I am totally sympathetic with teachers’ reactions to the simplistic, pedestrian ways of evaluating the quality of their work, the quality of student work, and the quality of schools. That efforts are made to reduce complex evaluands to simple ones is a serious problem. The “EVALUATE THAT” campaign identifies important aspects of teaching and education that aren’t measured and therefore not evaluated… things like compassion, empathy, cooperation… the emotional, interactional content of the work of teaching. [Click here, for the heartfelt remarks of one teacher.] The campaign (started by BadAss Teachers who created the meme shown in this post) also suggests these things can’t be measured and can’t be evaluated. Stories are being aggregated with the use of the Twitter hastag #evaluatethat.

Whether you are a teacher, student, parent, administrator… tell us, in a brief sentence or two, YOUR moments of teaching or learning (yours or someone else’s) that was never formally measured but made an impression on you. These ‘bites’ of reality do not have to be all gloriously positive, the only criteria – true, real and not measured (no hypotheticals please).

We are collecting these via Twitter by using #evaluatethat hashtag in each relevant tweet. This will ensure all of these are kept in one place and can be easily seen by all.

The hashtag has taken on a bit of a f*&k you tone… you can sort of imagine the tweeter grabbing their crouch while they shout “EVALUATE THAT.” Even so, the collection of stories is an important reminder of the complexity of teaching and schooling… a complexity that needs to be incorporated into judgements of the quality of teaching, learning and schooling. While it may be very difficult to measure such things as compassion and empathy that’s not a reason to step away, but all the more reason to find sound ways of incorporating those behaviors and actions into evaluations.

a blog post about whether I should be blogging…

The International Studies Association (political science folks) is discussing a proposal to ban Association journal editors, editorial board members and anyone associated with its journals from blogging. Here is the language:

“No editor of any ISA journal or member of any editorial team of an ISA journal can create or actively manage a blog unless it is an official blog of the editor’s journal or the editorial team’s journal,” the proposal reads. “This policy requires that all editors and members of editorial teams to apply this aspect of the Code of Conduct to their ISA journal commitments. All editorial members, both the Editor in Chief(s) and the board of editors/editorial teams, should maintain a complete separation of their journal responsibilities and their blog associations.”

Singling out blogs, but no other social media or letters to the editor or op eds, the ISA asserts that blogging is some how unseemly, that it is a kind of discourse that is not proper professional behavior, and that if one blogs one is likely to sink into some abyss losing a grasp on one’s dignity and respectability.

At best this proposal is quaint, a desire for a past when professors stayed in their offices and wrote for and engaged with their peers through narrow publication channels (like the ISA journals). At worst, this is a draconian effort to challenge academic freedom, to squelch professors’ engagement in public life, and to control access to knowledge. The silliness of this proposal does little to obviate its threat to civic engagement of scholars, both the activist minded and those who understand the world is bigger than the university campus.