Category Archives: Olympics

Ranking ~ who’s the best now that the Olympics are over?

Wherever in the world you were watching the Olympics from, there would have been a nationalistic bias in what you saw and a constant counting and recounting of medals to assert the superiority of your country over all others (you hope) or at least over some other countries. That Russia, the host country, earned the most medals, and especially the most gold and silver medals declares Russia simply as #1, best in the world, and highly accomplished in amateur sports. Russia is followed by the USA, Norway, Canada, and the Netherlands in terms of national prowess in winter sports.

This ranking is based on the number of medals received regardless of the level of medal. Naturally, it is the media that creates these rankings (not the IOC) and this rather simple strategy might distort who is the best (if this notion of the best has any construct validity, but that’s another discussion). It seems fairly obvious that getting the gold is better than getting the silver and that both trump getting a bronze medal. If we weighted the medal count (3 pts for gold, 2 for silver, and 1 for bronze) would the rankings of countries change? They do, a bit, and there are two noticeable changes. First is that Russia is WAY better than even the other top five ranking countries with a score of 70, compared to the next highest scoring country, Canada (which has moved from fourth to second place) with a score of 55. Perhaps less profound, but still an interesting difference is that although overall the USA had two more medals than Norway their weighted scores are identical at 53.

But wait. The Olympics are held every four years and while one might expect relative stability in the rankings. The table to left is the top six ranked countries in 2010, when the Olympics were held in beautiful Vancouver, BC (no bias, on my part here). Russia squeaks into the top six ranked countries.

So two things to note: 1) using the weighted scoring suggested above the order doesn’t change and we get a similar magnitude of performance [USA score = 70; Germany = 63; Canada = 61; Norway = 49; Austria = 30; Russia = 26], and 2) something miraculous happened in Russia in the last four years! Russia’s weighted score went from 26 in 2010 to 70 in 2014.

Looking across 2006, 2010, and 2014 you get a different picture with the countries that appear in the top six countries changing and the stability of the weighted ranking fluctuating notably. There are a couple of take away messages for evaluators. The simply one is to be cautious when using ranking. There are quite specific instances when evaluators might use ranking (textbook selection; admissions decisions; research proposal evaluation are examples) and a quick examination of how that ranking is done illustrates the need for thoughtfulness in creating algorithms. Michael Scriven and Jane Davidson offer an alternative, a qualitative wt & sum technique, to a numeric wt & sum strategy I have used here, and it is often a great improvement. When we rank things we can too easily confuse the rankings with grades, in other words, the thing that is ranked most highly is defined as good. In fact, it may or may not be good… it’s all relative. The most highly ranked thing isn’t necessarily a good thing.

Fidelity is over-rated… or understanding “hurry, hurry hard”

I couldn’t get through this project of learning about evaluation from the Olympics without a mention of curling. Born on the Canadian prairies, I curl! We curled during phys ed class and as a young adult it was an important context for socializing. Curling is a polite game, winning is important but good sportsmanship is more important ~ players are on their honour and there are no judges or referees. And what other sport has a tradition of all the competitors getting together after the match for rounds of drinks, what is called “broomstacking.” Maybe it’s an easy game to make fun of, but try it and you’ll discover there’s more to it than it seems.

Curling is a sport that has many skills that can be isolated, practice and mastered. Like drawing to the button, or peeling off a guard, or a take out with a roll behind a guard, or throwing hack weight. And there’s learning to know when to sweep and yell at the top of your lungs, “hurry, hurry hard!” Countries relatively new to the sport focus on these skills and demonstrate extraordinary abilities of execution, which is important to winning. But winning the game also requires something more elusive. These teams often confuse fidelity with quality, an all too common mistake in program evaluation. Being able to execute shots with precision is necessary, but not sufficient to win, in either curling or programs.

Strategy is also key in curling and is not so easily mastered through repetitious practice of isolated skills. Curling has been called “chess on ice.” There are aggressive and conservative strategies. Strategy depends in large part on the context ~ factors such as the ice, skill levels, whether you have the hammer (the last rock thrown), and so on. Strategy in program delivery, especially on the ground interpretations and practice, also depends on the context and practitioners use their strategic knowledge to adjust interventions to achieve maximum success. This strategic adjustment must often trade away fidelity to the intervention plan or map, and too frequently this is seen as a failure. Program evaluations sensitive to both programmatic intentions and local variation are more comprehensive and meaningful for understanding how and why programs work, or don’t.

Precision measurement ~ sometimes it matters, like in Luge, but not most of the time

In some Olympic sports thousandsth of a second matter. In the men’s doubles luge run the difference between the gold and silver medals was about 1/2 a second (.522 of a second to be exact). Lugers compete against a timer and luge is probably one of the most precisely timed sports in the world. Just to be clear, luge specifies a base weight (90 kg for individuals, 180 kg for doubles) and lugers may add weights to their sleds so that each run is precisely the same weight, and skill in maneuvering the track is what accounts for differences in time. Luge is a sport that is judged entirely on the outcome ~ the shortest time. How you get there doesn’t matter, other than that it is understood that following the “perfect line” is likely to get you to the finish line in the least amount of time. However, in luge nuance is critical. But often that nuance escapes even the knowledgable commentators who attempt to give spectators a sense of what is happening during a luge run. Mostly it comes down to a better run is one where the luger moves very little and doesn’t hit a wall!

For those of us doing program evaluation in the world of social, educational, health, policy interventions we might envy such precise measurements, but the work we do is different in a number of ways. Precision of measurement must be judged within the context of evaluation. First, we have no singular and unambiguous outcomes to measure. Our outcomes are constructs, ones that depend for their definition on values and ideologies. For example, poverty reduction might be an agreed upon outcome, but how that is conceptualized is quite elastic. And poverty reduction is likely conflated with other constructs like food security or affordable housing. Second, measures used in evaluation are not like time. We have no analogous high precision outcome measure to time in luge competitions, in large part because of the ambiguity of our outcomes. And last, we seldom want to give up investigating process and focus solely on outcomes. In the social world, how we attempt to ameliorate problems is an essential component of the quality of those efforts… outcomes matter to be sure, but getting to outcomes matters as much, and sometimes more.

Evaluators (and figure skating judges) should be impartial

Although figure skating is still one of the most popular Olympic sports it has lost some of its romance and charm with Tonya Harding’s henchmen whacking on Nancy Kerrigan’s knee and the ongoing real and alleged buying and selling of the judging.

We were all familiar with the 6 point grading scale used in figure skating, scrapped after the cheating scandals at the 2002 Olympics in Salt Lake City. The old 6 pt grading scale required each judge to publicly give a grade to a skating performance, and the synthesis of the judges scores has been done in a number of different ways over the years.

The new evaluation system, the ISU of International Judging System) took effect in 2005. The ISU breaks the performance into elements (determined by a technical judge) and uses a computerized tabulation as a primary function of which is to make the judges grading anonymous. Low and high scores are discarded and the remaining scores averaged. It’s a complicated evaluation system… many criteria, use of video playback to analyze the technical elements, checks for extreme errors in judging, anonymous judging, and so on. It isn’t clear that this new system is better.

At the heart of the judging issues in figure skating is an important evaluation issue: impartiality. Even though judges scores are anonymous, which many agree has compromised accountability and transparency, judges are selected by nations and so nationalist favoritism may still be at play. Eric Zitzewitz, a Dartmouth economist analyzed judging data and found the chance that judges give higher marks to skaters from their own country is now about 20 percent greater than in the 6.0 system.

How can impartiality in evaluation be fostered? First, those doing the evaluations ought to be accountable for the justification of their judgements. That means they are known and there needs to be transparency in the evaluation process ~ what is the evidence and how has it been synthesized into an evaluative claim? This is a feature of meta-evaluation and isn’t much more than expecting that evaluations should be auditable. But impartiality requires more than transparency, it also requires fairness as well as integrity & honesty (one of AEA’s guiding principles). What we mean by impartiality is quite complex and the matter won’t be resolved here, but figure skating judging sure reminds us of the importance of minding this matter in our practice.

Olympic judges shouldn’t cheat, neither should evaluators

This is a pretty easy take away message, and figure skating is not surprisingly the sport to deliver this one. Figure skating might be one of the roughest non-contact sports there is. Cheating by judges and skaters attacking other skaters off the ice are legendary. Cheating in judging scandals have resulted in a revised evaluation system that most would suggest isn’t much of an improvement (more about that in another post). To say that judging in figure skating has credibility problems is an understatement.

So, it’s not surprising (even if it isn’t true) that as the competition begins there are rumors that the Russian and US judges are colluding to squeeze Canada out of any medals. As reported in the Globe and Mail, “The allegation implies the pact would see the U.S. judge dish out favourable marks to Russia in the team event, where the U.S. is not a contender for the podium, in exchange for the Russian judge boosting the scores for Americans Meryl Davis and Charlie White in the ice dance.” This sort of collusion harkens back to the 2002 Salt Lake City Olympics where the Canadian pairs team lost the gold to Russia, and the French judge Marie-Reine Le Gougne eventually revealed she was pressured by the French with the influence of a Russian mobster to award the Russians high marks, in exchange for similar treatment for France’s ice dance team. (For a quick summary, click here.) So yeah, rumour or truth, the fact that it’s happened before lends just a little weight to the “current” collusion accusations.

Most evaluators aren’t in the position to collude in quite the same way as these Machiavellian figure skating judges, but the advice ~ do not cheat still holds. The cheating might take on a different form… like designing an evaluation you know will make the evaluand look to be a failure. The best (meaning most egregious and obvious) example of this that comes to mind is Charles Murray’s evaluation of PUSHExcel in the 1980s. Designing an evaluation that some have contended was inappropriate and doomed the program before the evaluation began, is cheating. Rigging the evaluation through a priori manipulation of the means for judging, whether in figure skating or program evaluation just isn’t what we should do!