Problems with grading rubrics for complex assignments

In an earlier post I discussed a paper by D. Royce Sadler on how peer marking could be a means for students to learn how to become better assessors themselves, of their own and others’ work. This could not only allow them to become more self-regulated learners, but also fulfill roles outside of the university in which they will need to evaluate the work of others. In that essay Sadler argues against giving students preset marking criteria to use to evaluate their own work or that of other students (when that work is complex, such as an essay), because:

“Quality” is more of a global concept that can’t easily be captured by a set of criteria, as it often includes things that can’t be easily articulated.
As Sadler pointed out in a comment to the post noted above, having a set of criteria in advance predisposes students to look for only those things, and yet in any particular complex work there may be other things that are relevant for judging quality.
Giving students criteria in advance doesn’t prepare them for life beyond their university courses, where they won’t often have such criteria.

I was skeptical about asking students to evaluate each others’ work without any criteria to go on, so I decided to read another one of his articles in which this point is argued for more extensively.

Here I’ll give a summary of Sadler’s book chapter entitled “Transforming Holistic Assessment and Grading into a Vehicle for Complex Learning” (in Assessment, Learning and Judgement in Higher Education, Ed. G. Joughin. Dordrecht: Springer, 2009). DOI: 10.1007/978-1-4020-8905-3_4).

[Update April 22, 2013] Since the above is behind a paywall, I am attaching here a short article by Sadler that discusses similar points, and that I’ve gotten permission to post (by both Sadler and the publisher): Are we short-changing our students? The use of present criteria in assessment. TLA Interchange 3 (Spring 2009): 1-8. This was a publication from what is now the Institute for Academic Development at the University of Edinburgh, but these newsletters are no longer online.

Note: this is a long post! That’s because it’s a complicated article, and I want to ensure that I’ve got all the arguments down before commenting.

Sadler distinguishes between two kinds of assessment: analytic grading and holistic grading. One of the main arguments of the essay is that analytic grading has significant problems when used for certain kinds of assignments, enough to suggest we should not be using it in those contexts. The other part of the argument is that we should be using peer assessment to help students learn how to use holistic methods in evaluating their own and others’ works.

The kinds of assignments Sadler is focused on, the ones where analytic grading is problematic, are “divergent” tasks: these could have multiple responses that are quite different but still of high quality, and they “provide opportunities for learners to demonstrate sophisticated cognitive abilities, integration of knowledge, complex problem solving, critical reasoning, original thinking, and innovation” (47). Those are precisely the kind of assignments I often give in both Philosophy and Arts One courses, when I ask students to write essays.

Analytic and holistic grading

One engages in analytic grading when one evaluates work using separate judgments on various criteria (whether given by the instructor, negotiated with students, or devised by students themselves). The judgments on each criterion are “combined using a rule or formula, and converted to a grade” (45). Clearly this would be the sort of thing one does when using a rubric that has points attached to each part of the rubric and in which the final grade is determined by adding up the points.

On a personal note, I have resisted going this route. I have used rubrics extensively, but mainly for the purposes of providing students knowledge in advance of the sorts of things they need to try to put into their essays, and to enable me to organize my comments so they can see which sorts of things they need to work on most (given the prevalence of comments in each category). I have also used rubrics as a check to help with fairness–it helps me make sure I don’t overlook one category in someone’s paper, while focusing on it in another. I feel like it helps me be more consistent.

However, I have refused to go the route of assigning marks or points to each category and adding up a grade that way. In fact, I have explicitly said on my rubrics that students are not to think of the rubrics and categories as providing some formula out of which they could or I could calculate a grade. I have said that marking essays is too complicated for that sort of thing.

For reference, and in case anyone is interested, here is the latest iteration of the grading rubric I use for philosophy essays: HendricksMarkingRubric-Jan2012

Sadler notes later in the essay, however, that analytic grading could also take place using a rubric without specific points or weights assigned, where an assessor picks a single “cell” in the rubric for each criterion or standard that best fits the work (52). That isn’t quite what I do, either. I actually tie each of my comments, as much as possible, to one of the “cells” in the rubric, so as to say, e.g., here the essay is doing something in the B-range for “structure.” But I don’t assign a single mark or cell for each criterion to the essay.

Holistic grading, on the other hand, occurs when an instructor judges a work as a whole and provides a “global judgment.”

Although the teacher may note specific features that stand out while appraising, arriving directly at a global judgment is foremost. Reflection on that judgment gives rise to an explanation, which necessarily refers to criteria. (46)

In holistic grading, then, the criteria come afterwards, as it were, when one explains to oneself and the student the judgment made. As Sadler puts it, holistic grading can be characterized as “impressionistic or intuitive” (46). To summarize the difference between holistic and analytic grading, Sadler says:

Holistic grading involves appraising student works as integrated entities; analytic grading requires criterion-by-criterion judgments. (48)

Reflecting on my own practice again, using “impressionistic or intuitive” judgments is what I used to do before using rubrics. Or rather, I still do it while using rubrics, but less. I would read an essay, give comments, and at the end find myself thinking that the essay as a whole deserved a certain grade. I was uncomfortable with this, though–where was that judgment coming from? Now I still do that sort of thing, but check it with the rubric–how many aspects of the essay are in the “A” range according to the rubric, how many in the “B” range, etc., and does this roughly correspond to the grade I’ve just impressionistically determined? This isn’t a formulaic sort of activity, as I don’t actually count and add, but it serves as a kind of check for me to make sure I’ve thought about all aspects of the essay (or rather, at least those on the rubric) before coming up with a grade.

Sadler points out later in the essay that there isn’t any reason to be uncomfortable with impressionistic judgments. This sort of holistic process is “rational, normal and professional” (59), as it is how judgment of complex works does and must work. Of course, one must have significant experience of various kinds of work in a genre, and works of various quality, to be able to come to such judgments well, as an expert. More on this below.

The supposed value of analytic grading

Sadler notes that analytic grading schemes have gained in popularity, and they “introduce formal structure into the grading process, ostensibly to make it more objective and thus reduce the likelihood of favouritism or arbitrariness” (48). He lists the various aspects of the rationale many have for analytic grading, including improving consistency and objectivity, making the grading process transparent to students, encouraging students “to attend to the assessment criteria during development of their work,” providing feedback “more efficiently, with less need for the teacher to write extensive comments” (50-51).

These reflect why I moved to using grading rubrics, except that I’d add: helping students see what they need to improve. Students can get lost in comments, so having a rubric organizes feedback and pinpoints certain things they need to do next time (e.g., be sure to have an introduction to your essay with a clear thesis statement).

The problem with analytic grading

Despite the supposed benefits listed above, Sadler argues that analytic grading schemes “can, and for some student works do, lead to deficient or distorted grading decisions,” or grading anomalies (51). He focuses on two such anomalies

I’ll combine the two anomalies somewhat here; both have to do with a mismatch between what an instructor thinks of a work globally and what sort of judgment would be suggested by using an analytic grading method. This is, actually, the first anomaly: e.g., it can be the case that when one finishes reading an essay (for example), one has the sense that it is a truly excellent one, but using the rubric shows that the essay falls short in a number of ways and therefore wouldn’t appear so excellent using the rubric alone. The opposite can, of course, occur as well.

The second problem can occur when one finds that the above issue is due to a criterion being missing from one’s list. This seems like it could be easy to fix, right? Just add a new criterion to the rubric. But to do so and judge that work on the new criterion is problematic: it “would breach the implicit contract between teacher and student that only specified criteria will be used” (54).

These problems occur for several reasons:

There may be a significant amount of knowledge that goes beyond what can be expressed in words (here he cites Polyani, 1962) (53).
Experts may process information to come up with judgments in complex assessment scenarios in ways that “do not necessarily map neatly onto explicit sets of specified criteria, or simple rules for combination” (here he cites Sadler, 1981) (53).
When specifying a set of criteria for assessing certain kinds of works, one has to choose from a larger set–there are many, many criteria that could be used for each kind of work, and to use them all would be unwieldy (if one could even specify them all, which might not be possible (54).

I have experienced both of these problems, and have done what Sadler says some instructors do as a response to the first problem: trust the holistic impression and fudge the use of the rubric to fit the former. For the second problem, my response has been to simply note the reasons for the holistic judgment in separate comments on the essay, rather than relying on the rubric alone. This works, because I have explicitly stated on the rubric that it is not to be used to mechanically determine a mark, and that it can’t possibly cover all aspects of judgments on quality (my rubric states at the top, among other things: “Note that the statements below are not exhaustive for what may occur in each category, but serve as common examples”).

The irony of analytic grading

Sadler notes that analytic grading schemes are often used to make the grading process more transparent, yet the anomalies above are often hidden from students, so they get the impression they are getting the real story when they are not (55).

Now, of course, if one tells students in advance that the rubric isn’t the full story, and that some of the grading process remains subjective, due to the nature of having experience in the field and knowing what counts as good work, then this particular problem doesn’t seem so bad. But Sadler goes further than this remedy, which I have already implemented. And it keeps the values of disclosure and openness intact.

Holistic grading and peer assessment

He isn’t suggesting we go back to simply judging works impressionistically and leaving students without a lot of guidance as to how we got to those judgments. Indeed, he supports a combination of holistic and analytic grading:

To advocate that a teacher should grade solely by making global judgments without reference to any criteria is as inappropriate as requiring all grades to be compiled from components according to set rules. Experienced assessors routinely alternate between the two approaches in order to produce what they consider to be the most valid grade. (57)

But it’s more than that–we need to also “induct students into the art of making appraisals” themselves (56). To do so is to start “learners on the path towards becoming connoisseurs” (56), where connoisseurs or experts are able to recognize quality in particular cases even without being able to give a general definition of quality for those kinds of works, or without being able to give a set of criteria for quality that applies to all such works.

How to help students become connoisseurs

Clearly, peer evaluation and feedback is key. Three aspects of such activities are highlighted by Sadler: (1) students need to be exposed to a variety of works in the same genre of what they’ll be producing; (2) they need exposure to works in a wide range of quality; (3) they need exposure to responses to a variety of “assessment tasks” (57).

Sadler notes that students, as well as instructors, should be using both holistic approaches and analytic approaches to evaluation, focusing on the holistic assessments first and “only afterwards formulating valid reasons for them” (57). I assume this means formulating valid reasons that appeal to criteria that attach to those particular works, since as noted above, experts may not be able to formulate a set of criteria for all such works. This sounds right, as Sadler later goes on to discuss how students and instructors can come up with new criteria to add to their working set as they review more works (58). These new criteria can be shared amongst the class, he notes, but “not with a view to assembling a master list,” because one should help students to see the limitations in trying to develop general sets of criteria (58).

In another interesting move, Sadler suggests that a large amount of class time could be devoted to peer assessment activities. Students could be asked to do formative responses to particular tasks related to course content, and much of the class meeting times could be devoted to students reading and commenting on each others’ works. As Sadler puts it:

In this way, student engagement with the substance of the course takes place through a sequence of produce and appraise rather than study and learn activities. (59)

In the remaining section of the paper, Sadler discusses obstacles to implementing his suggestions, and ways to get around them. I won’t discuss those here, in the interest of not extending this blog post too much further.

My thoughts

I must admit I am warming to the idea of not providing a set of criteria for essays in advance as if they were the only things I look for when grading. Still, I already state that the things on my rubric are not exhaustive, so I’m moving in that direction already. And Sadler notes in this article that “certain criteria may always be relevant” to a genre of works (59). He cites things like grammar, paragraph organization and logical development as examples for written essays. I like to think that the things I’ve put on my rubric are things that are “always relevant,” but I guess I’d need to think about that further. Is it absolutely critical that essays have a clear thesis statement at the end of the intro, and a conclusion that rounds out the essay (for example)? Could there be an A+ essay that doesn’t have these but is truly excellent in other ways?

What I haven’t been doing is working on helping students to become connoisseurs themselves. I do have some peer feedback in my philosophy courses, but usually students only do it once or twice, which may not be enough to really move them along this path (unless they get a lot in other courses as well, which I am not sure of). And I don’t encourage them to come up with their own criteria for quality, necessarily, but rather to use the rubric I’ve provided (at least in 1st and 2nd year courses). I guess I think they need guidance in the early years…how can they know what is a good philosophy essay if this is their first philosophy course? I am still unsure about that one.

Perhaps I could give them a pared down rubric, with just those things I do really think are always relevant, and then encourage them to come up with other standards or criteria and share them with the rest of the class, and talk about how complex assessment really is. I could also talk about peer assessment as a way to help them learn to see quality themselves. And I am very intrigued by the idea of having more peer assessment in class, using formative (ungraded) assignments.

What do you think?

Rubrics are popular; I heard in multiple professional development workshops of their value. Do you think they might be stifling in the ways noted above? Is there anything in Sadler’s article you agree/disagree with? If you use rubrics yourself, do you think they’re valuable in ways not yet mentioned here?

Works cited

Polanyi, M. (1962). Personal knowledge. London: Routledge and Kegan Paul.

Sadler, D. R. (1981). Intuitive data processing as a potential source of bias in naturalistic evaluations. Educational Evaluation and Policy Analysis, 3(4), 25–31.

Problems with grading rubrics for complex assignments / You're the Teacher by chendric is licensed under a Creative Commons Attribution 4.0 CC BY

2 comments