Comparative judgement: the next big revolution in assessment?

Director of Education at No More Marking, Daisy outlines why teachers should rethink how they assess, why they assess, and vitally, how much time should be spent doing it.

Marking writing reliably is hard. To understand why, try this thought experiment. Imagine that you have a mathematics exam paper. It’s a simple paper with just 40 questions and all those questions are fairly straightforward. One mark is available for each question, and there are no marks for method. Suppose I then give that paper to a pupil, and get them to complete it. If I then copied their answer script and gave it to a group of 100 maths teachers, I would expect that all of those teachers would agree on the mark that script should be awarded, even if they had never met before or never discussed the questions on the paper.

Now take the same pupil, and imagine they have been asked to write a short description of the town where they live. Suppose again that we copy their script, distribute it to 100 teachers, and ask them to give the script a mark out of 40. It is far less likely that the teachers will all agree on the mark that script should be awarded. Even if they had all undergone training in the meaning of the mark scheme, and met in advance to discuss what the mark scheme meant, it would be highly unlikely that they would all then independently agree on the mark that one script deserved.

To a certain extent, this is to be expected. There is no one right answer to an extended writing question, and different people will have different ideas about how to weight the various different aspects that make up a piece of writing. However, whilst we might accept that we will never get markers to agree on the exact mark, we surely do want them to be able to agree on an approximate mark. We may not all agree that a pupil deserves 20/40, but perhaps we can all agree that they deserve 20/40, plus or minus a certain number of marks. The larger this margin of error is, the more difficulty we have in working out what the assessment is telling us. Suppose, hypothetically, that the margin of error on this question was plus or minus 15. A pupil with 20/40 might have scored anywhere between 5 and 35! Large margins of error make it difficult to see how well a pupil is doing, and they also make it even more difficult to see if a pupil is making progress, as then you have to contend with the margin of error on two assessed pieces of work.

In order to know how well pupils are doing, and whether they are improving, we therefore need a method of reliably assessing extended writing. In order to consider how we might arrive at this, let us first look at two reasons why it is so difficult to mark extended writing at the moment.

First, traditional writing assessment often depends on absolute judgements. Markers look at a piece of writing and attempt to decide which grade is the best fit for it. This may feel like the obvious thing to do, but in fact humans are very bad at making such absolute judgements. This is not just true of marking essays, either, but of all kinds of absolute judgement. For example, if you are given a shade of blue and asked to identify how dark a shade it is on a scale of 1 to 10, or given a line and asked to identify the exact length of it, you will probably struggle to be successful. However, if you are given two shades of blue and asked to find the darker one, or two lines, and asked to find the longer one, you will find that much easier. Absolute judgement is hard; comparative judgement is much easier, but traditional essay marking works mainly on the absolute model.1

Second, traditional writing assessment depends on the use of prose descriptions of performance, such as those found in mark schemes or exam rubrics. The idea is that markers can use these descriptions to guide their judgements. For example, with one exam board, the description for the top band for writing is described in the following way:

  • Writing is compelling, incorporating a range of convincing and complex ideas
  • Varied and inventive use of structural features2

The next band down is described as follows:

  • Writing is highly engaging, with a range of developed complex ideas
  • Varied and effective structural features

It is already not hard to see the kinds of problems such descriptors can cause. What is the difference between ‘compelling’ and ‘highly engaging’? Or between ‘effective’ use of structural features and ‘inventive’ use? Such descriptors cause as many disagreements as they resolve, because prose descriptors are capable of being interpreted in a number of different ways. As Alison Wolf says, ‘One cannot, either in principle or in theory, develop written descriptors so tight that they can be applied reliably, by multiple assessors, to multiple assessment situations.’3

Comparative judgement offers a way of assessing writing which, as its name suggests, does not involve difficult absolute judgements, and which also reduces reliance on prose descriptors. Instead of markers grading one essay at a time, comparative judgement requires the marker to look at a pair of essays, and to judge which one is better. The judgement they make is a holistic one about the overall quality of the writing. It is not guided by a rubric, and can be completed fairly quickly. If each marker makes a series of such judgements, it is possible for an algorithm to combine all the judgements and use them to construct a measurement scale.4 This algorithm is not new: it was developed in the 1920s by Louis Thurstone.5 In the last few years, the existence of online comparative judgement engines has made it easy and quick for teachers to experiment with such a method of assessment.

At No More Marking, where I am Director of Education, we have used our comparative judgement engine for a number of projects at primary and secondary. In our assessments of pupils’ writing, we can measure the reliability of our markers, and we are routinely able to reduce the margin of error down to just plus or minus 2 marks on a 40-mark question. Teachers are also able to complete these judgements relatively rapidly, leading to reductions in workload too. In the longer term, our hope is that wider use of comparative judgement will allow teachers to identify promising teaching methods with greater accuracy, and also to reduce the influence that tick-box style mark schemes have on teaching and learning.

To find out more, read Making Good Progress – the Future of Assessment for Learning (2016) by Daisy Christodoulou, published by Oxford University Press.

Download a PDF version of this issue.


References

1
Laming, D. (2003) Human judgment: the eye of the beholder. Boston, MA: Cengage Learning.

2
AQA, GCSE English Language 8700, Paper 2 Mark Scheme.
filestore.aqa.org.uk/resources/english/AQA-87002-SMS.PDF

3
Wolf, A. (1998) ‘Portfolio assessment as national policy: the National Council for Vocational Qualifications and its quest for a pedagogical revolution’, Assessment in Education: Principles, Policy & Practice, 5 (3) pp. 413–445, p. 442.

4
Pollitt, A. (2012) ‘Comparative judgement for assessment’, International Journal of Technology and Design Education, 22 (2) pp. 157–170.

5
Thurstone, L. L. ‘A law of comparative judgment’, Psychological Review, 34 (4) pp. 273–286.