Why all the fuss about randomised trials?

Hamish Chalmers is a teacher and a lecturer in applied linguistics at the University of Oxford. Here he demystifies the opportunities and challenges that randomised controlled trials – RCTs – offer education and the classroom. They are often seen as a gold standard in research, and being aware of the differences between these qualities is essential to appreciating their value.

In the past five years or so, randomised controlled trials (RCTs) have firmly entered the lexicon of educational research. They are fast becoming the preferred method by which to evaluate the effects of educational interventions in the UK. One-third of English state schools have taken part in RCTs funded by the Education Endowment Foundation, and RCTs are routinely referred to in order to guide policy decisions. But what is so special about RCTs that they are enjoying such privilege?

In 1957, pioneering experimental social scientist Donald Campbell laid down the fundamental principle of experimentation, saying that ‘The very minimum of useful scientific information involves at least one formal comparison and therefore at least two careful observations’(1). In education research this means that to understand the effects of a new teaching approach we need to compare what happens when pupils are taught using it with what happens when they are taught using an alternative approach.

It is impossible for one group of pupils to be taught simultaneously using more than one approach. Therefore, we need to create comparison groups that are approximations of each other. This has been attempted in several ways. For example, data from one group of pupils can be compared with data from another group (PISA rankings are a good example of this). Alternatively, pupil outcomes before a new intervention is introduced can be compared with outcomes afterwards (average reading attainment in the UK before and after the introduction of the phonics screener, for example). Or, groups of pupils can be matched on characteristics such as age and socioeconomic status, then each group taught using different approaches and their outcomes compared.

As any primary school pupil can tell you, a key requirement of any scientific experiment is that it is a fair test. One way of helping to make an educational experiment fair is to ensure that the groups of children being compared are as similar as possible. The designs described above fall short of this basic requirement. For example, PISA ranking relies on data from different children in different countries to assert the relative effectiveness of different approaches to teaching. Comparing attainment before and after an intervention does not account for changes in the children over time (the children at the beginning of the intervention are essentially different people by the end of it). Matched groups of children may be similar on characteristics we know about, but what about important things we don’t know about or haven’t measured?

A fair test requires that comparison groups have similar proportions of pupils who share characteristics that could affect the way they respond to the interventions being compared. That’s all well and good if you are confident that you can identify every conceivably influential characteristic of your pupils. Although even if that were possible, would this result in a fair distribution of all influential characteristics? The only honest answer is ‘We can’t know.’ In addition to characteristics that we can identify, there are likely to be some that we can’t. How do we account for things like personal enthusiasm for a subject, relevant experience outside of school, individual idiosyncrasies, and so on? These are all potentially important characteristics that we have no clear way to identify and quantify, and therefore no way to deliberately distribute equally across groups.

Differences among pupils emphasise the complexity of human beings. These differences and the resulting complexity is why random allocation to comparison groups is so powerful. Random allocation takes into account how messy human beings are and distributes the mess fairly. By deciding at the flip of a coin who goes in one group and who goes in the other, random allocation creates groups that differ only as a result of the play of chance. This is not the same as saying that groups are ‘equal’ (they probably won’t be in some respects), but it does mean that the groups are not systematically different, and that any differences result from pure coincidence. As a result, we can be more confident than with other research designs that any differences in outcomes between comparison groups are due to differences in the interventions and not because of non- random differences (biases) between the pupils in the comparison groups.

Failing to properly account for systematic differences between comparison groups can massively influence how we interpret the results of educational research. Consider driver’s education, a popular way to try to reduce car crashes among young drivers. Data from non-randomised comparisons has been used to promote this intervention. Researchers looked at the rates of car crashes among youths who had taken these classes and youths who had not, and they found that the latter were more than twice as likely to have been involved in a car crash than the former(2). When driver’s education was evaluated in a series of RCTs, however, very little difference in accident rates was detected between drivers randomly allocated to attend the classes and drivers randomly allocated to not take those classes(3). So, which evidence do you trust more? The non-randomised studies did nothing to account for possible differences between people who took the classes and those who did not. The RCT ensured that, even if not identical, the comparison groups differed only by chance.

As it turns out, there is a good explanation for why these two approaches came to conflicting conclusions. In a separate study, researchers found that people who take driver’s education courses tend to display psychological characteristics that are compatible with safer attitudes to road use. The drivers in each group in the non-randomised studies were systematically different from each other.

Failing to properly account for systematic differences between comparison groups can massively influence how we interpret the results of educational research.

The difference in results in the driver’s education studies had a plausible explanation. However, we are not always able to unpick causal relationships so easily. Even so, teachers must still take decisions about their practice. In a study of an after-school programme designed to reduce anti-social behaviour in primary school children,(4) non-randomised evaluations of the programme suggested that it helped. On the basis of that finding, schools were preparing to roll out the programme to all children. When it was evaluated in an RCT, however, researchers found that instances of anti- social behaviour increased in children who had taken part in the programme compared to their peers who had not. Unlike the driver’s education studies, there was little to explain why this was. Nonetheless, schools were faced with a choice over what to do. Should they trust the results of the non-randomised study, and roll out the programme to all children? Or should they trust the results of the RCT and cancel it? As with the driver’s education example, their choice was between a study in which they could not confidently say whether like was being compared with like, and one in which they knew that researchers had used the best method available for creating unbiased comparison groups. Logic prevailed and they chose to cancel the programme.

Random allocation to comparison groups is the only defining feature of an RCT, and it is the only feature that prevents allocation bias. This simple feature is why RCTs are the preferred method for assessing programme effectiveness. When faced with decisions about practice, all else being equal, teachers and policy makers must decide whether they trust the findings of these fair tests or the findings of studies for which no similar reassurance is possible.


1. Campbell, D. T. (1957) ‘Factors relevant to the validity of experiments in social settings’, Psychological Bulletin 54 (4) pp. 297–312, p. 298.

2. MacFarland, R. A. (1958) ‘Health and safety in transportation’, Public Health Reports 73 (8) pp. 663–680.

3. Vernick, J. S., Li, G., Ogaitis, S., MacKenzie, E. J., Baker, S. P. and Gielen, A. C. (1999) ‘Effects of high school driver education on motor vehicle crashes, violations, and licensure’, American Journal of Preventive Medicine 16 (1S) pp. 40–46.

4. O’Hare, L., Kerr, K., Biggart, A. and Connolly, P. (2012) Evaluation of the effectiveness of the childhood development initiative’s ‘Mate- Tricks’ pro-social behaviour after-school programme. Available online at: www.goo.gl/sVUtFJ (Accessed 10 July 2018).