Dr Wayne Harrison

The '2 Sigma Problem' and EdTech: Scrutinising the Evidence for Impact

I am seeing more posts about the impact of AI in education and how this will transform learning. Silicon Valley appear to be referencing ‘The 2 Sigma Problem’ from a study by the educational researcher Benjamin Bloom (1) with the AI advances over the last 12 months. The problem stems from the observation that students who are tutored one-on-one using mastery learning techniques perform much better than students who learn via conventional instructional methods. Bloom found that the average tutored student performed better than about 98% of the students in the traditional learning setting. This difference in performance is referred to as the "two sigma" advantage, as it represents two standard deviations on a normal distribution curve. This fancy language comes from a long, detailed and rich history of scientific enquiry, where it has been noticed, in countless observations across all of nature (it started with counting stars) that events with single causes tend to happen in the same clumped pattern around an average. We know if something funny is happening if it happens outside this pattern. 

The purpose of this short blog is not to address the methodological challenges with effectiveness studies when scaled, as this needs to be covered in greater depth when considering the 2 Sigma Problem. I will be tackling this over the summer as I believe we need to take a step back to consider the basic concepts in testing the effectiveness of any type of educational intervention. I think we can all relate the following example from our early childhoods. 

Try to remember being back in primary school, as you eagerly prepare for a simple yet fascinating science experiment. Your teacher introduces the concept of fair tests and variables by having you and your classmates grow two sets of bean plants. To determine the impact of sunlight on plant growth, you place one set near the window where they receive ample sunlight, while the other set is tucked away in a dark corner of the classroom. By carefully monitoring and comparing the growth of these two sets of plants, your young mind begins to grasp the importance of using a comparator in research. This early exposure to experimental design not only sparks your curiosity but also lays the groundwork for understanding the significance of robust evidence in education and beyond. 

Now let’s return to the present day, and my current LinkedIn news feed informing me of the of next big transformational AI product that will revolutionise learning. EdTech entrepreneurs and established companies releasing new ground-breaking technology and how these will scale to millions of users in the next 12 months.

I am personally puzzled by education and how as a society we accept that anyone can create a product, release this to millions of users without any robust evidence of impact and without knowing if the product works. I have previously blogged about this (2) so I will not repeat my frustrations, but I do believe this is because the effects are long term and not life threatening, such as a plane crashing from 30,000 feet or a medical treatments causing serious side effects.

In the ever-evolving field of education, the need for creating robust evidence cannot be overstated. It is crucial that we critically assess the interventions we implement in our classrooms, ensuring they are not only effective but also grounded in high-quality evidence. By embracing a rigorous approach to evaluating educational strategies, we can make informed decisions on which interventions to adopt and which to discard. The consequences of relying on ineffective interventions can be far-reaching, potentially stalling students' progress, wasting valuable resources, and inadvertently widening achievement gaps. By prioritising evidence-based practices, we can minimise these negative outcomes and contribute to the overall success of our students, schools, and communities.

Generating reliable evidence in education comes with its unique set of challenges, one of which is establishing a credible counterfactual or benchmark against which to measure whether something works. Like the experiment into what makes plants grow above, how do we know if sunlight helps unless all the other things that might make plants grow are the same apart from the sunlight. Without a comparison counterfactual if we see plants growing it might be any number of reasons. In research design, the counterfactual represents a hypothetical scenario where the intervention in question is not implemented, enabling us to gauge its effectiveness by comparing outcomes between the actual situation and the counterfactual. Unfortunately, we do not have the power of the time stone and Dr Stephen Strange (apologies if you are not a Marvel fan!). The difficulty in educational settings, however, is that we cannot simply observe both the presence and absence of an intervention simultaneously.

In the absence of a clear counterfactual, it becomes increasingly challenging to ascertain whether the observed changes in student performance are truly attributable to the intervention. In education, the design of the research is critical in trying to minimise if other plausible explanations could be attributed to the impact of the intervention.

Internal validity is a critical aspect of educational research design that refers to the extent to which the observed effects of an intervention can be confidently attributed to the intervention itself, rather than to confounding variables or biases. In the plant example, a study may claim it measured the effect of sunlight but its internal validity relies on whether it reasonably did, for example whether they ensured the control was in shadow the whole time. In other words, internal validity is concerned with establishing a causal relationship between the intervention and the measured outcomes.

When evaluating the quality of research evidence in education, it is crucial to consider internal validity, as it helps us determine whether an intervention is truly effective in producing the intended results. Several factors can threaten internal validity in educational research, such as selection bias, history effects, maturation, and instrumentation issues, among others.

In EdTech, I often see claims of evidence using single group designs to measure effectiveness. A single group just means that the study has no comparator group. Let’s take an example of a new hypothetical AI product that has been tested with a class of thirty students and completes a pre- and post-assessment with questions selected by the developers and used as part of the intervention. At the end of the term, the research reports an increase in attainment of 40% and the developers claim this provides robust evidence of impact.

In the example, a single group design without a comparator group has low internal validity, as many other plausible explanations can possibly explain the 40% increase in attainment. I hope reading this you are already starting to return to your early science experiments from your primary years, questioning other alternative reasons for the increase. What if the students were familiar with the questions from the pre-test, were exposed to these in the intervention and due to the familiarity, performed better on the post-post? What if the students would improve anyway over time with the teaching in the classroom? What if on the day of the pre-test a high portion of high ability students were on a school trip and missed the assessment, and these were included in taking part in the post-test? Does the sample of 30 students provide confidence that the findings will be generalisable to the wider student population?

This is why research design is important, along with sample size and outcome measures when considering the robustness of educational research. Rather than become fixed on achieving the ‘2 Sigma Problem’ as the north star moment for AI or EdTech, I think we should all take a step back and start by testing the effectiveness of new interventions robustly.

In education, the advances in technology and in particular AI, have the potential to impact education. However, it is important to remember that until we evaluate these, we do not know whether the impacts will be positive, negative, or not make a difference to learner outcomes.

1) Bloom, B. S. (1984). The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational Researcher, 13(6), 4-16.

2) https://interventions.whatworked.education/blog/why-should-edtech-be-different
Created with