Assessment – A Reading List.

I’m creating a reading list to share with staff at my school. I plan to give them some time to read some of these at the start of my assessment INSET next week.

Formative Assessment:

Avoid complex tasks for formative assessment:
What makes good formative assessment by Daisy Christodoulou

Making the feedback more work for the pupils than the teacher:
Five Ways to give Feedback as Actions by Tom Sherrington

A tip for using mini-whiteboards:
Slow motion learning by Greg Ashman

Why it’s dangerous to use summative assessments formatively:
Assessment practice that is wide of the mark by Matthew Benyohai

Use feedback to change behaviour:
Moving from Marking to Feedback by Harry Fletcher-Wood

Formative assessment can be generic and subject specific:
AfL in Science, A Symposium by Adam Boxer

A classic, in case you haven’t read it. Giving pupils grades is not generally a good idea:
Inside the Black Box by Dylan Wiliam and Paul Black

Summative Assessment:

Data collection shouldn’t be the main focus of assessment:
Breaking useless assessment habits by Stephen Tierney

Long tests are better than short tests, but even then we should only take note of significant differences. Teacher accountability is bad.
What if we can’t measure progress? by Becky Allan

Why teacher assessment isn’t as good as it may seem:
Tests are Inhuman, and that’s what is so good about them by Daisy Christodoulou

A way to report pupils performance which gives better information to parents, tutors and the pupils themselves:
Pragmatic Assessment by Matthew Benyohai


I’d be interested to know if you’ve read any blogs which summarise Inside the Black Box nicely, as it’s a bit long for staff to read during an INSET session. I’d also be grateful if you could share a blog on the difference between summative and formative assessment, in case there are any members of staff who are unsure about this.

Apart from those two specific questions… these are just five blogs that have stuck in my memory over the past year. I’m sure there are many others that I have read and forgotten about, or completely missed in the first place. What would you add to this list?

I love podcasts

I really do love podcasts. Since discovering them, I now hardly ever listen to the radio (at least live: many of my podcasts are radio shows) and have pretty much completely stopped listening to music. You may see these as negative consequences but personally, I feel so much more positive about time spent travelling now I have a convenient form of audio entertainment which keeps my attention.

So, inspired by Caitlin Clock on twitter, I thought I’d publish my current list of favourites. They err very much towards the informational side of things, representing how much I love learning, but there are a few pure entertainment ones in there too. I’ll answer Caitlin’s questions for each one:

More or Less is BBC Radio 4’s program which investigates numbers in the news and everyday life. This is my go-to easy listening and it keeps me well informed: so many things which feature in this come up in day to day conversation. I am planning to use some episodes as part of my teaching next year.

Episodes to try: Women, the Oscars and the Bechdel Test, Trump tells the Truth, Grammar Schools. But probably the most recent episodes are most interesting because they’re topical.

In Revisionist History, Malcolm Gladwell ‘goes back and reinterprets something from the past: an event, a person, an idea. Something overlooked. Something misunderstood.’ I’m not hooked by every episode, perhaps because my lack of knowledge of history means that I never misunderstood in the first place, but when it works it can make you think about things in a completely different way. Above all, I find Gladwell to be a great story teller so even when I’m not completely sold on his point, I enjoy it nonetheless.

Episodes to try: The Lady Vanishes, Blame Game, The Satire Paradox, Mcdonalds Broke My Heart, Free Brian Williams, Malcom Gladwell’s 12 Rules for living.

The Life Scientific, also by Radio 4 (yes, a theme is developing here) features Jim Al-Khalili interviewing famous scientists to ‘find out what inspires and motivates them and ask what their discoveries might do for mankind.’ A bit of a mixed bag (not all top academics are good at explaining their work to lay-people!), but I really enjoy most of these and feel that I learn something along the way.

Episodes to try: Tim Birkhead, Eugenia Cheng, Daniel Dennet, Sadaf Farooqi, Nick Davies, Peter Piot, Carol Black.

50 Things That Made the Modern Economy is created by Tim Harford, the regular presenter of More or Less, so I was bound to like it, but I even ended up preferring it. Again (a bit of a theme) it takes simple ideas but tells you something quite surprising about them. In one of my favourite episodes, Tim starts the show calling up a bookmakers to try to place a bet on his own death. “William Hill won’t gamble on life and death. A life insurance company does little else. Legally and culturally, there’s a clear distinction between gambling and insurance. Economically, the difference is not so easy to see.”

Episodes to try: All of them. I’ve listened to them all 3 times.

Coffee Break French and Spanish. Seasons 3 and 4 are great for intermediate level speakers: there is a lovely interaction between the hosts (the regular presenter Mark, and a native speaker) as they discuss a “text” which is read out at the start of the episode. My wife and I make this even more valuable by listening through the text a couple of times and asking each other questions about any parts we don’t understand, before listening to the discussion. I’m not so sure about the earlier seasons for beginners.

Episodes to try: Spanish Season 3 with Alba, French Season 4 with Pierre-Benoit.

Desert Island Discs is a classic Radio 4 program, attracting high-level celebrities to discuss their life and work, whilst choosing the 8 tracks they would take to them on a desert island. I think that Kirsty Young is such a great presenter, she’s very kind and her warmth encourages the guests to open up about their lives, but she’s also not afraid to delve into sensitive issues if they’re being cagey.

Episodes to try: Tom Hanks, Nigel Owens, Sir Anthony Seldon (crucial listening for teachers), Sir David Attenborough. Really, just whoever interests you most.

A History of the World in 100 Objects (and, if you like that Germany: Memories of a Nation) is presented by Neil MacGregor, the former director of the British Museum. It has got me interested in a subject which school lessons completely failed to; I particularly like the chronological approach, there are some nice themes running through several episodes and an attempt to cover the whole world.

Episodes to try: All of them. This time, just because the chronological approach is important. My wife hates the music: don’t let that put you off.

I feel that there is almost no point in mentioning Mr Barton Maths Podcast as anyone who reads this blog will probably already be an avid listening. I absolutely love these two-hour+ episodes in which he interviews interesting people from the world of education. This pretty much takes up all my teacher-focussed listening, but I sometimes find time for The Education Research Reading Room by Ollie Lovell, which follows a similar (usually not quite so epic!) format.

Episodes to try: Jo Morgan, Dan Meyer, Jamie Frost, Greg Ashman part 1, Daisy Christodoulou,, Dylan Wiliam part 2, Harry Fletch-Wood and of course the Slice of Advice (because I feature for a couple of minutes!) From the ERRR: Adrian Simpson and John Hattie.


Special mentions go to…

I suspect that Freakonomics Radio will make the main list in a month or two but I’ve only just started listening.

Friday Night Comedy from Radio 4 (of course) interchanges between Dead Ringers (a classic impersonation show), The News Quiz and The Now Show, all of which are solid political-focussed comedy panel shows. Think of it as the podcast version of Have I Got News For You.

Serial which I remember reading was the most popular podcast ever, and for good reason. Season 1 is a gripping murder mystery with a twist: it’s a real life story.

In Our Time which is often a bit too intellectual for me, but I still enjoy it. I know that some people find Melvin Brag annoying.

No Such Thing As A Fish which is basically QI in radio format, minus Stephen Fry.

Reply All and  This American Life can sometimes be great, and I like the way they give me a bit of insight into the US, but I find them inconsistent.

I feel that I should love The Infinite Monkey Cage but I’m not sure the mix of comedy and science is actually that great. Brian Cox should stick to ‘Wonders of the…’ but it’s still worth a try.


So… what am I missing? Let me know what you listen to and why I should join you!

Post-intervention Progress?

This idea is totally stolen from a former colleague of mine, Andrew Dales.

His question: what happens after an intervention? Specifically, he framed the question perfectly as the following graph:

So, assuming that the intervention produces a greater increase in attainment than control (which I’ve exaggerated significantly for the purposes of clarity), what happens afterwards?

Does the intervention instil in pupils some newfound ability to continue progressing at a greater rate than they would have before (route A)?

Do the pupils return to the same rate of progress as before the intervention, staying ahead of where they were before (route B)?

Or do they, post-intervention, progress at a lower rate for some time, returning to their original path (route C)?

Of course, this will depend on the intervention. As teachers, I feel that we should consider which approaches will help our pupils to follow route A (or at least B!), not just those which produce the largest improvement over the time span of the intervention.

In the same way as Daniel Kahneman introduces new terminology in Thinking, Fast and Slow, I think it could be valuable to introduce the new terminology into educational research: “Is this a Route A intervention, or will the students revert to control via Route C?”

I wonder how much research tries to answer this question? My suspicion is not a lot.

To Mark Or Not To Mark

One of the two big takeaways which I discussed in my clip for Craig Barton’s ‘slice of advice’ podcast was that you don’t have to mark pupils’ written work.

I think the first place that I heard of this radical idea was from several of the teachers at Michaela School. They promote the idea of whole-class feedback: instead of marking pupils’ work, look through it, note down common strengths and weaknesses or misconceptions and use this to plan feedback which you deliver to the whole class.

Many of the advantages of this approach are discussed by Andrew Percival from minutes 14-18 on Craig’s podcast: It’s easier to give detail and nuance verbally, it gives teachers more time to think about planning lessons, it encourages focus on ‘improving the pupil, not the work’ (a quote from Dylan Wiliam).

For me, another major advantage is not having to spend hours writing individual comments… I did so for 10 years and it was the one part of my job that I hated, even though I knew that the research showed that it was more effective than giving numerical scores. I guess I just didn’t think enough about whether there may be an alternative.

I used whole-class feedback last term together with online homework. I set this through but next term I plan to trial, particularly because I like the fact that there is the option for adaptive level of difficulty in the questions. Pupils get immediate feedback as to whether or not their answers are correct and hence the chance, which many have taken, to correct their mistakes.

Once all pupils have completed the task, I look through the answers and choose which questions I wish to discuss in class. Last term, I then presented solutions to these questions, or asked a pupil who had correctly solved the problem to do so. Next term, I think I will try to present a slightly different question (possibly just changing the numbers) because this will be less frustrating for those pupils who have already answered the question correctly. It will also give those pupils who struggled originally the chance to re-do the question to show that they have improved their understanding.

I’ve just mentioned one negative of the whole-class feedback approach: some of the feedback won’t be relevant to some pupils, so we may see this as a waste of time. An alternative would be try to give verbal feedback based in a more individual way, but this would take a large amount of time with a whole class. One advantage of individual written feedback is that producing it does not use up the limited resource of lesson time.

Another advantage of written feedback may be its longer-term nature. A pupil may well have forgotten what you said to them yesterday, but if you wrote it down they can go back and look at it again. As I developed as a teacher before I stopped writing comments, I was getting better at directing pupils to look back at my past comments when they had not taken my suggestions on board.

Overall, I feel that these two issues do not outweigh the advantages of whole-class feedback (though from my comments above, you can tell I’m a bit biased!) However, they mean that I haven’t completely ruled out written feedback.

What do you think?

Scheming, Part 2: Spaced Practice and Interweaving.

You can tell that I had 9 months off work last year because I’m spending an unnaturally large amount of my summer holidays working. Most of this time has been improving my scheme of work. In part 1, I talked about how I’d been thinking carefully about which topics were prerequisites of each other to ensure that everything was taught in a logical order.

This time, I was to discuss this:

If you’re not mathematically or computationally inclined, please don’t be put off! This bit of code simply answers the question: “When is this idea revisited in the scheme of work?” in order to help me build spaced practice into the scheme and improve my interleaving.

Here is what my scheme of work looks like:

I can see that the first topic “Fraction Division A” (in my scheme: a model for division, writing it as a fraction and learning the relevant vocabulary) is ‘revisited’ after 2, 14, 16, 29, 52 and 66 sequences of lessons. Each sequence in my scheme takes around 3 hours / 1 week of teaching, so you can think of this roughly as weeks. By ‘revisited’, I mean either that it is listed as a prerequisite or application of a future sequence.

How are these numbers useful? Well, I’m not sure but I think that in an ideal world, they would go something like 2, 6, 20… I’m basing this roughly on something Mark McCourt said in his Slice of Advice, which comes from analysing data from his complete maths platform. I did read another blog recently which suggested in some cases linear spaced practice (eg. 4, 8, 12 , 16) may be more effective in some situations (annoyingly I can’t find this blog now).

Regardless, with Fraction Division A, the gap from 2 to 14 seems a bit large so this may prompt me to move the topic which comes 14 sequences later forward in my scheme.

Other topics have fewer applications. For example Rounding A had no close follow-ons at all. In this case, I didn’t want to move Rounding B forward in the curriculum, so I looked for another topic which could be used to apply rounding A and decided I could apply that when teaching Area A. In this way, it’s promoting me to think much more about genuine interweaving (not just interleaving) of topics.

Another harder question that immediately arose was constructions A, as there is a vague link after 10 topics but the next comes after 76, by which point the students will surely have forgotten most of what they learned. In this case, I need to think more carefully about whether I really want this topic at this point in the curriculum at all, as it doesn’t have many links to the rest of the curriculum (a few people on twitter helped me with this recently, but I’m still slightly lacking links).

I’ve only just started using this formula, but I already feel that its very powerful and is allowing me to make decisions about where to place topics and how to interweave in a much more informed way than I ever had before.

Are there any other uses for these numbers that you think I’ve missed?

What are the weaknesses of this approach?

Are you impressed by my excel skills?!

Professional Judgement

As a teacher, I have been asked to make predictions as to how my pupils will do in GCSE and A-level exams more times than I can remember. At my previous school, we did this three times a year for A-level students (which made up 80% of my teaching).

I questioned the value of these predictions, especially after reading in Thinking Fast and Slow, about the illusion of expertise: the example given was of stockbrokers who consistently thought that they could out-perform algorithms in making good predictions. The data did not support them.

I had a database of several hundred A-level students from my school so I decided to calculate how accurate our predictions were and compare this to my super-hi-tech algorithm for predicting A2 performance: AS grade + 8 UMS points.

I then calculated the mean squared error in all of these predictions and you can see these numbers in the top right of the spreadsheet.

My super-hi-tech algorithm produced an error of 0.42. (note that I could have added anywhere between 6 and 11 UMS points and this doesn’t change much).

In January, the team of expert teachers (I’m not joking here: my colleagues were very experienced and effective teachers) produced an error of 0.64, in March they’d reduced this to 0.45 but it wasn’t until April, about a month before the exams that the experts finally beat the algorithm, with an error or 0.35.

This suggests that there was absolutely no point in making the earlier predictions. To be honest, I’m not sure what use the April predictions were either but at least they were slightly more accurate than the simplest model I could think of. Moreover, I think it shows how bad teachers are at judging students and why we shouldn’t use teacher assessment in reports, or school data generally. This point is also made well in Daisy Christodoulou’s blog: Tests are inhuman, and that is what’s so good about them.

Draft Assessment Policy

Over the past two months, I’ve been writing the assessment policy for my new school. It was inspired very much by Making Good Progress, but also contains ideas from several policies that teachers kindly shared with me on twitter.

I have a two-hour INSET slot on assessment at the start of next term. One of my ideas is to share this policy with our teachers, ask them for their views and work on improving it as a team.
Before I do that, it would be great if anyone would help me by making any suggestions for improvement.

It will cover all subjects (as we currently only have one or two teachers in each) so I wonder if it lacks flexibility: it’s written very much from my maths teaching perspective?

Here you go:

GES Assessment Policy

Formative Assessment

Formative assessment refers to practices used by teachers which assess their pupils’ progress in order that the teacher and pupil can plan their future learning more effectively.

Teachers should use assessment:

  • to determine pupils’ prior learning
  • to check on pupils’ understanding
  • as a form of retrieval practice to improve memory
  • to correct factual and literacy errors / poor effort
  • to modify teaching and decide whether or not to move on

In order to achieve these aims, they may use the following forms of assessment.

Individual Verbal Feedback
Verbal feedback should be specific and positive (‘do this’, rather than ‘don’t do that’) and should ensure that the responsibility to improve the work remains with the pupil. Teachers should spend a brief amount of time giving feedback to a pupil; if feedback takes longer, then further instruction is needed and the teacher should assess whether a reteach of the topic/concept is required.

This has two main purposes:
Sharing knowledge and understanding around the class;
Checking on the knowledge and understanding of the class.
In the first case, teachers may accept ‘hands up’ but the majority of the time, teachers should use named questions. This enables teachers to assess whether pupils have understood the topic they are teaching, rather than just hearing from the highest-attaining pupils.
It is wise to ask a question, pause for the whole class to think, then target the question towards an individual, to encourage all pupils to engage in thinking.

Critiquing Good Work
Teachers photograph a good piece of work and project it on the board. The class discusses the strengths of the work and how it can be improved.

Teachers ask questions and pupils write their responses on mini-whiteboards which they hold up for the teacher to see.
Pupils should generally wait to show their answers at the same time, so that they are not encouraged to rush.
Complex tasks should be broken down into smaller steps for this activity.

Multiple Choice Questions
A teacher presents a multiple choice question and pupils can hold up a number of fingers to indicate their answer.
The wrong answers should ideally contain common misconceptions.
Higher-attaining pupils can be encouraged to think through what misconceptions might lead to the other answers given.

Self and Peer Assessment
If trying to address misconceptions, self-assessment is often more effective as pupils learn from there own mistakes more readily than mistakes of others.
Peer assessment can have the advantage of pupils being exposed to and learning from the ideas of others, but this may be more effectively managed through the strategy of ‘critiquing good work’.

Tests are an effective way to encourage pupils to retrieve information and test their understanding. In studies, pupils learn most from such tests if they are low-stakes: self-marked, no negative consequences for poor performance or even no-stakes: the teacher doesn’t even find out the score.
Such tests should not be assigned a percentage or grade. The focus should be on what the test tells the pupil and teacher about the next steps required for the pupils to improve.
Short, specific tasks usually provide better formative information than complex tasks, because they help to highlight the exact misconceptions of a pupil.

Teachers are encouraged to look through all homework, make brief notes and to give ‘whole class feedback’ the following lesson.
Teachers are encouraged to make notes about common errors and add them to the schemes of learning in order to address these potential pitfalls when the topic is delivered in the future.

If teachers wish to give individual written feedback on homework, it should not come in the form of a grade, but should comment on what is specifically good about the work and give one or two suggestions for improvement.
These suggestions may come in the form of follow up tasks (possibly one of the 5R’s: see appendix 1.)
Ideally, the teacher should check that these follow up tasks have been completed, but we do not want to encourage an endless cycle which burdens teachers with unmanageable workload. The follow up tasks should be more work for the pupil than for the teacher.
Teachers should encourage pupils to look back on previous feedback before completing future tasks.

Summative Assessment

The main purpose of summative assessment is to give all stakeholders (pupils, teachers and parents) an idea of how pupils are performing and progressing over time.
In order for this to happen, the results of such assessment needs to be reliable and valid, and to communicate shared meaning.

A twenty minute test will not give a reliable picture of a pupils knowledge and understanding of an entire subject. homework is not a reliable indicator of performance, as the level of time, effort and assistance sought can vary significantly.
In order to be as reliable as possible, they should be long, and ideally set over several different days to allow for pupils having a ‘bad day’.

A test on the conditional tense in Spanish will probably not be a valid indicator of how well a pupil will perform in GCSE Spanish. Similarly, the quality of a long-term project will not be a valid indicator for a subject which is assessed by examination.
Tests will be valid if they sample from a large and wide ranging proportion of the expected knowledge and understanding for a pupil of this age.

Shared meaning:
A raw score (e.g. 21 / 30) or percentage (e.g. 53%) does not communicate shared meaning because there is no common basis of understanding. It is not clear to pupils or parents, and to some extent even teachers, whether 21/30 or 53% is a ‘good’ score, nor what ‘good’ even means in this context. In order to communicate shared meaning, summative assessments results should be scaled appropriately.

How do we apply these principles at GES?

Each year group takes part in an extended period of exams towards the end of the academic year.
In year 7 and 8, these tests are sat in classrooms within the normal school timetable.
In year 9 and above, these tests are set in an exam hall over the course of one week. Revision periods are allocated between exams.
Where a department demonstrates that examination is not the most reliable predictor or GCSE success, flexibility will be given as to the method of assessment used.

Each subject is tested at least twice, with the length of exam being related to the number of lessons taught in each subject and the age of the pupils involved.
Pupils in year 7 and 8 can expect at least 2 hours of tests in Maths, English and Science. Pupils in year 9 and above can expect to take at least 3 hours of tests in these subjects. The length of these tests help make them a reliable indicator of a pupil’s performance.
These tests will aim to cover as much of the material taught up to the point as possible, in order to make them a as valid assessment as possible.

Results for each subject will be provided as a standardised score, such that the average score for the year group in each subject is 100 and the standard deviation is 20. This helps us to compare pupils’ performance between subjects and from year to year. See appendix 2 for more detail on this process.

In year 9 and 10, there will also be an indication of what a score of 70, 100 and 130 might mean in terms of a ‘working towards’ GCSE grade. These will be produced using CEM data, alongside assessments from national comparative judgement assessments in English and Maths. This will help to communicate shared meaning to parents, without giving the false impression that we can accurately predict grades at this stage.
In year 11, mock exams will be sat in February and the grades will be reported to parents, alongside that term’s progress report from teachers.

As a school, we only use summative assessment once per year because, in order to be reliable and valid, the tests must take up a significant amount of potential teaching time. We also feel that formative assessment is more important for pupils’ leaning; summative tests are not easy to use formatively as they include complex tasks which require a variety of knowledge and skills, making it less clear to the teacher which of these are lacking.
As a result, teachers are discouraged from using summative assessment at other times of the year.

Reporting to Parents

Assessment of Effort

We use the following effort descriptors:

  • Listens carefully during whole-class discourse.
  • Works hard during individual tasks in class.
  • Collaborates well with peers.
  • Completes homework carefully and on time.
  • Asks questions to clarify or probe as appropriate.

The score for each criteria is on the following scale:

  • Almost always
  • Mostly
  • Sometimes
  • Rarely

Pupils self-assess their effort before teachers assess it.
Teachers meet with parents and pupils. They discuss the effort assessments and agree upon one or two targets for the pupil to work on.
Pupils create a Google document with their targets, share it with their tutor, who makes sure they know what they need to do in order to meet their targets.

Teachers write brief comments on how each pupil is working towards the targets they set in Autumn term. They should be aimed at the pupils and hence written in the second person.
They are sent to parents, pupils and tutors, who discuss them with the pupils.

Early May
Pupils self-assess and teachers separately assess pupils’ effort.
Teachers meet with pupils and parents to discuss the effort assessments and progress towards their targets. During this meeting, targets are revised if appropriate.
Pupils update their target sheet and discuss this with their tutors, particularly focussing on targets that have remained from the autumn term.

Late June
Teachers mark the end of year assessments and work with the head of assessment to convert the scores into scaled scores, which are then reported to parents.

Appendix 1


Appendix 2

Lets say Jamie scores 75% on an English test and 60% on a science test.
It appears at first that’s he’s doing better in English, but this does not take account of the difficulty of the test.
It could be that the class average in English was 80% and the average in science was 50%. Then Jamie is actually below average for English and above average for science. Pupils intuitively know this, which is why they want to ask their peers how they did after results of a test are delivered.
There is also a more subtle issue, which is that the results of different tests may be more spread out than others.

To account for differing averages and spread, we can standardise the scores in the following way:

The standardised score in every test will have an average of 100 and a ‘spread’ of 20. In the example of Jamie’s test results above, his English grade may (it would depend on the spread of results) be standardised to 93 and his Science grade may be standardised to 120.
This will allow his tutor and parents to compare these results fairly: he can’t use the classic excuse “but everyone did badly in English”.

Next year, he will receive another science grade on the same standardised measure. Let’s say this is 115. In this case, we should be careful not to assume that he has done worse this year than last / made less than average progress in science. If however, his score is 90 in science in the second year, this significant drop is probably worth investigating.

This system is not perfect:
It does not allow us to compare the performance of departments or teachers but we don’t believe that we should use test results to do this.
It doesn’t give students an idea of how they’re doing nationally. This issue it tackled in the feedback policy by relating standardised scores to GCSE grades.

Note that we are only talking about summative tests here, in which the aim is to “track pupils’ attainment and progress, to give them, their teachers and parents an idea of how they might perform in future external exams.”.
Formative tests, which form the vast majority of testing, should not be analysed in this way and pupils should be discouraged from comparing their performance to each other.