How Replicable is the Imperial College Model?

By Sue Denim

After Toby published my first and second pieces, Imperial College London (ICL) produced two responses. In this article I will study them. I’ve also written an appendix that provides some notes on the C programming language to address some common confusions observed amongst modellers, which Toby will publish tomorrow.

Attempted replication. On the June 1st ICL published a press release on its website stating that Stephen Eglen, an academic at Cambridge, was able to reproduce the numbers in ICL’s influential Report 9. I was quite interested to see how that was achieved. As a reminder, Imperial College’s Report 9 modelling drove lockdown in many countries.

Unfortunately, this press release continues ICL’s rather worrying practice of making misleading statements about its work. The headline is “Codecheck confirms reproducibility of COVID-19 model results”, and the article highlights this quote:

I was able to reproduce the results… from Report 9.

This is an unambiguous statement. However, the press release quotes the report as saying: “Small variations (mostly under 5%) in the numbers were observed between Report 9 and our runs.”

This is an odd definition of “replicate” for the output of a computer program, but it doesn’t really matter because what ICL doesn’t mention is this: the very next sentence of Eglen’s report says:

I observed 3 significant differences:
1. Table A1: R0=2.2, trigger = 3000, PC_CI_HQ_SDOL70, peak beds (in thousands): 40 vs 30, a 25% decrease.
2. Table 5: on trigger = 300, off trigger = 0.75, PC_CI_HQ_SD, total deaths: 39,000 vs 43,000, a 10% increase.
3. Table 5: on trigger = 400, off trigger = 0.75, CI_HQ_SD, total deaths: 100,000 vs 110,000, a 10% increase.

In other words, he wasn’t able to replicate Report 9. There were multiple “significant differences” between what he got and what the British Government based its decisions on.

How significant? The supposedly minor difference in peak bed demand between his run and Report 9 is 10,000 beds, or roughly the size of the entire UK field hospital deployment. This supports the argument that ICL’s model is unusable for planning purposes, although that’s the entire justification for its existence.

Eglen claims this non-replication is in fact a replication by arguing:

although the absolute values do not match the initial report, the overall trends are consistent with
the original report

A correctly written model will be replicable to the last decimal place. When using the same seeds and same input data the expected variance is zero, not 25%. Stephen Eglen should retract his “code check”, as it’s incorrect to claim a model is replicable when nobody can get it to generate the same outputs that other people saw.

Number of simulation runs. ICL have contradicted themselves about how Report 9 was generated. Their staff previously claimed that, “Many tens of thousands of runs contributed to the spread of results in report 9.” In Eglen’s report we see a very different claim. He explains some of the difference between his results and ICL’s by saying:

These results are the average of NR=10 runs, rather than just one simulation as used in Report 9

Imperial College’s internal controls are so poor they can’t give a straight accounting of how Report 9 was generated.

The point of stochasticity is to estimate confidence bounds. If incorporating random chance into your simulation changes the output only a bit, you assume random chance won’t affect real world outcomes much either and this increases your confidence. Report 9 is notable for not providing any confidence bounds whatsoever. All numbers are given as precise predictions in different scenarios, with no discussion of uncertainty beyond a few possible values of R0. None of the graphs render uncertainty bounds either (unlike e.g. the University of Washington model). The lack of bounds would certainly be explained if the simulation was run only once.

People working on the ICL model have argued the huge variety of bug reports they received don’t matter, because they just run it repeatedly and average the outputs. This argument is nonsense as discussed repeatedly, but if they didn’t actually run it multiple times at all then the argument falls apart on its own terms.

Models vs experiments. The belief that you can just average out model bugs appears to be based on a deep confusion between simulations and reality. A shockingly large number of academics seem to believe that running a program is the same thing as running an experiment, and thus any unexplained variance in output should just be recorded and treated as cosmic uncertainty. However, models aren’t experiments; they are predictions generated by entirely controllable machines. When replicating software-generated predictions, the goal is not to explore the natural world, but to ensure that the program can be correctly tested, and to stop model authors simply cherry-picking outputs to fit their pre-conceived beliefs. As we shall see, that is a vital requirement.

Does replication matter? It does. You don’t have to take my word for it: ask Richard Horton, editor of the Lancet, who in 2015 stated:

The case against science is straightforward: much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness. As one participant put it, “poor methods get results”.

Alternatively ask Professor Neil Ferguson, who is a signatory to this open letter to the Lancet requesting retraction of the “hydroxychloroquine is dangerous” paper because of the unreliability of the data it’s based on, supplied by an American health analytics company called Surgisphere. The letter justifies the demand for retraction by saying:

The authors have not adhered to standard practices in the machine learning and statistics community. They have not released their code or data.

ICL should give the authors the benefit of the doubt – maybe Surgisphere just need a couple of months to release their code. They are peer-reviewed experts, after all. And statistics isn’t a sub-field of epidemiology, so according to Imperial College spokespeople that means Ferguson isn’t qualified to criticise it anyway.

Initial response and the British Computer Society. Via its opinion writers, the Daily Telegraph picked up on my analysis. ICL gave them this statement:

A spokesperson for the Imperial College COVID-19 Response Team responded to criticism of its code by saying the Government “has never relied on a single disease model to inform decision-making”.
“Within the Imperial research team we use several models of differing levels of complexity, all of which produce consistent results. We are working with a number of legitimate academic groups and technology companies to develop, test and further document the simulation code referred to. However, we reject the partisan reviews of a few clearly ideologically motivated commentators.“

The first bolded statement is typically misleading. In the SAGE publication from March 9th addressing lockdowns, the British Government was given the conclusions of the SPI-M SAGE subgroup in tables 1 and 2. On page 8, that document states the tables and assumptions are sourced to a single paper from ICL which has never been published, but from the title and content it seems clear that it was an earlier draft of Report 9. There is no evidence of modelling from any other institution contributing to this report, i.e. it doesn’t appear to be true that the Government has “never” relied on a single model – that’s exactly what it was fed by its own advisory panel.

The second bolded statement is merely unfortunate. By ideologically motivated commentators they must have meant the vast array of professional software engineers who posted their reactions on Twitter, on GitHub and on this site. The beliefs of the vast majority in the software industry were summarised by the British Computer Society (BCS), a body that represents people working in computer science in the UK. The BCS stated:

Computer code used to model the spread of diseases including coronavirus “must meet professional standards” … “the quality of the software implementations of scientific models appear to rely too much on the individual coding practices of the scientists who develop them”

Is Imperial College going to argue that the BCS is partisan and ideologically motivated?

On motivations. It’s especially unfortunate when academics defend themselves by claiming their critics – all of them, apparently – are ideological. Observing that coding standards are much higher in the private sector than in the academy isn’t even controversial, let alone ideological, as shown by the numerous responses from academics agreeing with this point, and stressing that they can’t be expected to produce code up to commercial standards. (They “need more funding”, obviously.)

But in recent days people have observed that “for months, health experts told people to stay home. Now, many are encouraging the public to join mass protests.” The world has watched as over 1,200 American epidemiologists, academics and other public health officials published an open letter which said: “[A]s public health advocates, we do not condemn these gatherings as risky for COVID-19 transmission …. this should not be confused with a permissive stance on all gatherings, particularly protests against stay at home orders.”

According to “the science” the danger posed by this virus depends on the ideological views of whoever is protesting. This is clearly nonsense and explains why Imperial College administrators were so quick to accuse others of political bias: they see it everywhere because academia is riven with it.

To rebuild trust in public science will require a firm policy response. As nobody rational will trust the claims of academic epidemiologists again any time soon, as the UK’s public finances are now seriously damaged by furlough and recession, and as professional modelling firms are attempting to develop reliable epidemic models themselves anyway, it’s unclear why this field should continue to receive taxpayer funding. The modellers with better standards can, and should, advise the Government in future.

Appendix: Common errors when working with C/C++. This section is meant only for modellers. Non-modellers or programmers already familiar with these languages should stop reading here.

The C/C++ programming languages are unlike most others. It’s apparent from talking to some modellers that this isn’t sufficiently clear. Some believe that the impact of bugs (any bugs) is always likely to be small relative to errors in assumptions, which isn’t the case. An academic working in molecular biology wrote an open letter in response to my analysis, arguing that the ICL fiasco is the fault of software developers for not putting warning labels on C++:

It’s you, the software engineering community, that is responsible for tools like C++ that look as if they were designed for shooting yourself in the foot. It’s also you, the software engineering community, that has made no effort to warn the non-expert public of the dangers of these tools. Sure, you have been discussing these dangers internally, even a lot. But to outsiders, such as computational scientists looking for implementation tools for their models, these discussions are hard to find and hard to understand.

Blaming professional software engineers for disasters caused by untrained academics is hardly a helpful take, especially given attacks on “armchair epidemiologists“, yet the problem he identifies is clearly a real one. Very few scientists work with C/C++. They mostly prefer to use R or Python. These languages are far better choices and don’t suffer the problems I’m about to outline, but are less efficient than C++. If you want something efficient yet safe, try exploring a more modern language like Kotlin for Data Science.

As these articles have been seen by a lot of scientists, I’ll now provide some quick explanations meant for that audience. If you’re a scientist working with C/C++ be aware of the following things:

Firstly and most critically, if at all possible don’t use these languages. They are designed for efficiency above all else. The ICL COVID-Sim program has several cases of so-called memory safety errors. Beyond data corruption, memory safety errors can create security vulnerabilities that could lead to your institution getting hacked. Google employs some of the best C++ programmers in the world and has a large industrial infrastructure devoted to catching memory safety errors. Despite that they routinely ship exploitable bugs to the Chrome userbase. To stop this leading to the sort of “code red” security disasters that were routine in the early years of the 20th century, they built a complete firewall around their own code (the “sandbox”) that assumes they have in fact failed and which tries to contain any subsequent attacks. They also moved to silent security upgrades that users cannot control. In this analysis the Chrome engineering team show that around 70% of all Chrome security bugs are related to memory safety and explore moving away from using C++.
A few modellers who commented believed that memory safety errors would surely cause the program to crash, so if it didn’t crash when producing Report 9, any bugs must have been introduced afterwards. This isn’t the case. Crashing is an intentional process started by the operating system when a violation of system rules is detected, indicating internal corruption inside a program. It is a best effort process because the OS cannot detect every possible corruption. If a bug causes a variable to be incorrectly set to zero and you then divide by it, your program will crash because dividing by zero is impossible. If the bug incorrectly sets the variable to anything else the division might succeed and yield an invalid result. Likewise, allocating a list with five elements and trying to read the tenth will reliably crash in almost every language except C and C++, in which the program is not checking list indexes for performance reasons. If you read the 10th element of a 5 element list and the OS doesn’t detect that, your variable will be set to an arbitrary value.
Memory safety errors do not yield uniformly random values. As I’ve repeatedly stressed, some modellers appeared to believe that memory safety errors don’t matter if you average the results. An out of bounds read like this bug is far more likely to yield some values than others, for example, 0, 1, -1, INT_MAX, INT_MIN and pointers into a heap arena or stack frame. You have no idea what it’ll be and cannot predict it, so don’t try.
Memory safety errors open up what’s called “undefined behaviour“. The compiler is allowed to assume your program has no memory safety errors in it, even though that may be hard to achieve in practice for large programs. It may change your program in complex ways before you run it, based on that assumption. For example, the compiler may silently delete parts of your code, including important parts like security checks. Obviously if the program you’re actually running silently skips a step in your model the results are scientifically meaningless, even if they may look plausible.

A major understanding gap between the software industry and academic science appears to be caused by this last point. Once your program contains undefined behaviour you cannot reason about what it will do or whether the outputs are correct. Common sense logic like “it looks right to me” means nothing because something as trivial as an overnight operating system update could cause the results to change totally. Any chance of reliably replicating your results goes out the window. To a software engineer, a program with memory safety errors could do literally anything at all, which is why it’s seen as pointless to argue about whether such a bug has a significant effect.

Here are two more issues that can bite you when working with C/C++:

The default random number generators are often too weak for scientific purposes. COVID-Sim attempted to use its own RNG to solve this, but it was also buggy, so that probably just made things worse. If you need a fast source of pseudo-random numbers for Monte Carlo techniques, use an open source RNG pre-written for you and which has been run against a battery of statistical tests. Mersenne Twisters work if you’re careful but a better algorithm to use is Xorshift+. For example this article gives implementations of xorshift and splitmix. Treat your RNG with care especially when splitting the stream in multi-threaded contexts. If you’re working with something safer like Java or Kotlin on the JVM, SplittableRandom exists to help you.
When doing floating point calculations in parallel, don’t add the result of each thread to a shared variable at the end of the loop. This can (a) cause lost writes if you forget to use an interlocked exchange and (b) can cause non-deterministic runs due to the non-associativity of floating point arithmetic.

Because so many C/C++ specific bugs are avoidable with experience, if you do decide your research needs the performance C or C++ offers you should attempt to gain funding to hire a software engineer who has worked with these languages for several years. Resist the temptation to go it alone: you risk the reputation of your institution by doing so.