OpenSAFELY: more proof that tackling the coronavirus pandemic does not require privacy to be compromised

Posted on May 23, 2020 by Glyn Moody

In recent weeks, there has been an intense focus on the use of contact tracing apps as a way to emerge safely from the lockdowns that are in place around the world. A key question is whether to use a centralized or de-centralized architecture. After some division, the balance has firmly swung towards the latter, with only a few hold-outs such as the UK and France sticking with centralized approaches. That’s clearly good news for privacy, since it’s riskier to keep data in one location, both in terms of leaks and abuse by governments. But it’s not the only area where some see a tension between data protection and tackling the Covid-19 pandemic effectively.

For example, Vint Cerf, widely recognized as one of the creators of the Internet, has recently written an article for the Indian site Medianama entitled “Internet Lessons from COVID19“, in which he warns:

Variations of the European Union’s General Data Protection Regulation (GDPR) are propagating around the world with good intent although implementation has shown some unintended consequences, not least of which may be the ability to share health information that would assist in finding a vaccine against SARS-COV-2.

Coming from Cerf, that point carries some weight – although it is worth remembering that he is now employed by Google as the company’s “Chief Internet Evangelist“. Since Google naturally wants as few obstacles as possible to sharing information, of whatever kind, it’s perhaps not so surprising that Cerf is trying to raise questions about the GDPR‘s “unintended consequences”. Nonetheless, given the gravity of the coronavirus situation, it’s certainly worth examining the extent to which the EU law might be having an adverse effect on research that could result in finding a vaccine, or in improving treatments for the disease.

In general, it’s clear that scientific information about Covid-19 is flowing around the world at an unprecedented rate. Two of the top sites for preprints – quick-release versions of academic research – already hold around 4000 papers relating to the disease. That such a large number have been posted in such a short time suggests there are no major barriers to sharing information. However, there is a particular area where the GDPR could be a significant factor. This is where research involves personal data about people, both those with and without coronavirus infections. Within the EU, this information is strictly controlled, and it is easy to see that GDPR rules might be holding up important work.

For example, it is vitally important to establish what are the factors linked to deaths caused by Covid-19: this will allow health professionals to understand who is most at risk and to explore ways to mitigate the problems. In order to spot subtle effects, it is important to use as large a data set as possible. However, the larger the dataset, the greater the risks for privacy, particularly in terms of moving data around from where it is held to where it is analyzed.

An important academic paper from a group of researchers has come up with a way around this problem. OpenSAFELY is a new secure analytics platform for electronic health records in the the UK’s National Health System (NHS). It has been created to deliver rapid results during the global coronavirus emergency. It allows more than 24 million patients’ full pseudonymised primary care NHS records to be analyzed in detail. The basic software is open source, and can be freely downloaded from GitHub for security review, scientific review, re-use and re-writing. Here’s the key feature of the system:

OpenSAFELY uses a new model for enhanced security and timely access to data: we don’t transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments managed by the electronic health record software company; instead, trusted analysts can run large scale computation across live pseudonymised patient records inside the data centre of the electronic health records software company.

By interfacing with the secure databases used for storing and managing this sensitive data, OpenSAFELY allows analyses to be conducted without copying the basic data for external use. Instead, only aggregated results can be viewed by researchers, analysed and then published. Although that might seem an obvious approach, it’s not one that has been adopted on such a massive scale before. The present paper shows that it can be done with real-world datasets, and extremely quickly.

In some ways, the UK’s NHS is unusual. It is a centrally-managed health system that brings together health records in large, unified databases. But in other respects, OpenSAFELY’s approach should be applicable anywhere: it simply requires the companies writing healthcare data management software to provide interfaces for researchers so that they can carry out controlled analysis on the databases. Those can then be extracted and combined with other results.

The more general point is that privacy can be respected while carrying out much-needed research; it just requires innovative thinking to come up with alternative approaches. The great danger is that governments – and companies – might try to exploit the current extreme situation to call for data protection to be watered down with the plausible aim of encouraging more rapid medical breakthroughs. Privacy campaigners should be clear that this is by no means necessary, as the decentralized contact tracing apps, and the new OpenSAFELY project, both show.

Featured image by Paul Flannery.