NHS care.data - a horribly botched operation

If respect for personal data had been put at the core of the care.data programme from the start the PR disaster of its rollout could have been avoided

The care.data programme currently being piloted in a few hundred GP practices before a planned national rollout expands NHS England's collection of medical data for analytical use. Since 1989 information on hospital stays has been collected for analysis and the new programme will extend this to cover data resulting from patients' visits to their GPs. Evidence from studying such data can be invaluable for medical research, including epidemiology and screening for cancer, and to monitor key information, such as how many patients a doctor has seen.

The rollout, which began last year, is widely considered a badly botched operation because of the high-handed way that it was communicated, in that that patient consent was taken for granted, and for a lack of transparency and clarity about what the programme aims to achieve. Its cause was not helped when it was revealed that NHS hospital data has been sold to insurance firms, despite assurances during the care.data rollout that patient data is not for sale. Put simply, the public is not convinced, and neither are many GPs.

At the root of the backlash is concern about what happens to our personal information. While many people have grown comfortable with sharing at least some aspects of their personal lives in return for tailored services, and even targeted marketing, information about their health is a different matter, especially if it could potentially be used to their disadvantage by healthcare, insurance or finance companies. With care.data, no one seems to know exactly what data will be shown to whom, and for what purpose.

It's personal
In terms of how it is treated, the difference between personal data and anonymised data is that the former is subject to data protection legislation whereas the latter is not. So, aggregated data of the sort used by statisticians to track health trends, to determine the spread of an illness or to find national average height and weight figures is anonymous and may be published in the public domain. Information that might allow the identification of a living individual may not, at least not without the consent of that person.

Sounds straightforward, but for research purposes a simple two-way split between the personal and the anonymised is not always possible. For example, investigation of rare diseases often requires study of individual records rather than aggregated data. Or a researcher might need to track a patient's interactions with the NHS longitudinally over time. The workaround to allow this is called pseudonymisation. Pseudonymisation enables data to be linked to the same person across multiple records or information systems without revealing the identity of the person. Identifiers (such as the NHS number, gender and age) are included in the record, but personal information (such as names and doctor's notes) is removed or replaced by a randomised string of characters.

Most people would be happy for their data to be used to improve medical science, so long as the process were consensual and transparent and they were satisfied that sensitive information about their health would not be passed on elsewhere (more about that later). But issues arise when defining what fields in the dataset are considered "personal", and what happens when pseudonymised data leaves the confines of the NHS, such as if it is sold on to a the private sector.

Horse trading, grey areas and loopholes
Issues around data privacy legislation inevitably involve a good deal of horse trading. Cultures and laws around personal privacy vary markedly from country to country (in the US the rules are far less stringent than they are in the EU, for example). State bodies such as intelligence agencies and global internet corporations that depend on processing personal data for their business models frequently lobby hard for privacy laws to be levelled down rather than up. On the other hand, many citizens are becoming increasingly concerned about what happens to their data, not least as a result of Edward Snowden's revelations, and are pushing for better protection and transparency.

Privacy campaigner Caspar Bowden says that within the EU, the UK and Ireland have historically tended to oppose strengthening privacy laws, interpreting EU directives in such a way that the effect on commercial interests and state access is minimised. With regard to the treatment of pseudonymised data, he says UK and Ireland failed to transpose part of EU Data Protection Directive Recital 26 into the Data Protection Act of 1998. Recital 26 states that data shall be considered anonymous if the person to whom the original data referred cannot be identified by the controller, or by any other person. Bowden says the UK and Ireland omitted the "or by any other person" part, and thus "created EU DP's nastiest loophole: that UK and Ireland treat pseudonymised data as anonymous".

Narrowing the definition of what is considered "personal" in this way certainly creates a lot of grey areas. Should an IP address be considered as anonymous or personal? It is unique to a property or a router rather than an individual, but if there are only one or two inhabitants in a house can an IP address really be considered anonymous? With a little more data pertaining to the same IP address an individual can often be easily identified, along with their online activity. The same applies to other identifiers that come very close to being unique to an individual, such as a car number plate or even a postcode. And what about geo-positioning information relating to a smartphone? Removing the "or by any other person" piece from the DPA , suggests Bowden, is at the root of the mistrust.

Kathryn Wynn, senior associate at law firm Pinsent Masons, explains that in regulating the DPA the Information Commissioner's Office (ICO) takes into account the likelihood of re-identification occurring in its distinction between personal and anonymous.

"ICO guidance on determining what is personal data ... depends on 'all the means likely reasonably to be used either by the controller or by any other person to identify the said person'," she says. So, although the "or by any other person" piece didn't make it into the DPA, confusingly it is included in the ICO's guidance.

Wynn also points out that while the DPA works on a binary basis (information is either personal data or it is not), the ICO's Anonymisation Code of Practice does treat pseudonymised information differently.

"Pseudonymised data is more likely to require careful treatment and be disclosed only to limited categories of recipients and subject to tight controls. Accordingly, it is not in our opinion necessarily a loophole," she says.

The personal as probability
The likelihood of an individual being identified by combining disparate datasets is an important part of the definition of personal data: "...The mere existence of such information elsewhere should not make the data personal within the meaning of the Directive. There must be a reasonable likelihood of the two pieces of information being capable of being brought together," says a UK government discussion document issued prior to the passing of DPA 1998 into law.

But such an occurrence is much more likely now than it was back in 1998, for a number of reasons.

One is the sheer amount of data that is now available. The rise of open data initiatives that many European countries including the UK are adopting enthusiastically, means that there is a lot more data free to download from government sites like data.gov.uk. (Incidentally the money to fund these initiatives itself often comes from selling data - Health and Social Care Information Centre (HSCIC), the body administering care.data, even published a price list for data linkage and extract services.) Then there is the growing industry of third-party brokers who gather up data assets from across the web and aggregate, integrate and sell that information on. Floods of personal data are also available from the social "firehose" of sites like Twitter. And last but by no means least are the extensive internal databases of customers that have been built up by all types of organisations.

So, anyone wanting to cross-match pseudonymised data to create a detailed personal profile of someone is spoiled for choice. The tools at their disposal are advancing all the time too. At Computing we speak frequently to big data companies and practitioners. Time and again they tell us how easy it is to crunch disparate datasets to identify individuals with a high degree of probability, even in the absence of some sort of consistent key to link them together.

Interviewed recently by Computing on this subject, David Davis MP gave the example of his own medical records. He argued that he could be easily identified from pseudonymised data if it were to be cross-referenced with information that is in the public domain, such as his age and the fact that he's broken his nose five times. "That takes it down to about 50... even if they take out all postcodes, there will be other things in the public domain, which have been in the newspapers," he said.

Indeed, there is a lively debate around how much data you need to identify someone. In his blog 33 Bits of Entropy, Princeton Professor Arvind Narayanan postulates that you need just 33 bits of information about any person in the world in order to to determine who they are. And If you know their hometown and if that town has 100,000 people or fewer, then you already have at least half of those bits of information.

Crimes & misdemeanours
Following the revelation that hospital data was being sold to insurance companies the law was changed in 2014. The NHS is now barred from selling personal data unless there is "clear health benefit" rather than for "purely commercial" use by insurers and other companies. However, there are plenty of grey areas here too. Many insurance and pharmaceutical companies undertake their own research activities, either on their own or through partnership with universities, and may still be able to get access to pseudonymised data that way. If information such as postcode and date of birth is included in that data such companies could potentially reap huge commercial rewards by cross-matching it to create detailed personal profiles.

However, this would be illegal says Pinsent Masons' Wynn:

"If there is evidence of re-identification taking place, such as reconstructing individual profiles using fragments of data from different data sets - the ICO is likely to take regulatory action, including potentially imposing a fine of civil monetary penalty of up to £500,000 or take enforcement action if the data was collected through re-identification process without knowledge or consent of the individual," she says.

But detecting such a crime and providing enough evidence for a prosecution would not be easy, and given the value of such information in providing a competitive edge the temptation for companies or individuals to bend the rules might just prove too great. Also, the re-identification might well take place outside of the UK's jurisdiction, and anyway the ICO does not have a strong record in punishing private companies who break the rules.

There is also the thorny issue of leaks and hacks, either from HSCIC or from the third parties with whom the data is shared. It has been noted that there is no stipulation that GP data needs to be encrypted before it is uploaded, and in 2014 it was revealed that more than two million "serious data breaches" of NHS patient records had been reported since the start of 2011.

And then there is the cloud. Last year, MP and GP Sarah Wollaston questioned how in 2011 PA Consulting Group was able to upload the entire NHS hospital patient database for England to Google's cloud where it was processed using the firm's BigQuery platform. The data was pseudonimised but it is not known what information was redacted. This is a serious issue because of the asymmetry between the data protection laws of the US and EU and the fact that as a US firm Google is subject to the Patriot Act and FISA 702, giving the US government a legal right to access its servers. Moving such data outside of the UK may be illegal under the DPA.

The ins and outs of consent
The issue of consent has been central to the problems with the care.data rollout. There have been numerous complaints that the public has been under-informed about what can be done with their data, and that the option to opt-out was not been made clear in the initial leaflet sent to households, seemingly assuming consent. There is now an opt-out, which confusingly offers two different options, Type 1 and Type 2, but there have been issues with what this means too. Recently it has been suggested that patients choosing a Type-2 opt-out may miss out on bowel cancer screening. There is no opt-in option, which the British Medical Association says should be the method by which the scheme achieves consent.

However, some academics believe that having an opt-in would reduce the amount and the quality of data available for research.

Speaking to the European Data in Health Research Alliance recently, Professor Simon Wessely, vice dean for academic psychiatry at the King's College London, explained that some groups are more likely to give consent for their data to be used than others, with young unmarried men being the least likely to respond. This can skew the results of research.

"Many countries have national disease registries where patient data is included without consent," Wessley said.

The current system in the UK "provides a system of checks and balances - in which the public interest in having valid answers to important questions, is balanced against risk of any detriment, distress or harm to the individual, as well as the continued importance of confidentiality and autonomy. It's a good system that has worked well for both the individual and the public," he said.

Arguably, if HSCIC had done a better job of explaining care.data, and specifically if it had barred the commercial use of personal or pseudonymised health data by private companies from the off, such concerns by researchers would have been reduced. The public are generally happy to consent to data being used for research, but suspicious about others profiting from their sensitive information. These suspicions make large-scale opting out more likely.

Other measures that should have be put in place include stipulating that pseudonymised data cannot be reconstituted, by using one-way encryption for example. That way it could be truly anonymised rather than just being treated as anonymised, protected by technology rather than relying on a confusing legal framework. Currently the distinction between reversible and irreversible pseudonymisation is not even considered in the ICO's Anonymisation Code of Practice.

Change on the horizon?
The political horse trading is still going on, of course, but NHS England may be forced to change the way it treats personal and pseudonimised data by the European Commission's (EC's) General Data Protection Regulation (GDPR), which is expected to see the light of day at the end of this year. GDPR replaces the woefully out-of-date 1995 Data Protection Directive. The current draft of GDPR contains several Articles that deal with pseudonymisation in relation to health data. Processing personal data for the purposes of "historical, statistical or scientific research" is likely to remain lawful only if these purposes cannot be otherwise fulfilled using anonymous data and that "data enabling the attribution of information to an identified or identifiable data subject is kept separately from the other information under the highest technical standards, and all necessary measures are taken to prevent unwarranted re-identification of the data subjects".

Wynn says that data processing for care.data will likely (i) require consent; (ii) require obligatory pseudonymisation; and (iii) patients will now have a right to object. However, she believes it is likely that HSCIC is already compliant with the first two points as a result of changes made during the rollout. But the right to object to data processing for research may lead to some changes in the way it operates.

On the face of it then, the new regulations from Europe are unlikely to be enough to answer the many objections of the public and the medical profession as to the way that sensitive personal data is treated under care.data. The initial rollout had to be canned after these objections were aired. The current Pathfinder pilot project now taking place in a few hundred practices may iron out a few more, but perhaps the whole scheme should be rebooted, with the duty to protect personal data placed at the centre of the programme, rather than tagged on as an afterthought.