Anonymous data can be 'de-anonymised' to reveal people's real identities, researchers warn

Machine learning algorithm can identify 99.98 per cent of people in any anonymised dataset, claim Imperial College researchers

Researchers have developed an algorithm that can correctly determine the real identities of individuals from anonymised data sets using just 15 demographic attributes.

The research, conducted by scientists from Imperial College London and UCLouvain in Belgium, indicates that current methods of data anonymisation can't protect complex data sets of personal information against re-identification.

Companies and governments downplay the risk of re-identification by arguing that the datasets they sell are always incomplete. Our findings show this might not help

The new study, published in journal Nature Communications, shows that machine learning algorithms can easily reverse-engineer such anonymous data to re-identify individuals with a high degree of accuracy.

According to the researchers, their new tool can re-identify 99.98 per cent of Americans in any available anonymised dataset by using only 15 attributes, including gender, age, and marital status.

"While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog," explained study co-author Dr Luc Rocher of UCLouvain.

Such details could enable buyers of the supposedly anonymous data to create detailed personal profiles of individuals.

Dr Yves-Alexandre de Montjoye, of Imperial's Department of Computing, and Data Science Institute pointed out that while personal data is covered under GDPR, if it is anonymised it can be sold to anyone.

"Although they [companies] are bound by GDPR guidelines, they're free to sell the data to anyone once it's anonymised. Our research shows just how easily - and how accurately - individuals can be traced once this happens.

"Companies and governments downplay the risk of re-identification by arguing that the datasets they sell are always incomplete. Our findings show this might not help.

"The results demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for."

Professor Julien Hendrickx from UCLouvain added: "We're often assured that anonymisation will keep our personal information safe. Our paper shows that de-identification is nowhere near enough to protect the privacy of people's data."

The researchers have also published an online tool "to help people see which characteristics make them unique in datasets". This online tool is for demonstration purpose only and doesn't save users' data.

In recent years, major tech firms have faced intense scrutiny from the public and data privacy watchdogs over their handing of user data.

Earlier this year, privacy campaigners said that they had found new evidence to back-up claims that internet giant Google is not complying with the EU's General Data Protection Regulation.

And in May, a lawyer for Facebook told a judge in the US court that Facebook users should not expect privacy on Facebook as there is no user privacy on any social media platform. The company has also been accused of hawking users' smartphone data to telecoms and phone makers.

Healthcare and technology firms often collect user data - including information from healthcare records - and convert it into supposedly anonymous data.

This doesn't include personally identifiable information, such as names, email IDs, phone number, and so on. Stripping the data of identifying attributes is intended to ensure, at least in theory, that nobody can identify any individual in that data.

Moreover, such anonymised data is no longer subject to data protection regulations, like GDPR, and can be freely shared with data brokers and advertising firms.