How Barclays data scientists created an algorithm to anonymise data at scale
IT industry should be doing more to create privacy solutions for the big data age, says Raffael Strassnig
Is there a technical fix to allow privacy to keep pace with advances in big data technologies? That, in essence, was the topic addressed by Raffael Strassnig, VP data scientist at Barclays retail bank, in his presentation at the Computing Big Data Summit last week.
"Research into the anonymisation of datasets was very active until 2007 when big data became a major player, when Facebook took off and when we threw our private information at big companies and let them do what ever they wanted to do with it," he said.
"At that time researchers came to the conclusion that anonymity is a very hard thing to achieve."
In particular there were problems with the prevailing anonomisation methodologies, such as k-anonymity, which become increasingly difficult to apply the larger and more complex a dataset gets - the curse of dimensionality as it's known.
The privacy issue is once again in the spotlight as people have become more concerned about their private data and as new legislation such as the EU General Data Protection Regulation (GDPR) is set to enshrine Privacy by Design and consent in law.
Strassnig reminded the audience that companies such as banks hold a great deal of very sensitive information about individuals and that the quantity of this information "is growing exponentially with no sign that this is going to stop".
"At Barclays we know where you've been, what you've bought, how much you're earning, who's living with you," he said, going on to discuss the many tensions that exist between the right to privacy and the utility of personal data to companies who base their business model on it.
Sensitive data must be closely guarded or its subjects will take action; at the same time advances in technology are making it ever easier to distribute, reproduce and process such data at scale.
"For a company there's nothing worse than a customer withdrawing consent for them to use their data ... but on the other hand data is most useful when it's widely shared."
The most valuable data for predictive modelling is generally the highly sensitive information that is unique to an individual such as financial or health records. It is very important that this data be shared with as few people as possible and kept secure. So how can this particular circle be squared? How can large, complex datasets containing sensitive data be effectively anonymised without seriously reducing the ability of analytsts to extract useful information from them?
Strassnig and his team came up with an automated data anonymisation algorithm that was based on a paper by a PhD student. This involves clustering the data into k-means clusters, with no cluster overlapping, the clusters being a certain size to comply with k-anonymity constraint, and minimising the loss of data when applying the procedure to the dataset by using a dissimilarity measure.
"The big question is does this scale? Yes under certain assumptions," Strassnig said, explaining that a key innovation was speeding up the clustering process computationally using a data structure called a VP-tree.
In future, developments in the fields of homomorphic encryption and machine learning may allow datasets to be interrogated without anyone ever seeing the data, but that is still at the research stage, Strassnig said, going on to admonish the industry as a whole for its lack of ambition in this area.
"If a small data science team with one research paper can create a scalable anonymity algorithm in just two days for one data scientist, surely the whole IT industry can do better, and can come up with ideas and algorithms that can guarantee anonymity to a much greater level than we can now?"
@_johnleonard