We have a big data problem, but it's not what you think
Does big data do as much harm as good?
The last decade has seen the world's store of data grow faster than at any other point in history. We're now creating more data in every two days than was created from the dawn of civilisation all the way up to 2003 (five exabytes). According to Hal Varian, Google's chief economist, humanity will hold 53 zettabytes (that's 53 trillion gigabytes) of data by 2020.
Smartphones and social media have been instrumental in that, but so have professional uses like business analysis and scientific research. This data has enabled massive changes to the way that people live and work, even prompting developments in machine learning, AI and storage to handle it all.
But more data is no good if it isn't being used - and sometimes its collection, not to mention the way that it used, is itself questionable. Is big data a big problem? It can be if handled incorrectly.
Profiling can reinforce perceptions
Facebook is one of the many websites that displays targeted adverts based on what you've been talking about on social media; and Google edits its results based on your search history. Been reading a lot of pro-McDonald's stories? Eventually your top results will be nothing but praise for McDonald's, and Burger King will only noticeable by its absence. That creates an echo chamber effect, which can limit your world view.
"Any tool can be used for good or bad," says Stephen Brobst of Teradata, referencing an idea put forth in Chris Anderson's The Long Tail: because the internet is not finite, we should be exposed to many more interesting and innovative ideas than anyone was beforehand. That theory doesn't hold water if people are automatically directed to what algorithms think that they'll already agree with:
"The reality is that, largely because of recommendation engines funnelling people to the lowest common denominator...we go to the average much faster than before. The recommendation engines often push us away from the interesting and innovative things [and] towards the boring norm."
Dredging the data swamp
Can you blame the data? It is more likely that the problem comes down to people and the uses that they make of it. Brobst argues for more sophisticated tools:
"We always need better analytical tools… An individual has lots and lots of dimensions to it and traditional mathematics don't handle high dimensionality very well… Better tools definitely offer an advantage. Most organisations are not using the latest stuff yet but they will be soon… Driving towards understanding individuals rather than forcing individuals into the pack...
"In the end these recommendation engines' goal is to influence behaviour. You can influence behaviour in a way that allows someone to exploit their individuality more effectively, or you can influence their behaviour and force them to the norm… The ‘boring norm' approach is probably commercially easier, but I think in the long run not as beneficial to the consumer nor, ultimately, to an innovative business."
You can't just turn a handle and get gold out - Stephen Brobst
Better tools will help to gain more value from the massive repositories of raw data that companies hold until it is needed. Ian Farquhar, cybersecurity strategist at Gigamon, refers to blind use of data lakes as a "cargo cult". He told us: "The concept of taking huge quantities of data [and] putting it in some data lake in the hope that some commensurate value will be derived is looking only at the (generally nebulous) results, without looking at the risks inherent in that data lake… The hoarding of huge amounts of data in the hope that some future business value will be derived is questionable at best, and outright reckless at worst."
"You can't just pour a bunch of data in, turn a handle and think you're going to get gold out," Brobst said.
We have a big data problem, but it's not what you think
Does big data do as much harm as good?
When does persuasion become manipulation?
The question of ethics and big data has come up several times over the last year, especially in regards to politics. A full discussion about comparative morality is probably beyond the scope of this article; but here are some of the salient points:
- Targeting is not unethical. If you are reasonably certain that people in an area are going to split their votes, then aiming messaging at them is a good use of resources. That was key to Trump winning the 2016 election; his team made excellent use of the data that they had.
- Manipulation is a grey area. When does persuading the electorate to lean one way turn into outright vote manipulation?
- Using big data to target the needy is wrong. Data processors who use their information to take advantage of a person or group cross a moral line. Example: the lottery is overwhelmingly played by people below the poverty line, because they are targeted by lottery organisers.
The elephant in the room
Consumers have grown increasingly wary of companies' use of data, especially with recent high profile data breaches like those at Uber, Equifax and Deloitte. The EU's GDPR, coming into effect next May, will punish these infractions harshly, and Farquhar is not optimistic for companies that use data lakes:
"I fail to see how an organisation which has a data lake, even one where sensitive data has been pseudo-anonymised - that being the only technology specifically called out by the GDPR - can honestly meet their GDPR requirements."
However, Brobst thinks that the regulation should be greeted positively by the companies that hold and are using customer data, as the necessary protections will serve to raise levels of trust: "I think that the GDPR's going to be a positive thing," he said. "It will create transparency and put the consumers more in control [of their data], so they can decide who they do and don't trust."
Is big data a big problem? It can be used in some very shady ways, and it can improve peoples' lives. It can streamline company workflows, and it can drive them towards bankruptcy.
In the end, big data is not unethical or evil, moral or good; it just is, and it is up to data processors to use it responsibly.