The biggest mistakes in Big Data and how to avoid them
The experts reveal their darkest big data nightmares, and explain how to avoid the pitfalls
Don't drown in your own data lake, don't reinvent the wheel, and if you're going to hire a bunch of expensive data scientists, try to avoid letting them sit around doing nothing.
That's a brief summary of the advice of a panel of experts speaking at a recent Computing event, as they revealed their mistakes when attempting to launch big data initiatives within their organisations.
Speaking at the Big Data and IoT Summit recently, Gael Decoudu, head of data science & digital analytics at Shop Direct said that his firm began by creating an impressive data lake, but then quickly drowned in it.
"The approach we took was to create massive data lake, to collect as much data as we could," said Decoudu. " So we invested money in that, then after couple of years realised that we couldn't do anything with it. So then we started thinking about hiring a team of data scientists and analysts who can find insights in that data," he added.
Dr Kevin Findlay, IT & digital board director at insurance firm Complete Cover Group, described the problems his organisation found with open source software.
"We went down the open source route - and we used [coding language] Python. There are just one or two platforms in the data lake world, based around the Hadoop infrastructure, so we went for one of those. Then one guy last year spent a lot of time writing neural network algorithms instead of just using the standard packages," said Findlay.
So far so good. But next Findlay admitted that whilst technically impressive, this didn't actually create any value.
"It was more for his own educational value though. The other side of the coin is that is there's a freely available Python library that does the same, so what's the point in making your own?"
Decoudu also had words to add about open source, suggesting that it can be hard to know which supporting tools and software to use.
"We've slowly moved on to open source, and we're now on AWS [Amazon Web Services], and we're starting to use [programming language] R. One of problems with open source is it's hard for someone who doesn't have lots of experience to pick the right package. There are probably 20 different ways of doing neural networks in R and Python, but which is the right one? Which should you use in a business setting? Getting that wrong can cost the company millions of pounds," said Decoudu.
Jude McCorry, head of business development, at The Data Lab advised firms to encourage their younger staff to be proactive.
"Some companies get excited about saying they hire graduates in data science programmes, but often those graduates just sit there and wait for work to come to them. They're supposed to be there to answer questions about the data, and be self starters," she argued.