Don't do too much big data on one cluster, warns Exasol

Big data is still about the right tools for the right job, says principal consultant Shuttleworth

Despite all the new tools and solutions flooding the big data landscape, there are still a few absolute "dos and don'ts" to be aware of when setting up a BI platform, Exasol principal consultant David Shuttleworth told delegates at today's Computing Big Data & Analytics Summit 2015 in London.

Shuttleworth said that when building BI platforms on SQL, fast response times are still hugely important, while "on some of the Hadoop-style things it's still okay to kick off a batch-style process every few hours".

"The name of the game is use the right tool for the job, and match the process with the environment," said Shuttleworth.

"The modern approach is typically using data and menu techniques to speed things up, and that works really well for analytics-style queries, as opposed to the old RAID-style of storage data," he added.

"So [my advice is] it's column based, or adding column-based functionality on top."

Shuttleworth reflected on how given approaches tend to be "ad-hoc, or coming from many angles".

"These things are very different in terms of what they need to run well, and it's probably a mistake to try and run these on a single cluster. People do try and do it, but it turns out that because of the different requirements, it's not a good idea. There's no environment I know of that can schedule both."

Shuttleworth advised those building big data platforms to "keep them separate - configure them separately in terms of spindles, memory and CPU, and then use the capabilities that are out there in modern products to bring the data together transparently".

"An end user or, more likely, an end user tool can then generate the SQL to drag all this data together as it needs to." The access time, he added, is now generally fast enough across modern systems to support this method.

"You have problems with data being fragged between two different environments, so it's better to push some logic into the external system and do some processing there. Much tighter integration means you could do the mapping on Hadoop and do the reduce on SQL."