Doing ad hoc big data analytics? Don't throw the data away when you're finished, warns IBM

IBM's Alex Chen warns that regulatory requirements means that big data projects will increasingly need to be closely managed

Organisations and lines-of-business have been warned not to casually throw away data after performing ad hoc big data analytics as, increasingly, they may need to refer to the decisions made as a result of regulatory and other potential requirements.

That is the warning of Alex Chen, director of file, object storage and big data flash at IBM, speaking at Computing's recent IT Leaders Forum.

"A lot of businesses are still in the early adopter stage of big data. They are figuring out how they can transition and operationalise into this IT environment.

"It's likely that someone in a line-of-business [in many organisations] has spinned-up a Hadoop cluster and called it their big-data analytics engine. They find a bunch of x86 servers with storage, and run HDFS," said Chen.

"You have all sorts of endpoints ingesting data into a storage pool and, on a periodic basis, you move data from the storage pool into the analytics engine. You start performing the analytics until, you get your answer. Then what do you do?

"You throw away the data that helped you devise the answer you were looking for," he said.

This is because, in many cases, it's not just terabytes of data that are being analysed, but petabytes, and the more data that has to be analysed, the longer it takes.

"Performing analytics generates a lot more meta-data, too, and due to regulations or business requirements people may just want to see what happened and why they made certain decisions. So you wlll need to re-run the analytics that were run before," Chen warned. "So you can't just throw away the data any more."

Furthermore, he added, in many cases the data needs to be encrypted, especially if its personally identifiable and with regulations such as the EU's General Data Protection Regulation coming in. And, at the same time, companies are moving towards a creeping dependence on big data.

"Business decisions are tied into the analytics engine itself and therefore your business processes will become dependent on the availability of the analytics engine. No longer can you just say, ‘ah, the Hadoop cluster went down for a few hours, but it's fine now'.

"You're actually making business decisions on these analytics engines, and you're not just running in Hadoop, but multiple analytics engines - Spark in-memory analytics, SAP Hana, SAS."