Experiment with big data now, and worry about ROI later, advises Pentaho 'guru'

The sooner you discover the best way to exploit big data the better

If you're one of the many enterprises who are still only experimenting on the edges of big data and analytics, you may want some advice on how to start taking the plunge.

According to Pentaho's "data guru", Wael Elrifai, successful big data policies are about focusing on a clear journey, rather than worrying too early about the return on investment.

"Companies insist they want to see an ROI on a project right away, now the reality is that's not the nature of data science. Data science - the whole purpose - is to explore whether relationships exist to allow us to make predictions," he told Computing.

"You can't explore and know the outcomes already, otherwise you wouldn't need the exploration part, right?"

At the same time, Elrifai warned that knowing the difference between "data laking" and data warehousing is key.

"There are a couple of business cases you can make for data laking. One is warm storage [data accessed less often than "hot", but more often than "cold"] - it's much faster and cheaper to run than a high-end data warehouse. On the other hand, that's not where the real value is - the real value is in exploring, so that's why you do at least need to have a data scientist, to do some real research and development."

And in a competitive field, Elrifai also said it's important to be conscious of your goals.

"If you want to be the second to market with a solution, just wait for the first to do it. But the risk is that the first to the market is someone like an Amazon, who does something that blows everyone else so far out of the water, they're all out of business."

In his recent web seminar with Computing's sister publication V3, Elrifai described many companies as being barely out of the "playground" in terms of big data.

"What I meant by that was companies are still building out their infrastructure - so they haven't even done the exploration or [data] science pieces yet. And [as part of my role] I'm moving them out of the playground into the business-critical operations as I discover more use cases.

"I think of the playground as the place where they set up their infrastructure, but don't know what they are doing yet - what they have is not yet compelling. Warehouse processing is just repeating what they were doing in EDW (enterprise data warehouse).

"But we don't know what a transformative operation is going to be - you don't know you need a car until cars proliferate and then you see there's a great use for them," he said.

But with more and more companies now running successful big data operations, isn't it possible now to look at rivals or peers and get at least some clue from what they've done?

"You have people trying to build out use cases, and the reality is we have a broad set of use cases that may apply to around 90 per cent of businesses," Elrifai said.

"So supply chain optimisation is a known element, or predictive maintenance around Internet of Things or machine learning - the predictive maintenance applications are huge. And real-time analytics is another known quantity.

"But planning for a specific business is never simple - even two businesses operating in retail are going to be different."

Elrifai does agree, however, that the data science skills gap is real and applies to the entire big data sector.

"I think there is a gap. A data scientist is comparable to a statistician. The way I like to think about it is if I'm going to bring this to the CIO or CEO, I might ask them: ‘If you had perfect information, what would you be capable of doing? If you had all the information out there, what could you optimise and improve?' And then we can start looking at where to get that information. Big data is all about talking about bringing more information into a company. So once we know the context, the data science problem becomes much more easily defined."

Elrifai said that the skills gap is beginning to narrow thanks to a proliferation of new courses.

The final piece of a big data journey, Elrifai said, is choosing the right technology from an increasingly advanced and, thus, useful set of industry offerings.

"I'd say the underlying technology on the Hadoop ecosystem is quite nascent, or in early versions. That landscape is rapidly moving, so if you want the blending or the abstraction layer, [using those systems] could take some of that complexity away.

While obviously biased towards Pentaho's own offerings in this area, Elrifai championed the cause of data integration technology.

"[Abstraction tools] look at all the underlying systems and allow you to blend them, with a drag and drop interface. Such technology also allows you to integrate business intelligence into the systems and the data lake - with both structured and unstructured data. So the first thing I'd say is look at the tools that provide a measure of abstraction."

Finally, Elrifai was keen to point out that security within big data is now hot topic.

"Another thing I'd say is look at systems that provide good levels of data lineage and security, which is particularly relevant when we look at the problems of TalkTalk in recent weeks," he concluded.

Pentaho is a partner of Computing 's upcoming IT Leaders Summit, where top level UK IT executives will come together for a packed half day of thought leadership sessions and networking (including more tips from Elrifai). Register your interest here and join the conversation.