IoT and AI: centralise the people, decentralise the processing

Edge computing and new management structures required for success in real-time data projects, say Tibco and Databricks

Processing and analytics have been moving steadily away from the centre and out towards the edge where much of the new data is now created.

However, the most widespread architecture for managing data from multiple sources still sees it corralled into a central repository such as an enterprise data warehouse (EDW) to be refined, processed, ordered and queried with the results sent back to the requester. But for many newer use cases - big data analytics, IoT, unstructured data, real-time responsiveness - this process is just too slow and rigid, especially at scale. It becomes a bottleneck.

There's a human analogue to this technological bottleneck too, one that's come about as a result of a shift away from BI to real-time analytics. When the data and analytics are the responsibility of separate teams there will inevitably be friction; a promising machine learning project can soon grind to a halt.

Computing spoke separately to two senior representatives of companies working in the data analytics space, albeit from different ends. Tibco's roots are in the infrastructure side with its background in high-speed data buses, while Databricks was founded to promote Apache Spark and thus comes more from the analytics side - although increasingly the two fields and the two companies in question are meeting in the middle.

"Taking the analytics to the data is where we are heading," said Shawn Rogers senior director of analytic strategy at Tibco.

"Why bring all this data across the network with all the complexity and cost just to do it at the centre when you should be making decisions where data lives?"

In a world defined by IoT, web apps, global enterprise datasets and distributed applications, the centralised approach typified by an EDW or large cloud repository makes less and less sense.

If a driverless car had to push sensor data to a central cloud for analysis and wait for the result before making a decision the inevitable latency would be very bad news for other road users. In this and many other use cases time is very much of the essence: processing needs to take place as soon as the data is generated.

It's important to put the analytics next to the data instead of always bringing the data back

"The days of constantly consolidating all of our data into a single data warehouse are over," Rogers said. "EDWs are still important and most enterprises will always have one but that's not the only place your data lives. It's in applications, in the cloud and in various enterprise silos so it's important to put the analytics next to the data instead of always bringing the data back."

As well as latency, transferring data from large numbers of sensors to the centre and back quickly runs into bandwidth scalability problems as the IoT grows. In addition, edge devices may not always be online. Largely because of the rise of IoT, edge computing has become an area of focus for Tibco, DataBricks and other data architecture firms.

But it's not just about technical solutions. Another factor that can slow the flow of data to where it's needed is the internal structure of the business says Databricks CEO Ali Ghodsi.

"The data engineers, the systems guys, they're managing the infrastructure and what they care about is security and reliability," he said. "These guys are pretty protective and conservative and they usually sit in IT."

The primary users of data tend to be located elsewhere.

"The other team sit in the wider business. They are the data scientists and the business analysts who build mathematical models. There are employed by the lines-of-business and they understand the business problems," Ghodsi said.

This arrangement is the reason why 99 per cent of machine learning and AI projects fail, he claimed.

"Basically there's tons of politics between these two teams and they usually don't like each other. The IT mentality is very different from the business mentality."

This separation of duties is particularly problematic for AI development. Machine learning algorithms require large volumes of learning data, and this data often needs to be enriched as flaws in the algorithms emerge through iteration. Any blockage to the flow of data can be fatal to the viability of an AI project, just as having to rely on remote processing would be fatal to a driverless car (and anyone unfortunate enough to be in the way of one).

The data is basically controlled by the engineering team

"The engineers say ‘sure I'll get you your data when I have time'," said Ghodsi. "So the data is basically controlled by the engineering team."

The few companies that are actually succeeding in machine learning and AI share a common factor in that the same team is responsible for data engineering and data science, he said.

So the rule of thumb for IoT analytics seems to be 'centralise the team, decentralise the processing'.