Spark catching up with Hadoop as the primary big data platform

Computing's in-depth research finds that Spark has a growing userbase in the UK

Hadoop and Spark are leading the way as the primary big data processing platforms for organisations in the UK, according to research from Computing.

More than 500 people who work in IT responded to a nationwide online quantitative study among companies that have 100 or more employees from different sectors. Respondents included CTOs, CIOs, COOs, CEOs, IT managers, developers, as well as many others.

While Apache Hadoop has become the de-facto big data storage engine, there has been talk of it being displaced for some processing tasks by newer technologies such as Apache Spark. However, the research still gives Hadoop a substantial lead.

When asked which big data processing platforms the respondents believed their company would be using as their primary tool in 18 months, the biggest proportion of those companies who said they would be processing big data said it would be Hadoop (59 per cent), followed by Spark (17 per cent). Kinesis (seven per cent), Storm (four per cent) and Flink (two per cent) were other platforms on the list that respondents said they would be using, while more than a quarter (26 per cent) said that they will use "other" big data processing platforms.

Computing's research also found that "advanced" organisations - those businesses that are leaders when it comes to adopting and using technology to drive change - are relatively more likely to be using Spark as their primary platform, suggesting that it is catching up with Hadoop.

It is worth noting that Hadoop and Spark are commonly used in conjunction with one another - but respondents were made to pick only one processing platform as Computing wanted to find out which platforms are making their mark.

Spark is a storage-agnostic general-purpose compute engine that can run on a wide range of back-ends, including Hadoop, NoSQL database Cassandra and cloud-based storage and dataware-housing systems.

While Hadoop has been around for several years, with organisations using the platform to distribute data across cheap hardware, obtaining the promised analytical insight using some of the applications in the Hadoop ecosystem is not always straightforward.

In response to end-user feedback, Hadoop vendors have started talking up the use of Spark, which is designed to speed up and simplify many common data-crunching and analytics tasks by pulling them together under one interface and doing all the processing in memory.

As well as the quantitative study, Computing also interviewed IT decision makers by phone, in face-to-face interview and in a focus group - and Spark was a subject that kept coming up in conversation.

"Spark for its speed and simplicity. It's easy to get it up and running, it's very easy to code and it's blindingly fast compared to Hadoop," said one CTO from the technology sector.

According to another CTO, it is easier to find people who have experience with Hadoop, but Spark and Storm are "much more attractive and faster".

"They are both a generation ahead of Hadoop, but not as widely adopted," the CTO said.

According to a data scientist, Spark is replacing the MapReduce element in the Hadoop ecosystem.

"Spark actually does the real-time data streaming, whereas it used to be MapReduce but that could only do batch. Now Spark does batch and real-time. So everybody's actually - if they haven't [already] deployed it, they'll go straight to Spark, if they're on MapReduce or they're in the process of migrating to Spark. Again, Spark is really the only game in town," the data scientist said.

Want to hear the full findings from Computing's Big Data research? Come to Computing's Big Data Summit on March 16 in London - it's free for end users to attend. See the full agenda here.