What is Spark? Six reasons why CIOs should find out (and one why they shouldn't)

Spark is a favourite among developers and data scientists, but why would a CIO need to know about it? John Leonard asks delegates at Spark Summit Europe for their views

A couple of years ago, those of us who follow developments in big data technologies began to hear a new word: Spark.

This was at the stage when many organisations had started to take notice of the Hadoop platform and were trying it out for pilot projects or small production tasks. Many found that while Hadoop was a great way of distributing data across cheap hardware, obtaining the promised analytical insight using some of the applications in the Hadoop ecosystem was not simple. For some people it was all a bit too much effort, and the speed at which results could be produced a little underwhelming. That's when we started hearing Hadoop vendors, who had spent a lot of money marketing the revolutionary benefits of that platform, talking about Spark.

Apache Spark is a fast-growing open-source project that is designed to speed up and simplify many common data-crunching and analytics tasks by pulling them together under one interface and doing all the processing in memory. It is a storage-agnostic general-purpose compute engine that can run on (and thus effectively integrate) a wide range of back-ends, including Hadoop, Cassandra, and cloud-based storage and datawarehousing systems.

It's certainly creating a buzz within the developer and analyst communities, but why would a CIO need to know about Spark? At the recent Spark Summit Europe, Computing asked various organisations who are using Spark, or integrating it into their own products, for the top reasons why they were doing so.

Reason 1: Speed

Spark is fast. Very fast. Everyone we spoke to mentioned speed. It's backers say that for some use cases it processes data 10 to 100 times faster than equivalent programs that write to disk.

"Memory is king in this world," says Natalino Busa (right), senior data architect at banking and financial services firm ING. "Everything is moving to memory and SSDs, and Spark is a data operating system that's designed to run in memory."

The falling price and increasing efficiency of memory and storage, he adds, means that unless you are a Facebook or a Twitter, almost all processing can now be done cost effectively in memory, even within very large corporations such as an international bank.

Anjul Bhambhri, VP product development, big data and analytics platform at IBM, says that even for batch-type jobs her labs have measured speeds two-to-four times faster than MapReduce, while by re-engineering the company's venerable SPSS predictive analytics server to run on Spark customers are seeing a measurable three-to-six times performance benefit running over terabytes of data.

The most common use-case for Spark is business intelligence. Being able to analyse streaming data on the fly and combine it with other sources in real time "brings the speed of analysis much closer to the speed of thought", says Isabel Nuage (left), product marketing at Talend.

"Decision-makers can make better decisions. It's making good on the promises of Hadoop," she adds.

Reason 2: A happy data scientist is a productive data scientist

Data scientists - in great demand these days as firms seek to utilise data to fill gaps in their knowledge - love Spark. Because Spark offers an interactive web interface and supports languages that data scientists and statisticians tend to use (Python and R) they find the barriers to using things like Hadoop reduced.

"As a data scientist I can type interactively. It sounds like a small thing but it's a game changer. You couldn't do that before in Hadoop," says Sean Owen (right), director of data science at Cloudera.

The fact that Spark can interrogate data sitting in many different sources from a single web interface (called a "notebook" in Spark parlance) without having to worry about formats and file systems appeals to Gianmario Spacagna, data scientist for customer and retail banking at Barclays.

"Any data scientist has to do some exploratory work and then some coding. With a notebook I can quickly do the exploratory work, check the output, create some charts, and do it again in the same session. It responds very quickly. It's interactive," he explains. "With Hadoop you can't do that because it's asynchronous."

Reason 3: It cuts development time

"It's end-to-end, a complete system within one framework," says Ayraman Farahat, distinguished architect at Yahoo. "Before you had a database part and an analytics part and you'd have to do some ad hoc things to try to put those together. With Spark I can go from the initial data collection all the way up to the design and running of the experiment and the analysis of the output all in one framework."

"You only have to know one system," Farahat adds.

"Spark speeds up the development of iterative and machine-learning-type applications because the data is in memory," says IBM's Bhambhri (left), who puts Spark's "expressiveness" as her number one attribute.

"It also makes developers more productive. They can solve the same big data problems in a fraction of the time. Because of how much processing Spark can handle on its own, the number of lines of code is a fraction of what it used to be with other big data technologies."

By refactoring the IBM DataWorks data refinery and ETL (extract, transform and load) solution to incorporate Spark the company was able to reduce the number of lines of code by 87 per cent, she says.

Reason 4: You can start small and scale up

Tug Grall, technical evangelist at MapR, comments that being able to use Spark independently of other software is a key advantage.

"You can start a really small deployment independently of Hadoop, so it makes deployment very simple for developers who didn't look at big data platforms before," he says, adding: "As soon as they get more data they can store it in Hadoop, or in NoSQL engines, or in a data warehouse and they can build very complex applications like machine learning."

"You can start small and scale up almost infinitely with the same easy model," Grall adds.

This is particularly true if you employ Scala developers, adds Barclays' Spacagna (right).

"You write some code in Scala and that same code magically works in Spark. It's executed in a distributed big data cluster but the way you define your logic is exactly the same" he says.

[Turn to next page]

What is Spark? Six reasons why CIOs should find out (and one why they shouldn't)

Spark is a favourite among developers and data scientists, but why would a CIO need to know about it? John Leonard asks delegates at Spark Summit Europe for their views

Reason 5 - The internet of things is coming

Spark was initially created for real-time analysis of streaming data and machine learning, and this is exactly what will be required as the IoT takes off. The number of sensors and data sources may be rising exponentially, but at it's heart, the IoT is a big data problem.

"A lot of people are moving into the mobile scenario in retail, and this means analysing data in real time to track the devices in order to serve customers better" says Amit Satoor, a director of marketing at SAP. "In manufacturing we're seeing analysis of streaming IoT data, and for genomics or any large-scale projects that use unstructured data Spark lets you get results in an hour rather than a couple of days."

Machine learning is commonly used in the building of recommendation engines that match customers with products they might want to buy based on past behaviour. Yahoo's Farahat worked on a system to match 50,000 apps with 500 million users. "That's 27 trillion possible combinations," he points out.

Reason 6: Spark is going to be big

In the febrile world of cutting-edge technology making predictions can be a dangerous game, but consider the facts. The number of active developers on the open-source project has risen more than six-fold in two years, making it Apache's largest. It's user base has doubled in the past year and it is backed by some big guns such as Huawei and IBM and investment groups Andreesson Horowitz and NEA, so funding is not an issue.

As a general-purpose big data platform that supports machine learning, streaming, integration and analytics it has the potential of uniting or even displacing older technologies, simplifying the landscape.

"At IBM we describe Spark as an analytics operating system and we're proud that over the years we have bet on game-changing technologies that have stood the test of time, be that V-System then Linux and now Spark," says Bhambhri.

One reason why you shouldn't

You don't have big data. While Spark can run on a single laptop, to stop there would truly be using a very large sledgehammer to crack a particularly tiny nut. Like all software it is merely a tool, and unless you have big data-type problems to solve, your time, effort and resources might be better spent on more appropriate ones, as is also the case with Hadoop.

There is a lot of hype around big data and Spark is certainly the shiny thing du jour. Its newness means that it is still immature in some aspects, and while companies like IBM are investing heavily in improving its SQL capabilities and optimising other features, most users will still get more from the traditional tools they invested time and money in learning how to use. Most organisations don't have big data problems to solve right now and time may be better spent on other projects.

That said, looking down the road a bit, more and more will. CIOs need to keep on top of emerging technologies and given that Spark is general purpose, growing fast and free to download and try out, it's surely worth a look.

Computing 's Big Data Summit returns to London in March. Registration is FREE for most delegates. Book your place today