Meet the 'nerdiest rock star': Matei Zaharia co-creator of Apache Spark

John Leonard talks to the Databricks CTO about why he created the fast-rising big data platform and his vision for its future

Billed by a colleague as a "truly nerdy rock star", Matei Zaharia strolls onto the stage for his keynote to whoops of appreciation from a capacity Amsterdam crowd that can only be described as excitable: "I'm here because Spark is great," says someone in the front row.

In person he is every inch not the rock star, and still very much the UC Berkeley computing systems PhD (in fact he's now an assistant professor at MIT) that a few short years ago tried to solve machine learning problems using Hadoop, an experience that led him to create Spark together with Benjamin Hindman and other colleagues. Slight, intense and amiable, he seems faintly bemused by the pace of change. "I'm still amazed at how fast it's shifted. I used to know personally everyone who was using Spark. Not any more," he says.

In the past two years Spark has gone from a promising experiment to - while not yet mainstream in the generally accepted use of that term - a project that is certainly the de facto choice for many developers, data scientists and analysts working with big data, machine learning and the IoT.

Spark is now the most popular Apache project in terms of the number of active developers, growing from 100 in 2013 to more than 600 today. A unified "data operating system" that pulls together several functionalities into one place, its user base has also doubled in the last year, and IBM, among others, is now making use of its capabilities to speed up some of its traditional analytics products.

Why Spark and why now?

"There's a growing interest in big data and what you can do with it, and changes in the hardware that have made in-memory computing more possible," says Zaharia.

"More people are interested in big data so more people want it to be easy to use. Most people come to Spark because it's just an easier way to program applications."

There has also been a snowball effect, he says, with books and online courses proliferating, and the fact that it is independent of Hadoop or any other platform gives it a wide constituency of users.

Spark has evolved to become a general-purpose, big data compute platform, with support for languages popular with data scientists - Python and R - added recently, Spark SQL for querying, and libraries and algorithms built in for use cases including streaming, machine learning, data science and graphical analysis. Is there any danger of it becoming too generalist, of losing sight of what developers liked in the first place, that it is fast and simple?

"You see people who want to do various things with big data and they say ‘what's the easiest thing to pick up' and the answer is Spark with Python, so that's what they go for," Zaharia says.

"We spent a ton of effort to make Spark work well with these other languages like Python and R and we're starting to see significant use of these - 60 per cent are using it with Python now."

He adds: "Spark is written in Scala, which is a great language for programmers, but there are huge numbers of people that know a bit of programming but who are not software engineers."

Out with the old

While Spark can run on many storage back ends, it is commonly used with Hadoop and Yarn. There is a general consensus that Spark has now replaced the batch-oriented MapReduce on Hadoop as a faster and more flexible way of crunching large amounts of data stored in Hadoop, even for many batch jobs, and there is talk of it displacing other elements of the Hadoop ecosystem, such as Hive, Mahout and Pig, too. So are developers moving away from these projects and on to Spark?

"The number of MapReduce developers has reduced, but Hadoop as a whole has roughly remained the same overall," Zaharia says. "But with Spark we get more developers because you can also run it on many other environments."

Nothing stands still in the world of software, particularly open source. As it moves towards mainstream recognition some are even describing Spark as old hat. What about the new streaming platforms such as Apex and Flisk that are bubbling under?

"With open source it's never like someone decides this is first generation and this is second generation," Zaharia explains. "There are lots of choices and people tend to pick one. It's like languages: some programming languages have been around almost forever: C, C++, Java. They're not going anywhere, whereas other languages come up and sometimes they're short lived and sometimes they're really good at a specific thing.

"With Spark we look at what users are doing and we try to improve it to make sure it's really good at doing those things. I think this goal of being a general platform to allow different sorts of processing is still unique to Spark. I don't know of any other projects that are doing that. It's very valuable because no one wants to learn and manage and hook together two or three different things when they could just have one."

Zaharia is cofounder and CTO of of Databricks, a company set up to commercialise Spark, particularly in the public cloud, where about half of its users currently deploy it. The CEO is Ion Stoica, his former professor at Berkeley, and many of his former student colleagues are also with the firm. Has this made a difference?

"Probably. It's a new technology. There wasn't much like it before and we have a really good team, we had one of the best research groups in computer science, really smart people in machine learning and databases and computer systems who were all working on this stuff together," Zaharia says.

And the "truly nerdy rock star" introduction by his colleague?

"I'll have to work out what I'm going to call him next time," he laughs, a little ruefully.