According to market intelligence company Transparency Market Research the global Hadoop market is growing at almost 55 per cent a year and is expected to be worth $20.9 bn in 2018.
No-one professes himself more surprised by this meteoric rise than Hadoop's co-creator Doug Cutting. After all, the platform that is emerging as the de facto means of pulling together, storing and processing diverse types of data started life in 2005 as a side project at Yahoo!, a way to enable two open-source search projects that Cutting was working on at home to run across lots of machines.
"I wasn't expecting it," he told Computing. "In a way it was a case of being in the right place at the right time - although of course you have to be able to take advantage of being in the right place at the right time."
Another surprise, at least in the early days, was that none of the major software vendors - the Oracles, IBMs and Microsofts - picked up on the growing demand for managing and processing data on an ever-increasing scale by developing their own distributed platforms. This he puts down to the strength of the open source model, or rather the weakness of the proprietary one.
"Actually Microsoft had a technology in house called Dryad, but they abandoned it because they realised it would only be a small proprietary corner," Cutting said.
"We may have had a little bit of first mover advantage, but if someone had gone the whole hog with an open source alternative that might have been a competitor," he added.
"Hadoop was sustained and solidified by the open source methodology. There's a growing awareness that with platform technology open source gives a tremendous market advantage. People aren't comfortable with gravitating towards a platform that's controlled by a single vendor."
The open source methodology "is not something these traditional software vendors are good at," Cutting said, going on to explain how the ethos at Yahoo!, Google and in university departments in the mid-2000s gave them the edge.
"We had experience of building open source communities, which eventually gave us an unassailable advantage," he said.
However, Cutting is keen to dispel any myth that he saw this all coming.
"If that makes me sound like a master tactician, I'm not!" he said modestly. "It's all after the fact."
In 2009 Cutting left Yahoo! to become chief architect at Hadoop startup Cloudera, just as the big data movement was starting to break out of the laboratories.
Hadoop has now been co-opted by all of the major vendors that might have once sought to outdo it. It sits at the centre of a huge ecosystem of open source databases, libraries, algorithms, query languages and middleware that together are being tasked by enterprises around the world with storing, managing and analysing the masses of data that flow across modern organisations, or that lie hidden away in silos.
Meanwhile Hadoop continues to evolve and mature. Much effort has been focussed on expanding on finding more flexible alternatives to the MapReduce data-crunching engine. Hadoop 2.0 was released in October last year and included Apache YARN, an execution layer that opens up the platform to real-time processing, broadening the range of data processing tasks that can be run and applications that can be developed for the platform. YARN was developed under the leadership of Arun Murthy co-founder of rival Hadoop distributor Hortonworks.
Cutting is chairman of Apache Software Foundation and as such presides over a diverse community of open source projects. But while open source development is collaborative it's also competitive. Sounding a little like a man praising a surrogate parent for the way they've brought up his children, Cutting tempered his praise for YARN by mentioning that that Cloudera released Impala, which is designed to process mixed workloads and allow real-time ad hoc querying, before YARN arrived.
"I think YARN is a good piece of engineering to help assist the platform. YARN allows you to dynamically allocate time and hardware more efficiently for mixed workloads," he said.
"But we saw the direction of travel before YARN came along. MapReduce is still a mainstay, a workhorse, but there are workloads that can be done in other ways. HBase and search sit well with Hadoop, and Impala is the leading interactive SQL engine for Hadoop. The trend towards a variety of workloads was well in progress, and in a way YARN was a response to that trend," Cutting said, adding that Impala, HBase and search are now integrated with YARN.
Just last week, Cloudera announced it is supporting another technology designed to increase the versatility of the Hadoop ecosystem, Spark. Spark is an in-memory technology that was developed to reach some of the other parts that workhorse MapReduce cannot reach. It is focused on supporting iterative algorithms, such as those used in machine learning, and also interactive data mining.
[Turn to next page]