eBay exec urges Hadoop community to get more focused

The online marketplace uses the Apache software to process CPU-intensive data

SAN DIEGO: Darren Bruntz, senior director of e-commerce at eBay, has indicated that the online marketplace might be more inclined to increase its use of Hadoop as a data analytics tool if its open-source community had more "focus".

Bruntz was speaking at the Teradata conference in San Diego this week.

Hadoop is an open-source tool that uses a process called parallel programming to help understand petabytes of data that were previously unstructured and too large to do anything with.

Parallel programming allows analytics to be run on hundreds of servers, with lots of disk drives, all at the same time.

Hadoop stores this data in a file system called HDFS (Hadoop distributed file system), in effect a flat file system that can spread data across multiple disk drives and servers.

eBay currently has three platforms for its data warehousing and analytics, one of which is Hadoop.

"We have a Teradata enterprise-class, data-warehousing system, which is actually two systems run virtually as one, that manages high-concurrency workloads. It has between five and 10 petabytes of storage in it," said Bruntz.

"Then we have a separate Teradata system that is really about deep storage. That is where we put all our behavioural data, our high-volume data, clickstream and event data. We also keep all our traditional warehousing data in there," he added.

"The third system is built on Hadoop. We target that for high CPU workload, but it is not where we would do relational or structured work. However, it is great for image processing and modelling, which is something that obviously requires a lot of CPU."

Despite its benefits for processing memory-intensive data, Bruntz describes Hadoop as a "programmatic" system that requires a lot of coding to create a system that has the tools you require.

It is widely agreed in the industry that Hadoop is an extremely complex system to master and requires intensive developer skills. There is also a lack of an effective ecosystem and standards around the open-source offering.

"I think we will stay on our setup of the three platforms for a few more years, but Hadoop could be a more compelling offering if the open source community and its contributors got some more focus and energy, as you would have a whole community of people working on new tools and features," said Bruntz.

"We are not really biased towards a particular technology - we look at the value we are getting from that technology. We look at all the different dimensions of service: are we working with a partner that can meet our needs in an aggressive way?" he added.

"So, it's not just the technology, but it's the ecosystem that goes around that. In future we could perhaps move to a single platform, but I don't see there being a single compelling technology for several years."