Storage vendor EMC announced earlier this month that it plans to incorporate open source technology Apache Hadoop into some of its appliances, with a view to helping enterprises analyse large amounts of unstructured data from Twitter, Facebook and other social networks.
The most interesting aspect of this announcement is not that EMC is to use open source technology, but that it has chosen to use Hadoop.
Hadoop has recently gained quite a following among data analytics providers, with IBM and Aster Data among those releasing Hadoop-based products for enterprise use. It has also been reported that Yahoo is planning to create a company around the technology.
Rod Smith, IBM’s vice president of emerging technologies, says EMC is responding to the growing realisation among enterprises that they need to be able to store and analyse what he calls “big data”.
“Every company I speak to is considering how they can harness the ever-increasing amounts of social media information to help improve their products or services,” says Smith. “People are not necessarily going to company web sites or to customer services to talk about solutions, they are going to their own social networks.
“I would say around 70 per cent of businesses, based on the companies I have spoken to, are already looking to the benefits of Hadoop and how it can help analyse this mass of information,” he adds.
How it works
The main benefit of Hadoop is that it uses a process called parallel programming to help companies understand petabytes of data that were previously unstructured and too large to do anything with.
Parallel programming allows analytics to be run on hundreds of servers, with lots of disk drives, all at the same time.
Hadoop stores this data in a file system called HDFS (Hadoop distributed file system), which is essentially a flat file system that can spread data across multiple disk drives and servers.
However, this distributed file system is not enough on its own - it requires an additional programming framework, as Donald Feinberg, vice president and distinguished analyst at Gartner, explains: “Three years ago, companies such as Aster Data and Greenplum released the capabilities to put a programming framework containing Map-Reduce type data inside the database.”
Greenplum went about this by allowing algorithms to be written for MapReduce that would go out to HDFS-type data and run analytics on it.
“Hadoop can effectively act as a big file store for a bunch of detailed data that a company does not want to put in its data warehouse, and this data is all available for analysis,” Feinberg says.
Feinberg suggests that companies are attracted to Hadoop because it is a “less expensive and more efficient way of storing and analysing data”, and means that businesses do not have to work out how to otherwise manage petabytes work of data.
How it needs to develop
Despite industry enthusiasm for Hadoop, analysts and companies still recognise that it is early days for the technology and that more needs to be done.
“The main problem with Hadoop is that it is still too geeky; you have to be a developer to use it,” says IBM’s Smith.
He also argues that the technology could help IT departments to change how they are perceived by the business: “The IT department needs to adapt from managing IT and information to providing access to it.”
Tony Baer, industry analyst at Ovum, agrees that the Hadoop ecosystem needs refinement, and argues that it will be years before the necessary standards are established.
“The industry is still trying to figure out a way to establish best practices,” he says.
“For instance, you may be allowed to collect public data from Facebook, but the reuse and repurpose of it may not be widely accepted. So we still need to figure out how this is going to work,” he adds.
“Over the next 18 months you will see the industry continue to commercialise Hadoop, but it will take at least three years before any good best practices are established. We are managing data of a kind we have never really managed before.”