Hadoop in danger of becoming a silo of silos, says Trillium

Trillium's VP of product management, Ed Wrazen, says the practice of dumping data in Hadoop without cleaning it risks big data failure

A lot of companies are very confused about Hadoop.

"I think they're somewhat misguided," says Ed Wrazen, vice-president product management at data quality firm Trillium Software, during an interview at Computing's recent Big Data and Analytics Summit.

"I ask them how much data they want to process and they say 30GB or about 30 million records. I tell them, you don't have big data - not only that but Hadoop won't work particularly well for that anyway."

Others try to use Hadoop to build real-time querying platforms, something for which the batch-oriented platform was not really designed and for which there may be simpler solutions - although developments such as Cloudera's Impala and Apache Spark go some way to addressing this issue.

And still others are stumped by the complexities of an ecosystem typically comprised of multiple interacting elements with names like Pig, Hive and Squoop, for which skills and familiarity may be in short supply.

Nevertheless, in line with analyses by Computing and others, Wrazen says that he has seen interest in Hadoop as the basis for serious business processes leap over the past few months.

"There are analytics and data science-type projects running in many big organisations, but using Hadoop for core enterprise mission-critical applications is still relatively new. But in the last six to nine months, suddenly we're seeing much more of Hadoop, in particular being used for mission-critical applications," he says.

The growing maturity of Hadoop, a new emphasis on analytics, the substantial cost savings on offer and the promise of investing in tomorrow's technology today are behind this surge, he says.

"Organisations are perhaps looking to divest from proprietary expensive hardware and software where they can gain more value long term at a lower cost of ownership using technology such as Hadoop. It offers them the benefits of more flexibility and scalability as well. When you're looking to spend millions on another server or a bigger server [or look at an alternative] it becomes a business decision, but it has to be fit for purpose."

While some firms are looking for a cheaper alternative for their mainframes or analytics appliances, most will retain traditional data warehouses and BI systems and run Hadoop in parallel with these traditional systems for the time being. Not only that, but Wrazen sees an imminent consolidation of the market. The lack of in-house skills and understanding will see many turning to the suppliers of those traditional systems rather than taking a chance on one of the new start-ups that are spending so much on bringing the technology to market, he believes.

"We've seen customers trying to grapple with this complex ecosystem and then turn to their traditional technology providers to provide solutions for Hadoop," Wrazen said. "IBM have a Hadoop offering and it's quite robust, with an ecosystem around it.

"It's going to be an interesting two or three years," he muses.

Big garbage in, big garbage out

Just because Hadoop is capable of ingesting and processing huge quantities of data, doesn't mean that the basics of data governance, profiling, de-duplication, cleansing and quality control can be tossed aside in the rush to embrace the new schema-less world. Quite the opposite, Wrazen says.

"The technology is not a panacea. It is not going to fix your problems. People think their data management will automatically get better because they have more data and they're loading it into one place, but bringing all the data together onto one platform may actually exacerbate the problems," he states, quoting an as-yet-unpublished survey by The Data Warehouse Institute (TDWI) that found that 45 per cent of organisations don't clean their data before loading it into Hadoop, and a further 11 per cent don't clean it once it's within the framework, with that last figure set to rise dramatically as data volumes increase.

Instead, they are using the platform as an unfiltered data lake, or a "data swamp" as Wrazen puts it, simply dumping the data in there wholesale, perhaps "enhancing" it by mixing in some unstructured social media feeds or open data, and then feeding it to whichever process needs it, be that new customer-facing web applications or traditional data warehouses, CRM systems or multi-dimensional cubes.

"My feeling is that Hadoop is in danger of making things worse for data quality. It may become a silo of silos, with siloed information loading into another silo which doesn't match the data that's used elsewhere. And there's a lot more of it to contend with as well," he says.

For most implimentations, extracting the data from Hadoop for cleaning and standardisation is impossible within the usual overnight window.

"You cannot pull that data out to clean it as it would take far too long and you'd need the same amount of disk storage again to put it on. It's not cost-effective or scalable," says Wrazen, suggesting that data needs to be cleaned at source or (preferably, given the increasing number of sources) within Hadoop itself.

How clean is clean?

But how clean does the data need to be? A recent Computing survey found that for most tasks data quality of between 80 and 90 per cent accuracy is good enough. But as Wrazen points out: "How do you know what is in the missing 20 per cent?"

In the case of customer records used for mass marketing, a name and an email may be sufficient to get the job done. But for some functions quality needs to be much closer to 100 per cent.

"For finance, invoices and purchase orders you need correct address information, maybe even more than that: you may need correct address information in the format that is required," Wrazen says.

"We had a customer whose invoices weren't being paid because they had the wrong legal entity for their customer. The name of the customer was correct, but their finance department wouldn't accept the invoice because the name of the legal entity was wrong."

Another customer, a large beverages distributor, was keen to discover what its best-selling brands were. After crunching all their sales data they discovered that Perrier Water was number one. And number two? Erm, Perrier Water.

The "garbage in, garbage out" adage is one of the oldest truisms in IT, and one that is remarkably platform agnostic.