Cloudera and Trillium announce alliance for data quality in Hadoop

Alliance will allow Cloudera users to clean and profile data natively within that platform

Data quality software provider Trillium Software has announced a technology alliance with Hadoop distributor Cloudera. The alliance sees the Trillium Big Data solution certified by Cloudera allowing it to run natively in the Cloudera Hadoop distribution as well as bringing the Cloudera Enterprise Hub to Trillium's customers.

"With Trillium Big Data and Cloudera Enterprise, companies can natively and seamlessly integrate data quality solutions into their Hadoop platform," said Cloudera vice president business and corporate development Tim Stevens, in a statement.

"To get maximum value from data, it needs to be accurate, complete and accessible," he continued.

Keith Kohl, vice president product management at Trillium Software, added:

"Hadoop's extraordinary growth and adoption by so many organisations drives the need for them to focus on the quality of big data, and utilise data quality solutions that are infinitely scalable across multiple domains and data sources."

Last month, Ed Wrazen, vice-president product management at Trillium, warned that the common practice of organisations simply dumping data into Hadoop without cleaning it either prior to loading it or once it is within Hadoop risks creating a "silo of silos".

"My feeling is that Hadoop is in danger of making things worse for data quality. It may become a silo of silos, with siloed information loading into another silo which doesn't match the data that's used elsewhere. And there's a lot more of it to contend with as well," he told Computing.

"You cannot pull that data out to clean it as it would take far too long and you'd need the same amount of disk storage again to put it on. It's not cost-effective or scalable," said Wrazen, suggesting that data needs to be cleaned at source or within Hadoop itself.

The alliance with Cloudera, one of the more popular Hadoop distributions, should ensure that deploying processes for data governance, profiling, de-duplication, cleansing and quality control is easier within that platform.