Big data part 2: technology

By Martin Courtney

07 Sep 2011

Be the first to comment

Big data concept

As the requirement for extracting information from huge databases of structured and unstructured data has grown, so hardware and software platforms have evolved to meet the demand. In some cases existing products have been adapted or combined to accelerate performance, while new applications have been developed to help analyse and interpret the results. 

Hadoop

Further reading

Apache Hadoop is seen by many experts as a driving force behind big data analytics. Hadoop is an open-source distributed file system that supports parallel processing of large-scale unstructured data spread across multiple connected systems. It incorporates various open-source software elements, including the Chukwa data collection and monitoring system, the HBase database, the Hive data warehouse, the Pig query tool and ZooKeeper configuration and synchronisation software.

Hadoop also relies on the MapReduce programming model, first developed by Google in 2004 to analyse web indexes, which can also support distributed computing on large structured or unstructured data sets sitting on clusters or grids of connected computers. Other developers have also used the Enterprise Control Language (ECL) on high-performance cluster computing (HPCC) to build their own distributed file systems or data mining grids.

Because it can be used to process and analyse data that resides outside, as well as inside, the corporate firewall, some multi-national organisations use Hadoop to collect data from thousands of sensors in warehouses and factories located in different parts of the world. The software is batch-oriented and can be complicated to configure, and results are slow to gather. While any IT department can deploy it themselves, there are companies offering support and management packages around the Hadoop platform, most notably Cloudera.

IBM has also looked to simplify Hadoop-based analytics, using the technology as the base for its InfoSphere BigInsights and Streams analytics applications, which process text, audio, video, social media, stock prices and data captured by sensors.

Business intelligence and visualisation software

While extracting and analysing big data is complex in itself, presenting the findings in a meaningful way can be just as hard. For years, business intelligence and analytics software has been used to output results into Microsoft Excel or specialist reporting tools, such as Crystal Reports, but fresh approaches to visualising that data can help business departments interpret predictive analytics more easily.

Data visualisation tools pull information from other BI applications or directly from underlying data sets, before presenting them in graphical format as opposed to numbers and text only. A good example is Tableau Software, but similar tools, both proprietary and open-source, include Tibco Spotfire, IBM OpenDX, Tom Sawyer Software, Mondrian and Avizo (the latter specialising in manipulating and understanding scientific and industrial data).

MPP appliances

Hardware appliances specifically designed to support massively parallel processing (MPP) of large datasets have recently surfaced following a round of software acquisitions by hardware vendors.

Storage giant EMC offers a data computing appliance built on the Greenplum database 4.0, delivering data loading performance of up to 10TB an hour, aimed primarily at large telecommunications companies and big retailers. The Oracle Exadata, Netezza TwinFin and Teradata 2580 are other MPP appliances built on multiple servers and CPUs, offering data storage capacities ranging from 20TB to 128TB, with the load ranging from 2-5TB per hour (Terabytes per hour – TB/hr – is a unit measuring data transmission rates or throughput).

Dell has hooked up with Aster Data’s nCluster MPP data warehouse platform, optimising the software to run on Dell PowerEdge C-Series servers for large-scale datawarehousing and advanced analytics, for example. However, it is unclear whether that partnership will continue as Aster Data is now owned by Teradata in full.

The latest version of HP Vertica uses a mix of cloud computing infrastructure-as-a-service (IaaS), virtual and physical resources to run analytics on SQL databases. Though the software has yet to make it onto a specialised hardware appliance, HP promises this is imminent and is including a software development kit for Vertica 5.0 so that customers can adapt or add APIs to existing analytics applications to pull data out of the MPP platform.

IBM has also developed a pay-as-you-go data storage system, dubbed Scale Out Network-attached Storage (SONAS), capable of hosting up to 14.4PB of information that uses its own clustered file system.

Reader comments

Have your say on this article

All fields required. Your email address will not be displayed on the site.

By submitting a comment you agree to abide by our Terms & Conditions

  • Digg
  • Tweet

Newsletters

Sign up for our FREE newsletters

Technology Patent Wars

Large companies such as Microsoft, Facebook and Google have been hoovering up technology patents recently. Is this stifling innovation?

87 %

5 %

8 %