As the requirement for extracting information from huge databases of structured and unstructured data has grown, so hardware and software platforms have evolved to meet the demand. In some cases existing products have been adapted or combined to accelerate performance, while new applications have been developed to help analyse and interpret the results.
Hadoop
Apache Hadoop is seen by many experts as a driving force behind big data analytics. Hadoop is an open-source distributed file system that supports parallel processing of large-scale unstructured data spread across multiple connected systems. It incorporates various open-source software elements, including the Chukwa data collection and monitoring system, the HBase database, the Hive data warehouse, the Pig query tool and ZooKeeper configuration and synchronisation software.
Hadoop also relies on the MapReduce programming model, first developed by Google in 2004 to analyse web indexes, which can also support distributed computing on large structured or unstructured data sets sitting on clusters or grids of connected computers. Other developers have also used the Enterprise Control Language (ECL) on high-performance cluster computing (HPCC) to build their own distributed file systems or data mining grids.
Because it can be used to process and analyse data that resides outside, as well as inside, the corporate firewall, some multi-national organisations use Hadoop to collect data from thousands of sensors in warehouses and factories located in different parts of the world. The software is batch-oriented and can be complicated to configure, and results are slow to gather. While any IT department can deploy it themselves, there are companies offering support and management packages around the Hadoop platform, most notably Cloudera.
IBM has also looked to simplify Hadoop-based analytics, using the technology as the base for its InfoSphere BigInsights and Streams analytics applications, which process text, audio, video, social media, stock prices and data captured by sensors.
Business intelligence and visualisation software
While extracting and analysing big data is complex in itself, presenting the findings in a meaningful way can be just as hard. For years, business intelligence and analytics software has been used to output results into Microsoft Excel or specialist reporting tools, such as Crystal Reports, but fresh approaches to visualising that data can help business departments interpret predictive analytics more easily.
Data visualisation tools pull information from other BI applications or directly from underlying data sets, before presenting them in graphical format as opposed to numbers and text only. A good example is Tableau Software, but similar tools, both proprietary and open-source, include Tibco Spotfire, IBM OpenDX, Tom Sawyer Software, Mondrian and Avizo (the latter specialising in manipulating and understanding scientific and industrial data).
MPP appliances
Hardware appliances specifically designed to support massively parallel processing (MPP) of large datasets have recently surfaced following a round of software acquisitions by hardware vendors.
Storage giant EMC offers a data computing appliance built on the Greenplum database 4.0, delivering data loading performance of up to 10TB an hour, aimed primarily at large telecommunications companies and big retailers. The Oracle Exadata, Netezza TwinFin and Teradata 2580 are other MPP appliances built on multiple servers and CPUs, offering data storage capacities ranging from 20TB to 128TB, with the load ranging from 2-5TB per hour (Terabytes per hour – TB/hr – is a unit measuring data transmission rates or throughput).
Dell has hooked up with Aster Data’s nCluster MPP data warehouse platform, optimising the software to run on Dell PowerEdge C-Series servers for large-scale datawarehousing and advanced analytics, for example. However, it is unclear whether that partnership will continue as Aster Data is now owned by Teradata in full.
The latest version of HP Vertica uses a mix of cloud computing infrastructure-as-a-service (IaaS), virtual and physical resources to run analytics on SQL databases. Though the software has yet to make it onto a specialised hardware appliance, HP promises this is imminent and is including a software development kit for Vertica 5.0 so that customers can adapt or add APIs to existing analytics applications to pull data out of the MPP platform.
IBM has also developed a pay-as-you-go data storage system, dubbed Scale Out Network-attached Storage (SONAS), capable of hosting up to 14.4PB of information that uses its own clustered file system.
Have your say on this article
Newsletters
Latest stories from Public Sector
You may also like
Public Sector jobs
Technology Patent Wars
Case studies from large organisations across all sectors
... And rich media, and flexible working, and peaks in traffic ...
Upcoming Events
Join us for this Computing web seminar, in which the Head of BI at the Co-operative Group Nick Colebourn will be explaining just how he reigned in the Group’s sprawling database estate and how significant savings were realised and data quality improved as a result.
Date: 31 May 2012
Time: 11:00 AM
Live June 13th 11:00am: Register now. During this web seminar we will be looking at the sorts of incidents that can bring data centres grinding to a halt and what can be done about them.
Date: 13 Jun 2012
Time: 11:00 am
Receive the latest jobs direct to your inbox
Are you being paid what you are worth?