Computing research: Big data and the quest for the ultimate truth

John Leonard discovers why organisations are turning their attention to data that lies outside enterprise databases

It has long been a key goal of enterprise IT to hoover up all the data sitting in corporate databases in far-flung corners of the organisation and deposit it into a central data warehouse from where it can be interrogated to provide a single view of the organisation from one horizon to the other.

Recently, though, these horizons have expanded. No longer is it sufficient to consolidate just this structured data. Enterprises in sectors such as retail and pharmaceutical have their eyes on a bigger prize: data that lies outside conventional structured enterprise databases.

As a proportion of the total data that passes through an organisation, this non-tabulated segment is increasing all the time. Blind to its existence, enterprise applications such as BI and CRM systems process only information that can be easily imported into relational databases. This means that enterprises are increasingly obtaining only a limited version of the truth on which to base forecasts and strategy.

Hidden in the depths of this "dark data" may be crucial insights into customers and marketplaces, offering a real competitive edge to those organisations that can successfully manage and interrogate it.

An untapped resource

Comprised both of big data (high volume, fast changing, diverse) and the sort of unstructured files that are stored within applications and on server hard drives in all organisations - spreadsheets in networked folders, PDFs in email attachments and video footage in intranets - the "invisible" data that exists outside of enterprise databases is an untapped resource.

The development of big data technologies is presenting new ways that such data can be brought onto centralised, high-performance platforms without creating whole new silos. With all manner of data integrated into a single platform, these technologies promise to provide unparalleled business intelligence (BI), competitive information and customer insights, including new knowledge about the markets that organisations operate in, emerging trends and potential new business opportunities.

However, consolidating existing enterprise databases onto a unified platform is a far from straightforward task without having to consider these new sources. How much more complex will it become when this additional, far less homogeneous data needs to be brought into the fold? We surveyed 100 IT decision makers in large organisations to learn more about the picture on the ground.

Understanding the data

What we now call big data is only one sub-set of the data that exists outside of relational databases. The remainder consists of files that are stored away on hard drives, attached storage, networks and within applications such as email: data that is basically static rather than changing rapidly.

A 2006 study by the Data Warehousing Institute found that 47 per cent of data within centralised corporate computing was structured (ie in a database), while 31 per cent was unstructured (eg PDFs, spreadsheets, emails, reports, presentations, images, videos, plain text files) and 22 per cent was semi-structured (JSON, XML and similar flat formats). The proportions represented by unstructured and semi-structured data will have only increased since 2006.

Semi-structured data tends to be found in IT, where formats such as XML are used to transfer data between databases or into ECM applications, for example (see figure 1).

[Click to enlarge]

Unstructured data, meanwhile, is most common in corporate, finance and HR departments - where strategic decisions are made (see figure 2). Spreadsheets and presentations are still the key management tools for corporate decision-making at all levels. In terms of BI, therefore, unstructured enterprise data is a potential goldmine.

[Click to enlarge]

It is worth adding that for many organisations, the unstructured data challenge does not just lie outside the realm of enterprise databases. Many large organisations hold vast amounts of unstructured data within enterprise database systems themselves, as Variable Character or Large Text fields, or as Binary Large Objects (BLOBs). Some enterprise databases can be optimised for full text search, while other enterprise databases lack advanced full text functions. But implementing sophisticated full text analytics on large enterprise databases can be both resource intensive and complex.

Computing research: Big data and the quest for the ultimate truth

John Leonard discovers why organisations are turning their attention to data that lies outside enterprise databases

If corporate unstructured and semi-structured data is generated largely within organisations, big data is driven by the increasing convergence of organisations with web technologies, although many, including those in finance, insurance and government, face significant big data processing challenges that have little to do with web use.

The “big” part of big data refers to the size of the dataset (which may be orders of magnitude more voluminous than the structured data held in databases), as well as its pace of change relative to the time needed to query it.

The “data” part of the big data equation can comprise almost any content: server logs; content captured from the web; social media streams; case notes; financial market data; banking transactions; audio and video footage; manufacturing process information; geophysical data; lab data; satellite telemetry; and so on. The wider the variety of formats, the more favourable big data techniques become when compared with relational database approaches.

The challenge

The combination of massive volumes and large variety of formats that characterises big data means that conventional databases can struggle to process information in a timely fashion. This has the crippling effect of greatly reducing the types of analysis that organisations can effectively run on very large bodies of data.

The challenge, then, is to take a broad approach to data management that makes use of all available data, regardless of where it resides or how rapidly it is being generated, capturing that data from its widely dispersed locations, and rendering or processing it rapidly in a way that yields up organisational knowledge and BI.

One approach to the issue of rapid processing of large data sets within relational databases is to co-opt existing big data technologies such as Hadoop, supplementing conventional databases – which store the data – with technologies that distribute the processing across multiple servers, and then reduce the result set and return it to the database or application. This allows data mining and analytics to be performed very rapidly, while the number of factors that can typically be analysed can often be multiplied by hundreds.

An example of the performance advantage of big data techniques over SQL relational approaches is provided by the credit reporting agency Equifax, which developed an algorithm to predict who across the United States might go bankrupt over the next 30 days. Unfortunately, written in SQL and running on a relational database, the function took 26 days to run. Transferred to a big data MapReduce platform, and rewritten in ECL, it took just six minutes to complete.

Another approach does away with underlying SQL databases altogether, replacing them with databases that operate on documents rather than tables, are capable of very rapid record storage and retrieval, and greater flexibility and efficiency in handling very different types of data. Dedicated NoSQL databases do much better at keeping pace with very fast storage and retrieval, albeit at the cost of reduced functionality for complex querying and transactions.

The rapidity of processing is a potential boon not only in the analysis data resulting from transactional websites, but also for financial trading systems, banking transaction systems, scientific observational data and military intelligence.

Big data systems also set out to preserve the variety and scope of source data. Where possible, according to big data advocates, keep everything. Converting data to be stored in a relational database often strips away vital information or metadata.

For most organisations, the optimum solution is likely to be a combination of both approaches. However, considerable challenges exist in integrating big data file systems. These range from a lack of big data integration skills and expertise, as these systems present a very different paradigm to conventional enterprise database and programming systems. Discovery and selection of appropriate file sets is a task in itself, as is integrating these data sets into big data file systems. There may also be issues around pilot studies and testing, which can be very different with big data when compared with conventional enterprise computing.

Towards unity

Various specialist vendors such as Pentaho, Talend and Informatica are now offering big data integration platforms and development environments for processing, manipulating and querying data from any source and at any scale, with major players like IBM also getting in on the act.

Computing research: Big data and the quest for the ultimate truth

John Leonard discovers why organisations are turning their attention to data that lies outside enterprise databases

The benefits of this approach are not limited to more comprehensive corporate data analysis by taking in the totality of enterprise data, rather than just a fraction of it, but the speed at which queries can be set up and executed can change from days or hours to minutes or seconds

There is also the prospect of more accurate information extending right across the enterprise. Just as big data can enhance decision making at the corporate level by integrating crucial unstructured data into BI and other enterprise data systems, it also promises to improve decision making, planning and execution at the departmental and workgroup levels.

Respondents to the Computing survey were asked what benefits they would most wish for from a big data implementation. Top of the list were: better data quality; a single version of the truth; eliminating confusion caused by multiple spreadsheet versions; and improved opportunity identification and decision support.

In terms of functionality, ease of use and the ability to integrate multiple data formats rapidly were the most sought-after features. Second to this are high performance and scalability. Technical issues and concerns about programming languages, APIs and other factors, are a lesser concern (see figure 3).

[Click to enlarge]

Whatever the promises, big data processing and integration is currently at an early stage of development as an enterprise technology, with only two per cent of the Computing survey respondents saying that they are now actively processing big data – while 12 per cent are at a planning and pilot study stage.

However, there can be little doubt that this is an area that will see much investment in the coming months and years. As standards emerge and solutions become easier to implement and operate take-up will inevitably increase. The drivers – both the need to manage the exponential growth in data volumes and the promise of better intelligence – remain very much in place.

About the survey

The survey was conducted by email in June 2012 among senior IT decision makers in public- and private-sector organisations, 85 per cent of which had more than 500 staff with 43 per cent having more than 5,000. Ninety-eight respondents answered all of the questions.