Computing research: Big data and the quest for the ultimate truth

By John Leonard
17 Oct 2012 View Comments

It has long been a key goal of enterprise IT to hoover up all the data sitting in corporate databases in far-flung corners of the organisation and deposit it into a central data warehouse from where it can be interrogated to provide a single view of the organisation from one horizon to the other.

Further reading

Recently, though, these horizons have expanded. No longer is it sufficient to consolidate just this structured data. Enterprises in sectors such as retail and pharmaceutical have their eyes on a bigger prize: data that lies outside conventional structured enterprise databases.

As a proportion of the total data that passes through an organisation, this non-tabulated segment is increasing all the time. Blind to its existence, enterprise applications such as BI and CRM systems process only information that can be easily imported into relational databases. This means that enterprises are increasingly obtaining only a limited version of the truth on which to base forecasts and strategy.

Hidden in the depths of this “dark data” may be crucial insights into customers and marketplaces, offering a real competitive edge to those organisations that can successfully manage and interrogate it.

An untapped resource

Comprised both of big data (high volume, fast changing, diverse) and the sort of unstructured files that are stored within applications and on server hard drives in all organisations – spreadsheets in networked folders, PDFs in email attachments and video footage in intranets – the “invisible” data that exists outside of enterprise databases is an untapped resource.

The development of big data technologies is presenting new ways that such data can be brought onto centralised, high-performance platforms without creating whole new silos. With all manner of data integrated into a single platform, these technologies promise to provide unparalleled business intelligence (BI), competitive information and customer insights, including new knowledge about the markets that organisations operate in, emerging trends and potential new business opportunities.

However, consolidating existing enterprise databases onto a unified platform is a far from straightforward task without having to consider these new sources. How much more complex will it become when this additional, far less homogeneous data needs to be brought into the fold? We surveyed 100 IT decision makers in large organisations to learn more about the picture on the ground.

Understanding the data

What we now call big data is only one sub-set of the data that exists outside of relational databases. The remainder consists of files that are stored away on hard drives, attached storage, networks and within applications such as email: data that is basically static rather than changing rapidly.

A 2006 study by the Data Warehousing Institute found that 47 per cent of data within centralised corporate computing was structured (ie in a database), while 31 per cent was unstructured (eg PDFs, spreadsheets, emails, reports, presentations, images, videos, plain text files) and 22 per cent was semi-structured (JSON, XML and similar flat formats). The proportions represented by unstructured and semi-structured data will have only increased since 2006.

Semi-structured data tends to be found in IT, where formats such as XML are used to transfer data between databases or into ECM applications, for example (see figure 1).


[Click to enlarge]

Unstructured data, meanwhile, is most common in corporate, finance and HR departments - where strategic decisions are made (see figure 2). Spreadsheets and presentations are still the key management tools for corporate decision-making at all levels. In terms of BI, therefore, unstructured enterprise data is a potential goldmine.


[Click to enlarge]

It is worth adding that for many organisations, the unstructured data challenge does not just lie outside the realm of enterprise databases. Many large organisations hold vast amounts of unstructured data within enterprise database systems themselves, as Variable Character or Large Text fields, or as Binary Large Objects (BLOBs). Some enterprise databases can be optimised for full text search, while other enterprise databases lack advanced full text functions. But implementing sophisticated full text analytics on large enterprise databases can be both resource intensive and complex.

Reader comments
blog comments powered by Disqus
Windows 10 - will you upgrade?

Microsoft has made an early version of Windows 10 - its next operating system - available for download. The OS promises better integration and harmonisation across platforms, including mobile and desktop. Will your business be upgrading?

37 %
27 %
15 %
21 %