Computing research: how and why big data has hit the mainstream
Exclusive Computing research explores how and why big data is here, and what UK organisations are doing to manage it effectively
Some IT trends can be seen coming from a long way off. Heralded by a cloud of dust on the horizon, a low rumble on the specialist blogs, and a flame war among the cognoscenti, The Next Big Thing rides into town.
Entrepreneurs plan for its arrival, columns grow in the tech press, discussion groups proliferate on the web and new acronyms and metaphors are created. Eventually, The Big Thing (as it is now) becomes a topic of casual conversation, even among those with no technical nous.
Big data is the latest such buzzphrase. But is it really new, or simply an old technology rebadged - like those aspects of cloud computing that we called ASPs (application service providers) and utility computing in the 1990s?
The truth is that big data is an old idea whose time has come, thanks to the twin forces of technological innovation and economic necessity - just as cloud computing has become viable on the back of broadband and mobile communications, combined with those constant forces propelling the IT industry forward: faster processors and cheaper, bigger storage.
Also like "cloud computing", the blandness of the phrase "big data" adds to many users' confusion. It has come to mean, variously: the explosion in the volumes of data generated, stored and processed; real-time analytics; virtualised and parallel processing; data visualisation; and many other things.
You cannot blame vendors for hitching their own wagons to the train. The big players - Oracle, IBM, SAP, Microsoft - have spent billions buying up analytics and data management firms, and "big data" is a term they want to own. However, pushing the debate down such narrow tracks does not lead to clarity; indeed, it can simply lead to cynicism among IT professionals.
Asked by Computing earlier this year about their opinion of the term "big data", 28 per cent of senior IT professionals at large UK organisations replied "vendor hype". Thirty-seven per cent saw big data as a big problem (presumably because they associate the term with increased data volumes), while 27 per cent said that it is shorthand for a big opportunity that few organisations grasp.
Take two organisations, identical in every respect apart from the quality of their data analytics. The firm that makes the best use of the largest quantity of the data it holds - turning that raw data into usable information, and then knowledge - will be the one that gets ahead. Analysts at Forrester Research estimate that enterprises use only five per cent of their available data, leaving the field open to those that can corral the remaining 95 per cent into a usable form.
Hidden value
The vast majority of Computing's survey respondents (72 per cent) recognise that their data holds hidden value. For any company - not just the oil giants and hedge funds most associated with big data - near-real time analysis of diverse pools of data has the potential to illuminate business trends, unlock new sources of economic value, improve business processes, and more. However, this presents a massive challenge.
Computing research: how and why big data has hit the mainstream
Exclusive Computing research explores how and why big data is here, and what UK organisations are doing to manage it effectively
The amount of digital information in the world increases tenfold every five years. Much of that data is unstructured. Moreover, as the information scientists of the 1960s noted, the very act of processing that data itself generates more data.
Bigger, faster and ever more connected: we are living in a world of exponentials, of graphs tending towards infinity. Everyone is storing and processing more and more data and looking to use it as an asset. This is one reason why the term big data has entered the mainstream.
Among the main categories of data that is being collected is detailed customer and transactional data. Achieving higher sales from the existing customer base has been the goal of many firms, especially in the retail sector where innovations such as RFID and loyalty cards are driven by this aim. In the case of a large retailer, these large data sets can be analysed to produce highly detailed pictures of patterns of demand, regional variations, supply-chain inefficiencies, and so on.
Organisations are hanging on to their data too – perhaps because it is so difficult to work out which files can safely be deleted. Thirty-two per cent of respondents say that they are retaining more than 90 per cent of the data they collect for more than three months.
Indeed, deciding what to keep and what to discard is the biggest headache reported by the survey's respondents, followed closely by the sheer volume of data they are required to store and then by the lack of personnel.
Computing research: how and why big data has hit the mainstream
Exclusive Computing research explores how and why big data is here, and what UK organisations are doing to manage it effectively
Analytical deficit
So what is the response of the Computing enterprise audience to this situation? Seventy-six per cent are dealing with increasing volumes by simply purchasing extra storage. Acquiring analytical skills (14 per cent) and analytical tools (six per cent) come way behind. Asked about the major issues they have with stored data, deciding what to keep came top, followed by simply dealing with excessive volumes.
This paints a picture of organisations frantically building new homes for data that they have little or no idea what to do with. Moore's Law allows them to do this – just – but for how long and to what purpose? Data retention policies are struggling to keep pace with reality.
The three Vs
For more than a decade, big data (or rather the line between big data and traditional analytics) has been defined in terms of "the three Vs": volume, velocity, and variety.
Volume – the quantity of data relative to the ability to store and manage it
Velocity – the speed of calculation needed to query the data relative to the rate of change of the data
Variety – a measure of the number of different formats the data exists in (eg text, audio, video, logs etc)
If any of these Vs is low, then it may be more efficient to analyse the data using traditional BI methods. However, if volume and required velocity are high, then big data techniques and technologies become more efficient and economical – increasingly so as variety increases.
So what are these techniques and technologies?
Big data in practice
First, we need cheap, responsive storage devices. This tends to mean direct attached storage (DAS) and low-latency solid state or SATA disks rather than SAN or NAS.
Next, in order to process masses of data, we need to employ parallel processing techniques, spreading the task across hundreds or thousands of servers, as Google does, or basing the processing in the cloud.
Then we need software to do the processing. Variants of the MapReduce software used by Google are popular here, including the open source Hadoop. New architectures are emerging all the time.
Among the survey audience, the relational database still dominates, using either standard built-in tools or specialist bespoke queries. Contrary to what some would have you believe, big data is not only about Hadoop and the like. It is about using the right tool for the job.
Computing research: how and why big data has hit the mainstream
Exclusive Computing research explores how and why big data is here, and what UK organisations are doing to manage it effectively
Even as specialist tools become more applicable to the mass market, it is likely that the most commonly used tool to analyse retained data – even in very large data sets – will be SQL. Serious data crunching using parallel processing is only used by eight per cent of respondents, although mapping tools score a respectable 22 per cent.
Perhaps the most significant shortage is not cheap storage or processing power, but rather the lack of sufficient skills across the organisation. The programming and processing skills required are far from trivial, and 43 per cent of the Computing survey sample said that a lack of analytical expertise was holding them back.
As we saw above, only a small proportion of respondents were investing in the sort of cross-functional skills required to obtain meaning from the data, to communicate this to where it is needed and to use it as a basis for informed complex decision making.
Without investing in such skills, big data runs the risk of becoming the next big waste of time.
About the survey
The survey was conducted by email in January 2012. Respondents are all IT decision makers with responsibility for their organisation's enterprise databases or datawarehousing functions. The vast majority of respondents (85 per cent) work in organisations with more than 500 employees, with 42 per cent having more than 5,000. All major industry sectors are represented in the cross-section, the most highly populated being IT & telecoms, public sector and finance.
Further reading
Acquire data skills now, or pay big money for big data analysis
Big data – big misunderstandings, big mistakes?
Big data: hands on or hands off?
Big data: how to get the board on board
Computing Big Data Summit 2012