11 Mar 2011
The term “big data” has emerged to cover the issues of dealing with the need to identify and analyse ever growing data sets. However, the need for dealing with mixed information – not only data held in databases, but also information held in files and on the internet – is leading to a problem with the understanding of where and how “big data” tools fit within the overall information management field.
A prime example of this has been with IBM’s Watson and how it played in the Jeopardy! challenge in the US. The need was for a computer system that could take a nominal answer and come up with the question that would be associated with this. So, for example, the “answer” presented could have been something like “the place where the world’s greatest athletes congregated in summer 2008”. For the human mind, we can pretty much work this out as Beijing, as we can make the linkages between “place” and “city”, and “world’s greatest athletes” and “Olympics”. But a computer can’t do this so easily. Watson had to be able to parse the sentence, use a variety of approaches and then use a combination of different algorithms to find a range of possible answers, apply a score as to much it trusted the answer and so be prepared to provide what it felt was a suitable result.
Is this “big data”? I don’t think so. Very little of the information Watson would be dealing with would come from the archetypal data stores that many would think of – such as DB2, Oracle, SQL Server or other data bases. The information would be spread around across millions of sources as files held as html web pages, as pdf files, as textual entries contained in blogs, wikis and so on.
This has so little to do with where the term “big data” originally came from that it is stretching its use beyond credibility – and will confuse users and specialists if the vendor community are not careful.
To my mind, what we are looking at here is “unbounded data”. We don’t know where it is, we don’t know what form it will take, we don’t know how important it is until we have got to it and analysed it. We cannot federate the data in the normal sense, as if everyone created a data warehouse housing copies of all the content of the internet, we’d have a few energy crises on our hands, so we have to work against the data in situ. Some of the findings we get from the analysis may need to be stored for future reference; a lot of it can be deleted straight after we have used it.
Indeed, in some cases, this is far more of a “little data” issue than a “big data” one. For example, some information may be so esoteric that there are only a hundred or so references that can be trawled. Once these instances have been found, analysing them and reporting on them does not require much In the way of computer power; creating the right terms of reference to find them may well be the biggest issue.
Tools for dealing with a mass of different sources and formats require a different approach. Formal data can be accessed via rows and columns, will generally have some form of identifier as to what it refers to, will be capable of being reported against in a graphical manner. The more ad hoc queries based on informal information require tools that can understand context, can précis sources, can build up a trust model as to whether the information found is verifiable (for example, does Wikipedia trump The Wall Street Journal?). Video and sound have to be sources that can be included, with tools enabling these to be rapidly searched and broken down into meaningful concepts and contextual “chunks” that can be embraced in the overall process. Admittedly, formal data still needs to be part of the mix – but the times have changed from where 80% of all data was held in a database – this has been flipped into the other direction.
So, I’d like to urge the industry – both vendors and users – to put “big data” back where it belongs, with the likes of oil and gas, pharmaceuticals, finance and others finding themselves dealing with aggregations of formal data within their direct or indirect control running beyond their capabilities to deal with through historical approaches. For the rest of us looking at how to make sense of a world of mixed information sources often well beyond our control, I propose that we start to use the term “unbounded data” in order to allow differentiation between the database tools aimed at dealing with formalised data and the different approaches needed in dealing with more open informational queries.
Watson has shown that a computer can now begin to apply broad information analysis in ways that would have been difficult to think of outside a sci-fi novel only a few years ago. The opportunities for such an approach in areas such as anti-terrorism, healthcare, fraud detection and other areas are phenomenal. Indeed, imaging if IBM and Honda got together putting an Asimo robot as the front end to a Watson system. The robot would be able to use the reasoning power of Watson to make sense of its surroundings in real time – a major step forward in the evolution of the socially useful robot. But, to my mind – it is not a big data issue, it’s different, and in so many important ways.
Clive Longbottom, Service Director, Business Process Analysis
Quocirca
Add your comment