Traditional databases use an approach called schema-on-write; schema must be created before any data can be loaded. Hadoop, on the other hand, is schema-on-read. Data is simply dropped into the file store and columns are created by a probabilistic interpretation of queries on the data.
Moreover, the more data you have and the more it is queried, the better it works. However, Dunn admits that explaining this can be difficult.
"You could see the quizzical look on his face," Dunn says of one customer, an experienced datawarehouse operative.
"He said 'When I make a schema I can call a column PostalCode. Then when I drop the data in, I know that what is in that column will be the postal code. You have no columns. How do you know what the postal code is?' Great question.
"It's the statistical nature of data," Dunn explains. "MapReduce works out what the relationships between the data are. The system will work out [by its position in relation to other characters] which set of characters is the postal code. After enough swings at the plate, enough iterations, it just becomes obvious. That's why you don't have to create the categorisation. The data will do it for you."
In the "old data world", as Dunn labels traditional analytics, you had to know the answer to the question you wanted answered before you ask the question. In the new data world you want the data to show you what questions to ask.
"The best example is the Google spellchecker. Google knows all the permutations, all the ways you can mis-spell a word. A word mis-spelled in a certain sentence could have one of many meanings. But Google knows which one you actually mean, because they've seen the word in the context of the sentence millions of times, not just in the context of the mis-spelling. They haven't simply categorised the word; they've allowed the word to be categorised in the context of the sentence. Allowing the experience to form the question."
The name Cloudera gives a clue to the company's origins. The original idea was to deploy Hadoop in the cloud. Indeed Dunn sees opportunities in forging further partnerships with public cloud providers like Rackspace, running CDH as a cloud application for use by customers to process data held in that environment.
However, Dunn concedes that analytics in the cloud has not taken off as quickly as predicted. For now, most analytics is performed in a datacentre, which is why Cloudera has forged partnerships with the likes of Teradata, Netezza and Oracle where, Dunn says, the firm's solution adds an extra string to these vendors' bows, rather than competing with their datawarehousing and BI offerings.
"Hadoop was created to deal cost effectively with a lot of randomly generated data at a high volume and variety. This was a new problem requiring a new technology. Cloudera will be a billion dollar company before we threaten the traditional data-warehouse business because it's not about trying to take a slice of an existing pie, but to make the pie dramatically bigger."
As the Chancellor pointed out, this pie is turning out to be a very large one indeed.