If you have less than a petabyte of data you don't need Hadoop

'Use it when you need it but when you don't, don't bother,' says Vincent de Lagabbe, CTO of bitcoin analysis firm Kaiko

Hadoop is unnecessary for smaller projects, and most firms should avoid using it unless they have to.

That's according to Vincent de Lagabbe, CTO of Kaiko, a company that offers real-time tracking of bitcoin exchanges.

"Considering the volume that we are dealing with Hadoop would be overkill. It was fancy technology for the time and people started using it for everything, but most things you can do without it. From experience, it's better to try to do without Hadoop - I mean use it when you need it but when you don't, don't bother," he said.

Kaiko pulls in transaction data from the Bitcoin blockchain and also monitors the major exchanges to see who is buying the crypto-currency in order to track its price in real-time and provide additional information about the market. While speed and being able to handle unstructured data is important, volume is less of an issue for the firm.

"If you've got less that a petabyte of data Hadoop is probably overkill," de Lagabbe said.

Instead of Hadoop, the company is deploying DataStax Enterprise (DSE), a commercial distribution of the Apache Cassandra NoSQL database, to perform storage duties.

"We didn't know what we were going to be putting into the database and we wanted something that could be flexible. Cassandra seemed like a pretty good solution to our requirements. We tried several other things but they weren't as stable for our usage. So Cassandra is our main data store. We store everything in it - blockchain data, exchanges data, everything else."

So why not opt for the free community version?

"We chose DSE because we found it was more stable than the version we were using before, maybe because the builds are more carefully monitored, but I don't know," de Lagabbe said.

"Then there's the extensibility, so you can easily have a Spark cluster on it to do further analysis. We have not deployed such cluster yet but we plan to do that for real-time streaming and in-memory map reduce jobs.

"The support from DataStax has been helpful," he added.

Recent research from Computing has found that Spark is catching up with Hadoop as a primary general-purpose big data platform - although the two are most frequently used together.

Join us for Computing's Big Data & Analytics Summit on March 17th. Attendance is free to qualifying end-users, so book your place now before they all go