Max Schireson is president of New York NoSQL pioneer 10gen, developer and supporter of the open-source, document-oriented database MongoDB.
Computing: Can you summarise the scenarios in which using a NoSQL database like MongoDB becomes preferable to using a standard RDBMS system?
Schireson: "It's all about scale, high data volumes, and where you have a lot of variety in the data. For example, an online retailer might have a large number of product lines, with each product being very different, meaning you would struggle with a traditional relational database.
"Most relational databases don't work well in a scale-out environment. You are restricted to one server, meaning that queries can take a long time to run. There are a few things we do that relational databases don't do. We have built-in MapReduce capability – with both a 'lite' version of Hadoop and a version with for complex analytics.
"If you're storing data in a relational database and you want to run it through Hadoop, you need to take the data out of the database, put it into HDFS [Hadoop File System], do the analytics in Hadoop, take the result of that and put it back into the database. With Mongo you can do those operations in real time while it's still in the operational database. You can also mix and match database-style queries with Hadoop-style MapReduce analytics. The ability to do realtime analytics on operational data in a system that's also optimised for transactional workloads is very powerful.
"There are some queries, especially column-oriented queries, where some of the more specialised relational databases will have performance advantages. On the other hand because the data model is richer in Mongo, you can have a single topic entry that is rich and hierarchical ... so depending on the detail of the data structures you're querying and how you're querying them then there could be performance advantages either way."
What about the whole issue of ACID transactions not being possible in distributed datasets? How do you get round that problem?
"If you have a partitioned system, either you can update information on only one side of the partition, or you can have updates that are inconsistent either side. If you have two sides, one in London and one in New York, and the cable between them is cut you have a choice. Either you say that the two can't talk and only New York will be updated, or you say both sides can be updated and I will resolve the conflicting updates once the cable is repaired and the two sides have come back up.
"[The latter approach] creates a lot of complexity for developers and users. If I enter something into the system I expect it to be there. What [10gen] does is to distribute the data in such a way that for each set of data there is always one set of machines that's the master. All updates happen on those machines first and then get propagated to the other systems for redundancy. If the master system goes down the cluster will detect that and another will become master, but at any point in time there's only one master. By default you will always be reading the master and so will always have up to date information, unless you specifically say in your query not to, for non-realtime analytics or backup for example.