This close integration with Hadoop, together with the ability of a NoSQL cluster to retrieve documents from multiple data stores several orders of magnitude faster than equivalent RDBMS makes it ideal for web portals and similar applications.
This is the environment in which the Guardian Group is using MongoDB. Transitioning away from Oracle, the media company’s CMS and registration systems are moving towards a NoSQL-only solution to escape from the template-driven approach that has become restrictive, and to allow for rapid development and deployment of applications.
MongoDB also underpins the new single-domain UK government portal GOV.UK, which is currently in beta, as well as the National Archive where details of 11 million records are being moved from Microsoft SQL Server to Mongo for publication online.
Impressive as it may be for large-scale analytics, NoSQL can be something of a blunt instrument when it comes to reporting. Applications such as BI typically employ complex column-oriented queries with multi-table joins, nested joins and filters. NoSQL is a distributed system with the data set sharded (horizontally partitioned) across several servers. While this reduces its size and increases speed, there is a trade-off in terms of its ability to handle complex and ad hoc queries (see below). For these, it will often make more sense to process the data using NoSQL and MapReduce, then pipe it into a traditional RDBMS-based system for reporting.
Since the P in CAP theorem (see below) is a given for a distributed system, the various NoSQL databases must choose between 100 per cent availability (A) or complete consistency (C). How they choose depends very much on typical use cases.
For example, CouchDB uses the concept of multi-version concurrency control, meaning that it can handle a large number of concurrent readers and writers without conflict, avoiding the need to lock the database during writes, which makes it suitable for supporting collaborative applications, at the expense of always being bang up to date.
MongoDB, meanwhile, with its focus on web-based server applications, ensures that one set of machines is always the master, and all updates happen on those machines first. Updates then get propagated to the other servers for redundancy. This means that the data as read by an application will always be up to date. MongoDB is designed to come as close as possible to a traditional RDBMS in terms of querying and reporting.
Cassandra is optimised for fast writing. It prioritises availability over consistency (the database is not locked during writes), although this trade-off is tunable. It is perhaps a better choice at the upper end of the volume scale (Twitter is run on Cassandra) but not for more complex querying.
Horses for courses again then. Every benefit comes with a cost.
Schireson acknowledges that the ability to scale up while maintaining consistency has its price, in this case the complexity inherent in running data across vast numbers of servers.
“We are becoming more widely used in enterprise, government, telcos, banks, which often have very demanding security requirements for their applications. [MongoDB] is not as easy to use in those environments as we’d like it to be. We have to do a lot of work to get it to conform to their security regulations and audit requirements. We want to make that a lot easier.
“In general, the developer experience is fantastic: operating one server of Mongo is much easier than operating one server of Oracle – it might be one third of the work. However, one of the capabilities we offer is to run a cluster of 100 Mongo servers – Disney is running 1,200 – and that’s a lot of work. So we want to make it easier for customers to operate big clusters.”
So, should Larry Ellison be thinking about moving to a smaller yacht? Perhaps not just yet, as most of the NoSQL operations are still tiny compared with the Big O. Certainly, though, he should be prepared for rough waters in the analytics and web services areas of his business. NoSQL is attracting a lot of interest from venture capitalists, and with the likes of Disney, Netflix, O2, Craigslist, Ebay, Reddit, the UK government and the BBC all looking to NoSQL providers such as DataStax and 10gen for cheaper, more flexible alternatives to the proprietary solutions provided by Oracle and other large vendors, a sea change could be upon us.
If the CAP fits…
CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
• consistency (C) equivalent to having a single up-to-date copy of the data;
• high availability (A) of that data (for updates); and
• tolerance to network partitions (P).
Since NoSQL data is held in partitions, either consistency or availability cannot be 100 per cent guaranteed, which has an impact on the ability to run complex transactions across collections (tables).