# MongoDB vs Cassandra

## Real software development

Over the 2 years we’ve been using MongoDB in production with our server monitoring tool, Server Density, we’ve built up significant experience and knowledge about how it works. Back in 2009 when I was looking at a replacement for MySQL I looked at Cassandra but dismissed it because MongoDB had several advantages, and Cassandra was still extremely early stage (even more so than MongoDB at the time). Having been invited to give a comparison at the Cassandra London Meetup, I thought I’d revisit it to see how it compares today.

Disclaimer: It’s important to note that much of what I know about MongoDB has been learnt through using it in production. We don’t use Cassandra so any comparisons are going to be fairly superficial but they will still be relevant because that’s the stage most people will be in when they are considering which database to pick. As a result of this I will try to avoid making technical comparisons about specific features because this will be biased towards my extensive understanding on MongoDB vs a limited understanding of Cassandra.

As such, this comparison is split into 2 types of difference – usage and operations.

• Usage: The actual usage as a developer implementing the application with the database.
• Operations: Points which are not directly about the core database but it’s suitability for production and management on an operational level.

That said, I will start with several technical comparisons because these are important to understand.

Usage – Structure

MongoDB acts much like a relational database. Its data model consists of a database at the top level, then collections which are like tables in MySQL (for example) and then documents which are contained within the collection, like rows in MySQL. Each document has a field and a value where this is similar to columns and values in MySQL. Fields can be simple key / value e.g. { 'name': 'David Mytton' } but they can also contain other documents e.g. { 'name': { 'first' : David, 'last' : 'Mytton' } }.

In Cassandra documents are known as “columns” which are really just a single key and value. e.g. { 'key': 'name', 'value': 'David Mytton' }. There’s also a timestamp field which is for internal replication and consistency. The value can be a single value but can also contain another “column”. These columns then exist within column families which order data based on a specific value in the columns, referenced by a key. At the top level there is a keyspace, which is similar to the MongoDB database.

A good set of data model diagrams for Cassandra can be found here.

Usage – Indexes

MongoDB indexes work very similar to relational databases. You create single or compound indexes on the collection level and every document inserted into that collection has those fields indexed. Querying by index is extremely fast so long as you have all your indexes in memory.

Prior to Cassandra 0.7 it was essentially a key/value store so if you want to query by the contents of a key (i.e the value) then you need to create a separate column which references the other columns i.e. you create your own indexes. This changed in Cassandra 0.7 which allowed secondary indexes on column values, but only through the column families mechanism.

Cassandra requires a lot more meta data for indexes and requires secondary indexes if you want to do range queries. E.g. if we define a new column family with 1 index:

 1 \$ bin/cassandra-cli --host localhost
 2 Connected to: "Test Cluster" on localhost/9160
 3 Welcome to cassandra CLI.
 4 Type 'help;' or '?'  for help. Type 'quit;' or 'exit;'  to quit.
 5 [default@unknown] create keyspace demo;
 6 [default@unknown] use demo;
 7 [default@demo] create column family  users with comparator=UTF8Type
 8 ... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
 9 ... {column_name: birth_date, validation_class: LongType, index_type: KEYS}];

then we cannot do range queries:

 1 [default@demo] get users where state = 'UT'  and birth_date > 1970;
 2 No indexed columns present in index clause with operator EQ

We must create a secondary index:

 1 update column family users with comparator=UTF8Type
 2 ... and column_metadata=[{column_name: full_name, validation_class: UTF8Type},
 3 ... {column_name: birth_date, validation_class: LongType, index_type: KEYS},
 4 ... {column_name: state, validation_class: UTF8Type, index_type: KEYS}];

Then Cassandra can use the state as the primary and filter based on the birth_date:

 1 get users where state = 'UT'  and birth_date > 1970;

(Code samples taken from this blog post).

Usage – Deployment

MongoDB is written in C++ and provided in binary form for Linux, OS X, Windows and several other platforms. It’s extremely easy to “install” – download, extract and run mongod.

Cassandra is written in Java and has the overhead that brings, but also the easy ability to integrate into existing Java projects. It takes a little longer to get started but there is a demonstration of setting up a 4 node cluster in less than 2 minutes, which you’d struggle to beat with MongoDB.

I know plenty of people running MongoDB on Windows but would be interested to hear if that’s the same with Cassandra (I suspect it’s more Linux).

Operations/Usage – Consistency/Replication

In MongoDB replication is achieved through replica sets. This is an enhanced master/slave model where you have a set of nodes where one is the master. Data is replicated to all nodes so that if the master fails, another member will take over. There are configuration options to determine which nodes have priority and you can set options like sync delay to have nodes lag behind (for disaster recovery, for example).

Writes in MongoDB are “unsafe” by default; data isn’t written right away by default so it’s possible that a write operation could return success but be lost if the server fails before the data is flushed to disk. This is how Mongo attains high performance. If you need increased durability then you can specify a safe write which will guarantee the data is written to disk before returning. Further, you can require that the data also be successfully written to n replication slaves.

MongoDB drivers also support the ability to read from slaves. This can be done on a connection, database, collection or even query level and the drivers handle sending the right queries to the right slaves, but there is no guarantee of consistency (unless you are using the option to write to all slaves before returning). In contrast Cassandra queries go to every node and the most up to date column is returned (based on the timestamp value).

Cassandra has much more advanced support for replication by being aware of the network topology. The server can be set to use a specific consistency level to ensure that queries are replicated locally, or to remote data centres. This means you can let Cassandra handle redundancy across nodes where it is aware of which rack and data centre those nodes are on. Cassandra can also monitor nodes and route queries away from “slow” responding nodes.

The only disadvantage with Cassandra is that these settings are done on a node level with configuration files whereas MongoDB allows very granular ad-hoc control down the query level through driver options which can be called in code at run time.

Operations – Who’s behind it?

Both Cassandra (Apache 2.0 license) and MongoDB (AGPL) are open source. You can freely download the code, write patches and submit them upstream. However, Cassandra is purely an open source project whereas MongoDB is “owned” by a commercial company, 10gen. The original authors of MongoDB are core contributors to the code and work for 10gen (indeed, 10gen was founded specifically to support MongoDB and the CEO and CTO are the original creators).

In contrast, Cassandra was created by 2 engineers from Facebook and is incubated by the Apache Foundation. This is not a disadvantage (indeed, the Apache Web server used by the majority of websites has similar roots and is part of the Apache Foundation) but is important to understand when it comes to support, ongoing development and the community (below).

Operations – Support

Although there are independent consultants for MongoDB, the best place to get support is from 10gen themselves because they wrote the database so they know it best. They’re able to provide support contracts with phone and e-mail SLAs.

In contrast, Cassandra has several companies offering commercial support and whilst they do have committers to the core Cassandra code, I’d argue it’s not the same as having access to the entire engineering team and original authors from a single contact point, as is the case with MongoDB.

Operations – Ongoing development

Interacting directly with the company that controls the main project, especially for support purposes, means you can have bug fixes and changes implemented to the code base. We’ve had numerous fixes committed as a result of problems discovered in our production usage of MongoDB. We pay 10gen for support now but even before we did they were very responsive to bugs. We also get votes for features and improvements.

In theory this is the same in Cassandra – you’d want bugs to be fixed and features implemented but that doesn’t have to happen because of the nature of open source projects run by volunteers (becomes more complex when companies are paying developers to work on the project e.g. Eric Evans from Rackspace working on Cassandra full time).

Of course there is a risk that the company behind the project disappears and all the engineers move on somewhere else but the project is still open source and this is the same with any piece of software you might use.

You could also argue there is more direction and focus from a commercial company working solely on the product (and more engineers dedicated to it) but I don’t want to go any further with this point as this post isn’t about open source vs commercial. This is just one point to be aware of.

Operations – Documentation

The official Cassandra documentation is poor. Researching for this I had to visit several websites and watch videos even to get explanations for key concepts like indexes. There is better documentation from Datastax but that is still lacking in explaining concepts in any depth.

The MongoDB documentation was good when I first looked at it but is even better nowadays. It’s actually kept up to date and covers all the features, with examples. Nobody likes writing documentation and it shows with many open source projects; another advantage of having a company behind the project, forcing developers to write the docs! Incidentally, one of the biggest advantages of the PHP language is the extensive documentation, examples and user submitted notes.

When you’re using a completely new data store then documentation is important, and is one of the reasons why I chose MongoDB back in 2009.

Operations – Community

MongoDB has to be a case study in how to build a community around a product. There have been almost 40 MongoDB conferences in the last year, a very active mailing list, and user groups around the world. You know you’re well known when a phrase like “web scale” is associated with your product (as a parody). Again, this is because there is a company behind the product actively promoting it and encouraging and managing these events.

Cassandra has had 1 conference in that time, and whilst there are user groups (I presented this talk at the London one) it’s certainly not on the same scale as MongoDB.

Does that matter? None of that existed when we chose MongoDB so we learnt everything ourselves. But for new users today, there’s a huge forum of people who are using MongoDB and are sharing their knowledge freely and easily accessible.

Operations/Usage – Drivers

The other main reason I chose MongoDB was the driver support. All the key drivers for MongoDB were available and most importantly, maintained by 10gen themselves. MongoDB has official drivers for C, C#, C++, Erlang, Javascript, Java, Perl, PHP, Python, Ruby and Scala. All fully supported.

The Python and PHP drivers were most important to us but we also use the C# driver in our Windows monitoring agent and to have these well maintained just like the core server makes a massive difference.

Cassandra only has official Java and Python drivers with a few others written by 3rd parties. I’ve found that Python is usually well catered for when it comes to libraries that work well. PHP is another story and we’ve had issues with RabbitMQ and ZeroMQ in the past (specifically not working well under heavy load; they all work fine for playing around). Good PHP libraries are hard to come by.

Conclusion

There is no conclusion. This post isn’t about which is best, it’s about comparing the two. Both have advantages and disadvantages and to truly compare you need to run them both in production under significant load for a long period of time. MongoDB has worked well for us and has proven itself at scale and to have flexibility to do things like building a queueing system as well as be the main data store for our server monitoring service.

For me, the operational considerations play a major part in making a decision because these types of databases are so new. I would suspect they’re also important to companies looking to adopt this technology. We don’t need a support contract for Apache, for example, because it’s so well proven. Our support contract with 10gen has been well worth the money!

#### NoSQL数据库对比：MongoDB vs.Cassandra

2011-10-11 06:59:59

#### cassandra vs mongo (1)存储引擎

2017-02-14 21:22:11

#### NoSQL 比较 - Cassandra vs MongoDB vs Redis vs ElasticSearch vs HBase

2016-05-17 16:38:10

#### Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs Ela

2014-08-12 18:07:57

#### NoSQL性能评估(MongoDB,HBase,Cassandra)：哪种数据库最适合你的数据？

2017年04月12日 534KB 下载

#### Cassandra Hbase MongoDB 三者性能比较

2014-07-15 11:05:00

#### Cassandra HBase和MongoDb性能比较

2014-02-11 18:59:58

#### 我们为什么选择了Cassandra而没有用Hbase

2015-07-29 15:47:37

2016-12-03 00:33:13