With the rapid adoption of schema-less, NoSQL data stores like MongoDB, Cassandra and Riak in the last few years, developers now have the ability enjoy greater agility when it comes to their application’s persistence model. However, just because a datastore is schema-less, doesn’t mean the structure of the stored documents won’t play an important role in the overall performance and resilience of the application. In this first, of a four part blog series about MongoDB we’ll explore a few strategies you should consider when designing your document structure.
Application requirements should drive schema design
If you ask a dozen experienced developers to design the relational database structure of an application, such as a book review site, it’s likely that each of the structures will be very similar. You’ll likely see tables for authors, books, commenters and comments and so on.. The likelihood of having varied relational structures is small because relational database structures are generally well understood. However, if you ask dozen experienced NoSQL developers to create a similar structure, you’re likely to get a dozen different answers.
Why is there so much variability when it comes to designing a NoSQL schema? To optimize application performance and reliability, a NoSQL schema must be driven by the application’s use case. It’s a novel idea, but it works. Luckily, there are only a few key factors you need to understand when deriving your schema from application requirements. These factors include: • How your documents reference children collections • The structure and the use of indexes • How your data will be sharded
Elements of MongoDB Schemas
Of these factors, how your documents reference child collections, or embedding, is the most important decision you need to make. This point is best demonstrated with an example.
Suppose we’re building the book review site as we mentioned in the introduction. Our application will have authors and books, as well as reviews with threaded comments. How should we structure the collections? Unfortunately, the answers depend on the number of comments we’re expecting per book and how frequently comments are read vs. written. Let’s look at our possible use cases.
The first possibility is were we’re only going to have a few dozen reviews per book, and each review is likely to have a few hundred comments. In this case, embedding the reviews and comments with the book is a viable possibility. Here’s what that might look like:
Listing 1 – Embedded
{ “_id”: ObjectId(“500c5f601fe9193a7eb5046a”), “title”: “MongoDB: The Definitive Guide”, “authors”: [{“lastName”: “Chodorow”, “firstName”: “Kristina”}, {“lastName”: “Dirolf”, “firstName”: “Michael”}], “description”: “How does MongoDB help you…”, “publisher”: “O’Reilly Media”, “formats”: [“Print”, “Ebook”, “Safari Books Online”], “isbn”: “978-1-4493-8156-1”, “pages”: “210”, “reviews”: [{ “rating”: 5, “description”: “The Authors made an excellent work…”, “title”: “One of O’Reilly excellent books”, “created”: ISODate(“2012-07-04T09:48:17Z”), “comments”: [{ “comment”: “This review helped me choose the correct book.”, “commenter”: “Nick”, “created”: ISODate(“2012-07-20T13:15:37Z”) }], “reviewer”: “Giuseppe” }] }
There are several advantages to a fully embedded structure. Since the entire structure is contained in a single document, your application only has to perform a single seek to load the object. This is ideal for read intensive applications. The problems with a fully embedded structure occur when writing to the document. If a new blockbuster book is released, the flood of incoming comments may impact application performance. When a document nears the capacity MongoDB has allocated for it, MongoDB will allocate a larger space for the document and move it to the new location. Your application will incur a performance penalty if MongoDB is constantly moving large documents around on disk.
To avoid this type of penalty, you should not embed your reviews and comments in the base document. Here’s a non-embedded version of our earlier document, using separate collections for books, reviews and comments:
Listing 2 – Non-Embedded
// Books { “_id”: ObjectId(“500c680c1fe9193b67b898a3”), “publisher”: “O’Reilly Media”, “isbn”: “978-1-4493-8156-1”, “description”: “How does MongoDB help you…”, “title”: “MongoDB: The Definitive Guide”, “formats”: [“Print”, “Ebook”, “Safari Books Online”], “authors”: [{ “lastName”: “Chodorow”, “firstName”: “Kristina” }, { “lastName”: “Dirolf”, “firstName”: “Michael” }], “pages”: “210” }
// Reviews { “_id”: ObjectId(“500c680c1fe9193b67b898a4”), “rating”: 5, “description”: “The Authors made an excellent work…”, “title”: “One of O’Reilly excellent books”, “created”: ISODate(“2012-07-04T09:48:17Z”), “book_id”: { “$ref”: “books”, “$id”: ObjectId(“500c680c1fe9193b67b898a3”) }, “reviewer”: “Giuseppe” }
// Comments { “_id”: ObjectId(“500c680c1fe9193b67b898a5”), “comment”: “This review helped me choose the correct book.”, “commenter”: “Nick”, “review_id”: { “$ref”: “reviews”, “$id”: ObjectId(“500c680c1fe9193b67b898a4”) }, “created”: ISODate(“2012-07-20T13:15:37Z”) }
While simple, this method does have some trade-offs. First, our reviews and comments are strewn throughout the disk. We’re potentially loading thousands of documents to display a page. This leads us to another common embedding strategy – “buckets”.
By bucketing review comments, we can maintain the benefit of fewer reads to display substantial amounts of content, while at the same time maintaining fast writes to smaller documents. An example of a bucketed structure is presented below:
Figure 1 – Hybrid Structure

In this example, the bucket, or hybrid, structure breaks the comments into chunks of roughly 100 comments. Each comment collection maintains a reference to the parent review, as well as its page and current number of contained comments.
Of course, as software developers, we’re painfully aware there’s no free lunch. The downside to buckets is the increased complexity your application has to deal with. The previous strategies were trivial to implement from an application perspective, but suffered from inefficiencies at scale. Buckets address these inefficiencies, but your application has to do a bit more bookkeeping, such as keeping track of the number of comment buckets for a given review.
Conclusion
My own personal projects with MongoDB have used each one of these strategies at one point or another, but I’ve always grown into more complicated strategies from the most basic, as the application requirements changed. One of the benefits of MongoDB is the ability to change your storage strategy at will and you shouldn’t be afraid to take advantage of this flexibility. By starting simple, you can maintain development velocity early and migrate to a more scalable strategy as the need arises. Stay tuned for additional blogs in this series covering the use of MongoDB indexes, sharding and replica sets.
If you are interested in experimenting with a few of the concepts without having to download and install MongoDB, try in on Red Hat’s OpenShift. It’s FREE to sign up and all it takes is an email and your minutes from having a MongoDB instance running in the cloud.