(For the duration of this discussion, I’m going to assume you have at least heard of HBase. If not, go check it out first or you might be a little confused.)
Ever since I read the original Bigtable paper, I knew that its design was something that would befuddle a lot of developers. As an industry, we are largely educated into the world of relational databases, the ubiquitous system of tables, relationships, and SQL. On the whole, relational databases are one of the most widespread, reliable, and well-understood technologies out there. This is one reason why so many developers today are resistant to different storage technologies, such as object databases and distributed hash tables.
However, at some point, the model starts to break down. Usually there are two kinds of pain that people run into: scaling and impedance mismatch. The scaling issue usually boils down to the fact that most RDBMSs are monolithic, single-process systems. The way you scale this type of database (MySQL, Oracle, etc) is by adding bigger and more expensive hardware – more CPUs, RAM, and especially disks. In this regard, at least the problem is already solved: you just have to spend the money. Unfortunately, the cost of this approach does not scale nearly linearly – getting a machine that can support twice as many disks costs more than twice as much money.
Impedance mismatch is a more subtle and challenging problem to get over. The problem occurs when more and more complex schemas are shoehorned into a tabular format. The traditional issue is mapping object graphs to tables and relationships and back again. One common case where this sort of problem comes to light is when your objects have a lot of possible fields but most objects don’t have an instance of every field. In a traditional RDBMS, you have to have a separate column for each field and store NULLs. Essentially, you have to decide on a homogeneous set of fields for every object. Another problem is when your data is less structured than a standard RDBMS allows. If you will have an undefined, unpredictable set of fields for your objects, you either have to make a generic field schema (Object has many Fields) or use something like RDF to represent your schema.
HBase seeks to address some of these issues. Still, there are situations where HBase is the wrong tool for the job. As a developer, you need to make sure you take the time to see beyond the hype about this technology or that and really be sure that you’re matching impedance.
When HBase Shines
One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.
Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful – if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.
In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.
Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!)
When HBase Isn’t Right
I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.
Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.
Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine – for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)
Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.
This post certainly doesn’t cover every use case and benefit or drawback of HBase, but I think it gives a pretty decent start. My hope is that people will be able to gain some insight into when they should start thinking of HBase for their applications, and also use this as a springboard for more questions about how to make use of HBase and ideas about how to make it better. So, I’ll end with a request – please, tell us what’s missing!