Matching Impedance: When to use HBase

转载 2011年07月29日 13:36:41

Matching Impedance: When to use HBase

(For the duration of this discussion, I’m going to assume you have at least heard of HBase. If not, go check it out first or you might be a little confused.)

Ever since I read the original Bigtable paper, I knew that its design was something that would befuddle a lot of developers. As an industry, we are largely educated into the world of relational databases, the ubiquitous system of tables, relationships, and SQL. On the whole, relational databases are one of the most widespread, reliable, and well-understood technologies out there. This is one reason why so many developers today are resistant to different storage technologies, such as object databases and distributed hash tables.

However, at some point, the model starts to break down. Usually there are two kinds of pain that people run into: scaling and impedance mismatch. The scaling issue usually boils down to the fact that most RDBMSs are monolithic, single-process systems. The way you scale this type of database (MySQL, Oracle, etc) is by adding bigger and more expensive hardware – more CPUs, RAM, and especially disks. In this regard, at least the problem is already solved: you just have to spend the money. Unfortunately, the cost of this approach does not scale nearly linearly – getting a machine that can support twice as many disks costs more than twice as much money.

Impedance mismatch is a more subtle and challenging problem to get over. The problem occurs when more and more complex schemas are shoehorned into a tabular format. The traditional issue is mapping object graphs to tables and relationships and back again. One common case where this sort of problem comes to light is when your objects have a lot of possible fields but most objects don’t have an instance of every field. In a traditional RDBMS, you have to have a separate column for each field and store NULLs. Essentially, you have to decide on a homogeneous set of fields for every object. Another problem is when your data is less structured than a standard RDBMS allows. If you will have an undefined, unpredictable set of fields for your objects, you either have to make a generic field schema (Object has many Fields) or use something like RDF to represent your schema.

HBase seeks to address some of these issues. Still, there are situations where HBase is the wrong tool for the job. As a developer, you need to make sure you take the time to see beyond the hype about this technology or that and really be sure that you’re matching impedance.

When HBase Shines

One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.

Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful – if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.

In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.

Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!)

When HBase Isn’t Right

I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.

Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.

Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine – for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)

Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.


This post certainly doesn’t cover every use case and benefit or drawback of HBase, but I think it gives a pretty decent start. My hope is that people will be able to gain some insight into when they should start thinking of HBase for their applications, and also use this as a springboard for more questions about how to make use of HBase and ideas about how to make it better. So, I’ll end with a request – please, tell us what’s missing!

Matching Impedance: When to use HBase

(For the duration of this discussion, I’m going to assume you have at least heard of HBase . If not...
  • macyang
  • macyang
  • 2011年02月11日 13:24
  • 740

阻抗匹配(Impedance Matching )

In electronics, impedance matching is the practice of designing the input impedance of an elect...
  • lantianjialiang
  • lantianjialiang
  • 2012年07月05日 10:31
  • 420

When Should I Use HBase

HBase isn’t suitable for every problem. First, make sure you have enough data.If you have hundr...
  • ysh126
  • ysh126
  • 2016年08月25日 10:37
  • 225

解决ios - use of @import when modules are disabled问题

第一步: 点击项目->targets->build settings 搜索module将下图两个设置成YES 编译运行,如果问题依然存在那么进入第二布。 第二步:查看项目中是否存在.mm文件,如果存...
  • kun_LY
  • kun_LY
  • 2017年04月13日 11:22
  • 2575

iOS之use of '@import' when modules are disabled

在导入百度地图SDK的Framework时遇到“use of '@import' when modules are disabled” 错误,寻找网上的解决方案一一试过均无效,偶然突发奇想,把SDK文...
  • hilaryms
  • hilaryms
  • 2016年10月11日 15:22
  • 1217

python 报错

1.报错:error: command 'gcc' failed with exit status 1 的解决办法 解决: yum -y install gcc python-devel li...
  • lyj1101066558
  • lyj1101066558
  • 2016年05月11日 19:34
  • 422

Hbase学习4_ 启动hbase 报错 Address already in use 的解决办法

master: Address already in use master:         at
  • wang_zhenwei
  • wang_zhenwei
  • 2017年02月16日 15:32
  • 474

Cannot use Jedis when in Multi. Please use Transation or reset jedis state.

使用jedis的transaction时,执行如下代码会报异常: Jedis conn = new Jedis("localhost");;Transaction...
  • yuxxz
  • yuxxz
  • 2016年08月30日 23:27
  • 2663

Bipartite Matching

CS4245 Analysis of Algorithms Bipartite Matching Istvan Simon The Marriage Problem and...
  • went2011
  • went2011
  • 2011年11月28日 18:24
  • 2766

使用TinyXML 出现 skipped when looking for precompiled header use 问题

使用TinyXML 出现 skipped when looking for precompiled header use 问题 参考了
  • gengxt2003
  • gengxt2003
  • 2010年10月28日 18:39
  • 4316
您举报文章:Matching Impedance: When to use HBase