Building indexes using HBase: mapping strings, numbers and dates onto bytes

下面这篇文章是Bruno Dumon给出的如何在Hbase上面搭建二级索引,除了这篇文章以外,他还给出了具体的library,接下来我会首先使用他提供的library进行一些简单的功能上的验证。

 

本文来源:

http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/

 

具体实现:

http://www.lilyproject.org/lily/about/playground/hbaseindexes.html

 

 

另外一个推荐的实现方式:http://blog.rajeevsharma.in/2009/06/secondary-indexes-in-hbase.html

 

最新支持0.90的github: https://github.com/hbase-trx/hbase-transactional-tableindexed

 

官方wiki: http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing

 

Note: a library implementing the ideas below is now available: hbase indexing library.

I am looking into exploiting the sorted nature of HBase tables to build indexes. See this interesting presentation by Ryan Barrett (or these slides by Joe Gregorio) for how Google does the same for App Engine’s datastore on Bigtable.

HBase identifies rows by a key, the key is a byte array, and it keeps the rows sorted on these byte[] keys. If we want to index more high-level data types like strings, number or dates, we will have to figure out how to map these onto bytes so that sort order is maintained as desired. This post is about exactly this problem, but first I will quickly go into the basics of building an index with HBase.

Building an index with HBase

HBase keeps rows sorted lexicographical by row key, and allows to do range scans (from-to) over these rows. Lexicographical sorting means sorting like in a dictionary, where the corresponding characters of two words (here bytes in byte arrays) are compared from left to right.

Suppose we have stored entities in HBase, one HBase-row per entity, and each of these entities has a property Country. Now we want to build a secondary index for these entities, in a different HBase table, to allow querying these entities by their Country property. The row keys in this index table would be composed of the Country property and the key of the row that uses this property value. In the example below, the target row key is shown as a number, but it can be any byte array.

CountryRow
Belgium1
Belgium5
Brazil7
France4
France12

If we want to find all entities whose Country is France, we can use an HBase scanner to find all rows starting with France. Technically, you would set the start row for the scanner to France and stop the scanning by using a RowFilter with a BinaryPrefixComparator on the end value, here again France.

We are not limited to equals searches. We can also do range searches (e.g. from ‘Belgium’ to ‘France’) or prefix searches (e.g. all entities whose country name starts with a B).

Note that we have no use for HBase columns here: all information we need is stored in the row key. The indexed row has to be part of the row key, otherwise we could have duplicate row keys which is not possible. Since HBase actually requires at least one column, we need to store a dummy column.

The same mechanism also allows to build composite indexes, where the row key is build up of multiple properties. Let’s extend our previous index with a new property Category.

CountryCategoryRow
BelgiumC1
FranceA4
FranceA12
FranceB4

The Category property is multi-valued, as exemplified by the entity 4 which occurs two times in the index.

We can again use HBase scanners to search on this index. With composite indexes, you do not necessarily need to search on each field, but you have to use the index-fields from left to right, and only the rightmost field can be used for range or prefix searching. In our example, you could equal-search on just country, equal-search on the combination of country and category, range/prefix scan on country, or equal-search on country in combination with a range/prefix scan on category.

When you do not search on all fields of a composite index, or when you do a range search and the field is multi-valued (like in the Category example), then the index can return the same row multiple times. In the example above, searching for France without a condition on Category would return row 4 two times, and not grouped together. This can be annoying for the consumer of the index results.

Merge joins

If we use indexes such that their returned row keys are unique (by searching on all fields of a composite index, or by not using range-scans for multi-valued properties) and sorted by row key (this last point is automatically assured by HBase), then we can easily merge-join results from multiple indexes! Merge-joins are nicely explained in this presentation by Brett Slatkin.

From the logical index model to bytes

As mentioned before, HBase row keys are plain byte arrays, and HBase determines the row order by byte comparison. This means that to compare two byte arrays, HBase compares the corresponding bytes, from left to right. All corresponding bytes being equal, shorter arrays compare as being smaller than larger arrays.

If we want to build indexes on data types such as strings, numbers or dates we will have to map them onto bytes in such a way that when HBase performs byte comparison, the order is maintained the same as when we would have compared the logical data types.

For composite indexes, we will need to pad the key entries with zeros so that corresponding values align in the byte arrays. So if the country field could be 10 bytes wide and the category field 3 bytes, index entries could look like:

France0000A002
Belgium000C001

The zeros represent bytes with all bits set to zero, so that in comparisons they will always be smaller than anything else.

Note that for strings, this means that you’ll have to decide on beforehand how long the indexed string value can get.

Update: as Chris points out in the comments, this padding is not necessary, a well-chosen separator will also do the trick. After all, we only need to compare corresponding values if all the values more to the left are already equal.

String sorting

When we want to keep an index of string values, we need to convert those strings first to bytes to be able to store them in HBase row keys. If this conversion is done using UTF-8, then according to Wikipedia, “Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points”.

Often we will not want our strings to be sorted by Unicode points, since we want to have for example é or E sorted before f. Note that this ordering is important because we want to do range scans, for equals searches only it does not matter what comes first.

Another aspect besides the order of the index is that, when searching on an index, one will often (but not always) want to ignore certain spelling variations like missing accents on characters.

Ignoring case

Ignoring the casing can be solved by translating all strings to their lowercase variant. Lowercasing a string in Java is a locale-dependent operation, though there are only a few locale’s for which this really makes a difference: Turkish, Azerbaijani and Lithuanian.

Normalizing

Sometimes there are multiple possible Unicode representations for the same visual character. A typical example is that characters with accents can be represented either as a single Unicode character or as the combination of the base character and a combining accent character. Java’s Normalizer class can canonicize these different forms.

Simplify strings

A possible solution for accented characters is to remove the accents, and more in general to reduce the text to plain ASCII. Lucene does this, we can re-use the code of their ASCIIFoldingFilter.

Collator

The standard Java solution for Locale-sensitive sorting is to use the Collator class. As far as sorting is concerned, it does exactly what we want. And it is even possible to materialize this ordening to something we can use as HBase row key: via the Collator you can get access to a CollationKey, which “converts a String to a series of bits that can be compared bitwise against other CollationKeys”. And you can get access to these bits using CollationKey’s toByteArray() method.

This sounds like an ideal solution, though there are some things to be aware of:

  • The collation byte arrays are rather long: seems like it uses 6 to 8 bytes per character, plus some global overhead.
  • The inverse translation, from collation bytes to string is not supported. This is not really needed for our indexing purpose, but might be handy for debugging indexes.
  • The algorithm for the construction of the collation key bits is not specified as part of the API, so it might differ between JVM implementations or JVM versions.
  • While the Collator offers optimal sorting, it does not help if you want to search ignoring accents. But the reverse is true too: if you want to perform exact case-sensitive searches, while also having locale-sensitive and case-insensitive sorting, then the collator solution is perfect.
  • The collation key of a shorter string is not a prefix of the collation key of a longer string, so if you want to search on a prefix of the string, this is not possible. I find this an important disadvantage.

String sorting conclusion

There is no obvious choice for a default fits-all solution, so for my HBase-indexing purpose I am looking into making the string to byte conversion pluggable.

Integer sorting

If we compare the byte representation of two integers, will this behave such that the smaller integer is considered smaller than the larger integer?

To know the answer to this question, we need to know that the binary representation of an integer is two’s complement.

Ignoring the two’s complement for a moment, in plain binary numbers, the more significant bits are more to the left, so HBase’s left-to-right comparison will automatically do the right thing.

The first bit of a two’s complement integer is a sign bit: 1 for negative numbers, 0 for positive numbers. 1 is larger than 0, so with byte-based comparison this would mean negative numbers are considered larger than positive numbers. This can be easily solved by flipping that bit.

In the negative number range, 1000 is smaller than 10, so bigger is smaller. However, because in two’s complements the bits of all negative numbers are inverted, the order will be fine.

So in conclusion, to make integers compare correctly, we only need to flip the sign bit.

Float sorting

Floats are a more interesting problem then integers. For one, floating point numbers are an approximate representation, so equals searches will be a problem. So let us assume that these indexes will be only used for range searches, possibly with a small epsilon range. In composite indexes, they will hence only be usable as the last field.

Java uses the float representation as defined by IEEE:

[1 sign bit][8 exponent bits][23 mantissa bits]

The sign bit is again 1 for negative numbers and 0 for positive numbers. The exponent and mantissa are such that the most significant bits are to the left. The exponent is unsigned. In the mantissa each place represents a negative power of 2: 2-1, 2-2, 2-3, … similar to the decimal system. See this document for more details on the float format.

With this representation positive floats will compare correctly. We only have to flip the sign bit so that positive numbers will be larger than negative numbers.

For negative numbers, in contrast with the two’s complement integers, the bits are not inverted. Simply flipping all the exponent and mantissa bits will get exactly the behavior we need.

Note that there is a lot more to say about floats: there is a way to encode positive and negative infinity, there is something called subnormal numbers, and there is a positive and negative zero. The float representation is designed such that all these will be sorted fine without further consideration. A special case is NaN, not a number, for which there are multiple representations possible, and these will be sorted before or after negative respectively positive infinity.

Decimal sorting

To sort BigDecimals, or also for floats, we can use the decimal string representation of the numbers. A simple example is:

055.23
124.359

Note that you have to pad your numbers with leading zeros in order to have them ordered correctly.

There are some difficulties with this approach though.

First, this will not work as-is for negative numbers: -1 is larger than -2. This can be solved by changing 1 to 9, 2 to 8, and so on. Then there is still the problem that 3.33 is larger than 3.333, but lexicographically the longer string will be sorted after the shorter one. This can be solved by suffixing each negative number with something that is larger than any digit, like the character ‘a’.

Second, you need to know on beforehand to know how large your numbers can get. And if they can get really large, you will end up with large strings. However, inspired by studying the floats encoding above, I think we can do a decimal equivalent of the floating point’s exponent-mantissa approach. With a bit of Googling, I found this post by Steven A Rowe which is about the same idea.

Update:

There is actually no need for a string-like approach for decimals, since HBase does not use strings anyway. I have probably been looking too much at Lucene lately.

As the BigDecimal javadoc says: “A BigDecimal consists of an arbitrary precision integer unscaled value and a 32-bit integer scale”. We can simply use this unscaled value as mantissa, it is available as (two’s complement) bytes via BigDecimal.unscaledValue().toByteArray(). The scale of a BigDecimal is the number of digits to the right of the decimal point. For the exponent, we are rather interested in the number of digits to the left, which can be computed via BigDecimal.precision() – BigDecimal.scale().

Given all this, we can build a byte array in the same way as for floats: a sign bit, some exponent bits, and a variable number of mantissa bits. As in IEEE floats, we can offset the exponent so that it becomes unsigned. For the rest, just invert bits with a similar reasoning as for floats.

Date sorting

The widely used ISO 8601 date-time format is designed such that lexicographic order corresponds to the chronological order, so we can just use that format. It only uses plain ASCII characters, so the string-to-byte conversion does not pose a problem.

But again, there are some things to be aware of.

First, we should normalize all our date-times to the same timezone, preferably UTC, in which case the ISO 8601 string ends on a Z.

Second, the sorting will not be correct for negative years. This could be solved in a similar way as for decimals.

Update:

Dates can of course also be treated as integer/longs.

Ascending and descending indexes

By scanning over an index, the matching rows will be automatically returned in the order of the values we indexed on. HBase cannot scan in reverse order, so if you would like to be able to retrieve results in reverse order, you could do so by inverting all the bits in the index row keys.

Status

All the above is preliminary thought work, it might be full of errors and oversights. I hope to put it into practice sometime soon.

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值