Hbase 学习笔记一 》Table Scans

本文介绍了Hbase中Table Scans的概念,强调了设计高效rowkey的重要性。通过实例展示了如何创建和使用Scan命令来获取特定用户的数据,并利用过滤器(Filter)实现对返回数据的筛选,从而提高查询效率。内容涵盖了扫描全表、指定范围扫描以及应用ValueFilter进行内容匹配的方法。
摘要由CSDN通过智能技术生成

Table scans

You likely noticed the lack of a query command of any kind. You won’t find one, at

least not today. The only way to access records containing a specific value is by using

the Scan command to read across some portion of the table, applying a filter to

retrieve only the relevant records. As you might imagine, the records returned while

scanning are presented in sorted order. HBase is designed to support this kind of

behavior so it’s fast.

To scan the entire contents of a table, use the bare Scan constructor:

Scan s = new Scan();


Often, however, you’re only interested in a subset of the entire table. Perhaps you only

want users with IDs starting with the letter T. Provide the Scan constructor with start

and end rows:

Scan s = new Scan(
	Bytes.toBytes("T"),
	Bytes.toBytes("U"));

This is a contrived example, perhaps, but you get the idea. How about a practical

example? You need to store twits. Further, you know you’ll want to access the most

recent twits from a particular user. Let’s start there.

1. Designing tables for scans

Just as you would when designing a relational schema, designing schema for HBase

tables requires that you consider the data shape and access patterns. Twits are a different

kind of data with different access patterns than users, so let’s put them in their

own table. For kicks, you’ll create the new table using the Java API instead of the shell.

Table manipulation is performed using an instance of the HBaseAdmin object:

Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);


Making an HBaseAdmin instance explicitly requires a Configuration instance, a detail

hidden from you by the default HTable and HTablePool constructors. That’s simple

enough. Now you can define a new table and create it:

HTableDescriptor desc = new HTableDescriptor("twits");
HColumnDescriptor c = new HColumnDescriptor("twits");
c.setMaxVersions(1);
desc.addFamily(c);
admin.createTable(desc);


The HTableDescriptor object lets you build up the description of the new table, starting

with its name: twits. Likewise, you build up the column family, also named twits, using

the HColumnDescriptor. As with the users table, you only need one column family here.

You don’t need twit versioning, so you’ll limit the retained versions to one.

With a fancy new twits table, you can begin storing twits. A twit consists of a message

and the date and time it was posted. You’ll need a unique value for the rowkey, so

let’s try the username plus the timestamp. Easy enough; let’s store twits like this:

Put put = new Put(
Bytes.toBytes("TheRealMT" + 1329088818321L));
put.add(
Bytes.toBytes("twits"),
Bytes.toBytes("Hello, TwitBase!));

You know you’ll want the most recent twits first. You know that HBase stores rows in

sorted order by rowkey in its physical data model. You take advantage of that feature.

By including the timestamp of the twit in the rowkey and multiplying it by -1, you have

the most recent twits first.

Rowkey design is critical in HBase schema

This point we can’t stress enough: HBase rowkeys are the number one most important

thing to think about when designing a table. We cover this in much greater detail

in chapter 4. We mention it now so you can keep it in mind as you pursue the examples.

The first question you should always ask yourself when looking at an HBase

schema is, “What’s in the rowkey?” The next question should be, “How can I use the

rowkey more effectively?”


Executing a scan

Using the user as the first portion of the twits rowkey turns out to be useful. It effectively

creates buckets of data by user in the natural ordering of rows. All data from one

user is in continuous rows. What does the Scan look like? More or less the same as

before, just with more complexity in calculating the stop key:

byte[] userHash = Md5Utils.md5sum(user);
byte[] startRow = Bytes.padTail(userHash, longLength); // 212d...866f00...
byte[] stopRow = Bytes.padTail(userHash, longLength);
stopRow[Md5Utils.MD5_LENGTH-1]++; // 212d...867000...
Scan s = new Scan(startRow, stopRow);
ResultsScanner rs = twits.getScanner(s);


In this case, you create the stop key by incrementing the value of the last byte of the

user ID portion of the rowkey. Scanners return records inclusive of the start key and

exclusive of the end key, so this gives you twits for only the matching user.

Reading twits off the ResultScanner is a simple loop:

for(Result r : rs) {
// extract the username
byte[] b = r.getValue(
Bytes.toBytes("twits"),
Bytes.toBytes("user"));
String user = Bytes.toString(b);
// extract the twit
b = r.getValue(
Bytes.toBytes("twits"),
Bytes.toBytes("twit"));
String message = Bytes.toString(b);
// extract the timestamp
b = Arrays.copyOfRange(
r.getRow(),
Md5Utils.MD5_LENGTH,
Md5Utils.MD5_LENGTH + longLength);
DateTime dt = new DateTime(-1 * Bytes.toLong(b));
}


The only work done in the loop is fixing the timestamp value and converting byte[]

values back to their proper data types. Voila! You’ll have something like this:

<Twit: TheRealMT 2012-02-20T00:13:27.931-08:00 Hello, TwitBase!>

Scanner caching

A scan can be configured to retrieve a batch of rows in every RPC call it makes to

HBase. This configuration can be done at a per-scanner level by using the setCaching(

int) API on the scan object. This configuration can also be set in the hbasesite.

xml configuration file using the hbase.client.scanner.caching property. If the

caching value is set to n, the scanner will return n rows with every RPC call and they

will be cached at the client side while it works through them. The default value of this

configuration is 1, which basically means that when you scan through a table, only one

row is returned per RPC call that the client makes to HBase. That’s a conservative number,

and you can tune it for better performance. But setting the value too high would

mean that the client’s interaction with HBase would have longer pauses, and this

could result in timeouts on HBase’s side.

The ResultScanner interface also has a next(int) call that you can use to ask it to

return the next n rows from the scan. This is an API convenience that doesn’t have any

relation to the number of RPC calls the client makes to HBase to get those n rows.

Under the hood, ResultScanner makes as many RPC calls as necessary to satisfy the

request; the number of rows returned per RPC call is solely dependent on the caching

value you configure for the scanner.

Applying filters

It’s not always possible to design a rowkey to perfectly match your access patterns.

Sometimes you’ll have use cases where you need to scan through a set of data in HBase

but return only a subset of it to the client. This is where filters come in. Add a filter to

your Scan object like this:

Filter f = ...
Scan s = new Scan();
s.setFilter(f);


A filter is a predicate that executes in HBase instead of on the client. When you specify

a Filter in your Scan, HBase uses it to determine whether a record should be

returned. This can avoid a lot of unnecessary data transfer. It also keeps the filtering

on the server instead of placing that burden on the client.

The filter applied is anything implementing the org.apache.hadoop.hbase.filter.

Filter interface. HBase provides a number of filters, but it’s easy to implement

your own.

To filter all twits that mention TwitBase, you can use a ValueFilter in combination

with a RegexStringComparator:

Scan s = new Scan();
s.addColumn(Bytes.toBytes("twits"), Bytes.toByes("twit"));
Filter f = new ValueFilter(
CompareOp.EQUAL,
new RegexStringComparator(".*TwitBase.*"));
s.setFilter(f);


HBase also provides a class for filter construction. The ParseFilter object implements

a kind of query language used to construct a Filter instance for you. The same

TwitBase filter can be constructed from an expression:

Scan s = new Scan();
s.addColumn(TWITS_FAM, TWIT_COL);
String expression = "ValueFilter(=,'regexString:.*TwitBase.*')";
ParseFilter p = new ParseFilter();
Filter f = p.parseSimpleFilterExpression(Bytes.toBytes(expression));
s.setFilter(f);



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值