笨鸟的平凡之路-Hbase预分区

最新推荐文章于 2024-11-06 11:12:27 发布

笨鸟的平凡之路

最新推荐文章于 2024-11-06 11:12:27 发布

阅读量246

点赞数 1

分类专栏： Hbase 文章标签：大数据

本文链接：https://blog.csdn.net/weixin_45109718/article/details/92798108

版权

Hbase 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

前言

Hbase在建表的时候默认只是有一个Region去存储数据,这个Region存储的数据是没有StartKey和EndKey的.如下图:
在这里插入图片描述
这样造成的影响是这张表的所有的数据都往这一个Region上存储,随着数据的增加,这个Region会承受不了更多的数据（达到hbase.hregion.max.filesize属性中定义的阈值，默认10GB）,导致Region会发生Split,均分成两个Region来存储数据.但是Split过程会消耗大量的I/O资源,并且频繁的Split会对Hbase造成巨大的性能影响.
所以为了解决这种数据量大存储时造成的Split影响,我们可以在存储数据之前就对表进行预分区.

预分区

因为分区是和rowkey相关,所以在进行预分区之前,首先得知道rowkey的组成原理或者取值范围.
比如网上提供的一组rowkey组成方式:两位随机数+时间戳+客户号
两位随机数的范围是00-99,所以可以根据前两位随机数分割成10个region,
-10,10-20,20-30,30-40,40-50,50-60,60-70,70-80,80-90,90-
下面首先使用Java的API建表,在建表之前需要产生splitkeys二维数组,这个数组存储的rowkey的边界值:

static byte[][] getSplitKeys() {
        String[] keys = new String[]{"10|", "20|", "30|", "40|", "50|",
                "60|", "70|", "80|", "90|"};
        byte[][] splitKeys = new byte[keys.length][];//二维数组,存储边界值
        TreeSet<byte[]> rows = new TreeSet<byte[]>(Bytes.BYTES_COMPARATOR);//升序排序
        for (int i = 0; i < keys.length; i++) {
            rows.add(Bytes.toBytes(keys[i]));
        }
        Iterator<byte[]> rowKeyIter = rows.iterator();
        int i = 0;
        while (rowKeyIter.hasNext()) {
            byte[] tempRow = rowKeyIter.next();
            rowKeyIter.remove();
            splitKeys[i] = tempRow;
            i++;
        }
        return splitKeys;
    }

需要注意的是,在上面的代码中用treeset对rowkey进行排序,必须要对rowkey排序,否则在调用admin.createTable(tableDescriptor,splitKeys)的时候会出错.创建表的代码如下:

 private static Configuration conf = HBaseConfiguration.create();

    static {
        conf.set("hbase.zookeeper.quorum", "masternode1:2181,masternode2:2181,masternode3:2181");
    }

    public static void main(String[] args) throws Exception {
        Connection connection = ConnectionFactory.createConnection(conf);
        Admin admin = connection.getAdmin();

        //判断是否已经存在该表
        TableName table_name = TableName.valueOf("test_region");//表名
        if (admin.tableExists(table_name)) {
            admin.disableTable(table_name);
            admin.deleteTable(table_name);
        }

        HTableDescriptor desc = new HTableDescriptor(table_name);
        HColumnDescriptor family1 = new HColumnDescriptor("cf".getBytes());//列簇
//        family1.setTimeToLive(3 * 60 * 60 * 24);     //过期时间TTL
//        family1.setMaxVersions(3);                   //版本数
        desc.addFamily(family1);

        byte[][] splitKeys = getSplitKeys();

        admin.createTable(desc, splitKeys);
        admin.close();
        connection.close();
    }

运行程序之后
在这里插入图片描述
可以看到,我的集群设置了2台region server,预分区的region均匀的分配在这2台region server上了.
下面我们测试Load数据:

public class TestHBasePartition {
    private static Configuration conf = HBaseConfiguration.create();

    static {
        conf.set("hbase.zookeeper.quorum", "masternode1:2181,masternode2:2181,masternode3:2181");
    }
    public static void main(String[] args) throws Exception{
//        HBaseAdmin admin = new HBaseAdmin(conf);
        HTable table = new HTable(conf, "test_region");
        table.put(batchPut());
    }

//rowkey前缀00-99随机生成
    private static String getRandomNumber(){
        String ranStr = Math.random()+"";
        int pointIndex = ranStr.indexOf(".");
        return ranStr.substring(pointIndex+1, pointIndex+3);
    }

//批量生成数据
    private static List<Put> batchPut() {
        List<Put> list = new ArrayList<Put>();
        for (int i = 1; i <= 10000; i++) {
            byte[] rowkey = Bytes.toBytes(getRandomNumber() + "|" + System.currentTimeMillis() + "-" + i);
            Put put = new Put(rowkey);
            put.add(Bytes.toBytes("cf"), Bytes.toBytes("name"), Bytes.toBytes("zs" + i));
            list.add(put);
        }
        return list;
    }
}

如图所示:
hbase中存储的数据共10000行
在这里插入图片描述
Hbase UI界面查看数据分布是否均匀:

这样数据已经均匀的分配在了10个region上了.
region预分区的技巧:
10进制跟16进制:
https://blog.csdn.net/weixin_33924770/article/details/90839928
可以结合es的方式:
https://blog.csdn.net/qq_31289187/article/details/80869906