HBase

最新推荐文章于 2023-01-25 23:29:26 发布

塞纳河畔的王子

最新推荐文章于 2023-01-25 23:29:26 发布

阅读量293

点赞数

本文链接：https://blog.csdn.net/qq_38078738/article/details/106311573

版权

大数据专栏收录该内容

12 篇文章 0 订阅

订阅专栏

一、概述

http://hbase.apache.org

Apache HBase是一个基于Hadoop的数据库，具有可靠、分布式的特点，适合结构化大数据的存储。

Apache HBase是Google BigTable的开源实现，开源、分布式、数据多版本、基于列存储的非关系型数据库。HBase建立在Hadoop的HDFS的基础之上。

列存储和行存储

列存储和行存储指的是数据在存储介质中的组织方式

关系型数据库（行存储）：Oracle、MySQL、DB2、SQL Server、MongoDB、Lexst等

非关系型数据库（列存储）：HBase、Hive、Druid、Vertica、Infobright等

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OnPUrKh0-1590288085038)(D:\Learnspace\training camp\day08\图片\2019082101.png)]

HBase数据模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jE4XlyoV-1590288085041)(D:\Learnspace\training camp\day08\图片\2019082102.png)]

主键：rowkey 获取数据的唯一标识，不能重复，根据字典顺序自动排序，底层采用byte[]进行存储
列簇：column family 多个列的集合，一个列簇中通常存放的是一组功能相似或者业务相近的列的集合
列：column 列簇中的一个字段，用来存放某一类别的数据
单元格：rowkey + cf + column用来定位一个cell，允许cell有多个数据版本，默认为1个
多版本：允许cell有多个数据版本
版本号：系统当前的时间戳，默认会将时间戳最新的cell数据返回给用户

特点

大：一个表可以有上百亿行，上百万列。
面向列：面向列表（簇）的存储和权限控制，列（簇）独立检索。
结构稀疏：对于为空（NULL）的列，并不占用存储空间，因此，表可以设计的非常稀疏。
无模式：每一行都有一个可以排序的主键和任意多的列，列可以根据需要动态增加，同一张表中不同的行可以有截然不同的列。
数据多版本：每个单元中的数据可以有多个版本，默认情况下，版本号自动分配，版本号就是单元格插入时的时间戳。
数据类型单一：HBase中的数据在底层都是以byte[]的形式存储，理论上可以存放任意类型的数据。

二、基本使用

环境搭建

伪分布式集群

准备工作

保证HDFS集群运行正常

保证ZooKeeper集群运行正常

[root@hadoop ~]# jps
83920 SecondaryNameNode
83602 NameNode
83698 DataNode
2548 QuorumPeerMain

安装及配置

[root@hadoop ~]# tar -zxf hbase-1.2.4-bin.tar.gz -C /usr

conf/hbase-site.xml

<property>
  <name>hbase.rootdir</name>
  <value>hdfs://hadoop:9000/hbase</value>
</property>
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>
<property>
  <name>hbase.zookeeper.quorum</name>
  <value>hadoop</value>
</property>
<property>
  <name>hbase.zookeeper.property.clientPort</name>
  <value>2181</value>
</property>

conf/regionservers
```
hadoop
```

修改环境变量

[root@hadoop hbase-1.2.4]# vi ~/.bashrc
# 将之前的环境变量配置删除 添加如下的环境变量配置
HBASE_MANAGES_ZK=false
HBASE_HOME=/usr/hbase-1.2.4
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export HBASE_HOME
export HBASE_MANAGES_ZK
[root@hadoop hbase-1.2.4]# source ~/.bashrc

启动服务

[root@hadoop hbase-1.2.4]# start-hbase.sh

验证HBase服务是否正常

方法一：

[root@hadoop hbase-1.2.4]# jps
83920 SecondaryNameNode
85104 HRegionServer  # HBase集群的从节点
84963 HMaster  # HBase集群的主节点
83602 NameNode
83698 DataNode
2548 QuorumPeerMain
85516 Jps

方法二：

http://hadoop:16010/master-status

方法三：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4wHnGGwm-1590288085043)(D:\Learnspace\training camp\day08\图片\2019082103.png)]

指令操作

使用客户端指令连接HBase Server

[root@hadoop hbase-1.2.4]# hbase shell

在命令窗口使用help指令查看帮助说明

hbase(main):007:0> help "get"  # help "命令"  查看指定命令的帮助说明
hbase(main):007:0> help "general"  # help "命令组"  查看指定命令组下的命令帮助说明
COMMAND GROUPS:
  Group name: general  # 通用指令
  Commands: status, table_help, version, whoami

  Group name: ddl   # 表操作指令
  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters

  Group name: namespace  # 类似于mysql中的数据库 组织管理表
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

  Group name: dml  # 数据的CRUD
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

General指令

status

hbase(main):010:0* status
1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load

version

hbase(main):013:0* version
1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017

whoami

hbase(main):014:0> whoami
root (auth:SIMPLE)
groups: root

Namespace指令

Namespace非常类似于MySQL中的数据库，是用来组织管理HBase表的，在HBase中，有一个默认的Namespace叫做default。

alter_namespace

hbase(main):022:0* alter_namespace 'baizhi',{METHOD=>'set', 'AUTHOR'=>'GAOZHY'}
0 row(s) in 0.0440 seconds

create_namespace

hbase(main):017:0> create_namespace 'baizhi'
0 row(s) in 0.0690 seconds

describe_namespace

hbase(main):019:0> describe_namespace 'baizhi'
DESCRIPTION
{NAME => 'baizhi'}
1 row(s) in 0.0210 seconds

drop_namespace

hbase(main):026:0> drop_namespace 'baizhi'
0 row(s) in 0.0460 seconds

list_namespace

hbase(main):027:0> list_namespace
NAMESPACE
default
hbase
2 row(s) in 0.0480 seconds

list_namespace_tables

hbase(main):025:0> list_namespace_tables 'hbase'
TABLE
meta
namespace
2 row(s) in 0.0280 seconds

DDL指令

表相关的操作

创建表：create

# 1. 语法： create '表名','列簇1','列簇2',...
# 2. 语法： create 'namespace:表名',{NAME=>'列簇名',VERSIONS=>版本数 允许cell出现的最多版本}
hbase(main):002:0> create 't_user','cf1'
0 row(s) in 1.6240 seconds

=> Hbase::Table - t_user
hbase(main):003:0> create 't_order',{NAME=>'cf1',VERSIONS=>3}
0 row(s) in 1.3200 seconds

展示列表: list

=> Hbase::Table - t_order
hbase(main):004:0> list
TABLE
t_order
t_user
2 row(s) in 0.0510 seconds

=> ["t_order", "t_user"]

修改表：alter

hbase(main):005:0* alter 't_user',NAME=>'cf1',TTL=>1800
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 2.3520 seconds

hbase(main):006:0> describe 't_user'
Table t_user is ENABLED
t_user
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD
ING => 'NONE', TTL => '1800 SECONDS (30 MINUTES)', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSI
ZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0340 seconds

hbase(main):008:0* alter 't_user',{NAME=>'cf1',VERSIONS=>3}
Updating all regions with the new schema...
1/1 regions updated.
Done.
0 row(s) in 1.9960 seconds

hbase(main):009:0> describe 't_user'
Table t_user is ENABLED
t_user
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD
ING => 'NONE', TTL => '1800 SECONDS (30 MINUTES)', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSI
ZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.0370 seconds

描述表：describe

hbase(main):001:0> describe 't_user'
Table t_user is ENABLED
t_user
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD
ING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REP
LICATION_SCOPE => '0'}

禁用表：disable, disable_all

hbase(main):011:0* disable 't_user'
0 row(s) in 2.3580 seconds

删除表：drop, drop_all

# 删除表时 首先需要禁用表
hbase(main):018:0> drop 't_user'

ERROR: Table t_user is enabled. Disable it first.

Here is some help for this command:
Drop the named table. Table must first be disabled:
  hbase> drop 't1'
  hbase> drop 'ns1:t1'

hbase(main):019:0> disable 't_user'
0 row(s) in 2.2830 seconds

hbase(main):020:0> drop 't_user'
0 row(s) in 1.3090 seconds

启动表：enable, enable_all

hbase(main):017:0* enable 't_user'
0 row(s) in 1.3400 seconds

判断表是否存在：exists

hbase(main):021:0> exists 't_user'
Table t_user does not exist
0 row(s) in 0.0240 seconds

hbase(main):022:0> exists 't_order'
Table t_order does exist
0 row(s) in 0.0240 seconds

判断表是否被禁用：is_disabled, is_enabled

hbase(main):023:0> is_disabled 't_order'
false
0 row(s) in 0.0150 seconds

DML指令（重点）

BigTable中数据的增删改查操作

获得总记录数：count

hbase(main):051:0> count 'default:t_order'
2 row(s) in 0.0570 seconds

=> 2

删除： delete, deleteall

# delete  删除某一个列的单元格数据
# deleteall  删除某一列数据
hbase(main):053:0> delete 't_order','order102','cf1:count'
0 row(s) in 0.0470 seconds

hbase(main):054:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3}
COLUMN                           CELL
 cf1:name                        timestamp=1566374163173, value=vivo
 cf1:name                        timestamp=1566374139746, value=oppo
 cf1:name                        timestamp=1566374045248, value=mix2s
3 row(s) in 0.0440 seconds

hbase(main):055:0> delete 't_order','order102','cf1:name',1566374045248
0 row(s) in 0.0340 seconds

hbase(main):056:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3}
COLUMN                           CELL
 cf1:name                        timestamp=1566374163173, value=vivo
 cf1:name                        timestamp=1566374139746, value=oppo
2 row(s) in 0.0160 seconds

hbase(main):059:0> deleteall 't_order','order102'
0 row(s) in 0.0140 seconds

hbase(main):060:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3}
COLUMN                           CELL
0 row(s) in 0.0220 seconds

获取数据：get

# get 'namespace:table','rowkey'
# get 'namespace:table','rowkey',{COLUMN=>'cf'}
# get 'namespace:table','rowkey',{COLUMN=>'cf',VERSIONS=>num}

# 获取指定主键中的所有列的数据
hbase(main):033:0* get 'default:t_order','order101'
COLUMN                           CELL
 cf1:count                       timestamp=1566373554307, value=2
 cf1:name                        timestamp=1566373502504, value=iphone
 cf1:price                       timestamp=1566373537106, value=1999
3 row(s) in 0.0560 seconds

hbase(main):034:0> get 'default:t_order','order102'
COLUMN                           CELL
 cf1:count                       timestamp=1566373582394, value=1
 cf1:name                        timestamp=1566373615024, value=HUAWEI P30

# 获取指定列簇中的所有列的数据
hbase(main):037:0> get 'default:t_order','order102',{COLUMN=>'cf1'}

# 获取指定列簇中的所有列的多版本数据
hbase(main):047:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3}
COLUMN                           CELL
 cf1:count                       timestamp=1566373582394, value=1
 cf1:name                        timestamp=1566374163173, value=vivo
 cf1:name                        timestamp=1566374139746, value=oppo
 cf1:name                        timestamp=1566374045248, value=mix2s

# 获取指定版本的单元格数据
hbase(main):048:0> get 'default:t_order','order102',{COLUMN=>'cf1',TIMESTAMP=>1566374045248 ,VERSIONS=>3}
COLUMN                           CELL
 cf1:name                        timestamp=1566374045248, value=mix2s
1 row(s) in 0.0240 seconds

新增（修改）数据：put

hbase(main):026:0* put 'default:t_order','order101','cf1:name','iphone'
0 row(s) in 0.1220 seconds

hbase(main):027:0> put 'default:t_order','order101','cf1:price',1999
0 row(s) in 0.0370 seconds

hbase(main):028:0> put 'default:t_order','order101','cf1:count',2
0 row(s) in 0.0330 seconds

hbase(main):029:0> put 'default:t_order','order102','cf1:count',1
0 row(s) in 0.0230 seconds

hbase(main):030:0> put 'default:t_order','order102','cf1:name','HUAWEI P30'

扫描表：scan

# 类似于查询所有
hbase(main):063:0> scan 't_order'
ROW                              COLUMN+CELL
 order101                        column=cf1:count, timestamp=1566373554307, value=2
 order101                        column=cf1:name, timestamp=1566373502504, value=iphone
 order101                        column=cf1:price, timestamp=1566373537106, value=1999
 order103                        column=cf1:name, timestamp=1566374793825, value=Apple Watch

截断表：truncate

截断指的是删除表中的所有数据

hbase(main):065:0> truncate 't_order'
truncate            truncate_preserve
hbase(main):065:0> truncate 't_order'
Truncating 't_order' table (it may take a while):
 - Disabling table...
 - Truncating table...
0 row(s) in 3.5150 seconds

hbase(main):066:0> scan 't_order'
ROW                              COLUMN+CELL
0 row(s) in 0.1550 seconds

JAVA API

Maven依赖

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-common</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-protocol</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-server</artifactId>
    <version>1.2.4</version>
</dependency>
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>

测试代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

public class HBaseClientTest {

    // 管理员对象(负责DDL操作)
    private Admin admin;

    // 连接对象(负责DML操作)
    private Connection connection;

    @Before
    public void doBefore() throws IOException {
        // 配置对象
        Configuration configuration = HBaseConfiguration.create();
        // 声明HBase的连接参数
        // HBase集群的入口信息 保存在ZK
        configuration.set(HConstants.ZOOKEEPER_QUORUM, "hadoop:2181");
        connection = ConnectionFactory.createConnection(configuration);
        admin = connection.getAdmin();
    }

    /**
     * 创建namespace
     * @throws IOException
     */
    @Test
    public void testCreateNamespace() throws IOException {
        NamespaceDescriptor namespaceDescriptor = NamespaceDescriptor.create("baizhi").addConfiguration("author", "gaozhy").build();
        admin.createNamespace(namespaceDescriptor);
    }

    /**
     * 创建table
     */
    @Test
    public void testCreateTable() throws IOException {
        HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("baizhi:t_user"));
        HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
        cf1.setMaxVersions(5); // cell最多保留5个历史版本
        HColumnDescriptor cf2 = new HColumnDescriptor("cf2");
        cf2.setTimeToLive(3600); // ttl=1hours
        hTableDescriptor.addFamily(cf1);
        hTableDescriptor.addFamily(cf2);
        admin.createTable(hTableDescriptor);
    }

    /**
     * 新增（修改）数据 ：
     *      put指令: put 'namespace:table','rowkey','cf1:name','value'
     * @throws IOException
     */
    @Test
    public void testInsert() throws IOException {
        Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
        // Put put = new Put("user101".getBytes()); // rowkey
        Put put = new Put(Bytes.toBytes("user103")); // HBase为了简化字节操作，提供了工具类 Bytes
        put.addColumn(Bytes.toBytes("cf1"),Bytes.toBytes("name"),Bytes.toBytes("小胖子"));
        table.put(put);
    }

    /**
     * 获得数据：
     *      get指令：get 'namespace:table','rowkey','cf:column'
     */
    @Test
    public void testSelect() throws IOException {
        Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
        Get get = new Get("user101".getBytes());
        // 查询指定单元格数据
        // get.addColumn("cf1".getBytes(),"name".getBytes());
        // 查指定列簇所有列数据
        // get.addFamily("cf1".getBytes())
        Result result = table.get(get);
        String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
        System.out.println(name);
    }

    /**
     * 测试删除数据：
     *    delete
     *    deleteall
     */
    @Test
    public void testDelete() throws IOException {
        Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
        Delete delete = new Delete(Bytes.toBytes("user101"));
        ArrayList<Delete> list = new ArrayList<Delete>();
        list.add(delete);
        table.delete(list);
    }

    /**
     * 扫描表
     *  scan 'namespace:table'
     */
    @Test
    public void testScan() throws IOException {
        Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
        Scan scan = new Scan();
        // 包含start  不包含stop
        scan.setStartRow(Bytes.toBytes("user101"));
        scan.setStopRow(Bytes.toBytes("user103"));

        ResultScanner rs = table.getScanner(scan);
        Iterator<Result> iterator = rs.iterator();
        while(iterator.hasNext()){
            Result result = iterator.next();
            String rowkey = Bytes.toString(result.getRow());
            String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
            System.out.println(rowkey + " | " +name);
        }
    }

    @After
    public void doAfter() throws IOException {
        if(admin != null) admin.close();
        if(connection != null) connection.close();
    }
}

作业

使用HBase作为数据存储，完成用户信息的增删改查
预习明天要讲的内容

三、HBase On MapReduce

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bFWxYy9E-1590288085044)(D:\Learnspace\training camp\day09\图片\2019082201.png)]

Maven依赖

<dependencies>
    <!--mapreduce + hbase-->
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-common</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>2.6.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-server</artifactId>
        <version>1.2.4</version>
    </dependency>
</dependencies>

测试数据

@Test
    public void testInsertSampleData() throws IOException {
        Table table = connection.getTable(TableName.valueOf("t_order"));
        Put put1 = new Put(Bytes.toBytes("1:20181010153020100"));
        put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(2500.0D));
        put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("p20"));
        put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        Put put2 = new Put(Bytes.toBytes("2:20180510121011233 "));
        put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(199.0D));
        put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("连衣裙"));
        put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        Put put3 = new Put(Bytes.toBytes("3:20180612111111111"));
        put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(999.9D));
        put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("小天鹅洗衣机"));
        put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        Put put4 = new Put(Bytes.toBytes("1:20181212011011111"));
        put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(200.0D));
        put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("搓衣板"));
        put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        Put put5 = new Put(Bytes.toBytes("1:20190206101010101"));
        put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(10D));
        put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("钢丝球"));
        put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        Put put6 = new Put(Bytes.toBytes("2:20180306101010101"));
        put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(9.9D));
        put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("丝袜"));
        put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
        ArrayList<Put> puts = new ArrayList<Put>();
        puts.add(put1);
        puts.add(put2);
        puts.add(put3);
        puts.add(put4);
        puts.add(put5);
        puts.add(put6);
        table.put(puts);
    }

创建输入表

@Test
public void testCreateOrderTable() throws IOException {
    boolean exists = admin.tableExists(TableName.valueOf("t_order"));
    if (exists) {
        admin.disableTable(TableName.valueOf("t_order"));
        admin.deleteTable(TableName.valueOf("t_order"));
    }
    HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("t_order"));
    HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
    hTableDescriptor.addFamily(cf1);
    admin.createTable(hTableDescriptor);
}

自定义Mapper

package com.baizhi;

import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * @Author：Gaozhy
 */
public class OrderMapper extends TableMapper< Text, DoubleWritable> {
    /**
     * @param key     rowkey
     * @param result   hbase中的一行记录
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void map(ImmutableBytesWritable key, Result result, Context context) throws IOException, InterruptedException {
        String rowkey = Bytes.toString(key.get());
        String userId = rowkey.split(":")[0];
        double money = Bytes.toDouble(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("money")));
        context.write(new Text(userId), new DoubleWritable(money));
    }
}

自定义Reducer

package com.baizhi;

import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.Iterator;

public class OrderReducer extends TableReducer<Text, DoubleWritable, NullWritable>{
    /**
     * @param key     userId
     * @param values  本年度的订单金额集合
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
        Double sum = 0.0D;
        Iterator<DoubleWritable> iterator = values.iterator();
        while (iterator.hasNext()) {
            sum += iterator.next().get();
        }
        // 1:2018
        Put put = new Put((key.toString() + ":2018").getBytes());
        put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(sum));
        context.write(null, put);
    }
}

自定义初始化类

package com.baizhi;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.*;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;

import java.io.IOException;

public class OrderComputeApplication {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Configuration configuration = HBaseConfiguration.create();
        configuration.set(HConstants.ZOOKEEPER_QUORUM, "hadoop:2181");
        Job job = Job.getInstance(configuration, "order compute");
        job.setJarByClass(OrderComputeApplication.class);

        job.setInputFormatClass(TableInputFormat.class);
        job.setOutputFormatClass(TableOutputFormat.class);

        // map任务的初始化
        Scan scan = new Scan();
        // 2018年度的账单统计
        // 正则表达式过滤符合条件的结果：^.*:2018.*$
        RowFilter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("^.*:2018.*$"));
        scan.setFilter(filter);
        // 3-5步
        TableMapReduceUtil.initTableMapperJob(TableName.valueOf("t_order"), scan, OrderMapper.class, Text.class, DoubleWritable.class, job);
        TableMapReduceUtil.initTableReducerJob("t_result", OrderReducer.class, job);
        job.waitForCompletion(true);
    }
}

本地计算+查看计算结果

@Test
public void testGetOrderTotal() throws IOException {
    Table table = connection.getTable(TableName.valueOf("t_result"));
    Result result = table.get(new Get(Bytes.toBytes("2:2018")));
    double total = Bytes.toDouble(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("total")));
    System.out.println("2号用户在2018年的年度消费账单为："+total);
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YOLoHisj-1590288085047)(D:\Learnspace\training camp\day09\图片\2019082202.png)]

远程计算

开发完成HBase On MapReduce应用运行在远程的YARN集群中运行

将应用打成JAR包

运行时依赖

应用在YARN集群中运行时需要依赖第三方的JAR包

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sNY0bnUt-1590288085049)(D:\Learnspace\training camp\day09\图片\2019082203.png)]

解决方案

将HBase应用依赖的jar包拷贝到share/hadoop/yarn/lib

配置HADOOP_CLASSPATH环境变量

[root@hadoop ~]# vi .bashrc
# 在配置文件的末尾添加如下的第三方依赖的路径
export HADOOP_CLASSPATH=/usr/hbase-1.2.4/lib/*
[root@hadoop ~]# source .bashrc

四、HBase完全分布式集群

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Mfst8fI9-1590288085052)(D:\Learnspace\training camp\day09\图片\2019082204.png)]

准备工作

启动之前搭建的Hadoop完全分布式集群
ZooKeeper集群服务运行正常
HDFS集群服务运行正常

环境搭建

时钟同步

注意: HBase集群节点和节点之间的时间误差最大允许为30s，如果大于30s需要对集群内的节点进行时间同步

[root@nodex ~]# date
2019年 08月 20日 星期二 17:13:53 CST
[root@nodex ~]# date -s '2019-08-22 15:49:00'
2019年 08月 22日 星期四 15:49:00 CST
[root@nodex ~]# date
2019年 08月 22日 星期四 15:49:03 CST
[root@nodex ~]# clock -w

上传HBase安装包

[root@node1 ~]# scp hbase-1.2.4-bin.tar.gz root@node2:~
hbase-1.2.4-bin.tar.gz                                                                    100%   74MB 100.8MB/s   00:00
[root@node1 ~]# scp hbase-1.2.4-bin.tar.gz root@node3:~
hbase-1.2.4-bin.tar.gz

解压缩安装HBase

[root@nodex ~]# tar -zxf hbase-1.2.4-bin.tar.gz -C /usr

修改配置文件hbase-site.xml

[root@nodex ~]# vi /usr/hbase-1.2.4/conf/hbase-site.xml
<property>
    <name>hbase.rootdir</name>
    <value>hdfs://mycluster/hbase</value>
</property>
<property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
</property>
<property>
    <name>hbase.zookeeper.quorum</name>
    <value>node1,node2,node3</value>
</property>
<property>
    <name>hbase.zookeeper.property.clientPort</name>
    <value>2181</value>
</property>

修改配置文件regionservers

[root@nodex ~]# vi /usr/hbase-1.2.4/conf/regionservers
node1
node2
node3

修改用户环境变量文件.bashrc

[root@nodex ~]# vi .bashrc
HBASE_MANAGES_ZK=false
HBASE_HOME=/usr/hbase-1.2.4
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export HBASE_HOME
export HBASE_MANAGES_ZK
[root@nodex ~]# source .bashrc

启动服务

启动HMaster

[root@nodex ~]# hbase-daemon.sh start master

启动HRegionServer

[root@nodex ~]# hbase-daemon.sh start regionserver

验证结果

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xzIHuHA8-1590288085053)(D:\Learnspace\training camp\day09\图片\2019082205.png)]

五、HBase架构详解

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2TkDAYKE-1590288085055)(D:\Learnspace\training camp\day09\图片\2019082206.png)]

HBase采用Master/Slave架构搭建集群，它隶属于Hadoop生态系统，由以下类型节点组成： HMaster 节点、HRegionServer 节点、 ZooKeeper 集群，而在底层，它将数据存储于HDFS中，因而涉及到HDFS的NameNode、DataNode等节点，总体结构如下：

HMaster节点用于：

管理HRegionServer，实现其负载均衡；
管理和分配HRegion，比如在HRegion Split时分配新的HRegion；
在HRegionServer退出时迁移其内的HRegion到其他HRegionServer上；
实现DDL操作（Data Definition Language，namespace和table的增删改，column familiy的增删改等）；
管理namespace和table的元数据（实际存储在HDFS上）；
权限控制（ACL）

HRegionServer节点用于：

存放和管理本地HRegion；
读写HDFS，管理Table中的数据；
Client直接通过HRegionServer读写数据（从HMaster中获取元数据，找到RowKey所在的HRegion/HRegionServer）

ZooKeeper集群用于：

存放整个 HBase集群的metadata（元数据）以及集群的状态信息
实现HMaster主从节点的failover（故障切换）
export HADOOP_HOME
export HBASE_HOME
export HBASE_MANAGES_ZK
[root@nodex ~]# source .bashrc

启动服务

启动HMaster

[root@nodex ~]# hbase-daemon.sh start master

启动HRegionServer

[root@nodex ~]# hbase-daemon.sh start regionserver

验证结果

[外链图片转存中…(img-xzIHuHA8-1590288085053)]

五、HBase架构详解

[外链图片转存中…(img-2TkDAYKE-1590288085055)]

HMaster节点用于：

管理HRegionServer，实现其负载均衡；
管理和分配HRegion，比如在HRegion Split时分配新的HRegion；
在HRegionServer退出时迁移其内的HRegion到其他HRegionServer上；
实现DDL操作（Data Definition Language，namespace和table的增删改，column familiy的增删改等）；
管理namespace和table的元数据（实际存储在HDFS上）；
权限控制（ACL）

HRegionServer节点用于：

存放和管理本地HRegion；
读写HDFS，管理Table中的数据；
Client直接通过HRegionServer读写数据（从HMaster中获取元数据，找到RowKey所在的HRegion/HRegionServer）

ZooKeeper集群用于：

存放整个 HBase集群的metadata（元数据）以及集群的状态信息
实现HMaster主从节点的failover（故障切换）