一、概述
http://hbase.apache.org
Apache HBase是一个基于Hadoop的数据库,具有可靠、分布式的特点,适合结构化大数据的存储。
Apache HBase是Google BigTable的开源实现,开源、分布式、数据多版本、基于列存储的非关系型数据库。HBase建立在Hadoop的HDFS的基础之上。
列存储和行存储
列存储和行存储指的是数据在存储介质中的组织方式
关系型数据库(行存储):Oracle、MySQL、DB2、SQL Server、MongoDB、Lexst等
非关系型数据库(列存储):HBase、Hive、Druid、Vertica、Infobright等
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OnPUrKh0-1590288085038)(D:\Learnspace\training camp\day08\图片\2019082101.png)]
HBase数据模型
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jE4XlyoV-1590288085041)(D:\Learnspace\training camp\day08\图片\2019082102.png)]
- 主键:rowkey 获取数据的唯一标识,不能重复,根据字典顺序自动排序,底层采用byte[]进行存储
- 列簇:column family 多个列的集合,一个列簇中通常存放的是一组功能相似或者业务相近的列的集合
- 列:column 列簇中的一个字段,用来存放某一类别的数据
- 单元格:rowkey + cf + column用来定位一个cell,允许cell有多个数据版本,默认为1个
- 多版本:允许cell有多个数据版本
- 版本号:系统当前的时间戳,默认会将时间戳最新的cell数据返回给用户
特点
- 大:一个表可以有上百亿行,上百万列。
- 面向列:面向列表(簇)的存储和权限控制,列(簇)独立检索。
- 结构稀疏:对于为空(NULL)的列,并不占用存储空间,因此,表可以设计的非常稀疏。
- 无模式:每一行都有一个可以排序的主键和任意多的列,列可以根据需要动态增加,同一张表中不同的行可以有截然不同的列。
- 数据多版本:每个单元中的数据可以有多个版本,默认情况下,版本号自动分配,版本号就是单元格插入时的时间戳。
- 数据类型单一:HBase中的数据在底层都是以
byte[]
的形式存储,理论上可以存放任意类型的数据。
二、基本使用
环境搭建
伪分布式集群
准备工作
-
保证HDFS集群运行正常
-
保证ZooKeeper集群运行正常
[root@hadoop ~]# jps 83920 SecondaryNameNode 83602 NameNode 83698 DataNode 2548 QuorumPeerMain
安装及配置
[root@hadoop ~]# tar -zxf hbase-1.2.4-bin.tar.gz -C /usr
-
conf/hbase-site.xml
<property> <name>hbase.rootdir</name> <value>hdfs://hadoop:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>hadoop</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property>
-
conf/regionservers
hadoop
修改环境变量
[root@hadoop hbase-1.2.4]# vi ~/.bashrc
# 将之前的环境变量配置删除 添加如下的环境变量配置
HBASE_MANAGES_ZK=false
HBASE_HOME=/usr/hbase-1.2.4
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
export HBASE_HOME
export HBASE_MANAGES_ZK
[root@hadoop hbase-1.2.4]# source ~/.bashrc
启动服务
[root@hadoop hbase-1.2.4]# start-hbase.sh
验证HBase服务是否正常
方法一:
[root@hadoop hbase-1.2.4]# jps
83920 SecondaryNameNode
85104 HRegionServer # HBase集群的从节点
84963 HMaster # HBase集群的主节点
83602 NameNode
83698 DataNode
2548 QuorumPeerMain
85516 Jps
方法二:
http://hadoop:16010/master-status
方法三:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4wHnGGwm-1590288085043)(D:\Learnspace\training camp\day08\图片\2019082103.png)]
指令操作
使用客户端指令连接HBase Server
[root@hadoop hbase-1.2.4]# hbase shell
在命令窗口使用help指令查看帮助说明
hbase(main):007:0> help "get" # help "命令" 查看指定命令的帮助说明
hbase(main):007:0> help "general" # help "命令组" 查看指定命令组下的命令帮助说明
COMMAND GROUPS:
Group name: general # 通用指令
Commands: status, table_help, version, whoami
Group name: ddl # 表操作指令
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, locate_region, show_filters
Group name: namespace # 类似于mysql中的数据库 组织管理表
Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
Group name: dml # 数据的CRUD
Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve
General指令
-
status
hbase(main):010:0* status 1 active master, 0 backup masters, 1 servers, 0 dead, 2.0000 average load
-
version
hbase(main):013:0* version 1.2.4, rUnknown, Wed Feb 15 18:58:00 CST 2017
-
whoami
hbase(main):014:0> whoami root (auth:SIMPLE) groups: root
Namespace指令
Namespace非常类似于MySQL中的数据库,是用来组织管理HBase表的,在HBase中,有一个默认的Namespace叫做default。
-
alter_namespace
hbase(main):022:0* alter_namespace 'baizhi',{METHOD=>'set', 'AUTHOR'=>'GAOZHY'} 0 row(s) in 0.0440 seconds
-
create_namespace
hbase(main):017:0> create_namespace 'baizhi' 0 row(s) in 0.0690 seconds
-
describe_namespace
hbase(main):019:0> describe_namespace 'baizhi' DESCRIPTION {NAME => 'baizhi'} 1 row(s) in 0.0210 seconds
-
drop_namespace
hbase(main):026:0> drop_namespace 'baizhi' 0 row(s) in 0.0460 seconds
-
list_namespace
hbase(main):027:0> list_namespace NAMESPACE default hbase 2 row(s) in 0.0480 seconds
-
list_namespace_tables
hbase(main):025:0> list_namespace_tables 'hbase' TABLE meta namespace 2 row(s) in 0.0280 seconds
DDL指令
表相关的操作
-
创建表:create
# 1. 语法: create '表名','列簇1','列簇2',... # 2. 语法: create 'namespace:表名',{NAME=>'列簇名',VERSIONS=>版本数 允许cell出现的最多版本} hbase(main):002:0> create 't_user','cf1' 0 row(s) in 1.6240 seconds => Hbase::Table - t_user hbase(main):003:0> create 't_order',{NAME=>'cf1',VERSIONS=>3} 0 row(s) in 1.3200 seconds
-
展示列表: list
=> Hbase::Table - t_order hbase(main):004:0> list TABLE t_order t_user 2 row(s) in 0.0510 seconds => ["t_order", "t_user"]
-
修改表:alter
hbase(main):005:0* alter 't_user',NAME=>'cf1',TTL=>1800 Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 2.3520 seconds hbase(main):006:0> describe 't_user' Table t_user is ENABLED t_user COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD ING => 'NONE', TTL => '1800 SECONDS (30 MINUTES)', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSI ZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0340 seconds hbase(main):008:0* alter 't_user',{NAME=>'cf1',VERSIONS=>3} Updating all regions with the new schema... 1/1 regions updated. Done. 0 row(s) in 1.9960 seconds hbase(main):009:0> describe 't_user' Table t_user is ENABLED t_user COLUMN FAMILIES DESCRIPTION {NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '3', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD ING => 'NONE', TTL => '1800 SECONDS (30 MINUTES)', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSI ZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.0370 seconds
-
描述表:describe
hbase(main):001:0> describe 't_user'
Table t_user is ENABLED
t_user
COLUMN FAMILIES DESCRIPTION
{NAME => 'cf1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCOD
ING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REP
LICATION_SCOPE => '0'}
- 禁用表:disable, disable_all
hbase(main):011:0* disable 't_user'
0 row(s) in 2.3580 seconds
-
删除表:drop, drop_all
# 删除表时 首先需要禁用表 hbase(main):018:0> drop 't_user' ERROR: Table t_user is enabled. Disable it first. Here is some help for this command: Drop the named table. Table must first be disabled: hbase> drop 't1' hbase> drop 'ns1:t1' hbase(main):019:0> disable 't_user' 0 row(s) in 2.2830 seconds hbase(main):020:0> drop 't_user' 0 row(s) in 1.3090 seconds
-
启动表:enable, enable_all
hbase(main):017:0* enable 't_user'
0 row(s) in 1.3400 seconds
-
判断表是否存在:exists
hbase(main):021:0> exists 't_user' Table t_user does not exist 0 row(s) in 0.0240 seconds hbase(main):022:0> exists 't_order' Table t_order does exist 0 row(s) in 0.0240 seconds
-
判断表是否被禁用:is_disabled, is_enabled
hbase(main):023:0> is_disabled 't_order' false 0 row(s) in 0.0150 seconds
DML指令(重点)
BigTable中数据的增删改查操作
-
获得总记录数:count
hbase(main):051:0> count 'default:t_order' 2 row(s) in 0.0570 seconds => 2
-
删除: delete, deleteall
# delete 删除某一个列的单元格数据 # deleteall 删除某一列数据 hbase(main):053:0> delete 't_order','order102','cf1:count' 0 row(s) in 0.0470 seconds hbase(main):054:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3} COLUMN CELL cf1:name timestamp=1566374163173, value=vivo cf1:name timestamp=1566374139746, value=oppo cf1:name timestamp=1566374045248, value=mix2s 3 row(s) in 0.0440 seconds hbase(main):055:0> delete 't_order','order102','cf1:name',1566374045248 0 row(s) in 0.0340 seconds hbase(main):056:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3} COLUMN CELL cf1:name timestamp=1566374163173, value=vivo cf1:name timestamp=1566374139746, value=oppo 2 row(s) in 0.0160 seconds hbase(main):059:0> deleteall 't_order','order102' 0 row(s) in 0.0140 seconds hbase(main):060:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3} COLUMN CELL 0 row(s) in 0.0220 seconds
-
获取数据:get
# get 'namespace:table','rowkey' # get 'namespace:table','rowkey',{COLUMN=>'cf'} # get 'namespace:table','rowkey',{COLUMN=>'cf',VERSIONS=>num} # 获取指定主键中的所有列的数据 hbase(main):033:0* get 'default:t_order','order101' COLUMN CELL cf1:count timestamp=1566373554307, value=2 cf1:name timestamp=1566373502504, value=iphone cf1:price timestamp=1566373537106, value=1999 3 row(s) in 0.0560 seconds hbase(main):034:0> get 'default:t_order','order102' COLUMN CELL cf1:count timestamp=1566373582394, value=1 cf1:name timestamp=1566373615024, value=HUAWEI P30 # 获取指定列簇中的所有列的数据 hbase(main):037:0> get 'default:t_order','order102',{COLUMN=>'cf1'} # 获取指定列簇中的所有列的多版本数据 hbase(main):047:0> get 'default:t_order','order102',{COLUMN=>'cf1',VERSIONS=>3} COLUMN CELL cf1:count timestamp=1566373582394, value=1 cf1:name timestamp=1566374163173, value=vivo cf1:name timestamp=1566374139746, value=oppo cf1:name timestamp=1566374045248, value=mix2s # 获取指定版本的单元格数据 hbase(main):048:0> get 'default:t_order','order102',{COLUMN=>'cf1',TIMESTAMP=>1566374045248 ,VERSIONS=>3} COLUMN CELL cf1:name timestamp=1566374045248, value=mix2s 1 row(s) in 0.0240 seconds
-
新增(修改)数据:put
hbase(main):026:0* put 'default:t_order','order101','cf1:name','iphone' 0 row(s) in 0.1220 seconds hbase(main):027:0> put 'default:t_order','order101','cf1:price',1999 0 row(s) in 0.0370 seconds hbase(main):028:0> put 'default:t_order','order101','cf1:count',2 0 row(s) in 0.0330 seconds hbase(main):029:0> put 'default:t_order','order102','cf1:count',1 0 row(s) in 0.0230 seconds hbase(main):030:0> put 'default:t_order','order102','cf1:name','HUAWEI P30'
-
扫描表:scan
# 类似于查询所有 hbase(main):063:0> scan 't_order' ROW COLUMN+CELL order101 column=cf1:count, timestamp=1566373554307, value=2 order101 column=cf1:name, timestamp=1566373502504, value=iphone order101 column=cf1:price, timestamp=1566373537106, value=1999 order103 column=cf1:name, timestamp=1566374793825, value=Apple Watch
-
截断表:truncate
截断指的是删除表中的所有数据
hbase(main):065:0> truncate 't_order' truncate truncate_preserve hbase(main):065:0> truncate 't_order' Truncating 't_order' table (it may take a while): - Disabling table... - Truncating table... 0 row(s) in 3.5150 seconds hbase(main):066:0> scan 't_order' ROW COLUMN+CELL 0 row(s) in 0.1550 seconds
JAVA API
Maven依赖
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-protocol</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.4</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
测试代码
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
public class HBaseClientTest {
// 管理员对象(负责DDL操作)
private Admin admin;
// 连接对象(负责DML操作)
private Connection connection;
@Before
public void doBefore() throws IOException {
// 配置对象
Configuration configuration = HBaseConfiguration.create();
// 声明HBase的连接参数
// HBase集群的入口信息 保存在ZK
configuration.set(HConstants.ZOOKEEPER_QUORUM, "hadoop:2181");
connection = ConnectionFactory.createConnection(configuration);
admin = connection.getAdmin();
}
/**
* 创建namespace
* @throws IOException
*/
@Test
public void testCreateNamespace() throws IOException {
NamespaceDescriptor namespaceDescriptor = NamespaceDescriptor.create("baizhi").addConfiguration("author", "gaozhy").build();
admin.createNamespace(namespaceDescriptor);
}
/**
* 创建table
*/
@Test
public void testCreateTable() throws IOException {
HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("baizhi:t_user"));
HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
cf1.setMaxVersions(5); // cell最多保留5个历史版本
HColumnDescriptor cf2 = new HColumnDescriptor("cf2");
cf2.setTimeToLive(3600); // ttl=1hours
hTableDescriptor.addFamily(cf1);
hTableDescriptor.addFamily(cf2);
admin.createTable(hTableDescriptor);
}
/**
* 新增(修改)数据 :
* put指令: put 'namespace:table','rowkey','cf1:name','value'
* @throws IOException
*/
@Test
public void testInsert() throws IOException {
Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
// Put put = new Put("user101".getBytes()); // rowkey
Put put = new Put(Bytes.toBytes("user103")); // HBase为了简化字节操作,提供了工具类 Bytes
put.addColumn(Bytes.toBytes("cf1"),Bytes.toBytes("name"),Bytes.toBytes("小胖子"));
table.put(put);
}
/**
* 获得数据:
* get指令:get 'namespace:table','rowkey','cf:column'
*/
@Test
public void testSelect() throws IOException {
Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
Get get = new Get("user101".getBytes());
// 查询指定单元格数据
// get.addColumn("cf1".getBytes(),"name".getBytes());
// 查指定列簇所有列数据
// get.addFamily("cf1".getBytes())
Result result = table.get(get);
String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
System.out.println(name);
}
/**
* 测试删除数据:
* delete
* deleteall
*/
@Test
public void testDelete() throws IOException {
Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
Delete delete = new Delete(Bytes.toBytes("user101"));
ArrayList<Delete> list = new ArrayList<Delete>();
list.add(delete);
table.delete(list);
}
/**
* 扫描表
* scan 'namespace:table'
*/
@Test
public void testScan() throws IOException {
Table table = connection.getTable(TableName.valueOf("baizhi:t_user"));
Scan scan = new Scan();
// 包含start 不包含stop
scan.setStartRow(Bytes.toBytes("user101"));
scan.setStopRow(Bytes.toBytes("user103"));
ResultScanner rs = table.getScanner(scan);
Iterator<Result> iterator = rs.iterator();
while(iterator.hasNext()){
Result result = iterator.next();
String rowkey = Bytes.toString(result.getRow());
String name = Bytes.toString(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("name")));
System.out.println(rowkey + " | " +name);
}
}
@After
public void doAfter() throws IOException {
if(admin != null) admin.close();
if(connection != null) connection.close();
}
}
作业
- 使用HBase作为数据存储,完成用户信息的增删改查
- 预习明天要讲的内容
三、HBase On MapReduce
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bFWxYy9E-1590288085044)(D:\Learnspace\training camp\day09\图片\2019082201.png)]
Maven依赖
<dependencies>
<!--mapreduce + hbase-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-jobclient</artifactId>
<version>2.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.2.4</version>
</dependency>
</dependencies>
测试数据
@Test
public void testInsertSampleData() throws IOException {
Table table = connection.getTable(TableName.valueOf("t_order"));
Put put1 = new Put(Bytes.toBytes("1:20181010153020100"));
put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(2500.0D));
put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("p20"));
put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
Put put2 = new Put(Bytes.toBytes("2:20180510121011233 "));
put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(199.0D));
put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("连衣裙"));
put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
Put put3 = new Put(Bytes.toBytes("3:20180612111111111"));
put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(999.9D));
put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("小天鹅洗衣机"));
put3.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
Put put4 = new Put(Bytes.toBytes("1:20181212011011111"));
put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(200.0D));
put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("搓衣板"));
put4.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
Put put5 = new Put(Bytes.toBytes("1:20190206101010101"));
put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(10D));
put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("钢丝球"));
put5.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
Put put6 = new Put(Bytes.toBytes("2:20180306101010101"));
put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("money"), Bytes.toBytes(9.9D));
put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("product"), Bytes.toBytes("丝袜"));
put6.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("count"), Bytes.toBytes(1));
ArrayList<Put> puts = new ArrayList<Put>();
puts.add(put1);
puts.add(put2);
puts.add(put3);
puts.add(put4);
puts.add(put5);
puts.add(put6);
table.put(puts);
}
创建输入表
@Test
public void testCreateOrderTable() throws IOException {
boolean exists = admin.tableExists(TableName.valueOf("t_order"));
if (exists) {
admin.disableTable(TableName.valueOf("t_order"));
admin.deleteTable(TableName.valueOf("t_order"));
}
HTableDescriptor hTableDescriptor = new HTableDescriptor(TableName.valueOf("t_order"));
HColumnDescriptor cf1 = new HColumnDescriptor("cf1");
hTableDescriptor.addFamily(cf1);
admin.createTable(hTableDescriptor);
}
自定义Mapper
package com.baizhi;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* @Author:Gaozhy
*/
public class OrderMapper extends TableMapper< Text, DoubleWritable> {
/**
* @param key rowkey
* @param result hbase中的一行记录
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(ImmutableBytesWritable key, Result result, Context context) throws IOException, InterruptedException {
String rowkey = Bytes.toString(key.get());
String userId = rowkey.split(":")[0];
double money = Bytes.toDouble(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("money")));
context.write(new Text(userId), new DoubleWritable(money));
}
}
自定义Reducer
package com.baizhi;
import org.apache.hadoop.hbase.client.Mutation;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.Iterator;
public class OrderReducer extends TableReducer<Text, DoubleWritable, NullWritable>{
/**
* @param key userId
* @param values 本年度的订单金额集合
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
Double sum = 0.0D;
Iterator<DoubleWritable> iterator = values.iterator();
while (iterator.hasNext()) {
sum += iterator.next().get();
}
// 1:2018
Put put = new Put((key.toString() + ":2018").getBytes());
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("total"), Bytes.toBytes(sum));
context.write(null, put);
}
}
自定义初始化类
package com.baizhi;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.*;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import java.io.IOException;
public class OrderComputeApplication {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = HBaseConfiguration.create();
configuration.set(HConstants.ZOOKEEPER_QUORUM, "hadoop:2181");
Job job = Job.getInstance(configuration, "order compute");
job.setJarByClass(OrderComputeApplication.class);
job.setInputFormatClass(TableInputFormat.class);
job.setOutputFormatClass(TableOutputFormat.class);
// map任务的初始化
Scan scan = new Scan();
// 2018年度的账单统计
// 正则表达式过滤符合条件的结果:^.*:2018.*$
RowFilter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("^.*:2018.*$"));
scan.setFilter(filter);
// 3-5步
TableMapReduceUtil.initTableMapperJob(TableName.valueOf("t_order"), scan, OrderMapper.class, Text.class, DoubleWritable.class, job);
TableMapReduceUtil.initTableReducerJob("t_result", OrderReducer.class, job);
job.waitForCompletion(true);
}
}
本地计算+查看计算结果
@Test
public void testGetOrderTotal() throws IOException {
Table table = connection.getTable(TableName.valueOf("t_result"));
Result result = table.get(new Get(Bytes.toBytes("2:2018")));
double total = Bytes.toDouble(result.getValue(Bytes.toBytes("cf1"), Bytes.toBytes("total")));
System.out.println("2号用户在2018年的年度消费账单为:"+total);
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YOLoHisj-1590288085047)(D:\Learnspace\training camp\day09\图片\2019082202.png)]
远程计算
开发完成HBase On MapReduce应用运行在远程的YARN集群中运行
将应用打成JAR包
运行时依赖
应用在YARN集群中运行时需要依赖第三方的JAR包
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sNY0bnUt-1590288085049)(D:\Learnspace\training camp\day09\图片\2019082203.png)]
解决方案
-
将HBase应用依赖的jar包拷贝到
share/hadoop/yarn/lib
-
配置
HADOOP_CLASSPATH
环境变量[root@hadoop ~]# vi .bashrc # 在配置文件的末尾添加如下的第三方依赖的路径 export HADOOP_CLASSPATH=/usr/hbase-1.2.4/lib/* [root@hadoop ~]# source .bashrc
四、HBase完全分布式集群
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Mfst8fI9-1590288085052)(D:\Learnspace\training camp\day09\图片\2019082204.png)]
准备工作
- 启动之前搭建的Hadoop完全分布式集群
- ZooKeeper集群服务运行正常
- HDFS集群服务运行正常
环境搭建
-
时钟同步
注意: HBase集群节点和节点之间的时间误差最大允许为30s,如果大于30s需要对集群内的节点进行时间同步
[root@nodex ~]# date 2019年 08月 20日 星期二 17:13:53 CST [root@nodex ~]# date -s '2019-08-22 15:49:00' 2019年 08月 22日 星期四 15:49:00 CST [root@nodex ~]# date 2019年 08月 22日 星期四 15:49:03 CST [root@nodex ~]# clock -w
-
上传HBase安装包
[root@node1 ~]# scp hbase-1.2.4-bin.tar.gz root@node2:~ hbase-1.2.4-bin.tar.gz 100% 74MB 100.8MB/s 00:00 [root@node1 ~]# scp hbase-1.2.4-bin.tar.gz root@node3:~ hbase-1.2.4-bin.tar.gz
-
解压缩安装HBase
[root@nodex ~]# tar -zxf hbase-1.2.4-bin.tar.gz -C /usr
-
修改配置文件
hbase-site.xml
[root@nodex ~]# vi /usr/hbase-1.2.4/conf/hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://mycluster/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>node1,node2,node3</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property>
-
修改配置文件
regionservers
[root@nodex ~]# vi /usr/hbase-1.2.4/conf/regionservers node1 node2 node3
-
修改用户环境变量文件
.bashrc
[root@nodex ~]# vi .bashrc HBASE_MANAGES_ZK=false HBASE_HOME=/usr/hbase-1.2.4 HADOOP_HOME=/usr/hadoop-2.6.0 JAVA_HOME=/usr/java/latest CLASSPATH=. PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin export JAVA_HOME export CLASSPATH export PATH export HADOOP_HOME export HBASE_HOME export HBASE_MANAGES_ZK [root@nodex ~]# source .bashrc
启动服务
-
启动HMaster
[root@nodex ~]# hbase-daemon.sh start master
-
启动HRegionServer
[root@nodex ~]# hbase-daemon.sh start regionserver
验证结果
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xzIHuHA8-1590288085053)(D:\Learnspace\training camp\day09\图片\2019082205.png)]
五、HBase架构详解
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2TkDAYKE-1590288085055)(D:\Learnspace\training camp\day09\图片\2019082206.png)]
HBase采用Master/Slave架构搭建集群,它隶属于Hadoop生态系统,由以下类型节点组成: HMaster 节点、HRegionServer 节点、 ZooKeeper 集群,而在底层,它将数据存储于HDFS中,因而涉及到HDFS的NameNode、DataNode等节点,总体结构如下:
HMaster节点用于:
-
管理HRegionServer,实现其负载均衡;
-
管理和分配HRegion,比如在HRegion Split时分配新的HRegion;
-
在HRegionServer退出时迁移其内的HRegion到其他HRegionServer上;
-
实现DDL操作(Data Definition Language,namespace和table的增删改,column familiy的增删改等);
-
管理namespace和table的元数据(实际存储在HDFS上);
-
权限控制(ACL)
HRegionServer节点用于:
-
存放和管理本地HRegion;
-
读写HDFS,管理Table中的数据;
-
Client直接通过HRegionServer读写数据(从HMaster中获取元数据,找到RowKey所在的HRegion/HRegionServer)
ZooKeeper集群用于:
-
存放整个 HBase集群的metadata(元数据)以及集群的状态信息
-
实现HMaster主从节点的failover(故障切换)
export HADOOP_HOME
export HBASE_HOME
export HBASE_MANAGES_ZK
[root@nodex ~]# source .bashrc
启动服务
-
启动HMaster
[root@nodex ~]# hbase-daemon.sh start master
-
启动HRegionServer
[root@nodex ~]# hbase-daemon.sh start regionserver
验证结果
[外链图片转存中…(img-xzIHuHA8-1590288085053)]
五、HBase架构详解
[外链图片转存中…(img-2TkDAYKE-1590288085055)]
HBase采用Master/Slave架构搭建集群,它隶属于Hadoop生态系统,由以下类型节点组成: HMaster 节点、HRegionServer 节点、 ZooKeeper 集群,而在底层,它将数据存储于HDFS中,因而涉及到HDFS的NameNode、DataNode等节点,总体结构如下:
HMaster节点用于:
-
管理HRegionServer,实现其负载均衡;
-
管理和分配HRegion,比如在HRegion Split时分配新的HRegion;
-
在HRegionServer退出时迁移其内的HRegion到其他HRegionServer上;
-
实现DDL操作(Data Definition Language,namespace和table的增删改,column familiy的增删改等);
-
管理namespace和table的元数据(实际存储在HDFS上);
-
权限控制(ACL)
HRegionServer节点用于:
-
存放和管理本地HRegion;
-
读写HDFS,管理Table中的数据;
-
Client直接通过HRegionServer读写数据(从HMaster中获取元数据,找到RowKey所在的HRegion/HRegionServer)
ZooKeeper集群用于:
-
存放整个 HBase集群的metadata(元数据)以及集群的状态信息
-
实现HMaster主从节点的failover(故障切换)