Hbase学习笔记

俬倴desiben

已于 2023-05-30 09:51:32 修改

阅读量210

点赞数

文章标签：学习笔记数据库

于 2023-05-30 09:36:04 首次发布

本文链接：https://blog.csdn.net/qq_63720568/article/details/130941626

版权

线性模块话扩展方式
严格一致性读写
自动可配置表切割
区域服务器之间自动容灾
通过服务器端过滤器实现查询预测
面向列的数据库

-----------------------------------------------------------------------------------------------

hbase存储机制
面向列存储，table是按行排序
表是行的集合，行是列族的集合，列族是列的集合，列是键值对的集合，还有数据版本（时间戳）

----------------------------------------------------------------------------------------------

HBASE安装
解压，配置环境变量
HBASE_HOME=/usr/local/hbase-1.2.7
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$MONGODB_HOME/bin:$HBASE_HOME/bin
export JAVA_HOME HADOOP_HOME HADOOP_CONF_DIR MONGODB_HOME HBASE_HOME PATH

-------------
vi hbase-env.sh
export JAVA_HOME=/usr/local/jdk1.8.0_192
export HBASE_MANAGES_ZK=false

vi hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://hdp12/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>node3:2181,node4:2181,node5:2181</value>
</property>
</configuration>

vi regionservers
node1
node2
node3
node4
node5

在Master节点启动
start-hbase.sh
在备用master上输入
hbase-daemon.sh start master启动备用服务器

登陆http://node1:16010/ 查看hbase

hbase shell 进入hbase命令行
输入help获取命令列表

--------------------------------------------------------------------------------------

命令分组
---------
   常规组[general]
   version whoami
---------------------------------------------
   ddl组[ddl]

创建命名空间:

create_namespace 'mydatabase'

   创建表
   create 'mydatabase:t1',{NAME => 'f1'}, {NAME => 'f2'}, {NAME => 'f3'}
   名字空间表名列族f1 列族f2 列族f3
------------------------
   查看表结构
   describe 'mydatabase:t1'

----------------------------------------------

   [namespace]类似于mysql的库概念

   list_namespace //列出名字空间
   list_namespace_tables 'default' //列出default名字空间下的tables
   create_namespace 'mydatabase' //创建名字空间

----------------------------------------------
   dml
   插入数据
   put 'mydatabase:t1', 'r1', 'f1:id', 100
   put 'mydatabase:t1', 'r1', 'f1:name', 'zhangsan'
   名字空间表名行键列族列的key 列的v

查询数据
get 'mydatabase:t1','r2' //拿第二行数据
scan 'mydatabase:t1' //扫描t1表

count 'mydatabase:t1' //统计表的行数
-----------------------------------------------------------------------------------------

regionServer包含很多区域，区域就是由大表切出来的小表，区域信息存在hbase:meta里

/hbase/data/mydatabase/t1/6bf41c58b439f4b29b1b4ece1725e2f5/f1
名字空间表名区域名列族

-----------------------------------------------------------------------------------------------

Hbase集群启动时，master负责分配区域到指定的区域服务器

客户端交互Hbase过程
1、联系zk，找出meta表所在regionServer
2、通过查询meta，定位row key，找到对应region server
3、缓存信息在本地
4、联系regionServer
5、HRegionServer负责open HRegion对象，为每个列族创建store实例，他们是对HFile的轻量级封装，每个store还对应了一个MemStore（用于内存存储数据）

------------------------------------------------------------------------------------------------
[WAL目录结构]
hdfs://hdp12/hbase/WALs/$区域服务器名称/
当写入数据时，先写入写前日志，再把数据写到memStore中去，memStore会溢出到磁盘

---------------------------------------------------------------------------------------------

批量写入
关闭写前日志和数据自动提交
@Test
public void batchInsert() throws IOException {
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);

TableName tname = TableName.valueOf("mydatabase:t1");
HTable table = (HTable) conn.getTable(tname);
//不要自动清理缓冲区
table.setAutoFlush(false);
for(int i =4; i < 100000; i ++){
Put put = new Put(Bytes.toBytes("r" + i));
put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("id"), Bytes.toBytes(i));
put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("name"), Bytes.toBytes("tom"+i));
//关闭写前日志
put.setWriteToWAL(false);
table.put(put);
if(i % 2000 == 0){
table.flushCommits();
}
}
table.flushCommits();
}

-----------------------------------------------------------------------------------------------

删除表
先禁用表
disable 'mydatabase:t1'
再删除

enable 'ns2:t2' //启用一张表

--------------------------------------------------------------------------

flush 'mydatabase:t1' 清理表中的内存数据到磁盘

---------------------------------------------------------------------------------------------

hbase默认切割region的阀值 10737418240b（字节）当文件达到10G大小时进行切割
<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value>
<source>hbase-default.xml</source>
</property>

---------------------------------------------------------------------------------------------

meta表的信息，startKey 包含起始行 endKey 不包含结束行

split 'mydatabase:t1' 将整张表对半切开
split 'regionName', 'rowKey' 按区域和rowKey切开
regionName用 scan 'hbase:meta' 找
mydatabase:t1,025090,1554086476468.5f4791b3b3171ccc28912 column=info:regioninfo, timestamp=1554086477462, value={ENCODED => 5f4791b3b3171ccc28912e72470cd9c2, NAME => 'mydatabase:t1,025090,1554086476468.5f4791b3b3171ccc28912e72
e72470cd9c2. 470cd9c2.', STARTKEY => '025090', ENDKEY => '050175'}

此条数据regionName为mydatabase:t1,025090,1554086476468.5f4791b3b3171ccc28912e72470cd9c2.

------------------------

移动region，
move 'ENCODED', 'SERVER_NAME'
例如把上面region移动到别的regionServer去
就是
move '5f4791b3b3171ccc28912e72470cd9c2', 'node3,16020,1554087822363'
在webui上看

----------------------------

合并两个区域
merge_region '72fd464eddbddf2ec7e1e2bd447a59ad', '5f4791b3b3171ccc28912e72470cd9c2'

---------------------------------------------------------------------------------------------------

hbase与高可用hdfs整合
vi hbase-env.sh
# Extra Java CLASSPATH elements. Optional.
export HBASE_CLASSPATH=$HBASE_CLASSPATH:/usr/local/hadoop-2.8.5/etc/hadoop
或者
把hdfs-site.xml和core-sit.xml复制到hbase的conf下

---------------------------------------------------------------------------------------------------

拆分风暴
如果多个表同时到达10G的切割阀值，就会同时切割，会产生性能问题
避免拆分风暴就需要把
<property>
<name>hbase.hregion.max.filesize</name>
<value>107374182400</value> //设置成100G
<source>hbase-default.xml</source>
</property>
设置的大一些，让其难触发自动切割，然后手动切割

-----------------------------------------------------------------------------------------------

HBase存储是以键值对方式存储的<K,V>，K是三级定位rowId+family+col+time 每个k都是这样写，所以冗余数据很大
所以rowId，family，col在定义的时候不要很长,列族争取就用一个字符，列最好也用一个

---------------------------------------------------------------------------------------------------

预切割
   创建表时，预先对表进行切割
   切割线是rowKey
   create 'ns2:t3', 'f1', SPLITS => ['10000', '20000', '30000']

---------------------------------------------------------------------------------------------------

创建3历史版本的表，列族的版本号
create 'ns2:t1', {NAME=>'f1',VERSIONS=>3}, SPLITS => ['10000', '20000', '30000']
获取历史版本
get 'ns2:t1', '000001', {COLUMN => 'f1', VERSIONS => 3}
rowKey 列族返回版本数

查询指定时间戳的数据
get 'ns2:t1', '000001', {COLUMN => 'f1', TIMESTAMP => 1554105730811}
查询时间范围
get 'ns2:t1', '000001', {COLUMN => 'f1', TIMERANGE => [1554105730810,1554105737842], VERSIONS => 3}

---------------------------------------------------------------------------------------------------

原生扫描
scan 'ns2:t1', {COLUMNS=>'f1', RAW => true, VERSIONS => 10}

删除要指定时间戳
delete 'ns2:t1', '000001', 'f1:name', 1554105730811
删除后再原生扫描
ROW COLUMN+CELL 000001 column=f1:name, timestamp=1554108525382, value=tom6 000001 column=f1:name, timestamp=1554108522225, value=tom5 000001 column=f1:name, timestamp=1554108517374, value=tom4 000001 column=f1:name, timestamp=1554108514077, value=tom3 000001 column=f1:name, timestamp=1554105737841, value=tom2 000001 column=f1:name, timestamp=1554105734553, value=tom1 000001 column=f1:name, timestamp=1554105730811, type=Delete 000001 column=f1:name, timestamp=1554105730811, value=tom
如果删除tom5，
再get
get 'ns2:t1', '000001', {COLUMN=>'f1', VERSIONS => 10}
就会得到
COLUMN CELL f1:name timestamp=1554108525382, value=tom6 f1:name timestamp=1554108517374, value=tom4 f1:name timestamp=1554108514077, value=tom3

-------------------------------------------------------------------------------------------------------------------

设置过期时间
create 'ns2:t2' , {NAME=>'f1', TTL=>60, VERSIONS => 3}
设置60秒过期
这是对所有数据而言的，包括删除和没有删除的数据

---------------------------------------------------------------------------------------

创建KEEP_DELETED_CELLS=>true 的表
create 'ns2:t4',{NAME=>'f1',VERSIONS=>3,KEEP_DELETED_CELLS=>true}
put 'ns2:t4','r1','f1:name','tom1'
put 'ns2:t4','r1','f1:name','tom2'
put 'ns2:t4','r1','f1:name','tom3'

scan 'ns2:t4',{COLUMN=>'f1',RAW=>true,VERSIONS=>5}得
ROW COLUMN+CELL
r1 column=f1:name, timestamp=1554172164547, value=tom3
r1 column=f1:name, timestamp=1554172161313, value=tom2
r1 column=f1:name, timestamp=1554172161310, value=tom1
然后执行delete 'ns2:t4', 'r1', 'f1', 1554172164547
再执行scan 'ns2:t4',{COLUMN=>'f1',RAW=>true,VERSIONS=>5}得
ROW COLUMN+CELL
r1 column=f1:, timestamp=1554172167655, type=Delete
r1 column=f1:name, timestamp=1554172167655, value=tom3
r1 column=f1:name, timestamp=1554172164547, value=tom2
r1 column=f1:name, timestamp=1554172161313, value=tom1
但是get 和非原生scan还可以得到tom3

但是如果设置TTL，就会真正的删除
详见 hbase数据清除策略.png

---------------------------------------------------------------------------------------------

扫描器租约
ResultScanner scanner = table.getScanner(scan);
保证ResultScanner不会占用服务器太长时间
在hbase配置文件里
<property>
<name>hbase.regionserver.lease.period</name>
<value>120000</value> //单位毫秒，设置为2分钟
</property>

--------------------------------------------------------------------------------------------------

扫描器缓存(面向行级别的)
ResultScanner在next的时候，每next一次就会向服务器发一次RPC请求，服务器游标就会向下读一行，默认是关闭扫描器缓存的

--------------------

可以在表的层面开启扫描器缓存，这样所有的scan都会开启扫描器缓存（全局）
设置hbase-site.xml
添加
<property>
<name>hbase.client.scanner.caching</name>
<value>10</value> //设置缓存数为10
</property>

-1为不缓存

--------------------

可以在查询时设置操作层面的缓存
Scan scan = new Scan();
scan.setCaching(10);

----------------------------------------------------------------------------------------------------------------------

批量扫描是面向列级别的
控制每次next()服务器端返回的列的个数
scan.setBatch(int batch)

ResultScanner一次next迭代出batch个列族中的列
假如一行中有5列数据，batch设为3，那就会next两次，第一次取出前3列，第二次取出后2列，然后再next下一行

---------------------------------------------------------------------------------------------------------------------

如果我有10行数据，每行有20个cell，
假如我设置caching = 2 batch = 100的话
那么我每行只能取出20个cell，总共取10个results，我一次又只能缓存2个results，所以我向hbase发出rpc的通信次数为
5 + 1 = 6，额外的一次是rpc用来判断scan是否完成的

假如我设置caching = 2 batch = 10
那么我每次能取出一行中的10个cell形成result，总共取2个result，我一次又只能缓存2个result，所以我向hbase发出rpc的通信次数为
10 + 1 = 11，额外的一次是rpc用来判断scan是否完成的

总结成公式就是
RPCs=(Rows* Cols per Row) / Min(Cols per Row, Batch size) / Scanner caching

----------------------------------------------------------------------------------------------

hbase过滤器
RowFilter行过滤器
RowFilter rowFilter = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes("000100")));
scan.setFilter(rowFilter);

------------------------

FamilyFilter列族过滤器

FamilyFilter familyFilter = new FamilyFilter(CompareFilter.CompareOp.LESS, new BinaryComparator(Bytes.toBytes("f2")));

----------------------------

QualifierFilter列过滤器

QualifierFilter qualifierFilter = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("id")));

---------------------------

ValueFilter值过滤器

ValueFilter valueFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new SubstringComparator("beijing"));
//除了rowid和addr，其他的值都是空的

---------------------------

依赖过滤器DependentColumnFilter
DependentColumnFilter dependentColumnFilter = new DependentColumnFilter(Bytes.toBytes("f2")
, Bytes.toBytes("id")
, false
, CompareFilter.CompareOp.NOT_EQUAL
, new BinaryComparator(Bytes.toBytes("2")));
dropDependentColumn为false，整行返回
dropDependentColumn为true，不返回作为条件的列

注意：它可以说是timeStamp Filter和ValueFilter的结合。因为DependentColumnFilter需要指定一个参考列，然后获取跟改参考列有相同时间戳的所有列，再在此基础上获取满足ValueFilter的列值。

---------------------------

SingleColumnValueFilter单列值过滤器，如果这列不满足，整行过滤掉，返回整行数据

SingleColumnValueFilter singleColumnValueFilter = new SingleColumnValueFilter(Bytes.toBytes("f2"), Bytes.toBytes("addr"), CompareFilter.CompareOp.NOT_EQUAL, Bytes.toBytes("beijing"));

返回
f1Id=[B@776aec5c/f1Name=[B@1d296da/f1Age=null/f2Id=[B@7c7a06ec/f2Name=[B@75d4a5c2/f2Age[B@557caf28/f2Addr=[B@408d971b
f1Id=[B@6c6cb480/f1Name=[B@3c46e67a/f1Age=[B@c730b35/f2Id=[B@206a70ef/f2Name=[B@292b08d6/f2Age[B@22555ebf/f2Addr=[B@36ebc363

-------------------------

SingleColumnValueExcludeFilter排除查询条件的单列查询过滤器，返回值中不包含查询条件

SingleColumnValueExcludeFilter singleColumnValueExcludeFilter = new SingleColumnValueExcludeFilter(Bytes.toBytes("f2"), Bytes.toBytes("addr"), CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("beijing")));

返回
f1Id=[B@776aec5c/f1Name=null/f1Age=null/f2Id=[B@1d296da/f2Name=[B@7c7a06ec/f2Age[B@75d4a5c2/f2Addr=null

---------------------------

PrefixFilter前缀过滤器，是rowKey过滤器

PrefixFilter prefixFilter = new PrefixFilter(Bytes.toBytes("r1"));

----------------------------

PageFilter分页过滤器，是rowkey过滤，是在每个region上分页，如果有3个region就会返回30个数据

PageFilter pageFilter = new PageFilter(10);

-----------------------------

KeyOnlyFilter只获取列族和列和时间戳的信息而不会获取对应value的信息

KeyOnlyFilter keyOnlyFilter = new KeyOnlyFilter();
Scan scan = new Scan();
scan.setFilter(keyOnlyFilter);
ResultScanner scanner = table.getScanner(scan);
Iterator<Result> iterator = scanner.iterator();
while (iterator.hasNext()){
Result next = iterator.next();
List<Cell> columnCells = next.getColumnCells(Bytes.toBytes("f1"), Bytes.toBytes("id"));
for(Cell cell: columnCells){
String f = Bytes.toString(cell.getFamilyArray(), cell.getFamilyOffset(), cell.getFamilyLength());
String c = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
String v = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength());
long ts = cell.getTimestamp();
System.out.println(f+"-"+c+"-"+v+"-"+ts);
}
}

返回
f1-id--1554178507735
f1-id--1554178565235
f1-id--1554178636355

------------------------------

ColumnPaginationFilter列分页过滤器，假如一行里列族下共有5个列，limit 2，offset 2，就代表取第三个和第四个列

ColumnPaginationFilter columnPaginationFilter = new ColumnPaginationFilter(2, 2);
Scan scan = new Scan();
scan.setFilter(columnPaginationFilter);

ResultScanner scanner = table.getScanner(scan);
Iterator<Result> iterator = scanner.iterator();
while (iterator.hasNext()){
System.out.println("========================================");
Result next = iterator.next();
byte[] row = next.getRow();
//System.out.println(Bytes.toString(row));
NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> map = next.getMap();
for(Map.Entry<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> familyMap:map.entrySet()){
byte[] family = familyMap.getKey();

for(Map.Entry<byte[], NavigableMap<Long, byte[]>> valueMap : familyMap.getValue().entrySet()){
byte[] col = valueMap.getKey();
System.out.println(Bytes.toString(row)+"/"+Bytes.toString(family)+"/"+Bytes.toString(col));
}
}
}

返回
========================================
r1/f2/age
r1/f2/id
========================================
r2/f2/addr
r2/f2/age
========================================
r3/f1/name
r3/f2/addr

---------------------------------------

/**
* 复杂sql查询
* 等价于
* select * from user where (age >= 13 and name like 'tome%') or addr like 'beijing%'
* @throws IOException
*/
@Test
public void testComboFilter() throws IOException {
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);

TableName tableName = TableName.valueOf("ns2:t5");
Table table = conn.getTable(tableName);

SingleColumnValueFilter ft1 = new SingleColumnValueFilter(
Bytes.toBytes("f2"),
Bytes.toBytes("age"),
CompareFilter.CompareOp.GREATER_OR_EQUAL,
new BinaryComparator(Bytes.toBytes("13"))
);

SingleColumnValueFilter ft2 = new SingleColumnValueFilter(
Bytes.toBytes("f2"),
Bytes.toBytes("name"),
CompareFilter.CompareOp.EQUAL,
new RegexStringComparator("^tom")
);
//相当于and
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL,ft1,ft2);

ValueFilter ft3 = new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("^beijing"));
//相当于or
FilterList filterList1 = new FilterList(FilterList.Operator.MUST_PASS_ONE,filterList,ft3);

Scan scan = new Scan();
scan.setFilter(filterList1);

ResultScanner scanner = table.getScanner(scan);

Iterator<Result> iterator = scanner.iterator();

while (iterator.hasNext()){
System.out.println("====================================");
Result next = iterator.next();
byte[] row = next.getRow();
byte[] f1Id = next.getValue(Bytes.toBytes("f1"), Bytes.toBytes("id"));
byte[] f1Name = next.getValue(Bytes.toBytes("f1"), Bytes.toBytes("name"));
byte[] f1Age = next.getValue(Bytes.toBytes("f1"), Bytes.toBytes("age"));
byte[] f2Id = next.getValue(Bytes.toBytes("f2"), Bytes.toBytes("id"));
byte[] f2Name = next.getValue(Bytes.toBytes("f2"), Bytes.toBytes("name"));
byte[] f2Age = next.getValue(Bytes.toBytes("f2"), Bytes.toBytes("age"));
byte[] f2Addr = next.getValue(Bytes.toBytes("f2"), Bytes.toBytes("addr"));

System.out.println(Bytes.toString(row)+":f1Id="+f1Id+"/f1Name="+f1Name+"/f1Age="+f1Age+"/f2Id="+f2Id+"/f2Name="+f2Name+"/f2Age"+f2Age+"/f2Addr="+f2Addr);
}

conn.close();
}

返回
====================================
r1:f1Id=[B@292b08d6/f1Name=null/f1Age=null/f2Id=null/f2Name=null/f2Agenull/f2Addr=[B@22555ebf
====================================
r2:f1Id=[B@36ebc363/f1Name=[B@45752059/f1Age=null/f2Id=[B@34e9fd99/f2Name=[B@3c41ed1d/f2Age[B@54d9d12d/f2Addr=[B@38425407
====================================
r3:f1Id=[B@43bc63a3/f1Name=[B@702657cc/f1Age=[B@6a6cb05c/f2Id=[B@40a4337a/f2Name=[B@6025e1b6/f2Age[B@22ff4249/f2Addr=[B@2d1ef81a

-----------------------------------------------------------------------------------------------------------------------

在hbase shell中put 数值类型，hbase存入的是字符串

-----------------------------------------------------------------------------------------------------------------------

hbase计数器
shell 操作方式
incr 'ns2:t6', 'r1', 'f1:click', 1
incr 'ns2:t6', 'r1', 'f1:click', -1
incr 'ns2:t6', 'r1', 'f1:click', 5
获取计数值
get_counter 'ns2:t6', 'r1', 'f1:click'

代码实现
public void testIncr() throws IOException {
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);

TableName tableName = TableName.valueOf("ns2:t6");
Table table = conn.getTable(tableName);

Increment incr = new Increment(Bytes.toBytes("r1"));
incr.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("click"), 1);
incr.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("daily"), 10);
incr.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("weekly"), 30);
incr.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("monthly"), 50);
table.increment(incr);

conn.close();
}

------------------------------------------------------------------------------------------------------------------------

协处理器Coprocessor
批处理的，等价于存储过程或者触发器
协处理器是跟区域HRegion关联的

----------------

Observer //观察者，类似于触发器，基于事件的，发生动作时，会回调相应方法
RegionObserver:用户可以用这种的处理器处理数据修改事件，它们与表的region联系紧密
MasterObserver:可以被用作管理或DDL类型的操作，这些是集群级别的事件
WALObserver:提供控制WAL的钩子函数

----------------

Endpoint //终端，类似于存储过程，客户端发起指令调用

----------------

可以从配置文件中加载全局协处理器hbase-site.xml

<property>
   <name>hbase.coprocessor.region.classes</name>
   <value>coprocessor.RegionObserverExample,coprocessor.AnotherCoprocessor</value>
</property>

<property>
   <name>hbase.coprocessor.master.classes</name>
   <value>coprocessor.MasterObserverExample</value>
</property>

<property>
   <name>hbase.coprocessor.wal.classes</name>
   <value>coprocessor.WALObserverExample,bar.foo.MyWALObserver</value>
</property>

-----------------------

endpoint实现
有如下数据
ROW COLUMN+CELL
id1 column=0:c, timestamp=1554270618545, value=100
id2 column=0:c, timestamp=1554270624774, value=200
id3 column=0:c, timestamp=1554270630368, value=300
id4 column=0:c, timestamp=1554270637156, value=400
id5 column=0:c, timestamp=1554270643199, value=500

下载protobuf2.5
https://github.com/protocolbuffers/protobuf/releases/tag/v2.5.0
tar -zxvf protobuf-2.5.0.tar.gz -C /usr/local/src
cd /usr/local/src/protobuf-2.5.0
./configure --prefix=/usr/local/protobuf
make
make check
make install
添加环境变量
vi /etc/profile
PROTOBUF_HOME=/usr/local/protobuf
PATH=$PATH:PROTOBUF_HOME/bin
export PROTOBUF_HOME PATH

编写proto文件
vi count_sum.proto

syntax = "proto2";
option java_package = "com.hny.hbase.coprocessor"; //工程下的包名
option java_outer_classname = "CountAndSumProtocol"; //类名
option java_generic_services = true;
option java_generate_equals_and_hash = true;
option optimize_for = SPEED;

message CountAndSumRequest { //入参
required string family = 1;
required string column = 2;
}

message CountAndSumResponse { //返回值
required int64 count = 1 [default = 0];
required double sum = 2 [default = 0];
}

service RowCountAndSumService { //service
rpc getCountAndSum(CountAndSumRequest)
returns (CountAndSumResponse);
}

在Linux下执行 protoc --java_out=./ count_sum.proto 在当前路径生成java文件
复制到idea对应的工程路径下

编写代码继承这个类

package com.hny.hbase.coprocessor;

import com.google.protobuf.RpcCallback;
import com.google.protobuf.RpcController;
import com.google.protobuf.Service;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.Coprocessor;
import org.apache.hadoop.hbase.CoprocessorEnvironment;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.coprocessor.CoprocessorException;
import org.apache.hadoop.hbase.coprocessor.CoprocessorService;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.protobuf.ResponseConverter;
import org.apache.hadoop.hbase.regionserver.InternalScanner;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class CountAndSum extends CountAndSumProtocol.RowCountAndSumService implements Coprocessor, CoprocessorService {

private RegionCoprocessorEnvironment env;

@Override
public void getCountAndSum(RpcController controller, CountAndSumProtocol.CountAndSumRequest request, RpcCallback<CountAndSumProtocol.CountAndSumResponse> done) {
String family = request.getFamily();
if (null == family || "".equals(family)) {
throw new NullPointerException("you need specify the family");
}
String column = request.getColumn();
if (null == column || "".equals(column)) {
throw new NullPointerException("you need specify the column");
}
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes(family), Bytes.toBytes(column));

CountAndSumProtocol.CountAndSumResponse response = null;
InternalScanner scanner = null;
try {
// 计数
long count = 0;
// 求和
double sum = 0;

scanner = env.getRegion().getScanner(scan);
List<Cell> results = new ArrayList<>();
boolean hasMore;
// 切记不要用while(){}的方式，这种方式会丢失最后一条数据
do {
hasMore = scanner.next(results);
if (results.isEmpty()) {
continue;
}
Cell kv = results.get(0);
double value = 0;
try {
value = Double.parseDouble(Bytes.toString(CellUtil.cloneValue(kv)));
} catch (Exception e) {
}
count++;
sum += value;
results.clear();
} while (hasMore);

// 生成response
response = CountAndSumProtocol.CountAndSumResponse.newBuilder().setCount(count).setSum(sum).build();
} catch (IOException e) {
e.printStackTrace();
ResponseConverter.setControllerException(controller, e);
} finally {
if (scanner != null) {
try {
scanner.close();
} catch (IOException ignored) {
}
}
}
done.run(response);
}

@Override
public void start(CoprocessorEnvironment env) throws IOException {
if (env instanceof RegionCoprocessorEnvironment) {
this.env = (RegionCoprocessorEnvironment) env;
} else {
throw new CoprocessorException("Must be loaded on a table region!");
}
}

@Override
public void stop(CoprocessorEnvironment env) throws IOException {
// do nothing
}

@Override
public Service getService() {
return this;
}
}

----------------------------------
静态部署
将jar包上传到hbase各个区域服务器的lib目录下
然后修改各个区域服务器的hbase-site.xml添加
<property>
<name>hbase.coprocessor.region.classes</name>
<value>com.wyd.hbase.observer.CountAndSum</value>
</property>
然后在client端写代码
package com.wyd.hbase.client;

import com.wyd.hbase.coprocessor.CountAndSumProtocol;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.client.coprocessor.Batch;
import org.apache.hadoop.hbase.ipc.BlockingRpcCallback;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;
import java.util.Map;

public class CountAndSumClient {

public static class CountAndSumResult {
public long count;
public double sum;

}

private Connection connection;

public CountAndSumClient(Connection connection) {
this.connection = connection;
}

public CountAndSumResult call(String tableName, String family, String column, String
startRow, String endRow) throws Throwable {
Table table = connection.getTable(TableName.valueOf(Bytes.toBytes(tableName)));
final CountAndSumProtocol.CountAndSumRequest request = CountAndSumProtocol.CountAndSumRequest
.newBuilder()
.setFamily(family)
.setColumn(column)
.build();

byte[] startKey = (null != startRow) ? Bytes.toBytes(startRow) : null;
byte[] endKey = (null != endRow) ? Bytes.toBytes(endRow) : null;
// coprocessorService方法的第二、三个参数是定位region的，是不是范围查询，在startKey和endKey之间的region上的数据都会参与计算
Map<byte[], CountAndSumResult> map = table.coprocessorService(CountAndSumProtocol.RowCountAndSumService.class,
startKey, endKey, new Batch.Call<CountAndSumProtocol.RowCountAndSumService,
CountAndSumResult>() {
@Override
public CountAndSumResult call(CountAndSumProtocol.RowCountAndSumService service) throws IOException {
BlockingRpcCallback<CountAndSumProtocol.CountAndSumResponse> rpcCallback = new BlockingRpcCallback<>();
service.getCountAndSum(null, request, rpcCallback);
CountAndSumProtocol.CountAndSumResponse response = rpcCallback.get();
//直接返回response也行。
CountAndSumResult responseInfo = new CountAndSumResult();
responseInfo.count = response.getCount();
responseInfo.sum = response.getSum();
return responseInfo;
}
});

CountAndSumResult result = new CountAndSumResult();
for (CountAndSumResult ri : map.values()) {
result.count += ri.count;
result.sum += ri.sum;
}

return result;
}

}
调用测试代码
package com.wyd.hbase.client;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;

public class Test {
public static void main(String[] args) throws Throwable {
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);

String tableName = "test";
CountAndSumClient client = new CountAndSumClient(conn);
CountAndSumClient.CountAndSumResult result = client.call(tableName, "0", "c", null, null);

System.out.println("count: " + result.count + ", sum: " + result.sum);
}
}

---------------------------------
动态调用
https://yq.aliyun.com/articles/670075
https://github.com/fayson/cdhproject/blob/master/hbasedemo/proto/MyFirstCoprocessor.proto
详见新hbase笔/hbasedemo项目

代码调用
/**
*给表动态加载协处理器
* @param connection
* @param table
* @param jarPath
* @param cls
*/
public static void setupToExistTable(Connection connection, Table table, String jarPath, Class<?>... cls) {
try {
if(jarPath != null && !jarPath.isEmpty()) {
Path path = new Path(jarPath);
HTableDescriptor hTableDescriptor = table.getTableDescriptor();
for(Class cass : cls) {
hTableDescriptor.addCoprocessor(cass.getCanonicalName(), path, Coprocessor.PRIORITY_USER, null);
}
connection.getAdmin().modifyTable(table.getName(), hTableDescriptor);
}

} catch (IOException e) {
e.printStackTrace();
}
}

/**
* 删除HBase表上的协处理器
* @param connection
* @param table
* @param cls
*/
public static void deleteCoprocessor(Connection connection, Table table, Class<?>... cls) {
System.out.println("begin delete " + table.getName().toString() + " Coprocessor......");
try {
HTableDescriptor hTableDescriptor = table.getTableDescriptor();
for(Class cass : cls) {
hTableDescriptor.removeCoprocessor(cass.getCanonicalName());
}
connection.getAdmin().modifyTable(table.getName(), hTableDescriptor);
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("end delete " + table.getName().toString() + " Coprocessor......");
}

----------------

shell调用

需要在hbase-site.xml中添加
<property>
<name>hbase.table.sanity.checks</name>
<value>false</value>
</property>

create 'guanzhu', 'f1'
create 'fensi', 'f1'

在代码中写
package com.wyd.hbase.observer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.coprocessor.BaseRegionObserver;
import org.apache.hadoop.hbase.coprocessor.ObserverContext;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
import org.apache.hadoop.hbase.regionserver.wal.WALEdit;

import java.io.IOException;

public class InvertedCoprocessor extends BaseRegionObserver {

@Override
public void prePut(ObserverContext<RegionCoprocessorEnvironment> e, Put put, WALEdit edit, Durability durability) throws IOException {
byte[] row = put.getRow();
Cell cell = put.get("f1".getBytes(), "from".getBytes()).get(0);
Put putIndex = new Put(cell.getValueArray(), cell.getValueOffset(),
cell.getValueLength());
putIndex.addColumn("f1".getBytes(), "from".getBytes(), row);
Configuration conf = HBaseConfiguration.create();
Connection conn = ConnectionFactory.createConnection(conf);

TableName tableName = TableName.valueOf("fensi");
Table table = conn.getTable(tableName);

table.put(putIndex);

conn.close();
}
}

打包上传到hdfs

disable 'guanzhu'
alter 'guanzhu', METHOD => 'table_att', 'Coprocessor'=>'hdfs://hdp12/hbase/coprocessor/MycountAndSum.jar|com.wyd.hbase.observer.InvertedCoprocessor|1001|参数...'
调用级别

alter 'ns1:guanzhu', METHOD => 'table_att', 'Coprocessor'=>'hdfs://hdp12/hbasetest-1.0-SNAPSHOT.jar|com.wyd.hbase.observer.InvertedCoprocessor|1001'

enable 'guanzhu'

这样 put 'guanzhu', 'r1', 'f1:from', 'a'
put 'guanzhu', 'r1', 'f1:user', 'wangyadi'
put 'guanzhu', 'r1', 'f1:start', 'xietingfeng'
在'fensi'里也会添加一条
ROW COLUMN+CELL a column=f1:from, timestamp=1554288186666, value=r1

这就是倒排索引

命令行卸载hbase表上协处理器
disable 'guanzhu'
alter 'guanzhu', METHOD => 'table_att_unset', NAME=>'coprocessor$1'
enable 'guanzhu'

-----------------------------------------------------------------------------------------------------------------------

Hbase连接池工具类
package com.wyd.hbase.util;

import java.io.IOException;

public class HBaseUtil {

//private static final String QUORUM = "192.168.1.100";
//private static final String CLIENTPORT = "2181";
private static Configuration conf = null;
private static Connection conn = null;

public static synchronized Configuration getConfiguration(){
if(conf == null){
conf = HBaseConfiguration.create();
//conf.set("hbase.zookeeper.quorum", QUORUM);
//conf.set("hbase.zookeeper.property.clientPort", CLIENTPORT);
}
return conf;
}

public static synchronized Connection getConnection() throws IOException {
if(conn == null){
conn = ConnectionFactory.createConnection(getConfiguration());
}
return conn;
}

}
qunuan

----------------------------------------------------------------------------------------------------
HTable连接池问题，和shell中移除coprocessor问题还没解决

盐析 salt rowKey的前缀

不要用时间片做rowKey的高位

通话记录表设计
用 regNo+主叫+callTime(yyyyMMddHHmmss)+被叫+拨打时间(00000s)当主键
regNo = (主叫+yyyyMM).hashCode() & Integer.MAXVALUE % 分区数(设置100个region)
这样数据就会按照regNo存储到对应的分区中

创建主叫表
create 'ns1:t1', 'f', SPLITS => ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49','50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99']
创建被叫表
...

alter 'ns1:t1', METHOD => 'table_att', 'Coprocessor1'=>'hdfs://hdp12/hbase/coprocessor/MycountAndSum.jar|com.wyd.hbase.observer.CallLogCoprocessor|1002|'

alter 'ns1:t1', METHOD => 'table_att', 'Coprocessor1'=>'hdfs://hdp12/hbase/coprocessor/original-hbasetest-1.0-SNAPSHOT.jar|com.wyd.hbase.coprocessor.observer.CallLogCoprocessor|1001'

alter 'ns1:t1', METHOD => 'table_att_unset', NAME=>'coprocessor$1'
enable 'guanzhu'

前包后不包
/**
* Create a Scan operation for the range of rows specified.
* @param startRow row to start scanner at or after (inclusive)
* @param stopRow row to stop scanner before (exclusive)
*/
public Scan(byte [] startRow, byte [] stopRow) {
this.startRow = startRow;
this.stopRow = stopRow;
//if the startRow and stopRow both are empty, it is not a Get
this.getScan = isStartRowAndEqualsStopRow();
}

如果查3月份通话详单，
new Scan(Bytes.toBytes(reNo+callid+"201703"),Bytes.toBytes(reNo+callid+"201704"))

--------------------------------------------------------------------------------------------------

mapreduce写入hbase
将
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/core-site.xml
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/hbase-site.xml
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/hdfs-site.xml
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/log4j.properties
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/mapred-site.xml
/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/src/main/resources/yarn-site.xml
复制到项目中
编写代码

package com.wyd.userdrawmr;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Durability;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableReducer;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;

public class UserDrawPutHbaseMapReduce {

public static class UserDrawPutHbaseMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(value, NullWritable.get());
}
}

public static class UserDrawPutHbaseReducer extends TableReducer<Text, NullWritable, NullWritable>{
@Override
protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
for(NullWritable value : values){
String[] arr = key.toString().split("[|]");
String rowKey = arr[1];
if(!StringUtils.isEmpty(rowKey)){
Put put = new Put(Bytes.toBytes(rowKey));

//跳过写前日志
put.setDurability(Durability.SKIP_WAL);
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("mdn"), Bytes.toBytes(arr[1]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("male"), Bytes.toBytes(arr[2]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("female"), Bytes.toBytes(arr[3]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("age1"), Bytes.toBytes(arr[4]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("age2"), Bytes.toBytes(arr[5]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("age3"), Bytes.toBytes(arr[6]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("age4"), Bytes.toBytes(arr[7]));
put.addColumn(Bytes.toBytes("draw"), Bytes.toBytes("age5"), Bytes.toBytes(arr[8]));

context.write(NullWritable.get(), put);
}
}
}
}

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME","root");
Configuration conf = HBaseConfiguration.create();
conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
conf.addResource("yarn-site.xml");
conf.addResource("mapred-site.xml");

Job job = Job.getInstance(conf);
job.setJobName("HDFSToHBase");
//job.setJarByClass(UserDrawPutHbaseMapReduce.class);
job.setJar("/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/target/userdrawself-1.0-SNAPSHOT.jar");

//设置表名
TableMapReduceUtil.initTableReducerJob("t_draw", UserDrawPutHbaseReducer.class, job);
job.setMapperClass(UserDrawPutHbaseMapper.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Put.class);

FileInputFormat.addInputPath(job, new Path("userdraw/out2"));

job.waitForCompletion(true);
}
}

truncate 'ns1:calllogs' //先禁表，再删表，再重建

-----------------------------------------------------------------------------------------------------

MapReduce读取HBase

package com.wyd.userdrawmr;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ReadHbaseToHdfsMapReduce {

public static class HbaseToHdfsMapper extends TableMapper<Text, Text>{
@Override
protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {

StringBuffer sb = new StringBuffer();
byte[] mdnBytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("mdn"));
if(mdnBytes != null && mdnBytes.length != 0){
sb.append(Bytes.toString(mdnBytes)).append("|");
}
byte[] maleBytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("male"));
if(maleBytes != null && maleBytes.length != 0){
sb.append(Bytes.toString(maleBytes)).append("|");
}
byte[] femaleBytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("female"));
if(femaleBytes != null && femaleBytes.length != 0){
sb.append(Bytes.toString(femaleBytes)).append("|");
}
byte[] age1Bytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("age1"));
if(age1Bytes != null && age1Bytes.length != 0){
sb.append(Bytes.toString(age1Bytes)).append("|");
}
byte[] age2Bytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("age2"));
if(age2Bytes != null && age2Bytes.length != 0){
sb.append(Bytes.toString(age2Bytes)).append("|");
}
byte[] age3Bytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("age3"));
if(age3Bytes != null && age3Bytes.length != 0){
sb.append(Bytes.toString(age3Bytes)).append("|");
}
byte[] age4Bytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("age4"));
if(age4Bytes != null && age4Bytes.length != 0){
sb.append(Bytes.toString(age4Bytes)).append("|");
}
byte[] age5Bytes = value.getValue(Bytes.toBytes("draw"), Bytes.toBytes("age5"));
if(age5Bytes != null && age5Bytes.length != 0){
sb.append(Bytes.toString(age5Bytes));
}

context.write(new Text(key.toString()), new Text(sb.toString()));

}
}

public static class HbaseToHdfsReducer extends Reducer<Text, Text, NullWritable, Text>{
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value:values){
context.write(NullWritable.get(), value);
}
}
}

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
System.setProperty("HADOOP_USER_NAME","root");

Configuration conf = HBaseConfiguration.create();
conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
conf.addResource("yarn-site.xml");
conf.addResource("mapred-site.xml");

Job job = Job.getInstance(conf);
job.setJar("/Users/wangyadi/IdeaProjects/bigdatatest/userdrawself/target/userdrawself-1.0-SNAPSHOT.jar");
job.setJobName("HbaseToHdfsMapReduce");
//job.setJarByClass(ReadHbaseToHdfsMapReduce.class);

Scan scan = new Scan();

TableMapReduceUtil.initTableMapperJob(Bytes.toBytes("t_draw")
,scan //指定查询条件
,HbaseToHdfsMapper.class //mapper class
,Text.class //mapper输出key类型
,Text.class //mapper输出value类型
,job
,false);

job.setReducerClass(HbaseToHdfsReducer.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);

FileOutputFormat.setOutputPath(job, new Path("userdraw/out3"));

job.waitForCompletion(true);
}
}

-----------------------------------------------------------------------------------------------------

HBASE调优

hbase-site.xml配置
hbase.tmp.dir

本地文件系统tmp目录，一般配置成local模式的设置一下，但是最好还是需要设置一下，因为很多文件都会默认设置成它下面的
线上配置
<property>
<name>hbase.tmp.dir</name>
<value>/mnt/dfs/11/hbase/hbase-tmp</value>
</property>
默认值：
${java.io.tmpdir}/hbase-${user.name}
写到系统的/tmp目录

hbase.rootdir
HBase集群中所有RegionServer共享目录，用来持久化HBase的数据，一般设置的是hdfs的文件目录，如hdfs://namenode.example.org:9000/hbase
线上配置
<property>
<name>hbase.rootdir</name>
<value>hdfs://mycluster/hbase</value>
</property>
默认值：
${hbase.tmp.dir}/hbase

hbase.cluster.distributed
集群的模式，分布式还是单机模式，如果设置成false的话，HBase进程和Zookeeper进程在同一个JVM进程。
线上配置为true
默认值：false

hbase.zookeeper.quorum
zookeeper集群的URL配置，多个host中间用逗号（,）分割
线上配置
<property>
<name>hbase.zookeeper.quorum</name> <value>inspurXXX.xxx.xxx.org,inspurXXX.xxx.xxx.org,inspurXXX.xxx.xxx.org,inspurXXX.xxx.xxx.org,inspurXXX.xxx.xxx.org</value>
</property>
默认值：localhost
hbase.zookeeper.property.dataDir

ZooKeeper的zoo.conf中的配置。快照的存储位置
线上配置：/home/hadoop/zookeeperData
默认值：${hbase.tmp.dir}/zookeeper

zookeeper.session.timeout
客户端与zk连接超时时间
线上配置：1200000（20min）
默认值：180000（3min）

hbase.zookeeper.property.tickTime
Client端与zk发送心跳的时间间隔
线上配置：6000（6s）
默认值：6000

hbase.security.authentication
HBase集群安全认证机制，目前的版本只支持kerberos安全认证。
线上配置：kerberos
默认值：空

hbase.security.authorization
HBase是否开启安全授权机制
线上配置： true
默认值： false

hbase.regionserver.kerberos.principal
regionserver的kerberos认证的主体名称（由三部分组成：服务或用户名称、实例名称以及域名）
线上配置：hbase/_HOST@HADOOP.xxx.xxx.COM
默认：无

hbase.regionserver.keytab.file
regionserver keytab文件路径
线上配置：/home/hadoop/etc/conf/hbase.keytab
默认值：无

hbase.master.kerberos.principal
master的kerberos认证的主体名称（由三部分组成：服务或用户名称、实例名称以及域名）
线上配置：hbase/_HOST@HADOOP.xxx.xxx.COM
默认：无

hbase.master.keytab.file
master keytab文件路径
线上配置：/home/hadoop/etc/conf/hbase.keytab
默认值：无

hbase.regionserver.handler.count
regionserver处理IO请求的线程数
线上配置：50
默认配置：10

hbase.regionserver.global.memstore.upperLimit
RegionServer进程block进行flush触发条件：该节点上所有region的memstore之和达到upperLimit*heapsize
线上配置：0.45
默认配置：0.4

hbase.regionserver.global.memstore.lowerLimit
RegionServer进程触发flush的一个条件：该节点上所有region的memstore之和达到lowerLimit*heapsize
线上配置：0.4
默认配置：0.35

hbase.client.write.buffer
客户端写buffer，设置autoFlush为false时，当客户端写满buffer才flush
线上配置：8388608（8M）
默认配置：2097152（2M）

hbase.hregion.max.filesize
单个ColumnFamily的region大小，若按照ConstantSizeRegionSplitPolicy策略，超过设置的该值则自动split
线上配置：107374182400（100G）
默认配置：21474836480（20G）

hbase.hregion.memstore.block.multiplier
超过memstore大小的倍数达到该值则block所有写入请求，自我保护
线上配置：8（内存够大可以适当调大一些，出现这种情况需要客户端做调整）
默认配置：2

hbase.hregion.memstore.flush.size
memstore大小，当达到该值则会flush到外存设备
线上配置：104857600（100M）
默认值： 134217728（128M）

hbase.hregion.memstore.mslab.enabled
是否开启mslab方案，减少因内存碎片导致的Full GC，提高整体性能
线上配置：true
默认配置： true

hbase.regionserver.maxlogs
regionserver的hlog数量
线上配置：128
默认配置：32

hbase.regionserver.hlog.blocksize
hlog大小上限，达到该值则block，进行roll掉
线上配置：536870912（512M）
默认配置：hdfs配置的block大小

hbase.hstore.compaction.min
进入minor compact队列的storefiles最小个数
线上配置：10
默认配置：3

hbase.hstore.compaction.max
单次minor compact最多的文件个数
线上配置：30
默认配置：10

hbase.hstore.blockingStoreFiles
当某一个region的storefile个数达到该值则block写入，等待compact
线上配置：100（生产环境可以设置得很大）
默认配置： 7

hbase.hstore.blockingWaitTime
block的等待时间
线上配置：90000（90s）
默认配置：90000（90s）

hbase.hregion.majorcompaction
触发major compact的周期
线上配置：0（关掉major compact）
默认配置：86400000（1d）

手动主合并命令major_compact 'mydatabase:t2'

hbase.regionserver.thread.compaction.large
large compact线程池的线程个数
线上配置：5
默认配置：1

hbase.regionserver.thread.compaction.small
small compact线程池的线程个数
线上配置：5
默认配置：1

hbase.regionserver.thread.compaction.throttle
compact（major和minor）请求进入large和small compact线程池的临界点
线上配置：10737418240（10G）
默认配置：2 * this.minFilesToCompact * this.region.memstoreFlushSize
HBase RS内部设计了两个线程池：large compactions与small compactions，用来分开处理Compaction操作，这个参数就是控制一个Compaction应该交由哪一个线程池处理，默认值2 * hbase.hstore.compaction.max * hbase.hregion.memstore.flush.size，即2*10*128M=2.5G，如果待合并文件的总大小大于该值则交给large compactions线程池处理，否则交给small compactions线程池处理。一般建议不调整或稍微调大。

hbase.hstore.compaction.max.size
minor compact队列中storefile文件最大size
线上配置：21474836480（20G）
默认配置：Long.MAX_VALUE

hbase.rpc.timeout
RPC请求timeout时间
线上配置：300000（5min）
默认配置：60000（10s）

hbase.regionserver.region.split.policy
split操作默认的策略
线上配置： org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy（采取老的策略，自己控制split）
默认配置： org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy（在region没有达到maxFileSize的前提下，如果fileSize达到regionCount * regionCount * flushSize则进行split操作）

hbase.regionserver.regionSplitLimit
单台RegionServer上region数上限
线上配置：150
默认配置：2147483647

hbase-env.sh配置
指定系统运行环境
export JAVA_HOME=/usr/lib/jvm/java-6-sun/ #JDK HOME
export HBASE_HOME=/home/hadoop/cdh4/hbase-0.94.2-cdh4.2.1 # HBase 安装目录
export HBASE_LOG_DIR=/mnt/dfs/11/hbase/hbase-logs #日志输出路径

JVM参数调优

export HBASE_OPTS="-verbose:gc -XX:+PrintGCDetails -Xloggc:${HBASE_LOG_DIR}/hbase-gc.log -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime \
-server -Xmx20480m -Xms20480m -Xmn10240m -Xss256k -XX:SurvivorRatio=4 -XX:MaxPermSize=256m -XX:MaxTenuringThreshold=15 \
-XX:ParallelGCThreads=16 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSFullGCsBeforeCompaction=5 -XX:+UseCMSCompactAtFullCollection \
-XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSMaxAbortablePrecleanTime=5000 \
"

-----------------------------------------------------------------------

https://blog.csdn.net/xiaoshunzi111/article/details/69844526

用到Memstore最主要的原因是：存储在HDFS上的数据需要按照row key 排序。而HDFS本身被设计为顺序读写(sequential reads/writes)，不允许修改。这样的话，HBase就不能够高效的写数据，因为要写入到HBase的数据不会被排序，这也就意味着没有为将来的检索优化。为了解决这个问题，HBase将最近接收到的数据缓存在内存中(in Memstore)，在持久化到HDFS之前完成排序，然后再快速的顺序写入HDFS。需要注意的一点是实际的HFile中，不仅仅只是简单地排序的列数据的列表，详见Apache HBase I/O – HFile。

除了解决“无序”问题外，Memstore还有一些其他的好处，例如：

作为一个内存级缓存，缓存最近增加数据。一种显而易见的场合是，新插入数据总是比老数据频繁使用。
在持久化写入之前，在内存中对Rows/Cells可以做某些优化。比如，当数据的version被设为1的时候，对于某些CF的一些数据，Memstore缓存了数个对该Cell的更新，在写入HFile的时候，仅需要保存一个最新的版本就好了，其他的都可以直接抛弃。
有一点需要特别注意：每一次Memstore的flush，会为每一个CF创建一个新的HFile。在读方面相对来说就会简单一些：HBase首先检查请求的数据是否在Memstore，不在的话就到HFile中查找，最终返回merged的一个结果给用户。

第一组是关于触发“普通”flush，这类flush发生时，并不影响并行的写请求。该类型flush的配置项有：
hbase.hregion.memstore.flush.size
<property>
<name>hbase.hregion.memstore.flush.size</name>
<value>134217728</value>
<description>
Memstore will be flushed to disk if size of the memstore
exceeds this number of bytes. Value is checked by a thread that runs
every hbase.server.thread.wakefrequency.
</description>
</property>

hbase.regionserver.global.memstore.lowerLimit
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.35</value>
<description>Maximum size of all memstores in a region server before
flushes are forced. Defaults to 35% of heap.
This value equal to hbase.regionserver.global.memstore.upperLimit causes
the minimum possible flushing to occur when updates are blocked due to
memstore limiting.
</description>
</property>

lowerLimit说明：同upperLimit，只不过当全局memstore的内存达到35%时，它不会flush所有的memstore，它会找一些内存占用较大的 memstore，个别flush，当然更新还是会被block。lowerLimit算是一个在全局flush前的补救措施。可以想象一下，如果 memstore需要在一段时间内全部flush，且这段时间内无法接受写请求，对HBase集群的性能影响是很大的。
调优：这是一个Heap内存保护参数，默认值已经能适用大多数场景。它的调整一般是为了配合某些专属优化，比如读密集型应用，将读缓存开大，降低该值，腾出更多内存给其他模块使用。
这个参数会给使用者带来什么影响？
比如，10G内存，100个region，每个memstore 64M，假设每个region只有一个memstore，那么当100个memstore平均占用到50%左右时，就会达到lowerLimit的限制。假设此时，其他memstore同样有很多的写请求进来。在那些大的region未flush完，就可能又超过了upperlimit，则所有 region都会被block，开始触发全局flush。

------------------------------------------------------------------

第二组设置主要是出于安全考虑：有时候集群的“写负载”非常高，写入量一直超过flush的量，这时，我们就希望memstore不要超过一定的安全设置。在这种情况下，写操作就要被阻止(blocked)一直到memstore恢复到一个“可管理”(manageable)的大小。该类型flush配置项有：

hbase.regionserver.global.memstore.upperLimit
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
<description>Maximum size of all memstores in a region server before new
updates are blocked and flushes are forced. Defaults to 40% of heap.
Updates are blocked and flushes are forced until size of all memstores
in a region server hits hbase.regionserver.global.memstore.lowerLimit.
</description>
</property>

upperlimit说明：hbase.hregion.memstore.flush.size 这个参数的作用是当单个memstore达到指定值时，flush该memstore。但是，一台ReigonServer可能有成百上千个memstore，每个 memstore也许未达到flush.size，jvm的heap就不够用了。该参数就是为了限制memstores占用的总内存。
当ReigonServer内所有的memstore所占用的内存综合达到heap的40%时，HBase会强制block所有的更新并flush这些memstore以释放所有memstore占用的内存。

hbase.hregion.memstore.block.multiplier
<property>
<name>hbase.hregion.memstore.block.multiplier</name>
<value>2</value>
<description>
Block updates if memstore has hbase.hregion.block.memstore
time hbase.hregion.flush.size bytes. Useful preventing
runaway memstore during spikes in update traffic. Without an
upper-bound, memstore fills such that when it flushes the
resultant flush files take a long time to compact or split, or
worse, we OOME.
</description>
</property>

说明：当一个region里的memstore超过单个memstore.size两倍的大小时，block该 region的所有请求，进行flush，释放内存。虽然我们设置了memstore的总大小，比如64M，但想象一下，在最后63.9M的时候，我 Put了一个100M的数据或写请求量暴增，最后一秒钟put了1万次，此时memstore的大小会瞬间暴涨到超过预期的memstore.size。这个参数的作用是当memstore的大小增至超过memstore.size2倍时，block所有请求，遏制风险进一步扩大。
调优：这个参数的默认值还是比较靠谱的。如果你预估你的正常应用场景（不包括异常）不会出现突发写或写的量可控，那么保持默认值即可。如果正常情况下，你的写量就会经常暴增，那么你应该调大这个倍数并调整其他参数值，比如hfile.block.cache.size和 hbase.regionserver.global.memstore.upperLimit/lowerLimit，以预留更多内存，防止HBase server OOM。

我们可以将“Lower limit”配置的更接近于“Upper limit”

很多情况下，一个CF是最好的设计。

--------------------------------------------------------------
hbase.regionserver.hlog.blocksize
hbase.regionserver.maxlogs
你可能已经发现，WAL的最大值由hbase.regionserver.maxlogs * hbase.regionserver.hlog.blocksize (2GB by default)决定。一旦达到这个值，Memstore flush就会被触发。所以，当你增加Memstore的大小以及调整其他的Memstore的设置项时，你也需要去调整HLog的配置项。否则，WAL的大小限制可能会首先被触发，因而，你将利用不到其他专门为Memstore而设计的优化。抛开这些不说，通过WAL限制来触发Memstore的flush并非最佳方式，这样做可能会会一次flush很多Region，尽管“写数据”是很好的分布于整个集群，进而很有可能会引发flush“大风暴”。

最好将hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs 设置为稍微大于hbase.regionserver.global.memstore.lowerLimit * HBASE_HEAPSIZE.

hbase.regionserver.thread.compaction.large
large compact线程池的线程个数
线上配置：5
默认配置：1

hbase.regionserver.thread.compaction.small
small compact线程池的线程个数
线上配置：5
默认配置：1

hbase.regionserver.handler.count
regionserver处理IO请求的线程数
线上配置：50
默认配置：10

hbase.hregion.majorcompaction
触发major compact的周期
线上配置：0（关掉major compact）
默认配置：86400000（1d）

hbase.hstore.compaction.min
进入minor compact队列的storefiles最小个数
线上配置：10
默认配置：3

hbase.hstore.compaction.max
单次minor compact最多的文件个数
线上配置：30
默认配置：10

==================================================================
安装phoenix
把phoenix-4.14.1-HBase-1.2-server.jar复制到hbase的lib目录下，master和regionServer都要复制
启动命令行客户端bin/sqlline.py node3:2181,node4:2181,node5:2181

phonix表操作
在phoenix中，默认情况下，表名等会自动转换为大写，若要小写，使用双引号，如"us_population"

CREATE TABLE IF NOT EXISTS WEB_STAT (
HOST CHAR(2) NOT NULL,
DOMAIN VARCHAR NOT NULL,
FEATURE VARCHAR NOT NULL,
DATE DATE NOT NULL,
USAGE.CORE BIGINT,
USAGE.DB BIGINT,
STATS.ACTIVE_VISITOR INTEGER
CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE)
);

CREATE TABLE IF NOT EXISTS us_population (
STATE CHAR(2) NOT NULL,
city VARCHAR,
population BIGINT,
CONSTRAINT PK PRIMARY KEY (STATE)
);

插入记录
upsert into us_population values('NY','NewYork',8143197)

查询记录
select * from us_population;
select * from us_population where state='NY'
select * from us_population where city='NewYork'

删除记录
delete from us_population where state='NY'
drop table us_population

============================

phoenix表映射
默认情况下，直接在hbase中创建的表，通过phoenix是查看不到的
如果需要在phoenix中操作直接在hbase中创建的表，则需要在phoenix中进行表的映射。映射方式有两种：视图映射和表映射。

视图映射
Phoenix创建的视图是只读的，所以只能用来做查询，无法通过视图对源数据进行修改等操作

在hbase shell中创建表
create 'test', 'name', 'company
插入数据
put 'test','1001','name:firstname','san'
put 'test','1001','name:lastname','zhang'
put 'test','1001','company:name','alibaba'
put 'test','1001','company:address','hangzhou'

在phoenix中创建映射视图
create view "test"(
"empid" varchar primary key,
"name"."firstname" varchar,
"name"."lastname" varchar,
"company"."name" varchar,
"company"."address" varchar
) as select * from "test"

删除映射
drop view "test"

====================

映射表
使用Apache Phoenix创建对HBase的表映射，有两种方法：
1）当HBase中已经存在表时，可以以类似创建视图的方式创建关联表，只需要将create view改为create table即可。
2）当HBase中不存在表时，可以直接使用create table指令创建需要的表，并且在创建指令中可以根据需要对HBase表结构进行显示的说明。
使用create table创建的关联表，如果对表进行了修改，源数据也会改变，同时如果关联表被删除，源表也会被删除。但是视图就不会，如果删除视图，源数据不会发生改变。

Phoenix创建的视图是只读的，所以只能用来做查询，无法通过视图对源数据进行修改等操作。而且相比于直接创建映射表，视图的查询效率会低，原因是：创建映射表的时候，Phoenix会在表中创建一些空的键值对，这些空键值对的存在可以用来提高查询效率。

create table "test"(
"empid" varchar primary key,
"name"."firstname" varchar,
"name"."lastname" varchar,
"company"."name" varchar,
"company"."address" varchar
)

用javaapi访问phoenix
添加maven依赖

<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-core</artifactId>
<version>4.14.1-HBase-1.2</version>
</dependency>

package com.wyd.phoenixclient;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
import java.sql.Statement;

public class ClientDemo {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver");
Connection conn = DriverManager.getConnection("jdbc:phoenix:node3:2181", "root", "");

Statement stat = conn.createStatement();

//stat.execute("create table user(id varchar not null primary key,info.name varchar,info.age integer,info.sex varchar)");

stat.executeUpdate("upsert into USER values('1001','zhangsan',18,'man')");
//一定要手动提交
conn.commit();
stat.close();
conn.close();
}
}

=====================================================================================================

hbase使用snappy压缩，在hadoop配置完支持snappy之后
配置hbase-env.sh环境变量
添加
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/Linux-amd64-64/:/usr/local/lib/
export HBASE_LIBRARY_PATH=$HBASE_LIBRARY_PATH:$HBASE_HOME/lib/native/Linux-amd64-64/:/usr/local/lib/

重启hbase
输入
hbase org.apache.hadoop.hbase.util.CompressionTest /tmp/testfile snappy
检查是否配置成功

create 'tsnappy', { NAME => 'f', COMPRESSION => 'snappy'}
put 'tsnappy','row1','f:col1','value'
scan 'tsnappy'

3、修改hbase的zookeeper连接限制

<property>
<name>hbase.zookeeper.property.maxClientCnxns</name>
<value>300</value> #默认是30,修改完以后，重启regioserver，但是没什么用
property>
4、修改zookeeper下的zoo.cfg文件

#maxClientCnxns=60 这个值跟刚才查看的ESTABLISHED连接数量刚好一致
取消掉注释，修改为150，重启zookeeper

如果协处理器报错，导致regionserver无法启动，要将
hbase.coprocessor.abortonerror这个参数设置为false