Hbase Phoenix 踩坑

最新推荐文章于 2024-04-25 08:30:00 发布

humorrr

最新推荐文章于 2024-04-25 08:30:00 发布

阅读量1k

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/qq_36561105/article/details/105779411

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

工作中用到了Hbase，由于Hbase原生api封装层次低，使用不方便，因此使用了Phoenix，这里记录使用过程中的踩坑心得。

·················································································································

使用Phoenix需要注意

一. 使用sql语句时，表名和子段名最好全部大写，因为Phoenix会自动识别为大写，如需小写要加单引号，而hbase又区分大小写
二. 连接池：phoenix不推荐使用连接池，因为其基于HBase的连接的创建成本很低，并且用过的HBase连接不能共享使用，因此用过的Connection需要关闭。可以在官网看到 Should I pool Phoenix JDBC Connections：

Phoenix’s Connection objects are different from most other JDBC Connections due to the underlying HBase connection. The Phoenix Connection object is designed to be a thin object that is inexpensive to create. If Phoenix Connections are reused, it is possible that the underlying HBase connection is not always left in a healthy state by the previous user. It is better to create new Phoenix Connections to ensure that you avoid any potential issues.

三. 刚创建 HBase 表的时候默认只有一个 Region 由一个 Region Server 管理，在数据量达到一定值的时候会触发分裂 split，这样会不断的分裂出更多的 Region，由不同的 Region Server 管理，每个 Region 管理的是一段连续的 row key，由 start row key 和 end row key 表示，这样会出现两个问题

无法充分利用分布式并发处理的优势，必须等待 Region 自动分裂成多个，这个过程可能会很久
由于每个 Region 管理一段连续的 row key，这样如果数据的读写不够随机，比如有自增 ID，比如大量操作集中在某段 row key，这样有可能导致压力都在同一个 Region 上
region的分裂策略定义在 hbase-site.xml 文件

<name>hbase.regionserver.region.split.policy</name>
<value>org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy</value>
<description>
  A split policy determines when a region should be split. The various other split policies that
  are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy,
  DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc.
</description>

默认策略是 IncreasingToUpperBoundRegionSplitPolicy，在 HBase 1.2 中，这个策略默认表示，当 Region 的大小达到 Region 个数的立方乘以 hbase.hregion.memstore.flush.size（默认 128 MB）再乘以 2 ，或是达到 hbase.hregion.max.filesize （默认 10 GB）时，就对该 Region 做分裂操作

第一次分裂的大小：1^3 * 128MB * 2 = 256MB
第二次分裂的大小：2^3 * 128MB * 2 = 2048MB
第三次分裂的大小：3^3 * 128MB * 2 = 6912MB
第四次分裂的大小：4^3 * 128MB * 2 = 16384MB，超过了 10GB，因此只取 10GB
后面的分裂大小都是 10GB

可以看到如果可以利用的节点比较多的话，那么可能得等很久才能充分利用，所以应该进行预分区

四. phoenix加盐

CREATE TABLE IF NOT EXISTS Product (
    id           VARCHAR not null,
    time         VARCHAR not null,
    price        FLOAT,
    sale         INTEGER,
    inventory    INTEGER,

    CONSTRAINT pk PRIMARY KEY (id, time)
) COMPRESSION = 'GZ', SALT_BUCKETS = 6

本质上是对 HBase 表的 row key 做了哈希后，对 SALT_BUCKETS 取余数，并将结果（上面的例子中是 0~5）作为 byte 插入到 row key 的第一位，根据这个数值将数据分到不同 Region 中，由于是作为 byte 存储，所以 SALT_BUCKETS 能取的最大值是 256，拥有相同 salt byte 的 row 会被分到相同的 region server，所以通常取 region server 的数量作为 SALT_BUCKETS

由于加了盐的数据最前面多了一位，这样默认情况下，从不同 region server 取出来的数据无法按原来的 row key 排序，如果需要保证排序，需要改一个配置

phoenix.query.force.rowkeyorder = true

sql语法规则

1.插入数据

在Phoenix中是没有Insert语句的，取而代之的是Upsert语句。Upsert有两种用法，
分别是:upsert into 和 upsert select

upsert into:
类似于insert into的语句，旨在单条插入外部数据
upsert into tb values('ak','hhh',222)
upsert into tb(stat,city,num) values('ak','hhh',222)

upsert select：
类似于Hive中的insert select语句，旨在批量插入其他表的数据。
upsert into tb1 (state,city,population) select state,city,population from tb2 where population < 40000;
upsert into tb1 select state,city,population from tb2 where population > 40000;
upsert into tb1 select * from tb2 where population > 40000;
注意：在phoenix中插入语句并不会像传统数据库一样存在重复数据。
因为Phoenix是构建在HBase之上的，也就是必须存在一个主键。
后面插入的会覆盖前面的，但是时间戳不一样。

2.删除数据
delete from tb; 清空表中所有记录，Phoenix中不能使用truncate table tb；
delete from tb where city = 'kenai';
drop table tb;删除表
delete from system.catalog where table_name = 'int_s6a';
drop table if exists tb;
drop table my_schema.tb;
drop table my_schema.tb cascade;用于删除表的同时删除基于该表的所有视图。

3.修改数据
由于HBase的主键设计，相同rowkey的内容可以直接覆盖，这就变相的更新了数据。
所以Phoenix的更新操作仍旧是upsert into 和 upsert select
upsert into us_population (state,city,population) values('ak','juneau',40711);

4.查询数据
union all， group by， order by， limit 都支持
select * from test limit 1000;
select * from test limit 1000 offset 100;
select full_name from sales_person where ranking >= 5.0 union all select reviewer_name from customer_review where score >= 8.0

5.在Phoenix中是没有Database的概念的，所有的表都在同一个命名空间。但支持多个命名空间
设置为true，创建的带有schema的表将映射到一个namespace
<property>
<name>phoenix.schema.isNamespaceMappingEnabled</name>
<value>true</value>
</property>

6.创建表
A.SALT_BUCKETS(加盐)
加盐Salting能够通过预分区(pre-splitting)数据到多个region中来显著提升读写性能。
本质是在hbase中，rowkey的byte数组的第一个字节位置设定一个系统生成的byte值，
这个byte值是由主键生成rowkey的byte数组做一个哈希算法，计算得来的。
Salting之后可以把数据分布到不同的region上，这样有利于phoenix并发的读写操作。

SALT_BUCKETS的值范围在（1 ~ 256）：
create table test(host varchar not null primary key, description varchar)salt_buckets=16;

upsert into test (host,description) values ('192.168.0.1','s1');
upsert into test (host,description) values ('192.168.0.2','s2');
upsert into test (host,description) values ('192.168.0.3','s3');

salted table可以自动在每一个rowkey前面加上一个字节，这样对于一段连续的rowkeys，它们在表中实际存储时，就被自动地分布到不同的region中去了。
当指定要读写该段区间内的数据时，也就避免了读写操作都集中在同一个region上。
简而言之，如果我们用Phoenix创建了一个saltedtable，那么向该表中写入数据时，
原始的rowkey的前面会被自动地加上一个byte（不同的rowkey会被分配不同的byte），使得连续的rowkeys也能被均匀地分布到多个regions。

B.Pre-split（预分区）
Salting能够自动的设置表预分区，但是你得去控制表是如何分区的，
所以在建phoenix表时，可以精确的指定要根据什么值来做预分区，比如：
create table test (host varchar not null primary key, description varchar) split on ('cs','eu','na');

C.使用多列族
列族包含相关的数据都在独立的文件中，在Phoenix设置多个列族可以提高查询性能。
创建两个列族：
create table test (
mykey varchar not null primary key,
a.col1 varchar,
a.col2 varchar,
b.col3 varchar
);
upsert into test values ('key1','a1','b1','c1');
upsert into test values ('key2','a2','b2','c2');

D.使用压缩
create table test (host varchar not null primary key, description varchar) compression='snappy';

7.创建视图,删除视图
create view "my_hbase_table"( k varchar primary key, "v" unsigned_long) default_column_family='a';
create view my_view ( new_col smallint ) as select * from my_table where k = 100;
create view my_view_on_view as select * from my_view where new_col > 70
create view v1 as select * from test where description in ('s1','s2','s3')

drop view my_view
drop view if exists my_schema.my_view
drop view if exists my_schema.my_view cascade

8.创建二级索引
支持可变数据和不可变数据（数据插入后不再更新）上建立二级索引
create index my_idx on sales.opportunity(last_updated_date desc)
create index my_idx on log.event(created_date desc) include (name, payload) salt_buckets=10
create index if not exists my_comp_idx on server_metrics ( gc_time desc, created_date desc )
data_block_encoding='none',versions=?,max_filesize=2000000 split on (?, ?, ?)
create index my_idx on sales.opportunity(upper(contact_name))
create index test_index on test (host) include (description);

删除索引：
drop index my_idx on sales.opportunity
drop index if exists my_idx on server_metrics
drop index if exists xdgl_acct_fee_index on xdgl_acct_fee

默认是可变表，手动创建不可变表
create table hao2 (k varchar primary key, v varchar) immutable_rows=true;
alter table HAO2 set IMMUTABLE_ROWS = false; 修改为可变
alter index index1 on tb rebuild;索引重建是把索引表清空后重新装配数据。

Global Indexing多读少写，适合条件较少
create index my_index on items(price);
调用方法：
1.强制索引
select /*+ index(items my_index) */ * from items where price=0.8824734;
drop index my_name on usertable;

2.覆盖索引 Covered Indexes，需要include包含需要返回数据结果的列。
create index index1_c on hao1 (age) include(name); name已经被缓存在这张索引表里了。
对于select name from hao1 where age=2，查询效率和速度最快
select * from hao1 where age =2，其他列不在索引表内，会全表扫描

Local Indexing写多读少，不是索引字段索引表也会被使用，索引数据和真实数据存储在同一台机器上（
create local index index3_l_name on hao1 (name);

异步创建索引，创建的索引表中不会有数据，单独使用命令行工具来执行数据的创建
create index index1_c on hao1 (age) include(name) async;
hbase org.apache.phoenix.mapreduce.index.indextool
--schema my_schema --data-table my_table --index-table async_idx
--output-path async_idx_hfiles

humorrr

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Hbase Phoenix 踩坑

工作中用到了Hbase，由于Hbase原生api封装层次低，使用不方便，因此使用了Phoenix，这里记录使用过程中的踩坑心得。·················································································································使用Phoenix需要注意一. 使...
复制链接

扫一扫