Impala与Hbase整合用于ETL过程尝试（3）

最新推荐文章于 2024-07-02 10:23:01 发布

fishhunter

最新推荐文章于 2024-07-02 10:23:01 发布

阅读量3.5k

点赞数

分类专栏： impala hbase

本文链接：https://blog.csdn.net/fishhunter/article/details/50328005

版权

hbase 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

impala

2 篇文章 0 订阅

订阅专栏

一、性能验证

如果真要在生产环境中用，需要验证如下场景：

l 正向操作：在impala中通过sql insert大规模的加载或更新hbase的记录

l 反向操作：将hbase中的表导出到impala中形成可分析统计的表

上述场景如果不满足性能要求，就很难在生产中用于ETL了，而只能是做局部的小批量更新。

1. 样本数据准备

为了模拟大数据量，将字段扩展至200个，产生一个1000万条记录的全表数据。

1、在hbase中建被写入的表

create 'cust_full', 'cf_01', 'cf_02'

2、在impala中建一张1000万记录的表

CREATE TABLE cust_full(

cust_id string,

col_01_001 string,

col_01_002 string,

col_01_003 string,

。。。

col_02_001 string,

col_02_002 string,

col_02_003 string,

col_02_004 string,

。。。

)row format delimited fields terminated by'|' lines terminated by '\n'

stored as textfile ;

并将准备好的文件上传至对应的impala hdfs目录下，形成表数据。

现在有了这张基础的impala大表，1000万条记录，200个字段。

2. 验证impala写入hbase性能

现尝试将这张大表insert到hbase中去，模拟宽表数据的写入过程。

[bd-131:21000] > insert into hbase_cust_full select * from cust_full ;

Query: insert into hbase_cust_full select * from cust_full

WARNINGS:

RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times,

RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times,

RetriesExhaustedWithDetailsException: Failed 172 actions: RegionTooBusyException: 172 times,

RetriesExhaustedWithDetailsException: Failed 86 actions: RegionTooBusyException: 86 times,

这个问题是由于hbase在加载数据过程中产生了region split操作，会阻塞写入操作，在hbase开发过中比较常见。

尝试进行hbase表在创建上的优化，预建分区：

disable 'cust_full'

drop 'cust_full'

create 'cust_full',{METHOD => 'table_att', MAX_FILESIZE => '6442450944'}, { NAME => 'cf_01'}, { NAME => 'cf_02'},{SPLITS => ['1001000000','1002000000','1003000000','1004000000','1005000000','1006000000','1007000000','1008000000','1009000000']

在insert的过程中，仍然出现同样的超时问题，导致失败，实际插入116万记录，>20分钟。因此，这里存在隐患：

l 预建分区的范围划分精确性

l 加载数据的不稳定性

如果不是通过程序来进行超时等待控制，很难控制其一次成功。

尝试只写入部分的字段（1个主键+2个字段）：

[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full ;

Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002) select cust_id,col_01_001,col_01_002 from cust_full

Inserted 10000000 row(s) in 136.67s

尝试加多字段写入（1个主键+10个字段

[bd-131:21000] > insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_0ect cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full ;

Query: insert into hbase_cust_full(cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010) select cust_id,col_01_001,col_01_002,col_01_003,col_01_004,col_01_005,col_01_006,col_01_007,col_01_008,col_01_009,col_01_010 from cust_full

WARNINGS:

RetriesExhaustedWithDetailsException: Failed 1024 actions: IOException: 1024 times,

3. 验证hbase导出至impala性能

回写的性能验证：将刚刚写入的116万记录写回到Impala表

[bd-131:21000]> create table tt_1 as select * from hbase_cust_full ;

Query:create table tt_1 as select * from hbase_cust_full

+-------------------------+

|summary |

+-------------------------+

|Inserted 1164460 row(s) |

+-------------------------+

Fetched1 row(s) in 576.32s

需要耗时10分钟左右，导出的成本同样比较高，遍历从来都是k/v存储的弱项。

4. 初步结论

该方案就目前测试的环境（1+3配置）上情况不乐观，当然也有一些其他的优化手段，比如通过程序代码来实现字段更新，加入出错重试的机制，但如果不能在SQL层面支撑好，就会使问题复杂化。

fishhunter

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Impala与Hbase整合用于ETL过程尝试（3）

一、性能验证如果真要在生产环境中用，需要验证如下场景：l 正向操作：在impala中通过sql insert大规模的加载或更新hbase的记录l 反向操作：将hbase中的表导出到impala中形成可分析统计的表上述场景如果不满足性能要求，就很难在生产中用于ETL了，而只能是做局部的小批量更新。 1. 样本数据准备为了模拟大数据量，将字段扩展至
复制链接

扫一扫