Hive分桶表性能测试

Hive数仓 专栏收录该内容
16 篇文章 0 订阅

环境说明:
1、源表:orc格式 按照天、小时分区  
2、目标表(分桶表)按天分区 按照event_id分100个桶 数据源来源于源表使用insert into select写入
3、一天的数据量约1.7亿+
4、查询时使用spark thriftserver 资源非固定

目的:测试Hive分桶表的在大数据量下的查询性能

当前spark2.4.0版本不支持spark写入桶表 https://issues.apache.org/jira/browse/SPARK-19256

1、样本数据:1天

桶表(SQL)

查询耗时(秒)

原始表 (SQL)

查询耗时(秒)

select 
app_id,user_id,event_id,page_id,page_time
from 
tmp.track_event_bucket
where pt='2020-12-09' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
35s
select 
app_id,user_id,event_id,page_id,page_time
from 
behavior.track_event
where pt='2020-12-09' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
37s

 

select 
*
from 
tmp.track_event_bucket
where pt='2020-12-09' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'

 

47s
select 
*
from 
behavior.track_event
where pt='2020-12-09' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
58s
select 
* 
from 
tmp.track_event_bucket tablesample(bucket 1 out of 5 )
where pt ='2020-12-09' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
limit 100;
49.8s  

 

2、样本数据:1个月

桶表(SQL)

查询耗时(秒)

原始表 (SQL)

查询耗时(秒)

提升

select
*
from

tmp.track_event_bucket

where pt between '2020-11-01' and '2020-11-30'
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'

261.8sselect
*
from
behavior.track_event
where pt between '2020-11-01' and '2020-11-30'
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
368.3s29%

select
app_id,user_id,event_id,page_id,page_time
from
tmp.track_event_bucket
where pt between '2020-11-01' and '2020-11-30'
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'

 

96.9s

select
app_id,user_id,event_id,page_id,page_time
from
behavior.track_event
where pt between '2020-11-01' and '2020-11-30' and app_id ='11205'

and event_id='OP_E_H5_CLICK_20200513_9415'

262.142s63%
--桶表取样1 20个桶
select 
app_id,user_id,event_id,page_id,page_time
from 
tmp.track_event_bucket tablesample(bucket 1 out of 5 )
where  pt between '2020-11-01' and '2020-11-30' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
160.89s   
--桶表取样2 10个桶
select 
app_id,user_id,event_id,page_id,page_time
from 
tmp.track_event_bucket tablesample(bucket 1 out of 10 )
where  pt between '2020-11-01' and '2020-11-30' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
106.87s   
--桶表取样3 5个桶
select 
app_id,user_id,event_id,page_id,page_time
from 
tmp.track_event_bucket tablesample(bucket 1 out of 20 )
where  pt between '2020-11-01' and '2020-11-30' 
and app_id ='11205'
and event_id='OP_E_H5_CLICK_20200513_9415'
142.58s   

 

3、结论

  • 除非知道所查询的数据在哪个桶,否则取样桶的个数递减查询时间并非跟着减少。
  • 桶表和非桶表在查询一天的数据时耗时相差不明显。
  • 桶表相对于hive orc表的指定字段查询在大范围数据查询中有约50%的性能提升,使用全列查询也能约提升30%。
  • 0
    点赞
  • 0
    评论
  • 0
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

©️2021 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值