Hive性能优化之数据倾斜

最新推荐文章于 2023-03-04 09:12:41 发布

仙道Bob

最新推荐文章于 2023-03-04 09:12:41 发布

阅读量333

点赞数

分类专栏： # Hive 文章标签： hive

本文链接：https://blog.csdn.net/jsbylibo/article/details/96117265

版权

Hive 专栏收录该内容

18 篇文章 2 订阅

订阅专栏

Hive的优化分为join相关的优化和join无关的优化，实际运用来看，join相关的优化占了很大的比重，而join相关的优化又分为mapjoin可以解决的join优化和mapjoin无法解决的join优化。

1 数据倾斜

倾斜来自统计学里的偏态分布。简单来说，就是数据的key分布严重不均匀，造成一部分数据特别多，一部分很少的局面。

2 Hive优化

2.1 一般性优化

2.1.1 select * 优化
尽量不要使用 select * from your_table 这样的方式，用到哪些列就指定哪些列，如select coll, col2 from your_table 。另外， where 条件中也尽量添加过滤条件，以去掉无关的数据行，从而减少整个 MapReduce 任务中需要处理、分发的数据量。

2.2 join无关的优化

2.2.1 group by 引起的倾斜优化
group by 引起的倾斜主要是group by 列分布不均匀导致的，优化很简单，只需设置下面参数即可：

set hive.map.aggr = true (map端部分聚合)
set hive.groupby.skewindata = true (有数据倾斜时负载均衡)

此时Hive在数据倾斜的时候会进行负载均衡，生成的查询计划会有两个MapReduce Job。第一个MapReduce Job中，Map的输出结果集合会随机分布到Reduce中，每个Reduce做部分聚合操作并输出结果，这样相同的 group by key有可能被分布到不同的Reduce中，从而达到负载均衡的目的；第二个MapReduce Job再根据预处理的数据结果按照group by key分布到Reduce中（这个过程可以保证相同的group by key被分配到同一个Reduce中），最后完成最终的聚合操作。

2.2.2 count distinct优化
写Hive sql要小心使用count distinct，因为很容易引起性能问题，比如：

select count(distinct user) from table;

由于必须去重，因此Hive会把Map阶段的输出全部分布到一个Reduce Task上，这样很容易引起性能问题，优化如下：

select count(1) from (select user from table group by user) tmp;

其原理：利用group by 去重，再统计group by的行数目。

2.3 大表join小表的优化
按年龄查询某天会员的交易记录，通常会这么写：

select b.age,count(1)
from (select mid,amount,remark from fund_mem_detail where dt='20190101') a
left join (select mid,age from mem) b
on a.mid=b.mid
group by b.age;

通常会员数量是有限的，而交易记录会非常大，现实中某些人的交易记录会非常多，这样就会造成数据倾斜，对于这种大表join小表问题，可以通过mapjoin方式来优化，只需要添加mapjoin hint即可，如下：

select /*+mapjoin(b)*/ b.age,count(1)
from (select mid,amount,remark from fund_mem_detail where dt='20190101') a
left join (select mid,age from mem) b
on a.mid=b.mid
group by b.age;

/*mapjoin(b)*/即mapjoin hint，如果需要mapjoin多个标，格式为/*mapjoin(b,c,d)*/。Hive对mapjoin是默认开启的，参数为：

set hive.auto.convert.join=true;

mapjoin优化是在Map阶段进行join，而不是像一般情况在reduce阶段按照join列进行分发后再每个reduce任务节点上进行join，不需要分发也就没有倾斜的问题，相反Hive会将小表全量复制到每个Map任务节点（仅复制sql指定的列），然后每个Map任务节点执行lookup小表即可。
注意，小表不能太大，否则全量复制得不偿失，Hive根据参数hive.auto.convert.join.noconditionaltask.size来确定小表的大小是否满足条件（默认25M），此参数可以修改，一般最大不能超过1G，否则Map任务所在节点内存会撑爆。

2.4 大表join大表的优化

2.4.1 问题场景
A表是一个汇总表，汇总卖家买家最近90天交易汇总信息，即对每个卖家最近90天，其每个买家共成交多少单，总金额是多少。A表的字段有：buyer_id、seller_id和pay_cnt_90d。
B表为卖家基本信息表，包含卖家的评级，比如：S0、S1、S2、S3。B表字段有：seller_id和s_level。
现在要获得每个买家在各个级别卖家的成交比例信息，如：S0：10%、S1：20%、S2:30%、S3：40%，正常可能这么写：

select
m.buyer_id
,sum(pay_cnt_90d) as pay_cnt_90d
,sum(case when m.s_level=0 then pay_cnt_90d end) as pay_cnt_90d_s0
,sum(case when m.s_level=1 then pay_cnt_90d end) as pay_cnt_90d_s1
,sum(case when m.s_level=2 then pay_cnt_90d end) as pay_cnt_90d_s2
,sum(case when m.s_level=3 then pay_cnt_90d end) as pay_cnt_90d_s3
from
(
select
a.buyer_id,a.seller_id,b.s_level,a.pay_cnt_90d
from (select buyer_id ,seller_id,pay_cnt_90d
from table_A ) a
join (select seller_id,s_level
from table_B ) b
on a.seller_id=b.seller_id
) m
group by m.buyer_id;

此sql会引起数据倾斜，因为某些卖家会有几百万甚至千万的买家，但是大部分卖家的买家数量并不多，join table_A和table_B按照seller_id进行分发，table_A的大卖家引起数据倾斜，但是本数据倾斜无法用mapjoin table_B解决，因为卖家有超过千万条，文件大小好几个G，超过了mapjoin表最大1G的限制。

2.4.2 方案1：转化为mapjoin
尽管B表无法直接mapjoin，但是可以间接的mapjoin它，有两种途径：限制行和限制列。
限制行：不需要join B全表，只需要join其在A表中存在的；
限制列：只取需要的字段。
加上限制后，如果满足mapjoin条件，则可以这么写：

select
m.buyer_id
,sum(pay_cnt_90d) as pay_cnt_90d
,sum(case when m.s_level=0 then pay_cnt_90d end) as pay_cnt_90d_s0
,sum(case when m.s_level=1 then pay_cnt_90d end) as pay_cnt_90d_s1
,sum(case when m.s_level=2 then pay_cnt_90d end) as pay_cnt_90d_s2
,sum(case when m.s_level=3 then pay_cnt_90d end) as pay_cnt_90d_s3
from
(
select /*mapjoin(b)*/
a.buyer_id,a.seller_id,b.s_level,a.pay_cnt_90d
from (select buyer_id ,seller_id,pay_cnt_90d
from table_A ) a
join (select b0.seller_id,b0_s_level
from table_B b0
join (select seller_id from table_A group by seller_id) a0
on b0.seller_id=a0.seller_id) b
on a.seller_id=b.seller_id
) m
group by m.buyer_id;

如果过滤后的B表还是很大，此方案就不起作用了。

2.4.3 方案2：join时用case when语句
此方案应用的场景为：倾斜的值是明确的而且数量很少，比如null值引起的倾斜。其核心是将这些引起倾斜的值随机分发到reduce，主要逻辑在于join时对这些特殊值concat随机数，从而达到随机分发的目的。

select a.user_id,a.order_id,b.user_id
from table_A a
join table_B b
on (case when a.user_id is null then concat('hive',rand()) else a.user_id end)=b.user_id

此方案也无法解决问题场景的倾斜问题，因为倾斜的卖家大量存在且动态变化。

2.4.4 方案3：倍数B表，再取模
没看懂，不解释

2.4.5 方案4：动态一分为二
对于mapjoin不能解决的问题，终极方案就是动态一分为二，即对倾斜和不倾斜的键值分开处理，不倾斜的正常join，倾斜的找出来后做mapjoin，最后union all其结果即可。

--先找出近90天买家数超过10000的卖家
create table if not exists tmp_talbe_B as
select m.seller_id,n.s_level
from (select seller_id from
(select seller_id,count(buyer_id) byr_cnt from table_A group by seller_id) a where a.byr_cnt>10000
) m
left join (select seller_id,s_level from table_B) n
on m.seller_id=n.seller_id;

--对于超过买家数超过10000的卖家mapjoin，其它卖家正常join即可
select
m.buyer_id
,sum(pay_cnt_90d) as pay_cnt_90d
,sum(case when m.s_level=0 then pay_cnt_90d end) as pay_cnt_90d_s0
,sum(case when m.s_level=1 then pay_cnt_90d end) as pay_cnt_90d_s1
,sum(case when m.s_level=2 then pay_cnt_90d end) as pay_cnt_90d_s2
,sum(case when m.s_level=3 then pay_cnt_90d end) as pay_cnt_90d_s3
from
(
select a.buyer_id,a.seller_id,b.s_level,a.pay_cnt_90d
from (select seller_id,buyer_id,pay_cnt_90d from table_A) a
join
(
select a.seller_id,b.s_level
from table_A a
left join tmp_table_B b on a.seller_id=b.seller_id
where b.seller_id is null
) b
on a.seller_id=b.seller_id

union all

select /*mapjoin(b)*/
a.buyer_id,a.seller_id,b.s_level,a.pay_cnt_90d
from (select seller_id,buyer_id,pay_cnt_90d from table_A) a
join (select seller_id,s_level from tmp_table_B ) b
on a.seller_id=b.seller_id
) m
group by m.buyer_id;

方案4需要新建一个临时表存放每日动态变化的大卖家，此方案最通用，自由度最高，可作为终极方案来使用。

set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set mapreduce.map.memory.mb=16000;
set mapreduce.map.java.opts='-Xmx15g';
set hive.map.aggr=true;
set hive.groupby.skewindata=true;

参考文献：离线和实时大数据开发实践朱松岭著

仙道Bob

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive性能优化之数据倾斜

Hive的优化分为join相关的优化和join无关的优化，实际运用来看，join相关的优化占了很大的比重，而join相关的优化又分为mapjoin可以解决的join优化和mapjoin无法解决的join优化。1 数据倾斜倾斜来自统计学里的偏态分布。简单来说，就是数据的key分布严重不均匀，造成一部分数据特别多，一部分很少的局面。2 Hive优化2.1 一般性优化2...
复制链接

扫一扫