Hive怎么处理数据倾斜问题
group by 倾斜
group by语句中出现的倾斜,通过改变写法或参数设置
原理:1次reduce->2次reduce
1 写法调整
对于确定的倾斜值,先均匀分布到各个reducer上,然后开启新一轮reducer进行统计操作。写法如下
-- 正常写法
select key,count(1) as cnt
from tb_name
group by key;
-- 改进后写法
select a.key,sum(cnt) as cnt
from (
select key,if(key = 'key001',random(),0),count(1) as cnt
from tb_name
group by key,if(key = 'key001',random(),0)
) t
group by t.key;
2 倾斜参数设置
如果在不确定倾斜值的情况下,可以设置hive.groupby.skewindata参数
set hive.groupby.skewindata=true;
select key,count(1) as cnt
from tb_name
group by key;
其原理和上述写法调整中类似,是先对key值进行均匀分布,然后开启新一轮reducer求值
join 倾斜
1 大表关联小表
通过map join优化 , 一般Hive默认是开启自动优化。
2 大表关联大表
(1) 设置倾斜参数
Skewed Join is exactly targeting this problem. At runtime, it scans the data and detects the keys with a large skew, which is controlled by parameter hive.skewjoin.key. Instead of processing those keys, it stores them temporarily in an HDFS directory. Then in a map-reduce job, process those skewed keys. The same key need not be skewed for all the tables, and so the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a map-join.
原理: 倾斜的key 用 map join 单独处理
(2) 增加虚拟列,通过虚拟列关联
原理:增加虚拟列,小key保持原值,大key随机打散。从而可以分到不同reduce任务上去执行
具体可以看下 Skew Join Optimization in Hive