Hive怎么处理数据倾斜问题

最新推荐文章于 2024-05-25 11:32:41 发布

tinyland

最新推荐文章于 2024-05-25 11:32:41 发布

阅读量495

点赞数 1

分类专栏：大数据文章标签： hive

本文链接：https://blog.csdn.net/xiaoluzjg/article/details/107814443

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Hive怎么处理数据倾斜问题

group by 倾斜

group by语句中出现的倾斜，通过改变写法或参数设置
原理：1次reduce->2次reduce

1 写法调整

对于确定的倾斜值，先均匀分布到各个reducer上，然后开启新一轮reducer进行统计操作。写法如下

-- 正常写法
select key,count(1) as cnt
from tb_name
group by key;

-- 改进后写法
select a.key,sum(cnt) as cnt
from (
	select key,if(key = 'key001',random(),0),count(1) as cnt
	from tb_name
	group by key,if(key = 'key001',random(),0)
) t
group by t.key;

2 倾斜参数设置

如果在不确定倾斜值的情况下，可以设置hive.groupby.skewindata参数

set hive.groupby.skewindata=true;

select key,count(1) as cnt
from tb_name
group by key;

其原理和上述写法调整中类似，是先对key值进行均匀分布，然后开启新一轮reducer求值

join 倾斜

1 大表关联小表

通过map join优化，一般Hive默认是开启自动优化。

2 大表关联大表

(1) 设置倾斜参数

Skewed Join is exactly targeting this problem. At runtime, it scans the data and detects the keys with a large skew, which is controlled by parameter hive.skewjoin.key. Instead of processing those keys, it stores them temporarily in an HDFS directory. Then in a map-reduce job, process those skewed keys. The same key need not be skewed for all the tables, and so the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a map-join.

原理：倾斜的key 用 map join 单独处理
在这里插入图片描述