大数据之SparkHive在生产实际中的如何优雅的解决数据倾斜的问题

最新推荐文章于 2022-11-07 16:45:00 发布

大数据学习僧

最新推荐文章于 2022-11-07 16:45:00 发布

阅读量807

点赞数

本文链接：https://blog.csdn.net/yu7888/article/details/112980821

版权

Hive 同时被 2 个专栏收录

16 篇文章 1 订阅

订阅专栏

Spark

10 篇文章 0 订阅

订阅专栏

面对大数据任务中shuffle失败的问题，本文提出通过关联字段加随机数解决数据倾斜。作者详细介绍了如何在业务、规则和定位信息表中添加随机数，并强调了处理后的数据还原过程。最终目标是提升任务执行效率并确保正确结果。

摘要由CSDN通过智能技术生成

问题：原数据匹配事件名称、定位信息，因某个字段对应的数据条数特别多，任务运行时间较长，在shuffle时总是失败。

报错信息：

ShuffleMapStage has failed the maxinum allowable number of times
Caused by :io.netty.utile.internal.OutOfDirectMemoryError:failed to allocate 16777216 bytes of direct memory(used:3741319168,max:3750756352)

好家伙数据量还挺大，任务运行这么久还要报错，顿时来气，那就上刀子一步到位，把数据倾斜问题解决了吧。

**方案：通过关联条件字段的拼接随机数进行加盐，扩容小表的匹配字段,最后记得去除添加的随机数，还原数据，才是正确结果**

睁大眼睛好好看，关联表那就加上随机数，具体实施：

业务表：

select
concat_ws('_',event_id,cast(rand() * 1000% 6 as int)) as event_id,
concat_ws('_',s_ip_string,cast(rand() * 1000 % 6 as int)) as s_ip_string,
concat_ws('_',d_ip_string,cast(rand() * 1000 % 6 as int)) as d_ip_string,
s_port,
d_port,
c_time
from atable

规则表：

select
concat_ws('_',event_id,num) as event_id,
event_name,
from btable
lateral view explode(split('1,2,3,4,5,6,7,8,9,0',',')) tp as num
#tp是临时表名

定位信息表：

select
concat_ws('_',ip,num) as ip,
company,
city
from ctable
lateral view explode(split('1,2,3,4,5,6,7,8,9,0',',')) tp as num

随机数添加好了，shuffle的时候就可以很均匀的分配任务到executor了吧

最后记得记得记得，重要事说三遍！！
还没结束呢，需要处理一下！

select
split(event_id,’_’)[0] as event_id,
event_name,
split(s_ip_string,’_’)[0] as s_ip_string,
s_city,
s_company,
split(d_ip_string,’_’)[0] as d_ip_string,
d_city,
d_company,
d_port,
s_port,
time
from 
a