本文阐述了一种离线数据处理中,数据倾斜打散热点key的思路,供大家参考
具体业务case
查询一段时间里各资源位上的去重用户数,逻辑表述如下
set stable.level = 5;
select pos_id,count(distinct user_id)
from xxxxxxx
where partition_date between '2023-11-01' and '2023-11-19'
group by 1
上述sql执行时间:
查询ID:745285720
查询状态:查询结束
提交时间:2023-11-20 12:53:04已提交2217秒
执行时间:2023-11-20 12:53:05已执行2217秒
其中shuffle阶段时长分布:
Min | 25th percentile | Median | 75th percentile | Max Metric | |
---|---|---|---|---|---|
Duration | 5s | 15s | 21s | 29s | 17 min |
GC Time | 30.0 ms | 0.6s | 1s | 2s | 15s |
Shuffle Read Size/Records | NaN undefined/11653260 | 208.1 MiB/11726798 | 208.4 MiB/11746961 | 208.7 MiB/11766420 | 209.2 MiB/11869073 |
Shuffle Write Size/Records | 23.6 KiB/361 | 24.1 KiB/368 | 24.2 KiB/370 | 24.4 KiB/372 | 25 KiB/381 |
逻辑优化思路
pos_id中的某几个比如72的用户量级较大,是热点key。给热点key增加后缀,将同一个posid的用户拆分成多组去count distinct计算UV,后缀的逻辑必须要确保相同用户分配给固定的组。然后再将posid的后缀去掉,将UV值做sum,得到每个posid的UV值。
逻辑表述如下
select pos_id,
sum(uv) as uv
from (
select split(pos_id_mod_user,'#')[0] as pos_id,
uv
from (
select concat(pos_id,'#',mod(user_id,pos_id)) as pos_id_mod_user,
count(distinct user_id) as uv
from xxxxxxx
where partition_date between '2023-11-01' and '2023-11-19'
group by concat(pos_id,'#',mod(user_id,pos_id))
) as t1
) as t2
group by pos_id
执行时间:
查询ID:745306136
查询状态:查询结束
提交时间:2023-11-20 13:25:44已提交1697秒
执行时间:2023-11-20 13:25:44已执行1697秒
shuffle阶段task耗时分布:
Min | 25th percentile | Median | 75th percentile | Max Metric | |
---|---|---|---|---|---|
Duration | 8s | 21s | 29s | 42s | 8.0 min |
GC Time | 27.0ms | 0.3s | 0.8s | 2s | 19s |
Shuffle Read Size/Records | NaN undefined/11652122 | 266.9 MiB/11725571 | 267.3 MiB/11747132 | 267.7 MiB/11766599 | 268.2 MiB/11862435 |
Shuffle Write Size/Records | 8.6 MiB/652850 | 8.6 MiB/654455 | 8.6 MiB/654846 | 8.6 MiB/655220 | 8.7 MiB/656641 |
结果分析
可以观察到,优化后的shuffle阶段,数据倾斜明显缓解(task最长耗时从17min降低到8min)。且经过比对,新旧逻辑的结果完全一致。