【SQL】数据倾斜处理之加盐哈希
处理数据倾斜最好的办法就是将数据打散,其中加盐哈希最为常用。
之前面阿里被问到如何加盐哈希。
其实就是将一个阶段处理拆分为两个阶段出来,比如以前是 key1(60),key2(20),key3(20),在这个字段上做聚合。加盐之后,可以分散为 1_key1(20),2_key1(20),3_key1(20),key2(20),key3(20),此时数据就较为平均。在处理过后的字段上聚合再做处理。
例如:
select
date,
app_id,
count(uid) as pv
from
source_tb
group by
date,
app_id;
某个 app 流量远超其他 app 就可能倾斜,因此可以改写:
with t1 as (
select
date,
-- 加随机前缀,用 “_” 连接
concat(cast(cast(RAND()*100 as int) as string), "_", app_id) as new_app_id,
uid
from
source_tb
),
t2 as (
select
date,
new_app_id,
count(uid) as pv
from
t1
group by
date,
app_id
)
select
date,
-- 用 “_” 拆分为两个部分,第二个部分为原始的 app_id
split(new_app_id,'_')[1] as app_id,
sum(pv) as pv
from
t2
group by
date,
split(new_app_id,'_')[1]