更多的场景示例请点击:https://blog.csdn.net/SKY_02/article/details/113394634
- 当前一些版本SQL引擎(Hive3.0以上、Spark2.X),group by的优化做得已经很好了,基本不怎么需要额外优化,以下可当作是优化原理解析。
- 两种方法原理相近,都是通过随机数把key尽可能均匀分摊到不同的分区中,通过多次的操作完成聚合。
- 随机数双重聚合
给原group by的列加上cast(rand() * N as int)进行第一次聚合,然后对子查询再做一次聚合 - HiveSQL加参数:
set hive.groupby.skewindata=true;
-- 示例前提(后面的示例会重新设置为true)
set hive.groupby.skewindata=false;
-- 原逻辑
select
t.col_a
,count(1) as cnt
,sum(t.amount) as cnt
from test.test_table_1 t
group by t.col_a
;
-- 解法1-随机数双重聚合
select
m2.col_a
,sum(m2.cnt) as cnt
,sum(m2.amount) as cnt
from
(
select
m1.rand_num
,m1.col_a
,count(1) as cnt
,sum(m1.amount) as cnt
(
select
cast(rand() * 100 as int) as rand_num
,t.col_a
,t.amount
from test.test_table_1 t
) m1
group by m1.col_a
,m1.rand_num
) m2
group by m2.col_a
;
-- 解法2-加参数
set hive.groupby.skewindata=true;
select
t.col_a
,count(1) as cnt
,sum(t.amount) as cnt
from test.test_table_1 t
group by t.col_a
;