PostgreSQL DBA(34) - (HashAggregate vs GroupAggregate)

PostgreSQL数据库中,聚合函数如max/min/count等有两种实现方式:HashAggregate与GroupAggregate.本节简单介绍这两者的区别.
案例一

首先我们看一个案例:
测试表:


drop table  if exists t_agg;
create table t_agg(bh varchar(20),c1 int,c2 int,c3 int,c4 int,c5 int,c6 int);
insert into t_agg select 'GZ01',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ02',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ03',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ04',col,col,col,col,col,col from generate_series(1,100000) as col;
insert into t_agg select 'GZ05',col,col,col,col,col,col from generate_series(1,100000) as col;

执行查询:


testdb=# -- 禁用并行
testdb=# set max_parallel_workers_per_gather=0;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=22427.00..22427.05 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5, c6
(5 rows)

PG的优化器选择了HashAggregate.
下面禁用HashAggregate,优化器只能选择GroupAggregate.可以看到两者的总成本比较:22427.05 vs 82968.97


testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh,min(c1),max(c1),min(c2),max(c2),min(c3),max(c3),min(c4),max(c4),min(c5),max(c5) from t_agg group by bh;
                                               QUERY PLAN                                               
--------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=67968.92..82968.97 rows=5 width=45)
   Output: bh, min(c1), max(c1), min(c2), max(c2), min(c3), max(c3), min(c4), max(c4), min(c5), max(c5)
   Group Key: t_agg.bh
   ->  Sort  (cost=67968.92..69218.92 rows=500000 width=25)
         Output: bh, c1, c2, c3, c4, c5
         Sort Key: t_agg.bh
         ->  Seq Scan on public.t_agg  (cost=0.00..8677.00 rows=500000 width=25)
               Output: bh, c1, c2, c3, c4, c5
(8 rows)

案例二
下面用一个宽表来进行测试:分组键值很少,但聚合列很多


drop table  if exists t_agg_width;
create table t_agg_width
(bh varchar(20)
,c1 int,c2 int,c3 int,c4 int,c5 int,c6 int,c7 int,c8 int,c9 int
,c11 int,c12 int,c13 int,c14 int,c15 int,c16 int,c17 int,c18 int,c19 int
,c21 int,c22 int,c23 int,c24 int,c25 int,c26 int,c27 int,c28 int,c29 int
,c31 int,c32 int,c33 int,c34 int,c35 int,c36 int,c37 int,c38 int,c39 int);
insert into t_agg_width 
select 'GZ01'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ02'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ03'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
insert into t_agg_width 
select 'GZ04'
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
,col,col,col,col,col,col,col,col,col 
from generate_series(1,100000) as col;
-- 禁用hashagg
set enable_hashagg = off;
-- 禁用并行
set max_parallel_workers_per_gather=0;
select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;

在这种情况下,优化器仍会选择Hash


testdb=# explain verbose select bh
testdb-# ,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
testdb-# ,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
testdb-# ,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
testdb-# ,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
testdb-# from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=49889.00..49889.04 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(5 rows)
testdb=# set enable_hashagg = off;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=110266.28..148266.32 rows=4 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=110266.28..111266.28 rows=400000 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..12889.00 rows=400000 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

下面增大分组键值的分布,同时提高c1等列的选择率,再次测试:


testdb=# insert into t_agg_width 
testdb-# select 'GZ'||col
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# ,mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100),mod(col,100) 
testdb-# from generate_series(1,1000000) as col;
INSERT 0 1000000
testdb=# set enable_hashagg = on;
SET
testdb=# explain verbose select bh
,min(c1),min(c2) ,min(c3) ,min(c4) ,min(c5) ,min(c6) ,min(c7) ,min(c8) ,min(c9)
,min(c11),min(c12) ,min(c13) ,min(c14) ,min(c15) ,min(c16) ,min(c17) ,min(c18) ,min(c19)
,min(c21),min(c22) ,min(c23) ,min(c24) ,min(c25) ,min(c26) ,min(c27) ,min(c28) ,min(c29)
,min(c31),min(c32) ,min(c33) ,min(c34) ,min(c35) ,min(c36) ,min(c37) ,min(c38) ,min(c39)
from t_agg_width group by bh;
                                                    QUERY PLAN                                                               
----------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=440012.46..586553.52 rows=7414 width=149)
   Output: bh, min(c1), min(c2), min(c3), min(c4), min(c5), min(c6), min(c7), min(c8), min(c9), min(c11), min(c12), min(c13),
 min(c14), min(c15), min(c16), min(c17), min(c18), min(c19), min(c21), min(c22), min(c23), min(c24), min(c25), min(c26), min(
c27), min(c28), min(c29), min(c31), min(c32), min(c33), min(c34), min(c35), min(c36), min(c37), min(c38), min(c39)
   Group Key: t_agg_width.bh
   ->  Sort  (cost=440012.46..443866.86 rows=1541757 width=149)
         Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c24, c25
, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
         Sort Key: t_agg_width.bh
         ->  Seq Scan on public.t_agg_width  (cost=0.00..49681.57 rows=1541757 width=149)
               Output: bh, c1, c2, c3, c4, c5, c6, c7, c8, c9, c11, c12, c13, c14, c15, c16, c17, c18, c19, c21, c22, c23, c2
4, c25, c26, c27, c28, c29, c31, c32, c33, c34, c35, c36, c37, c38, c39
(8 rows)
testdb=#

这一次选择的是GroupAggregate.

HashAggregate
HashAggregate,数据库会根据group by字段后面的值算出hash值,并在内存中维护对应的Hash表,比如select有n个聚合函数,那么在内存中就会维护n个Hash表.这种方式使用的内存比GroupAggregate要大,内存的使用与group by COLUMN中的COLUMN的唯一键值以及聚合列的多少成正比.

GroupAggregate
GroupAggregate,数据库先将表中的数据按group by的字段进行排序,然后对排好序的数据进行一次扫描,计算得到聚合的结果.这种方式需要先执行一次排序,计算复杂度上面要比HashAggregate要高,但这种方法的好处是与group by COLUMN中的COLUMN的唯一键值多寡/聚合列多寡无关,分组键值很多而且聚合列很多且列数据选择很高的情况下,会优于HashAggregate.

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/6906/viewspace-2642384/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/6906/viewspace-2642384/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值