PostgreSQL 如何修改执行计划 GroupAggregate 为 HashAggregate

1、前言

PostgreSQL 聚合算法有两种,HashAggregate and GroupAggregate 。我们知道GroupAggregate 需要对记录进行排序,而 HashAggregate 则无需进行排序,通常 HashAggregate 要快很多。 但是,我们经常会看到优化器使用 GroupAggregate,即使 enable_sort=off 也不能阻止 排序操作。那么有什么办法让优化器使用 HashAggregate 了?

2、优化例子

创建测试数据:数据每个列有100个不同值,但是列值之间是相关的,也就是 count(distinct *) 也就100 个。

create table t1
(
  id1 integer,
  id2 integer,
  id3 integer,
  id4 integer,
  id5 integer,
  id6 integer, 
  name text
);

insert into t1 
select 
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  repeat('a',100);

begin 
  for i in 1..16 loop
    insert into t1 select * from t1;
  end loop;
end;

test=# select count(*) from t1;
  count
---------
 6553600
(1 row)

不同的group by 走不同 aggregrate 算法。

test=# explain analyze select id1,count(*) from t1 group by id1;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=224335.18..224336.18 rows=100 width=12) (actual time=2084.369..2084.387 rows=100 loops=1)
   Group Key: id1
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=4) (actual time=1.043..876.510 rows=6553600 loops=1)
 Planning Time: 0.051 ms
 Execution Time: 2084.422 ms
(5 rows)

test=# explain analyze select id1,id2,count(*) from t1 group by id1,id2;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=240719.21..240819.21 rows=10000 width=16) (actual time=4801.232..4801.302 rows=100 loops=1)
   Group Key: id1, id2
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=8) (actual time=0.017..3457.207 rows=6553600 loops=1)
 Planning Time: 0.076 ms
 Execution Time: 4801.430 ms
(5 rows)

--随着 group by 列的数量增加,优化器开始走 GroupAggregate
test=# explain analyze select id1,id2,id3,count(*) from t1 group by id1,id2,id3;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1157569.44..1246043.20 rows=655361 width=20) (actual time=4519.386..6399.677 rows=100 loops=1)
   Group Key: id1, id2, id3
   ->  Sort  (cost=1157569.44..1173953.47 rows=6553612 width=12) (actual time=4501.004..5619.469 rows=6553600 loops=1)
         Sort Key: id1, id2, id3
         Sort Method: external merge  Disk: 141144kB
         ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=12) (actual time=0.492..1274.615 rows=6553600 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 6415.101 ms
(8 rows)

test=# explain analyze select id1,id2,id3,id4,count(*) from t1 group by id1,id2,id3,id4;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1157569.44..1262427.23 rows=655361 width=24) (actual time=4873.150..7008.342 rows=100 loops=1)
   Group Key: id1, id2, id3, id4
   ->  Sort  (cost=1157569.44..1173953.47 rows=6553612 width=16) (actual time=4855.042..6153.156 rows=6553600 loops=1)
         Sort Key: id1, id2, id3, id4
         Sort Method: external merge  Disk: 166792kB
         ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=16) (actual time=0.403..1320.904 rows=6553600 loops=1)
 Planning Time: 0.208 ms
 Execution Time: 7023.674 ms
(8 rows)

问题解析:根据优化器的估算,如果 group by 只有一列 ,有 100 个不同值,2列 100*100 个, 3列 100*100*100 个,也就是group by 后面的列越多,优化器会认为返回的不同值越多,优化器更偏向于 采用GroupAggregate算法。基于以上思路,我们考虑对表创建多列的统计数据。

test=# create statistics s2(ndistinct) on id1,id2,id3,id4,id5,id6 from t1;
CREATE STATISTICS
test=# analyze t1;
ANALYZE
test=# explain analyze select id1,id2,id3,id4,count(*) from t1 group by id1,id2,id3,id4;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=273487.27..273488.27 rows=100 width=24) (actual time=2787.681..2787.695 rows=100 loops=1)
   Group Key: id1, id2, id3, id4
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=16) (actual time=0.024..926.445 rows=6553600 loops=1)
 Planning Time: 0.243 ms
 Execution Time: 2787.764 ms
(5 rows)

test=# explain analyze select id1,id2,id3,id4,id5,id6,count(*) from t1 group by id1,id2,id3,id4,id5,id6;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=306255.33..306256.33 rows=100 width=32) (actual time=3282.206..3282.223 rows=100 loops=1)
   Group Key: id1, id2, id3, id4, id5, id6
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=24) (actual time=0.232..951.364 rows=6553600 loops=1)
 Planning Time: 0.633 ms
 Execution Time: 3282.329 ms
(5 rows)

可以看到,对表创建多列统计数据后,优化器采用了 HashAggregate 算法,性能也得到了很大提升。

3、结论

优化器在决定采用哪种 group by 算法时,会估算结果分组个数,如果结果分组个数过多,优化器会偏向于 GroupAggregate。但是我们知道,PostgreSQL 对于多列值的估算存在问题,这就需要手动创建多列统计信息。实际优化过程中,可能经常需要创建多列统计信息,以促使优化器采用HashAggregate。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值