PostgreSQL 如何修改执行计划 GroupAggregate 为 HashAggregate

最新推荐文章于 2023-04-12 11:50:42 发布

Kingbase 研究院

最新推荐文章于 2023-04-12 11:50:42 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/lyu1026/article/details/124893602

版权

关键词由CSDN通过智能技术生成

1、前言

PostgreSQL 聚合算法有两种，HashAggregate and GroupAggregate 。我们知道GroupAggregate 需要对记录进行排序，而 HashAggregate 则无需进行排序，通常 HashAggregate 要快很多。但是，我们经常会看到优化器使用 GroupAggregate，即使 enable_sort=off 也不能阻止排序操作。那么有什么办法让优化器使用 HashAggregate 了？

2、优化例子

创建测试数据：数据每个列有100个不同值，但是列值之间是相关的，也就是 count(distinct *) 也就100 个。

create table t1
(
  id1 integer,
  id2 integer,
  id3 integer,
  id4 integer,
  id5 integer,
  id6 integer, 
  name text
);

insert into t1 
select 
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  generate_series(1,100),
  repeat('a',100);

begin 
  for i in 1..16 loop
    insert into t1 select * from t1;
  end loop;
end;

test=# select count(*) from t1;
  count
---------
 6553600
(1 row)

不同的group by 走不同 aggregrate 算法。

test=# explain analyze select id1,count(*) from t1 group by id1;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=224335.18..224336.18 rows=100 width=12) (actual time=2084.369..2084.387 rows=100 loops=1)
   Group Key: id1
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=4) (actual time=1.043..876.510 rows=6553600 loops=1)
 Planning Time: 0.051 ms
 Execution Time: 2084.422 ms
(5 rows)

test=# explain analyze select id1,id2,count(*) from t1 group by id1,id2;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=240719.21..240819.21 rows=10000 width=16) (actual time=4801.232..4801.302 rows=100 loops=1)
   Group Key: id1, id2
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=8) (actual time=0.017..3457.207 rows=6553600 loops=1)
 Planning Time: 0.076 ms
 Execution Time: 4801.430 ms
(5 rows)

--随着 group by 列的数量增加，优化器开始走 GroupAggregate
test=# explain analyze select id1,id2,id3,count(*) from t1 group by id1,id2,id3;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1157569.44..1246043.20 rows=655361 width=20) (actual time=4519.386..6399.677 rows=100 loops=1)
   Group Key: id1, id2, id3
   ->  Sort  (cost=1157569.44..1173953.47 rows=6553612 width=12) (actual time=4501.004..5619.469 rows=6553600 loops=1)
         Sort Key: id1, id2, id3
         Sort Method: external merge  Disk: 141144kB
         ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=12) (actual time=0.492..1274.615 rows=6553600 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 6415.101 ms
(8 rows)

test=# explain analyze select id1,id2,id3,id4,count(*) from t1 group by id1,id2,id3,id4;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1157569.44..1262427.23 rows=655361 width=24) (actual time=4873.150..7008.342 rows=100 loops=1)
   Group Key: id1, id2, id3, id4
   ->  Sort  (cost=1157569.44..1173953.47 rows=6553612 width=16) (actual time=4855.042..6153.156 rows=6553600 loops=1)
         Sort Key: id1, id2, id3, id4
         Sort Method: external merge  Disk: 166792kB
         ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=16) (actual time=0.403..1320.904 rows=6553600 loops=1)
 Planning Time: 0.208 ms
 Execution Time: 7023.674 ms
(8 rows)

问题解析：根据优化器的估算，如果 group by 只有一列，有 100 个不同值，2列 100*100 个， 3列 100*100*100 个，也就是group by 后面的列越多，优化器会认为返回的不同值越多，优化器更偏向于采用GroupAggregate算法。基于以上思路，我们考虑对表创建多列的统计数据。

test=# create statistics s2(ndistinct) on id1,id2,id3,id4,id5,id6 from t1;
CREATE STATISTICS
test=# analyze t1;
ANALYZE
test=# explain analyze select id1,id2,id3,id4,count(*) from t1 group by id1,id2,id3,id4;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=273487.27..273488.27 rows=100 width=24) (actual time=2787.681..2787.695 rows=100 loops=1)
   Group Key: id1, id2, id3, id4
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=16) (actual time=0.024..926.445 rows=6553600 loops=1)
 Planning Time: 0.243 ms
 Execution Time: 2787.764 ms
(5 rows)

test=# explain analyze select id1,id2,id3,id4,id5,id6,count(*) from t1 group by id1,id2,id3,id4,id5,id6;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=306255.33..306256.33 rows=100 width=32) (actual time=3282.206..3282.223 rows=100 loops=1)
   Group Key: id1, id2, id3, id4, id5, id6
   ->  Seq Scan on t1  (cost=0.00..191567.12 rows=6553612 width=24) (actual time=0.232..951.364 rows=6553600 loops=1)
 Planning Time: 0.633 ms
 Execution Time: 3282.329 ms
(5 rows)

可以看到，对表创建多列统计数据后，优化器采用了 HashAggregate 算法，性能也得到了很大提升。

3、结论

优化器在决定采用哪种 group by 算法时，会估算结果分组个数，如果结果分组个数过多，优化器会偏向于 GroupAggregate。但是我们知道，PostgreSQL 对于多列值的估算存在问题，这就需要手动创建多列统计信息。实际优化过程中，可能经常需要创建多列统计信息，以促使优化器采用HashAggregate。