greenplum 简单sql优化案例

最新推荐文章于 2023-09-25 16:27:35 发布

ckqq46580

最新推荐文章于 2023-09-25 16:27:35 发布

阅读量511

点赞数

数据仓库刚迁到GP上，发现下面sql语句运行起来并没有想象中的快，三个表总数据量150万，结果集70万条记录，却运行了7秒多，感觉有蹊跷。这里记录了整个优化过程。
语句如下：

select

a.pay_date,

c.user_id,

a.order_id,

COALESCE(a.amount,0.00),

d.product_type,

d.base_rate,

d.add_rate,

a.product_period,

c.name,

c.sex,

substr(c.birthday,1,7),

c.phone,

COALESCE(c.province,'未知'),

c.salesman

from CDM.cdm_invest_itemized a,CDM.cdm_user c,CDM.cdm_product d

where a.user_id=c.user_id and a.product_name=d.product_name and a.pay_day= d.startdate and a.pay_day<=d.enddate

执行计划如下

Gather Motion 4:1 (slice3; segments: 4) (cost=1065.82..1066.07 rows=102 width=334)

Merge Key: "?column15?"

-> Sort (cost=1065.82..1066.07 rows=26 width=334)

Sort Key: a.pay_day

-> Hash Join (cost=656.01..1062.44 rows=26 width=334)

Hash Cond: d.product_name::text = a.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_product d (cost=0.00..322.00 rows=5550 width=56)

-> Hash (cost=610.34..610.34 rows=914 width=324)

-> Broadcast Motion 4:4 (slice2; segments: 4) (cost=190.00..610.34 rows=914 width=324)

-> Hash Join (cost=190.00..564.67 rows=229 width=324)

Hash Cond: a.user_id::text = c.user_id::text

-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..353.50 rows=975 width=133)

Hash Key: a.user_id::text

-> Seq Scan on cdm_invest_itemized a (cost=0.00..275.50 rows=975 width=133)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=140.00..140.00 rows=1000 width=239)

-> Seq Scan on cdm_user c (cost=0.00..140.00 rows=1000 width=239)

运行7.5s秒

1、看下统计信息是否有问题

edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');

relname | relpages | reltuples

---------------------+----------+-----------

cdm_user | 0 | 0

cdm_product | 0 | 0

cdm_invest_itemized | 0 | 0

果然，刚迁移过来的数据，统计信息还是空的

二话不说，先收集统计信息

edw=# VACUUM ANALYSE CDM.cdm_product;

edw=# VACUUM ANALYSE CDM.cdm_user;

edw=# VACUUM ANALYSE CDM.cdm_invest_itemized;

edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');

relname | relpages | reltuples

---------------------+----------+-----------

cdm_user | 6760 | 571022

cdm_product | 104 | 21280

cdm_invest_itemized | 6620 | 734028

统计信息这下正常了，执行时间也降低到了3.8秒

Gather Motion 4:1 (slice3; segments: 4) (cost=52471.97..68149.70 rows=101713 width=191)

-> Hash Join (cost=52471.97..68149.70 rows=25429 width=191)

Hash Cond: c.user_id::text = a.user_id::text

-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)

-> Hash (cost=51200.56..51200.56 rows=25429 width=140)

-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=2444.80..51200.56 rows=25429 width=140)

Hash Key: a.user_id::text

-> Hash Join (cost=2444.80..49166.32 rows=25429 width=140)

Hash Cond: a.product_name::text = d.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_invest_itemized a (cost=0.00..17630.42 rows=61169 width=131)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=1380.80..1380.80 rows=21280 width=57)

-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1380.80 rows=21280 width=57)

-> Seq Scan on cdm_product d (cost=0.00..316.80 rows=5320 width=57)

2、看下执行计划cdm_product表发生广播，cdm_invest_itemized发生重分布，说明三个表的分布键都不相同，看下三个张表的分布键，发现cdm_invest_itemized表是随机分布的

edw=# select gp_segment_id,"count"(*) from cdm.cdm_user GROUP BY 1;

gp_segment_id | count

---------------+--------

3 | 143999

0 | 143818

1 | 143931

2 | 143462

(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;

gp_segment_id | count

---------------+-------

2 | 5322

0 | 5322

3 | 5322

1 | 5322

(4 rows)

edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;

gp_segment_id | count

---------------+--------

2 | 181572

3 | 181580

0 | 181580

1 | 181576

(4 rows)

edw=# \d cdm.cdm_invest_itemized

....

Distributed randomly

edw=# \d cdm.cdm_user

....

Distributed by: (user_id)

edw=# \d cdm.cdm_product

....

Distributed by: (id)

修改分布键

edw=# alter table CDM.cdm_invest_itemized set distributed by(user_id);

edw=# alter table CDM.cdm_product set distributed by(product_name);

再看下数据分布

edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;

gp_segment_id | count

---------------+--------

3 | 184454

1 | 174896

2 | 186732

0 | 180226

edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;

gp_segment_id | count

---------------+-------

1 | 5288

3 | 5329

0 | 5370

2 | 5301

没发生严重的数据倾斜

再看下执行计划

Gather Motion 4:1 (slice2; segments: 4) (cost=50272.97..65941.86 rows=101207 width=191)

-> Hash Join (cost=50272.97..65941.86 rows=25302 width=191)

Hash Cond: c.user_id::text = a.user_id::text

-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)

-> Hash (cost=49007.88..49007.88 rows=25302 width=140)

-> Hash Join (cost=2445.47..49007.88 rows=25302 width=140)

Hash Cond: a.product_name::text = d.product_name::text

Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate

-> Seq Scan on cdm_invest_itemized a (cost=0.00..17613.69 rows=61088 width=131)

Filter: pay_day::numeric < 20161111::numeric

-> Hash (cost=1381.62..1381.62 rows=21277 width=57)

-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1381.62 rows=21277 width=57)

-> Seq Scan on cdm_product d (cost=0.00..317.77 rows=5320 width=57)

表cdm_product和表cdm_invest_itemized关联，分布键不一致，cdm_product 表数据量（21280 ）*节点数（4）<cdm_invest_itemized数据量（734028 ），表cdm_product广播没毛病；表cdm_invest_itemized和表cdm_user关联分布键不发生广播和重分布，走hash="" join没毛病最终执行时间优化到2.65秒

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/29989552/viewspace-2128307/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/29989552/viewspace-2128307/