数据仓库刚迁到GP上,发现下面sql语句运行起来并没有想象中的快,三个表总数据量150万,结果集70万条记录,却运行了7秒多,感觉有蹊跷。这里记录了整个优化过程。
语句如下:
语句如下:
select
a.pay_date,
c.user_id,
a.order_id,
COALESCE(a.amount,0.00),
d.product_type,
d.base_rate,
d.add_rate,
a.product_period,
c.name,
c.sex,
substr(c.birthday,1,7),
c.phone,
COALESCE(c.province,'未知'),
c.salesman
from CDM.cdm_invest_itemized a,CDM.cdm_user c,CDM.cdm_product d
where a.user_id=c.user_id and a.product_name=d.product_name and a.pay_day= d.startdate and a.pay_day<=d.enddate
执行计划如下
Gather Motion 4:1 (slice3; segments: 4) (cost=1065.82..1066.07 rows=102 width=334)
Merge Key: "?column15?"
-> Sort (cost=1065.82..1066.07 rows=26 width=334)
Sort Key: a.pay_day
-> Hash Join (cost=656.01..1062.44 rows=26 width=334)
Hash Cond: d.product_name::text = a.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_product d (cost=0.00..322.00 rows=5550 width=56)
-> Hash (cost=610.34..610.34 rows=914 width=324)
-> Broadcast Motion 4:4 (slice2; segments: 4) (cost=190.00..610.34 rows=914 width=324)
-> Hash Join (cost=190.00..564.67 rows=229 width=324)
Hash Cond: a.user_id::text = c.user_id::text
-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..353.50 rows=975 width=133)
Hash Key: a.user_id::text
-> Seq Scan on cdm_invest_itemized a (cost=0.00..275.50 rows=975 width=133)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=140.00..140.00 rows=1000 width=239)
-> Seq Scan on cdm_user c (cost=0.00..140.00 rows=1000 width=239)
运行7.5s秒
1、看下统计信息是否有问题
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
relname | relpages | reltuples
---------------------+----------+-----------
cdm_user | 0 | 0
cdm_product | 0 | 0
cdm_invest_itemized | 0 | 0
果然,刚迁移过来的数据,统计信息还是空的
二话不说,先收集统计信息
edw=# VACUUM ANALYSE CDM.cdm_product;
edw=# VACUUM ANALYSE CDM.cdm_user;
edw=# VACUUM ANALYSE CDM.cdm_invest_itemized;
edw=# select relname,relpages,reltuples from pg_class where relname in ('cdm_invest_itemized','cdm_user','cdm_product');
relname | relpages | reltuples
---------------------+----------+-----------
cdm_user | 6760 | 571022
cdm_product | 104 | 21280
cdm_invest_itemized | 6620 | 734028
统计信息这下正常了,执行时间也降低到了3.8秒
Gather Motion 4:1 (slice3; segments: 4) (cost=52471.97..68149.70 rows=101713 width=191)
-> Hash Join (cost=52471.97..68149.70 rows=25429 width=191)
Hash Cond: c.user_id::text = a.user_id::text
-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)
-> Hash (cost=51200.56..51200.56 rows=25429 width=140)
-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=2444.80..51200.56 rows=25429 width=140)
Hash Key: a.user_id::text
-> Hash Join (cost=2444.80..49166.32 rows=25429 width=140)
Hash Cond: a.product_name::text = d.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_invest_itemized a (cost=0.00..17630.42 rows=61169 width=131)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=1380.80..1380.80 rows=21280 width=57)
-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1380.80 rows=21280 width=57)
-> Seq Scan on cdm_product d (cost=0.00..316.80 rows=5320 width=57)
2、看下执行计划cdm_product表发生广播,cdm_invest_itemized发生重分布,说明三个表的分布键都不相同,看下三个张表的分布键,发现cdm_invest_itemized表是随机分布的
edw=# select gp_segment_id,"count"(*) from cdm.cdm_user GROUP BY 1;
gp_segment_id | count
---------------+--------
3 | 143999
0 | 143818
1 | 143931
2 | 143462
(4 rows)
edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
gp_segment_id | count
---------------+-------
2 | 5322
0 | 5322
3 | 5322
1 | 5322
(4 rows)
edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
gp_segment_id | count
---------------+--------
2 | 181572
3 | 181580
0 | 181580
1 | 181576
(4 rows)
edw=# \d cdm.cdm_invest_itemized
....
Distributed randomly
edw=# \d cdm.cdm_user
....
Distributed by: (user_id)
edw=# \d cdm.cdm_product
....
Distributed by: (id)
修改分布键
edw=# alter table CDM.cdm_invest_itemized set distributed by(user_id);
edw=# alter table CDM.cdm_product set distributed by(product_name);
再看下数据分布
edw=# select gp_segment_id,"count"(*) from cdm.cdm_invest_itemized GROUP BY 1;
gp_segment_id | count
---------------+--------
3 | 184454
1 | 174896
2 | 186732
0 | 180226
edw=# select gp_segment_id,"count"(*) from cdm.cdm_product GROUP BY 1;
gp_segment_id | count
---------------+-------
1 | 5288
3 | 5329
0 | 5370
2 | 5301
没发生严重的数据倾斜
再看下执行计划
Gather Motion 4:1 (slice2; segments: 4) (cost=50272.97..65941.86 rows=101207 width=191)
-> Hash Join (cost=50272.97..65941.86 rows=25302 width=191)
Hash Cond: c.user_id::text = a.user_id::text
-> Seq Scan on cdm_user c (cost=0.00..12470.22 rows=142756 width=98)
-> Hash (cost=49007.88..49007.88 rows=25302 width=140)
-> Hash Join (cost=2445.47..49007.88 rows=25302 width=140)
Hash Cond: a.product_name::text = d.product_name::text
Join Filter: a.pay_day >= d.startdate AND a.pay_day <= d.enddate
-> Seq Scan on cdm_invest_itemized a (cost=0.00..17613.69 rows=61088 width=131)
Filter: pay_day::numeric < 20161111::numeric
-> Hash (cost=1381.62..1381.62 rows=21277 width=57)
-> Broadcast Motion 4:4 (slice1; segments: 4) (cost=0.00..1381.62 rows=21277 width=57)
-> Seq Scan on cdm_product d (cost=0.00..317.77 rows=5320 width=57)
表cdm_product和表cdm_invest_itemized关联,分布键不一致,cdm_product 表数据量(21280 )*节点数(4)<cdm_invest_itemized数据量(734028 ),表cdm_product广播没毛病;表cdm_invest_itemized和表cdm_user关联分布键不发生广播和重分布,走hash="" join没毛病最终执行时间优化到2.65秒
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/29989552/viewspace-2128307/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/29989552/viewspace-2128307/