Greenplum 注意对其数据类型，否则优化器让你好看-CSDN博客

在测试tpch时，发现有一些SQL跑超时，原因是测试SQL中有一些JOIN查询，而这些JOIN的列数据类型不一致，导致无法使用索引，或者无法使用hash join。
例子：
int和int8，都是整型，只是长度不一样。关联时，会遇到性能问题，因为不能使用HASH JOIN。
digoal=# create table t(id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# insert into t select generate_series(1,10000);
INSERT 0 10000
digoal=# insert into t1 select generate_series(1,10000);
INSERT 0 10000
关联字段为int和int8，不能使用hash join，只能用nestloop。
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
即使强制关掉nestloop也不行。
digoal=# set enable_nestloop=off;
SET
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
和数据量当然并没有关系。
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN                                              
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
接下来，使用同样的数据类型，可以用HASH JOIN关联。性能大幅提升。
digoal=# create table t2(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t2 select * from t1;
INSERT 0 10010000
digoal=# create index idx_t2 on t2(id);
CREATE INDEX
digoal=# explain select * from t2 join t1 on( t2.id=t1.id );
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=236320.07..497677.13 rows=10013203 width=16)
   ->  Hash Join  (cost=236320.07..497677.13 rows=455146 width=16)
         Hash Cond: t1.id = t2.id
         ->  Seq Scan on t1  (cost=0.00..111158.22 rows=455161 width=8)
         ->  Hash  (cost=111155.03..111155.03 rows=455146 width=8)
               ->  Seq Scan on t2  (cost=0.00..111155.03 rows=455146 width=8)
 Optimizer status: legacy query optimizer
(7 rows)

另一个例子，当char长度不一样，不会有以上类似的问题。
digoal=# create table t(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id char(20));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# explain select * from t join t1 on( t.id=t1.id and t.id='1');
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..401.54 rows=4 width=32)
   ->  Nested Loop  (cost=0.00..401.54 rows=1 width=32)
         ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
               Index Cond: id = '1'::bpchar
         ->  Index Scan using idx_t1 on t1  (cost=0.00..200.83 rows=1 width=21)
               Index Cond: '1'::bpchar = id
 Optimizer status: legacy query optimizer
(7 rows)
varchar和char(n)也不存在问题。
postgres=# create table t(id varchar);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# create table t1(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# explain select * from t , t1 where t.id=t1.id;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.01..0.07 rows=4 width=76)
   ->  Hash Join  (cost=0.01..0.07 rows=1 width=76)
         Hash Cond: t.id::bpchar = t1.id
         ->  Redistribute Motion 22:22  (slice1; segments: 22)  (cost=0.00..0.02 rows=1 width=32)
               Hash Key: t.id::bpchar
               ->  Seq Scan on t  (cost=0.00..0.00 rows=1 width=32)
         ->  Hash  (cost=0.00..0.00 rows=1 width=44)
               ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=44)
 Optimizer status: legacy query optimizer
(9 rows)

数据类型和查询条件的类型不一致时，不走索引的例子。
digoal=# explain select * from t where id=1;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=0.00..163369.87 rows=9995 width=11)
   ->  Seq Scan on t  (cost=0.00..163369.87 rows=455 width=11)
         Filter: id::text = '1'::text
 Optimizer status: legacy query optimizer
(4 rows)

digoal=# explain select * from t where id='1';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)
显示转换后，可以走索引。
digoal=# explain select * from t where id=1::bpchar;
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)

greenplum因为用了早期的postgresql版本，所以这一方面优化器并不好，在使用greenplum时，需要注意一下。
greenplum将合并到postgresql 9.5的版本，以上问题可以消除。
postgresql 9.4的例子：
digoal=# create table t(id int);
CREATE TABLE
digoal=# create table t1(id int8);
CREATE TABLE
digoal=# insert into t select generate_series(1,100000);
INSERT 0 100000
digoal=# insert into t1 select generate_series(1,100000);
INSERT 0 100000
digoal=# explain select * from t,t1 where t.id=t1.id;
                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=2693.00..6136.00 rows=100000 width=12)
   Hash Cond: (t.id = t1.id)
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
         ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8)
(5 rows)