Db2数据分布不均匀导致的"select ... fetch first n rows only"性能问题

问题描述:
某应用SQL语句如下,其中id和name上均有索引:

select id from t1 where name is null order by id fetch first 10 rows only;

发现SQL语句执行时间特别长,如果order by id后面加上desc就会特别快:

select id from t1 where name is null order by id desc fetch first 10 rows only;

一般情况下fetch first xx rows only会执行得非常快,只要取到xx条记录就结束了,本次的SQL里,ID上有索引,所以排序不占时间,查看访问计划, 发现确实使用了id列上的索引,怀疑是id较小的记录中,对应name很少是null的,导致索引扫描效率低下.

复现:
为了验证该问题,设计了两个实验

试验1中,表T1有2列:ID和name, 2千万条记录,其中id为1-5000000的记录中name不是NULL的,剩余1千5百万条记录中name是NULL的。也就是NULL字段比较“集中”在后面

inst105@NODE01:~> cat 1.py 
for i in range(1,5000001):
    print '%s,name%s' % (i, i)
for i in range(5000001,20000001):
    print '%s,' % i
inst105@NODE01:~> python 1.py > t1.del
inst105@NODE01:~> db2 "connect to sample"
inst105@NODE01:~> db2 "drop table t1"
inst105@NODE01:~> db2 "create table t1(id int not null, name char(20))"
inst105@NODE01:~> db2 "create UNIQUE index t1_id_idx on t1(id)"
inst105@NODE01:~> db2 "create index t1_name_idx on t1(name)"
inst105@NODE01:~> db2 "load from t1.del of del insert into t1 nonrecoverable"
inst105@NODE01:~> db2 "runstats on table t1 with distribution on all columns and detailed indexes all"

inst105@NODE01:~> db2 "select * from t1 order by id fetch first 10 rows only"

ID          NAME                
----------- --------------------
          1 name1               
          2 name2               
          3 name3               
          4 name4               
          5 name5               
          6 name6               
          7 name7               
          8 name8               
          9 name9               
         10 name10              

  10 record(s) selected.

分别运行SQL,查看执行时间:

inst105@NODE01:~> time db2 "select id from t1 where name is null order by id fetch first 10 rows only"

ID         
-----------
    5000001
    5000002
    5000003
    5000004
    5000005
    5000006
    5000007
    5000008
    5000009
    5000010

  10 record(s) selected.


real    0m9.279s
user    0m0.048s
sys     0m0.024s

inst105@NODE01:~> time db2 "select id from t1 where name is null order by id desc fetch first 10 rows only"

ID         
-----------
   20000000
   19999999
   19999998
   19999997
   19999996
   19999995
   19999994
   19999993
   19999992
   19999991

  10 record(s) selected.


real    0m0.083s
user    0m0.045s
sys     0m0.026s

可以看到第一条SQL用时很长, 第2条SQL用时很短,这两条SQL的访问计划都是一样的:

Statement:
  
  select id 
  from t1 
  where name is null 
  order by id 
  fetch first 10 rows only

Optimizer Plan:

         Rows   
       Operator 
         (ID)   
         Cost   
               
         10    
       RETURN  
        ( 1)   
       27.2237 
         |    
         10   
       FETCH  
        ( 2)  
       238650 
      /      \
   2e+07    2e+07  
  IXSCAN    Table: 
   ( 3)     inst105 
  80383.7   T1     
    |      
   2e+07   
 Index:    
 inst105    
 T1_ID_IDX 

试验2:
试验2中, 同样name不为NULL的字段为5000000条,为NULL字段的记录为1千5百万条,只不过分布的比较“均匀”,如下:

inst105@NODE01:~> cat 3.py 
for i in range(1,20000001,4):
    print '%s,name%s' % (i, i)
    print '%s,' % (i+1)
    print '%s,' % (i+2)
    print '%s,' % (i+3)
inst105@NODE01:~> python 3.py > t3.del
inst105@NODE01:~> db2 "drop table t3"
inst105@NODE01:~> db2 "create table t3(id int not null, name char(20))"
inst105@NODE01:~> db2 "create UNIQUE index t3_id_idx on t3(id) cluster"
inst105@NODE01:~> db2 "create index t3_name_idx on t3(name)"
inst105@NODE01:~> db2 "load from t3.del of del insert into t3 nonrecoverable"
inst105@NODE01:~> db2 "runstats on table t3 with distribution on all columns and detailed indexes all"

inst105@NODE01:~> db2 "select * from t3 order by id fetch first 10 rows only"

ID          NAME                
----------- --------------------
          1 name1               
          2 -                   
          3 -                   
          4 -                   
          5 name5               
          6 -                   
          7 -                   
          8 -                   
          9 name9               
         10 -                   

  10 record(s) selected.

inst105@NODE01:~> time db2 "select id from t3 where name is null order by id fetch first 10 rows only"

ID         
-----------
          2
          3
          4
          6
          7
          8
         10
         11
         12
         14

  10 record(s) selected.


real    0m0.079s
user    0m0.047s
sys     0m0.023s

inst105@NODE01:~> time db2 "select id from t3 where name is null order by id desc fetch first 10 rows only"

ID         
-----------
   20000000
   19999999
   19999998
   19999996
   19999995
   19999994
   19999992
   19999991
   19999990
   19999988

  10 record(s) selected.


real    0m0.079s
user    0m0.045s
sys     0m0.024s

可以看到,上面两条SQL执行的时间都比较短,其执行计划相比试验1,并没有任何不同。

Statement:
  
  select id 
  from t3 
  where name is null 
  order by id 
  fetch first 10 rows only

Optimizer Plan:

         Rows   
       Operator 
         (ID)   
         Cost   
               
         10    
       RETURN  
        ( 1)   
       27.2237 
         |    
         10   
       FETCH  
        ( 2)  
       238650 
      /      \
   2e+07    2e+07  
  IXSCAN    Table: 
   ( 3)     inst105 
  80383.7   T3     
    |      
   2e+07   
 Index:    
 inst105    
 T3_ID_IDX 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值