SQL性能区别(2)——join on/and顺序

1. hive中使用join时,and 和 on条件等的先后顺序影响

2019-08-28记录

在sql中,往往会使用上一篇博文【性能区别(1)】中分析的那样,在join时,就采用and条件提前过滤不需要的数据,但是有一个地方注意到没有:

-- 1. 先on 连接条件,后 and 筛选条件
explain
select o.orderid,s.sequence
from flt_bidb.v_ffo o
left join flt_bidb.DW_FactFltSegment s
on s.orderid = o.orderid 
and s.d = '2019-09-18' 
and s.flightclass = 'I'
where o.flightclass = 'I'
and s.sequence = 1

-- 2. 先on 筛选条件,后 and 连接条件
explain
select o.orderid,s.sequence
from flt_bidb.v_ffo o
left join flt_bidb.DW_FactFltSegment s
on s.flightclass = 'I'   
and s.d = '2019-09-18' 
and s.orderid = o.orderid
where o.flightclass = 'I'
and s.sequence = 1

这两个得到的结果是一样的,至少在hive中是,那么这个运气起来有啥区别呢?

看一下执行计划。

Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整,以获得最少的MapReduce作业数。

(看的迷糊的可以直接看后面结论)

-- 1. on 连接条件 and 筛选条件
STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-4
    Conditional Operator
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: o
            Statistics: Num rows: 580159641 Data size: 8244188132803 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (flightclass = 'I') (type: boolean)
              Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: orderid (type: bigint)
                sort order: +
                Map-reduce partition columns: orderid (type: bigint)
                Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: s
            Statistics: Num rows: 655134731 Data size: 58757293446 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (flightclass = 'I') (type: boolean)
              Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: orderid (type: bigint)
                sort order: +
                Map-reduce partition columns: orderid (type: bigint)
                Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
                value expressions: sequence (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 orderid (type: bigint)
            1 orderid (type: bigint)
          outputColumnNames: _col0, _col291
          Statistics: Num rows: 360324109 Data size: 32316512046 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col291 = 1) (type: boolean)
            Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: bigint), 1 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

和下面第二种对比,发现其执行计划没有区别,但是有个意外收获,不知道大家主要到where中的两个条件没有,

o.flightclass = 'I' and s.sequence = 1

这两个条件虽然都在where中,但是其执行计划却放在了不同的位置!

-- 2. on 筛选条件 and 连接条件
STAGE DEPENDENCIES:
  Stage-4 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1
STAGE PLANS:
  Stage: Stage-4
    Conditional Operator
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: o
            Statistics: Num rows: 580159641 Data size: 8244188132803 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (flightclass = 'I') (type: boolean)
              Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: orderid (type: bigint)
                sort order: +
                Map-reduce partition columns: orderid (type: bigint)
                Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
          TableScan
            alias: s
            Statistics: Num rows: 655134731 Data size: 58757293446 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (flightclass = 'I') (type: boolean)
              Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: orderid (type: bigint)
                sort order: +
                Map-reduce partition columns: orderid (type: bigint)
                Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
                value expressions: sequence (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          keys:
            0 orderid (type: bigint)
            1 orderid (type: bigint)
          outputColumnNames: _col0, _col291
          Statistics: Num rows: 360324109 Data size: 32316512046 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col291 = 1) (type: boolean)
            Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: bigint), 1 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

所以可以得出的结论是:

  1. 在hive的join中,无论是先on连接条件,还是先on筛选条件,其执行计划是一样的,都是先在map operator tree 阶段根据筛选条件过滤数据,然后再在reduce operator tree阶段执行 join 操作;
  2. hive中,会自动进行谓词下推操作,在where条件中写的对主表的条件,会在map operator tree 阶段就执行,但是where条件中对其他连接表的操作,就不会进行下推,而是放在了reduce operator tree阶段执行。

第二个结论也符合上一篇博文中写的on/where的区别分析,因为对主表进行谓词下推,无论是早点筛选,还是晚点,对结果是没啥影响的,但是在连接表中,比如左连接,谓词下推就只能对右边的连接表进行过滤,但放在where中,就会对最后的整个结果集进行过滤,所以得到的结果是不一样的,也就不会将其进行自动下推!

这是我自己分析的结果哈,欢迎指导!

2. 谓词下推和子查询的区别

2019-09-11记录

-- 1.
  select count(*),count(distinct seo.uid)
	  from (
    		select se.uid 
    		  from tmp_fltdb.tmp_search_3m_uid se
    	 left join tmp_fltdb.tmp_order_3m o
    			on upper(o.uid) = upper(se.uid)
    		 where o.uid is null
  			  ) seo
left join tmp_fltdb.tmp_order_cf s
       on upper(s.uid) = upper(seo.uid)
  	where s.uid is not null

-- 2. 
	select count(*),count(distinct se.uid)
	  from  tmp_fltdb.tmp_search_3m_uid se
 left join tmp_fltdb.tmp_order_3m o
		on upper(o.uid) = upper(se.uid)
left join tmp_fltdb.tmp_order_cf s
	   on upper(s.uid) = upper(se.uid)
	where s.uid is not null
	  and o.uid is null

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值