1. hive中使用join时,and 和 on条件等的先后顺序影响
2019-08-28记录
在sql中,往往会使用上一篇博文【性能区别(1)】中分析的那样,在join时,就采用and条件提前过滤不需要的数据,但是有一个地方注意到没有:
-- 1. 先on 连接条件,后 and 筛选条件
explain
select o.orderid,s.sequence
from flt_bidb.v_ffo o
left join flt_bidb.DW_FactFltSegment s
on s.orderid = o.orderid
and s.d = '2019-09-18'
and s.flightclass = 'I'
where o.flightclass = 'I'
and s.sequence = 1
-- 2. 先on 筛选条件,后 and 连接条件
explain
select o.orderid,s.sequence
from flt_bidb.v_ffo o
left join flt_bidb.DW_FactFltSegment s
on s.flightclass = 'I'
and s.d = '2019-09-18'
and s.orderid = o.orderid
where o.flightclass = 'I'
and s.sequence = 1
这两个得到的结果是一样的,至少在hive中是,那么这个运气起来有啥区别呢?
看一下执行计划。
Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整,以获得最少的MapReduce作业数。
(看的迷糊的可以直接看后面结论)
-- 1. on 连接条件 and 筛选条件
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: o
Statistics: Num rows: 580159641 Data size: 8244188132803 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (flightclass = 'I') (type: boolean)
Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: orderid (type: bigint)
sort order: +
Map-reduce partition columns: orderid (type: bigint)
Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: s
Statistics: Num rows: 655134731 Data size: 58757293446 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (flightclass = 'I') (type: boolean)
Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: orderid (type: bigint)
sort order: +
Map-reduce partition columns: orderid (type: bigint)
Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
value expressions: sequence (type: int)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 orderid (type: bigint)
1 orderid (type: bigint)
outputColumnNames: _col0, _col291
Statistics: Num rows: 360324109 Data size: 32316512046 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col291 = 1) (type: boolean)
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), 1 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
和下面第二种对比,发现其执行计划没有区别,但是有个意外收获,不知道大家主要到where中的两个条件没有,
o.flightclass = 'I' and s.sequence = 1
这两个条件虽然都在where中,但是其执行计划却放在了不同的位置!
-- 2. on 筛选条件 and 连接条件
STAGE DEPENDENCIES:
Stage-4 is a root stage , consists of Stage-1
Stage-1
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-4
Conditional Operator
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: o
Statistics: Num rows: 580159641 Data size: 8244188132803 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (flightclass = 'I') (type: boolean)
Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: orderid (type: bigint)
sort order: +
Map-reduce partition columns: orderid (type: bigint)
Statistics: Num rows: 290079820 Data size: 4122094059296 Basic stats: COMPLETE Column stats: NONE
TableScan
alias: s
Statistics: Num rows: 655134731 Data size: 58757293446 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (flightclass = 'I') (type: boolean)
Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: orderid (type: bigint)
sort order: +
Map-reduce partition columns: orderid (type: bigint)
Statistics: Num rows: 327567365 Data size: 29378646678 Basic stats: COMPLETE Column stats: NONE
value expressions: sequence (type: int)
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
keys:
0 orderid (type: bigint)
1 orderid (type: bigint)
outputColumnNames: _col0, _col291
Statistics: Num rows: 360324109 Data size: 32316512046 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (_col291 = 1) (type: boolean)
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), 1 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 180162054 Data size: 16158255978 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
所以可以得出的结论是:
- 在hive的join中,无论是先on连接条件,还是先on筛选条件,其执行计划是一样的,都是先在map operator tree 阶段根据筛选条件过滤数据,然后再在reduce operator tree阶段执行 join 操作;
- hive中,会自动进行谓词下推操作,在where条件中写的对主表的条件,会在map operator tree 阶段就执行,但是where条件中对其他连接表的操作,就不会进行下推,而是放在了reduce operator tree阶段执行。
第二个结论也符合上一篇博文中写的on/where的区别分析,因为对主表进行谓词下推,无论是早点筛选,还是晚点,对结果是没啥影响的,但是在连接表中,比如左连接,谓词下推就只能对右边的连接表进行过滤,但放在where中,就会对最后的整个结果集进行过滤,所以得到的结果是不一样的,也就不会将其进行自动下推!
这是我自己分析的结果哈,欢迎指导!
2. 谓词下推和子查询的区别
2019-09-11记录
-- 1.
select count(*),count(distinct seo.uid)
from (
select se.uid
from tmp_fltdb.tmp_search_3m_uid se
left join tmp_fltdb.tmp_order_3m o
on upper(o.uid) = upper(se.uid)
where o.uid is null
) seo
left join tmp_fltdb.tmp_order_cf s
on upper(s.uid) = upper(seo.uid)
where s.uid is not null
-- 2.
select count(*),count(distinct se.uid)
from tmp_fltdb.tmp_search_3m_uid se
left join tmp_fltdb.tmp_order_3m o
on upper(o.uid) = upper(se.uid)
left join tmp_fltdb.tmp_order_cf s
on upper(s.uid) = upper(se.uid)
where s.uid is not null
and o.uid is null