hive中join on和where一起使用时的执行顺序及关联字段为null的情况

查看很多博客,都说join on和where一起使用时,先进行join产生临时表,再进行where条件过滤得到结果表,但使用explain查看执行计划,会发现执行顺序并不是这样。

在两表join(注意: 此处只表示内连接, 如果是left join的话,不会过滤null, 会产生数据倾斜)时如果关联字段为null,有些博客上写会产生数据倾斜或笛卡儿积,其实不会,因为在join之前就会对join的表中关联字段进行非空过滤。

> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> month="2019-02";
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage             # 先执行Stage-4
  Stage-3 depends on stages: Stage-4  # Stage-3依赖于Stage-4
  Stage-0 depends on stages: Stage-3  # Stage-0依赖于Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_1:order_tb   #先执行join后的表
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_1:order_tb 
          TableScan  #扫描表
            alias: order_tb
            Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  #执行where条件进行过滤
              predicate: ((UDFToString(month) = '2019-01') and day is not null) (type: boolean)
              Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
              Select Operator  #在执行select语句
                expressions: day (type: varchar(10)), money (type: int)
                outputColumnNames: _col1, _col2
                Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col1 (type: varchar(10))
                    1 _col1 (type: varchar(10))

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan  #扫描表
            alias: deal_tb
            Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
            Filter Operator  #执行where语句进行过滤
              predicate: daystr is not null (type: boolean)  #对关联字段进行非空过滤
              Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
              Select Operator  #执行select语句
                expressions: id (type: int), daystr (type: varchar(10)), amount (type: decimal(10,2))
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:  #执行join
                       Inner Join 0 to 1
                  keys:
                    0 _col1 (type: varchar(10))
                    1 _col1 (type: varchar(10))
                  outputColumnNames: _col0, _col1, _col2, _col4, _col5
                  Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), '2019-01' (type: varchar(7)), _col4 (type: varcha
r(10)), _col5 (type: int)                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                    Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 41 Data size: 794 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 1.019 seconds, Fetched: 71 row(s)

在explain后的执行计划中的29行和47行可以看出,先扫描join后的表,对此表进行where语句的条件过滤和关联字段非空过滤,再扫描from后的表进行where语句中条件的过滤和关联字段的非空过滤。

当where语句中的条件为关联字段的过滤,不管你过滤的字段名用A表还是B表,在join之前都会对两表的关联字段进行过滤,不信可以查看SQL语句的执行计划:

> explain select
> *
> from
> deal_tb
> join
> order_tb
> on daystr=day
> where
> daystr="2019-01-20";
OK
Explain
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        order_tb 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        order_tb 
          TableScan
            alias: order_tb
            Statistics: Num rows: 14 Data size: 280 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (day = '2019-01-20') (type: boolean)
              Statistics: Num rows: 7 Data size: 140 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 daystr (type: varchar(10))
                  1 day (type: varchar(10))

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: deal_tb
            Statistics: Num rows: 38 Data size: 722 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (daystr = '2019-01-20') (type: boolean)
              Statistics: Num rows: 19 Data size: 361 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 daystr (type: varchar(10))
                  1 day (type: varchar(10))
                outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8
                Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: varchar(10)), _col2 (type: decimal(10,2)), _col6 (type: varchar(7)), _col7 (type: varchar(10))
, _col8 (type: int)                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
                  Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 20 Data size: 397 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Execution mode: vectorized
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

Time taken: 2.237 seconds, Fetched: 63 row(s)

在执行计划中30行和44行出现这两句话,说明对两表都进行了关联字段where语句条件的过滤
predicate: (day = ‘2019-01-20’) (type: boolean)
predicate: (daystr = ‘2019-01-20’) (type: boolean)

对于hive SQL执行的顺序如果有模糊的,建议使用explain查看执行计划,这样能看到正确的执行顺序。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值