sql join on 与where区别

sql join on 与where区别

explain select count(1) from cellinfo_20171124 ci  join shandong_lac_ci slc on concat(cast(ci.lac as string),",",cast(ci.ci as string)) =slc.lac_ci;

| == Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)])
      +- *Project
         +- *SortMergeJoin [concat(cast(lac#72382 as string), ,, cast(ci#72383 as string))], [lac_ci#72394], Inner
            :- *Sort [concat(cast(lac#72382 as string), ,, cast(ci#72383 as string)) ASC], false, 0
            :  +- Exchange hashpartitioning(concat(cast(lac#72382 as string), ,, cast(ci#72383 as string)), 200)
            :     +- HiveTableScan [lac#72382, ci#72383], MetastoreRelation default, cellinfo_20171124, ci
            +- *Sort [lac_ci#72394 ASC], false, 0
               +- Exchange hashpartitioning(lac_ci#72394, 200)
                  +- *Filter isnotnull(lac_ci#72394)
                     +- HiveTableScan [lac_ci#72394], MetastoreRelation default, shandong_lac_ci, slc  |

explain select count(*) from cellinfo_20171124 where concat(cast(lac as string),",",cast(ci as string)) in (select lac_ci from shandong_lac_ci);

| == Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[partial_count(1)])
      +- *Project
         +- SortMergeJoin [concat(cast(lac#72407 as string), ,, cast(ci#72408 as string))], [lac_ci#72420], LeftSemi
            :- *Sort [concat(cast(lac#72407 as string), ,, cast(ci#72408 as string)) ASC], false, 0
            :  +- Exchange hashpartitioning(concat(cast(lac#72407 as string), ,, cast(ci#72408 as string)), 200)
            :     +- HiveTableScan [mcc#72405, mnc#72406, lac#72407, ci#72408, lat#72409, lon#72410, acc#72411, date#72412, validity#72413, addr#72414, province#72415, city#72416, district#72417, township#72418], MetastoreRelation default, cellinfo_20171124
            +- *Sort [lac_ci#72420 ASC], false, 0
               +- Exchange hashpartitioning(lac_ci#72420, 200)
                  +- HiveTableScan [lac_ci#72420], MetastoreRelation default, shandong_lac_ci  |

以上两条语句实现的功能一样,查询基站的覆盖率,结果条数也一样,但是速度却有差别。
image
从上图中可以看出来,速度差好几倍,为什么?


  • 查看后台执行计划
    image
    image

通过后台执行的图中可以看出来,join on 与in的主要区别在Shuffle Write与第二段的Shuffle Read @(这两个值是一个概念,第一阶段的输出是第二阶段的输入)
通过查询join on与where的区别得知以下情况:
https://www.cnblogs.com/Jessy/p/3525419.html
1,on条件是在生成临时表时使用的条件
2,where条件是在临时表生成好后,再对临时表进行过滤的条件
理解这两个原因之后,join的时候,临时表的内容只有lac,ci这两个字段(972.5M);而in的where是临时表之后的条件,所以in的临时表有所有字段(11.9G,为什么不是21.3G,因为spark做了压缩,条数一样,数据量变小了),导致输出时间长,因为写文件大,下阶段的输入时间长,因为要对这些数据读取做比较处理。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值