sql join on 与where区别
explain select count(1) from cellinfo_20171124 ci join shandong_lac_ci slc on concat(cast(ci.lac as string),",",cast(ci.ci as string)) =slc.lac_ci;
| == Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)])
+- *Project
+- *SortMergeJoin [concat(cast(lac#72382 as string), ,, cast(ci#72383 as string))], [lac_ci#72394], Inner
:- *Sort [concat(cast(lac#72382 as string), ,, cast(ci#72383 as string)) ASC], false, 0
: +- Exchange hashpartitioning(concat(cast(lac#72382 as string), ,, cast(ci#72383 as string)), 200)
: +- HiveTableScan [lac#72382, ci#72383], MetastoreRelation default, cellinfo_20171124, ci
+- *Sort [lac_ci#72394 ASC], false, 0
+- Exchange hashpartitioning(lac_ci#72394, 200)
+- *Filter isnotnull(lac_ci#72394)
+- HiveTableScan [lac_ci#72394], MetastoreRelation default, shandong_lac_ci, slc |
explain select count(*) from cellinfo_20171124 where concat(cast(lac as string),",",cast(ci as string)) in (select lac_ci from shandong_lac_ci);
| == Physical Plan ==
*HashAggregate(keys=[], functions=[count(1)])
+- Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)])
+- *Project
+- SortMergeJoin [concat(cast(lac#72407 as string), ,, cast(ci#72408 as string))], [lac_ci#72420], LeftSemi
:- *Sort [concat(cast(lac#72407 as string), ,, cast(ci#72408 as string)) ASC], false, 0
: +- Exchange hashpartitioning(concat(cast(lac#72407 as string), ,, cast(ci#72408 as string)), 200)
: +- HiveTableScan [mcc#72405, mnc#72406, lac#72407, ci#72408, lat#72409, lon#72410, acc#72411, date#72412, validity#72413, addr#72414, province#72415, city#72416, district#72417, township#72418], MetastoreRelation default, cellinfo_20171124
+- *Sort [lac_ci#72420 ASC], false, 0
+- Exchange hashpartitioning(lac_ci#72420, 200)
+- HiveTableScan [lac_ci#72420], MetastoreRelation default, shandong_lac_ci |
以上两条语句实现的功能一样,查询基站的覆盖率,结果条数也一样,但是速度却有差别。
![]()
从上图中可以看出来,速度差好几倍,为什么?
- 查看后台执行计划
通过后台执行的图中可以看出来,join on 与in的主要区别在Shuffle Write与第二段的Shuffle Read
@(这两个值是一个概念,第一阶段的输出是第二阶段的输入)
通过查询join on与where的区别得知以下情况:
https://www.cnblogs.com/Jessy/p/3525419.html
1,on条件是在生成临时表时使用的条件
2,where条件是在临时表生成好后,再对临时表进行过滤的条件
理解这两个原因之后,join的时候,临时表的内容只有lac,ci这两个字段(972.5M);而in的where是临时表之后的条件,所以in的临时表有所有字段(11.9G,为什么不是21.3G,因为spark做了压缩,条数一样,数据量变小了),导致输出时间长,因为写文件大,下阶段的输入时间长,因为要对这些数据读取做比较处理。