Left join结果却是Inner join?(Spark/Hive)

将右表条件写在了where上,导致执行结果变成inner join。

1. left join where和on条件的区别

测试数据

--建表
create table zzzt1(id string, name string) partitioned by (dt string);
create table zzzt2(id string, class string) partitioned by (dt string);

-- 插入数据
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT overwrite TABLE zzzt1 partition(dt) values
('1','A','2022-08-10'), 
('2','B','2022-08-10'),
('1','A1','2022-08-11'),
('2','B1','2022-08-11')

INSERT overwrite TABLE zzzt2 partition(dt) values
('1','101', '2022-08-10'),
('2','101', '2022-08-10'),
('1','102','2022-08-11'),
('2','102','2022-08-11')
  • 示例1
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t1.id = '1'
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+
  • 示例2
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id AND t1.id = '1'
+------+-------+-------+--------+-------------+-------------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     |
+------+-------+-------+--------+-------------+-------------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        |
+------+-------+-------+--------+-------------+-------------+

将左表的t1.id = '1'条件从where移到on中后,结果中仅仅包含了右表id=1的连接结果。

  • 示例3
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id AND t2.id = '1'
+------+-------+-------+--------+-------------+-------------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     |
+------+-------+-------+--------+-------------+-------------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        |
+------+-------+-------+--------+-------------+-------------+

on条件中增加t2.id = '1',结果中仅仅包含了右表id=1的连接结果。

  • 示例4
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t2.id = '1'
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

右表t2.id = '1'条件写在where中,结果同inner join。

上述4个示意中,存在两个疑问:

  • 为什么示例2和3中的结果中仅包含右表中id=1的连接结果。
  • 为什么右表中的条件是写在where中,结果同inner join?

分别查看上述三个SQL的执行计划。

示例1执行计划如下,从执行计划中可知写在where中的左表t1.id = '1'条件分别加入到t1表和t2表的fliter和谓词下推中,EqualTo(id,1)、(id#6 = 1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) Project [id#6, name#7, dt#8]
      :        +- *(1) Filter (isnotnull(id#6) && (id#6 = 1))
      :           +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter ((id#9 = 1) && isnotnull(id#9))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,class:string>

示例2执行计划如下,从执行计划可知,写在on中的左表t1.id = '1'条件,并没有作用在左表上,而是作用在了右表,EqualTo(id,1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter, (id#6 = 1)
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter ((id#9 = 1) && isnotnull(id#9))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,class:string>

示例3执行计划如下,从执行计划可知,写在on中的右表t2.id = '1'条件作用在了右表上,EqualTo(id,1)、(id#9 = 1)。

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- SortMergeJoin [id#6], [id#9], LeftOuter
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter (isnotnull(id#9) && (id#9 = 1))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,class:string>

示例4执行计划如下,从执行计划可知,写在where中的右表t2.id = '1'条件,同时作用在了左表和右表,EqualTo(id,1)、(id#6 = 1)、(id#9 = 1)

== Physical Plan ==
*CollectLimit 1000
+- *(3) Project [id#6 AS id1#2, id#9 AS id2#3, name#7, class#10, dt#8 AS dt1#4, dt#11 AS dt2#5]
   +- *(3) SortMergeJoin [id#6], [id#9], Inner
      :- *(1) Sort [id#6 ASC NULLS FIRST], false, 0
      :  +- *Exchange hashpartitioning(id#6, 900)
      :     +- *(1) Project [id#6, name#7, dt#8]
      :        +- *(1) Filter ((id#6 = 1) && isnotnull(id#6))
      :           +- *(1) FileScan orc zzzt1[id#6,name#7,dt#8] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt01], PartitionFilters: [], PushedFilters: [EqualTo(id,1), IsNotNull(id)], ReadSchema: struct<id:string,name:string>
      +- *(2) Sort [id#9 ASC NULLS FIRST], false, 0
         +- *Exchange hashpartitioning(id#9, 900)
            +- *(1) Project [id#9, class#10, dt#11]
               +- *(1) Filter (isnotnull(id#9) && (id#9 = 1))
                  +- *(1) FileScan orc zzzt2[id#9,class#10,dt#11] Batched: true, Format: ORC, Location: CatalogFileIndex[hdfs://***db/zzzt2], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct<id:string,class:string>

从以上4个执行计划中还可以发现,id字段做作为条件是,会自动加上 is not null 限制。

1.1. right join

接下来看下右链接中,左右表条件分别写在where和on的结果。

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t1.id = '1'
-- 左表条件写在where中,同上述示例4
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id AND t1.id = '1'
-- 左表条件写在on中,同上述示例3
+-------+------+-------+--------+-------------+-------------+
|  id1  | id2  | name  | class  |     dt1     |     dt2     |
+-------+------+-------+--------+-------------+-------------+
| 1     | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1     | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1     | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1     | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
| NULL  | 2    | NULL  | 102    | NULL        | 2022-08-11  |
| NULL  | 2    | NULL  | 101    | NULL        | 2022-08-10  |
+-------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id AND t2.id = '1'
-- 右表条件写在on中,同上述示例2
+-------+------+-------+--------+-------------+-------------+
|  id1  | id2  | name  | class  |     dt1     |     dt2     |
+-------+------+-------+--------+-------------+-------------+
| 1     | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1     | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1     | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1     | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
| NULL  | 2    | NULL  | 101    | NULL        | 2022-08-10  |
| NULL  | 2    | NULL  | 102    | NULL        | 2022-08-11  |
+-------+------+-------+--------+-------------+-------------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
FROM zzzt1 AS t1
right JOIN zzzt2 AS t2
ON t1.id = t2.id
WHERE t2.id = '1'
-- 右表条件写在where中,同上述示例1
+------+------+-------+--------+-------------+-------------+
| id1  | id2  | name  | class  |     dt1     |     dt2     |
+------+------+-------+--------+-------------+-------------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  |
+------+------+-------+--------+-------------+-------------+

结论
在使用left join时最好将右表的条件写在on中。rigth join同理,将左表的条件写在on条件中。

2. 多个join的执行顺序

验证多个left join时得执行顺序

create table zzzt3(id string, des string);
INSERT overwrite TABLE zzzt3 values
('101','班级1'),
('102','班级2')
SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
       ,t3.des
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id and t2.id = '1'
left join zzzt3 t3
on t2.class = t3.id
-- 执行结果为
+------+-------+-------+--------+-------------+-------------+---------+
| id1  |  id2  | name  | class  |     dt1     |     dt2     | t3.des  |
+------+-------+-------+--------+-------------+-------------+---------+
| 1    | 1     | A     | 101    | 2022-08-10  | 2022-08-10  | 班级1     |
| 1    | 1     | A     | 102    | 2022-08-10  | 2022-08-11  | 班级2     |
| 1    | 1     | A1    | 101    | 2022-08-11  | 2022-08-10  | 班级1     |
| 1    | 1     | A1    | 102    | 2022-08-11  | 2022-08-11  | 班级2     |
| 2    | NULL  | B     | NULL   | 2022-08-10  | NULL        | NULL    |
| 2    | NULL  | B1    | NULL   | 2022-08-11  | NULL        | NULL    |
+------+-------+-------+--------+-------------+-------------+---------+

SELECT  t1.id AS id1
       ,t2.id AS id2
       ,name
       ,class
       ,t1.dt AS dt1
       ,t2.dt AS dt2
       ,t3.des
FROM zzzt1 AS t1
LEFT JOIN zzzt2 AS t2
ON t1.id = t2.id and t2.id = '1'
join zzzt3 t3
on t2.class = t3.id
+------+------+-------+--------+-------------+-------------+---------+
| id1  | id2  | name  | class  |     dt1     |     dt2     | t3.des  |
+------+------+-------+--------+-------------+-------------+---------+
| 1    | 1    | A     | 101    | 2022-08-10  | 2022-08-10  | 班级1     |
| 1    | 1    | A     | 102    | 2022-08-10  | 2022-08-11  | 班级2     |
| 1    | 1    | A1    | 102    | 2022-08-11  | 2022-08-11  | 班级2     |
| 1    | 1    | A1    | 101    | 2022-08-11  | 2022-08-10  | 班级1     |
+------+------+-------+--------+-------------+-------------+---------+

从结果得知,多表进行join时,执行顺序按照join的前后顺序,先执行前两个表join得到一个虚拟的表,再和第三个表进行join,得到虚拟结果后,继续和第四个表进进行join……

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值