深入了解spark sql的join

7 篇文章 0 订阅

问1:left join和right join以及join有什么区别?

答1:左关联会取左边所有数据,右关联相反,内关联会取满足on条件的数据。

问2:那我join,left join时如何添加限制条件?

答2:通过where或者and限制条件

问3:那and和where限制条件时有什么区别?

答3:懵逼中。。。。。。

以上3个问答说到底面试官就是想知道到底有没有掌握join,因为join时很容易导致数据倾斜,只有进一步了解join才能避免这种情况。

专业名词解释

谓词下推 顾名思义,就是把过滤算子(就是你在 sql语句里面写的 where语句),尽可能地放在执行计划靠前的地方,好处就是尽早地过滤到不必要的数据,后续流程都节省了计算量,从而优化了性能。

下面分别说一下在spark sql中inner join,left join,right join ,full join四种情况

准备工作

create table person_t(
id          string
,lession_id string
)
;
insert into person_t select '001','sx';
insert into person_t select '001','yw';
insert into person_t select '002','sx';
insert into person_t select '003','yy';
insert into person_t select '003','ty';
insert into person_t select '004','cp';
insert into person_t select '004','wz';
create table lession_t(
lession_id string
,lession   string
)
;
insert into lession_t select 'sx','数学';
insert into lession_t select 'yw','语文';
insert into lession_t select 'yy','英语';
insert into lession_t select 'ty','体育';
insert into lession_t select 'cp','计算机';

1.对于inner join

条件无论写在子查詢表中,还是join on的and条件中还是on条件之后的where中,其物理执行计划都一样,原因在于spark优化器可以做到了谓词下推,当条件写在join on之后的条件中后,spark优化器会将条件下推到join之前执行,减少join的数据量。具体场景需要查看具体物理执行计划。建议写显式的join,并且条件写在子表中,不管spark优化器有没有做到谓词下推,也要先filter数据再join。

spark对笛卡尔关联也做了优化,当简单的相等条件关联是也会转化为inner join。

2.对于left join

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id and a.id='001'
order by id
;
== Physical Plan ==
*(6) Sort [id#3538928 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3538928 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3538928, lession_id#3538929, lession#3538931]
      +- SortMergeJoin [lession_id#3538929], [lession_id#3538930], LeftOuter, (id#3538928 = 001)
         :- *(2) Sort [lession_id#3538929 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 69478428) hashpartitioning(lession_id#3538929, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3538928,lession_id#3538929] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3538930 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 69478428) hashpartitioning(lession_id#3538930, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3538930, lession#3538931]
                  +- *(3) Filter isnotnull(lession_id#3538930)
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3538930,lession#3538931] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划,左表中的条件写在left join条件中,并没有实现谓词下推,也就是左表条件并没有在join之前运行

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id and b.lession='语文'
order by id
;
== Physical Plan ==
*(6) Sort [id#3538940 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3538940 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3538940, lession_id#3538941, lession#3538943]
      +- SortMergeJoin [lession_id#3538941], [lession_id#3538942], LeftOuter
         :- *(2) Sort [lession_id#3538941 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 1556362687) hashpartitioning(lession_id#3538941, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3538940,lession_id#3538941] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3538942 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 1556362687) hashpartitioning(lession_id#3538942, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3538942, lession#3538943]
                  +- *(3) Filter ((isnotnull(lession#3538943) && (lession#3538943 = 语文)) && isnotnull(lession_id#3538942))
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3538942,lession#3538943] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession), EqualTo(lession,语文), IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划,右表中的条件写在left join条件中,实现了谓词下推,也就是对右表先过滤了数据再left join

下面再看看left join后的条件有没有实现谓词下推

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id 
where a.id='001'
order by id
;
== Physical Plan ==
*(6) Sort [id#3539594 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3539594 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3539594, lession_id#3539595, lession#3539597]
      +- SortMergeJoin [lession_id#3539595], [lession_id#3539596], LeftOuter
         :- *(2) Sort [lession_id#3539595 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 853774433) hashpartitioning(lession_id#3539595, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) Project [id#3539594, lession_id#3539595]
         :        +- *(1) Filter (isnotnull(id#3539594) && (id#3539594 = 001))
         :           +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3539594,lession_id#3539595] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,001)], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3539596 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 853774433) hashpartitioning(lession_id#3539596, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3539596, lession#3539597]
                  +- *(3) Filter isnotnull(lession_id#3539596)
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3539596,lession#3539597] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划,左表中的条件写在left join条件之后,也实现了谓词下推,也就是对左表先过滤了数据再left join,最终的结果和先left join之后再filter效果一样,但是join的数据量却少了很多

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id 
where b.lession='语文' 
order by id
;
== Physical Plan ==
*(6) Sort [id#3539606 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3539606 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3539606, lession_id#3539607, lession#3539609]
      +- *(5) SortMergeJoin [lession_id#3539607], [lession_id#3539608], Inner
         :- *(2) Sort [lession_id#3539607 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 2068297835) hashpartitioning(lession_id#3539607, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) Project [id#3539606, lession_id#3539607]
         :        +- *(1) Filter isnotnull(lession_id#3539607)
         :           +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3539606,lession_id#3539607] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3539608 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 2068297835) hashpartitioning(lession_id#3539608, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3539608, lession#3539609]
                  +- *(3) Filter ((isnotnull(lession#3539609) && (lession#3539609 = 语文)) && isnotnull(lession_id#3539608))
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3539608,lession#3539609] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession), EqualTo(lession,语文), IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

右表条件写在left join之后,这个就比较有意思,有人说它没有实现谓词下推,因为如果谓词下推的话,先过滤右表再left join得出的结果是返回左表所有行,不符合预期结果。我不这样认为,看上图的物理执行计划,它其实是先谓词下推过滤右表数据,再把left join转换为inner join。

结论:

是否谓词下推左表右表
left join之中未下推下推
left join之后下推下推&left转inner

3.对于right join 

不做一一讨论,下面给出结论

是否谓词下推左表右表
right join之中下推未下推
right join之后下推&right转inner下推

4.对于 full join

也不做讨论,参考left join,可以自行举例试试

是否谓词下推左表右表
full join之中未下推未下推
full join之后下推&full转right下推&full转left

最后建议:

由于新人不熟悉代码,对spark了解不深入,建议统一在子查询中过滤数据,好处是既方便代码阅读又能获取想要的数据,而且最重要的是减少join时出现数据倾斜的情况

 

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值