深入了解spark sql的join

最新推荐文章于 2023-08-27 21:37:28 发布

LSB19930706

最新推荐文章于 2023-08-27 21:37:28 发布

阅读量854

点赞数

分类专栏： Hive spark

本文链接：https://blog.csdn.net/lsb19930706/article/details/109283323

版权

Hive 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

spark

7 篇文章 0 订阅

订阅专栏

问1：left join和right join以及join有什么区别？

答1：左关联会取左边所有数据，右关联相反，内关联会取满足on条件的数据。

问2：那我join，left join时如何添加限制条件?

答2：通过where或者and限制条件

问3：那and和where限制条件时有什么区别？

答3：懵逼中。。。。。。

以上3个问答说到底面试官就是想知道到底有没有掌握join，因为join时很容易导致数据倾斜，只有进一步了解join才能避免这种情况。

专业名词解释

谓词下推，顾名思义，就是把过滤算子（就是你在 sql语句里面写的 where语句），尽可能地放在执行计划靠前的地方，好处就是尽早地过滤到不必要的数据，后续流程都节省了计算量，从而优化了性能。

下面分别说一下在spark sql中inner join，left join，right join ，full join四种情况

准备工作

create table person_t(
id          string
,lession_id string
)
;
insert into person_t select '001','sx';
insert into person_t select '001','yw';
insert into person_t select '002','sx';
insert into person_t select '003','yy';
insert into person_t select '003','ty';
insert into person_t select '004','cp';
insert into person_t select '004','wz';
create table lession_t(
lession_id string
,lession   string
)
;
insert into lession_t select 'sx','数学';
insert into lession_t select 'yw','语文';
insert into lession_t select 'yy','英语';
insert into lession_t select 'ty','体育';
insert into lession_t select 'cp','计算机';

1.对于inner join

条件无论写在子查詢表中，还是join on的and条件中还是on条件之后的where中，其物理执行计划都一样，原因在于spark优化器可以做到了谓词下推，当条件写在join on之后的条件中后，spark优化器会将条件下推到join之前执行，减少join的数据量。具体场景需要查看具体物理执行计划。建议写显式的join，并且条件写在子表中，不管spark优化器有没有做到谓词下推，也要先filter数据再join。

spark对笛卡尔关联也做了优化，当简单的相等条件关联是也会转化为inner join。

2.对于left join

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id and a.id='001'
order by id
;

== Physical Plan ==
*(6) Sort [id#3538928 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3538928 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3538928, lession_id#3538929, lession#3538931]
      +- SortMergeJoin [lession_id#3538929], [lession_id#3538930], LeftOuter, (id#3538928 = 001)
         :- *(2) Sort [lession_id#3538929 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 69478428) hashpartitioning(lession_id#3538929, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3538928,lession_id#3538929] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3538930 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 69478428) hashpartitioning(lession_id#3538930, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3538930, lession#3538931]
                  +- *(3) Filter isnotnull(lession_id#3538930)
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3538930,lession#3538931] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划，左表中的条件写在left join条件中，并没有实现谓词下推，也就是左表条件并没有在join之前运行

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id and b.lession='语文'
order by id
;

== Physical Plan ==
*(6) Sort [id#3538940 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3538940 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3538940, lession_id#3538941, lession#3538943]
      +- SortMergeJoin [lession_id#3538941], [lession_id#3538942], LeftOuter
         :- *(2) Sort [lession_id#3538941 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 1556362687) hashpartitioning(lession_id#3538941, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3538940,lession_id#3538941] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3538942 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 1556362687) hashpartitioning(lession_id#3538942, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3538942, lession#3538943]
                  +- *(3) Filter ((isnotnull(lession#3538943) && (lession#3538943 = 语文)) && isnotnull(lession_id#3538942))
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3538942,lession#3538943] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession), EqualTo(lession,语文), IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划，右表中的条件写在left join条件中，实现了谓词下推，也就是对右表先过滤了数据再left join

下面再看看left join后的条件有没有实现谓词下推

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id 
where a.id='001'
order by id
;

== Physical Plan ==
*(6) Sort [id#3539594 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3539594 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3539594, lession_id#3539595, lession#3539597]
      +- SortMergeJoin [lession_id#3539595], [lession_id#3539596], LeftOuter
         :- *(2) Sort [lession_id#3539595 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 853774433) hashpartitioning(lession_id#3539595, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) Project [id#3539594, lession_id#3539595]
         :        +- *(1) Filter (isnotnull(id#3539594) && (id#3539594 = 001))
         :           +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3539594,lession_id#3539595] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [IsNotNull(id), EqualTo(id,001)], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3539596 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 853774433) hashpartitioning(lession_id#3539596, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3539596, lession#3539597]
                  +- *(3) Filter isnotnull(lession_id#3539596)
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3539596,lession#3539597] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

结合上图sql及其物理执行计划，左表中的条件写在left join条件之后，也实现了谓词下推，也就是对左表先过滤了数据再left join，最终的结果和先left join之后再filter效果一样，但是join的数据量却少了很多

explain
select a.id,a.lession_id,b.lession
from person_t a left join lession_t  b 
on a.lession_id=b.lession_id 
where b.lession='语文' 
order by id
;

== Physical Plan ==
*(6) Sort [id#3539606 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(id#3539606 ASC NULLS FIRST, 200)
   +- *(5) Project [id#3539606, lession_id#3539607, lession#3539609]
      +- *(5) SortMergeJoin [lession_id#3539607], [lession_id#3539608], Inner
         :- *(2) Sort [lession_id#3539607 ASC NULLS FIRST], false, 0
         :  +- Exchange(coordinator id: 2068297835) hashpartitioning(lession_id#3539607, 200), coordinator[target post-shuffle partition size: 16777216]
         :     +- *(1) Project [id#3539606, lession_id#3539607]
         :        +- *(1) Filter isnotnull(lession_id#3539607)
         :           +- *(1) FileScan parquet whzhcs_dataplatform_dev.person_t[id#3539606,lession_id#3539607] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/person_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession_id)], ReadSchema: struct<id:string,lession_id:string>
         +- *(4) Sort [lession_id#3539608 ASC NULLS FIRST], false, 0
            +- Exchange(coordinator id: 2068297835) hashpartitioning(lession_id#3539608, 200), coordinator[target post-shuffle partition size: 16777216]
               +- *(3) Project [lession_id#3539608, lession#3539609]
                  +- *(3) Filter ((isnotnull(lession#3539609) && (lession#3539609 = 语文)) && isnotnull(lession_id#3539608))
                     +- *(3) FileScan parquet whzhcs_dataplatform_dev.lession_t[lession_id#3539608,lession#3539609] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://wuhubigdata/user/hive/warehouse/whzhcs_dataplatform_dev.db/lession_t], PartitionFilters: [], PushedFilters: [IsNotNull(lession), EqualTo(lession,语文), IsNotNull(lession_id)], ReadSchema: struct<lession_id:string,lession:string>

右表条件写在left join之后，这个就比较有意思，有人说它没有实现谓词下推，因为如果谓词下推的话，先过滤右表再left join得出的结果是返回左表所有行，不符合预期结果。我不这样认为，看上图的物理执行计划，它其实是先谓词下推过滤右表数据，再把left join转换为inner join。

结论：

是否谓词下推	左表	右表
left join之中	未下推	下推
left join之后	下推	下推&left转inner

3.对于right join

不做一一讨论，下面给出结论

是否谓词下推	左表	右表
right join之中	下推	未下推
right join之后	下推&right转inner	下推

4.对于 full join

也不做讨论，参考left join，可以自行举例试试

是否谓词下推	左表	右表
full join之中	未下推	未下推
full join之后	下推&full转right	下推&full转left

最后建议：

由于新人不熟悉代码，对spark了解不深入，建议统一在子查询中过滤数据，好处是既方便代码阅读又能获取想要的数据，而且最重要的是减少join时出现数据倾斜的情况

LSB19930706

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
深入了解spark sql的join

问1：left join和right join以及join有什么区别？答1：左关联会取左边所有数据，右关联相反，内关联会取满足on条件的数据。问2：那我join，left join时如何添加限制条件?答2：通过where或者and限制条件问3：那and和where限制条件时有什么区别？答3：懵逼中。。。。。。以上3个问答说到底面试官就是想知道到底有没有掌握join，因为hive join时很容易导致数据倾斜，只有进一步了解hive join才能避免这种情况。谓词下推，顾名.
复制链接

扫一扫

专栏目录