(08)Hive——Join连接、谓词下推_hive join是内连接吗(3)

select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id;

ps:full join 本质等价于 left join union  right join;

select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id;

----- 等价于下述代码

select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as  t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
union
select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
 right  join studentScore t2
   on t1.user_id = t2.user_id

1.5 多表连接

注意:连接
n
个表,至少需要
n-1
个连接条件。例如:连接三个表,至少需要两个连接

条件。
**join on使用的key有几组就会被转化为几个MR任务,**使用相
同的key来连接,则只会被转化为1个MR任务。

1.6 cross join 交叉连接

交叉连接cross join,将会返回被连接的两个表的笛卡尔积,返回结果的行数等于两个表行数的乘积 N*M。对于大表来说,cross join慎用(笛卡尔积可能会造成数据膨胀

在SQL标准中定义的cross join就是无条件的inner join。返回两个表的笛卡尔积,无需指定关联 键。
  在HiveSQL语法中,cross join 后面可以跟where子句进行过滤,或者on条件过滤。

---举例:
select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1, studentScore t2

--- 等价于:
select
     t1.user_id as t1_user_id ,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 join studentScore t2

---等价于:
select
     t1.user_id as t1_user_id ,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 cross  join studentScore t2

1.7 join on和where条件区别

两者之间的区别见文章:
Hive中left join 中的where 和 on的区别-CSDN博客文章浏览阅读1.2k次,点赞21次,收藏23次。Hive中left join 中的where 和 on的区别外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传https://blog.csdn.net/SHWAITME/article/details/135892183?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522170780016016800197016026%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=170780016016800197016026&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2blogfirst_rank_ecpm_v1~rank_v31_ecpm-1-135892183-null-null.nonecase&utm_term=where&spm=1018.2226.3001.4450

1.8 join中不能有null

  • group by字段为null,会导致结果不正确(null值也会参与group by 分组)
group by column1
  • join字段为null会导致结果不正确(例如:下述 t2.b字段是null值)
t1 left join t2 on t1.a=t2.a and t1.b=t2.b 

1.9 join操作导致数据膨胀

select *
from a 
left join b 
on a.id = b.id 

如果主表a的id是唯一的,副表b的id有重复值,非唯一,那当on a.id = b.id 时,就会导致数据膨胀(一条变多条)。因此两表或多表join的时候,需保证join的字段唯一性,否则会出现一对多的数据膨胀现象。

二、Hive的谓词下推

2.1 谓词下推概念

在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,提升任务性能。

在hive生成的物理执行计划中,有一个配置项用于管理谓词下推是否开启。

set hive.optimize.ppd=true; 默认是true

疑问:如果hive谓词下推的功能与join同时存在,那下推功能可以在哪些场景下生效

2.2 谓词下推场景分析

数据准备:以上述两张表studentInfo、studentScore为例

查看谓词下推是否开启:set hive.optimize.ppd;

(1) inner join 内连接
  • 对左表where过滤
 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id
where t1.user_id >2

explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。

  • 对右表where过滤
 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id
where t2.user_id is not null

explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。

  • 对左表on过滤
explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id and t1.user_id >2

explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。

  • 对右表on过滤
 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id and t2.user_id is not null

explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。

(2) left join(right join 同理)
  • 对左表where过滤
explain
select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
where t1.user_id >2;

explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。

  • 对右表where过滤
explain
select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
where t2.user_id is not null;

explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。 

  • 对左表on过滤
explain 
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
   left join studentScore t2
     on t1.user_id = t2.user_id and t1.user_id >2

explain查看执行计划,在对t2表进行scan后,在对t1表未进行filter,即谓词下推不生效

  • 对右表on过滤
explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
   left join studentScore t2
     on t1.user_id = t2.user_id and t2.user_id is not null;

explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。

(3) full join
  • 对左表where过滤
explain 
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id
where  t1.user_id >2 ;

explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。

  • 对右表where过滤
explain
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id
where  t2.user_id is not null

explain查看执行计划,在对t1 表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。

  • 对左表on过滤
explain
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id and t1.user_id >2;


**自我介绍一下,小编13年上海交大毕业,曾经在小公司待过,也去过华为、OPPO等大厂,18年进入阿里一直到现在。**

**深知大多数大数据工程师,想要提升技能,往往是自己摸索成长或者是报班学习,但对于培训机构动则几千的学费,着实压力不小。自己不成体系的自学效果低效又漫长,而且极易碰到天花板技术停滞不前!**

**因此收集整理了一份《2024年大数据全套学习资料》,初衷也很简单,就是希望能够帮助到想自学提升又不知道该从何学起的朋友。**
![img](https://img-blog.csdnimg.cn/img_convert/de79a7bad20e6ed5b3ad7a11d6ea43de.png)
![img](https://img-blog.csdnimg.cn/img_convert/933158de0a471a8e03a1b0bb020236e6.png)
![img](https://img-blog.csdnimg.cn/img_convert/bb936b5b6fc0d4f1a33f958b785acae8.png)
![img](https://img-blog.csdnimg.cn/img_convert/286d03aad3b8df2fdefdb03a46e41cda.png)
![img](https://img-blog.csdnimg.cn/img_convert/6e39fe039752d07412491b48fe743157.png)

**既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,基本涵盖了95%以上大数据开发知识点,真正体系化!**

**由于文件比较大,这里只是将部分目录大纲截图出来,每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频,并且后续会持续更新**

**如果你觉得这些内容对你有帮助,可以添加VX:vip204888 (备注大数据获取)**
![img](https://img-blog.csdnimg.cn/img_convert/be741be3db367a8bf4ecc5abaf986111.png)

2024年大数据全套学习资料》,初衷也很简单,就是希望能够帮助到想自学提升又不知道该从何学起的朋友。**
[外链图片转存中...(img-3z7E45Gt-1712818761280)]
[外链图片转存中...(img-YaGbm1YL-1712818761280)]
[外链图片转存中...(img-NGVmnJXd-1712818761281)]
[外链图片转存中...(img-veihlQbp-1712818761281)]
[外链图片转存中...(img-qD6Obz35-1712818761281)]

**既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,基本涵盖了95%以上大数据开发知识点,真正体系化!**

**由于文件比较大,这里只是将部分目录大纲截图出来,每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频,并且后续会持续更新**

**如果你觉得这些内容对你有帮助,可以添加VX:vip204888 (备注大数据获取)**
[外链图片转存中...(img-YvneUJPs-1712818761281)]

  • 16
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值