(08)Hive——Join连接、谓词下推_hive join是内连接吗(3)

最新推荐文章于 2024-09-08 18:21:03 发布

2301_79097795

最新推荐文章于 2024-09-08 18:21:03 发布

阅读量826

点赞数 16

分类专栏： 2024年程序员学习文章标签： hive hadoop 数据仓库

本文链接：https://blog.csdn.net/2301_79097795/article/details/137639671

版权

2024年程序员学习专栏收录该内容

133 篇文章 0 订阅

订阅专栏

select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id;

ps：full join 本质等价于 left join union right join;

select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id;

----- 等价于下述代码

select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as  t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
union
select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
 right  join studentScore t2
   on t1.user_id = t2.user_id

1.5 多表连接

注意：连接
n
个表，至少需要
n-1
个连接条件。例如：连接三个表，至少需要两个连接

条件。
**join on使用的key有几组就会被转化为几个MR任务，**使用相
同的key来连接，则只会被转化为1个MR任务。

1.6 cross join 交叉连接

交叉连接cross join，将会返回被连接的两个表的笛卡尔积，返回结果的行数等于两个表行数的乘积 N*M。对于大表来说，cross join慎用（笛卡尔积可能会造成数据膨胀）

在SQL标准中定义的cross join就是无条件的inner join。返回两个表的笛卡尔积,无需指定关联键。
在HiveSQL语法中，cross join 后面可以跟where子句进行过滤，或者on条件过滤。

---举例：
select
    t1.user_id as t1_user_id ,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1, studentScore t2

--- 等价于：
select
     t1.user_id as t1_user_id ,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 join studentScore t2

---等价于：
select
     t1.user_id as t1_user_id ,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 cross  join studentScore t2

1.7 join on和where条件区别

两者之间的区别见文章：
Hive中left join 中的where 和 on的区别-CSDN博客文章浏览阅读1.2k次，点赞21次，收藏23次。Hive中left join 中的where 和 on的区别外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传 https://blog.csdn.net/SHWAITME/article/details/135892183?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522170780016016800197016026%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=170780016016800197016026&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_blogfirst_rank_ecpm_v1~rank_v31_ecpm-1-135892183-null-null.nonecase&utm_term=where&spm=1018.2226.3001.4450

1.8 join中不能有null

group by字段为null，会导致结果不正确（null值也会参与group by 分组）

group by column1

join字段为null会导致结果不正确(例如：下述 t2.b字段是null值)

t1 left join t2 on t1.a=t2.a and t1.b=t2.b

1.9 join操作导致数据膨胀

select *
from a 
left join b 
on a.id = b.id

如果主表a的id是唯一的，副表b的id有重复值，非唯一，那当on a.id = b.id 时，就会导致数据膨胀（一条变多条）。因此两表或多表join的时候，需保证join的字段唯一性，否则会出现一对多的数据膨胀现象。

二、Hive的谓词下推

2.1 谓词下推概念

在不影响结果的情况下，尽量将过滤条件提前执行。谓词下推后，过滤条件在map端执行，减少了map端的输出，降低了数据在集群上传输的量，提升任务性能。

在hive生成的物理执行计划中，有一个配置项用于管理谓词下推是否开启。

set hive.optimize.ppd=true; 默认是true

疑问：如果hive谓词下推的功能与join同时存在，那下推功能可以在哪些场景下生效？

2.2 谓词下推场景分析

数据准备：以上述两张表studentInfo、studentScore为例

查看谓词下推是否开启：set hive.optimize.ppd;

(1) inner join 内连接

对左表where过滤

 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id
where t1.user_id >2

explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即谓词下推生效。

对右表where过滤

 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id
where t2.user_id is not null

explain查看执行计划，在对t2表进行scan后，优先进行filter，过滤t2.user_id is not null，即谓词下推生效。

对左表on过滤

explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id and t1.user_id >2

explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即谓词下推生效。

对右表on过滤

 explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
    inner join studentScore t2 on t1.user_id = t2.user_id and t2.user_id is not null

explain查看执行计划，在对t2表进行scan后，优先进行filter，过滤t2.user_id is not null，即谓词下推生效。

(2) left join（right join 同理）

对左表where过滤

explain
select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
where t1.user_id >2;

explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即谓词下推生效。

对右表where过滤

explain
select
    t1.user_id,
    t1.name,
    t1.gender,
    t2.user_id,
    t2.subject,
    t2.score
from studentInfo t1
 left  join studentScore t2
   on t1.user_id = t2.user_id
where t2.user_id is not null;

explain查看执行计划，在对t2表进行scan后，优先进行filter，过滤t2.user_id is not null，即谓词下推生效。

对左表on过滤

explain 
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
   left join studentScore t2
     on t1.user_id = t2.user_id and t1.user_id >2

explain查看执行计划，在对t2表进行scan后，在对t1表未进行filter，即谓词下推不生效。

对右表on过滤

explain
select
    t1.user_id as t1_user_id,
    t1.name,
    t1.gender,
    t2.user_id as t2_user_id,
    t2.subject,
    t2.score
from studentInfo t1
   left join studentScore t2
     on t1.user_id = t2.user_id and t2.user_id is not null;

explain查看执行计划，在对t2表进行scan后，优先进行filter，过滤t2.user_id is not null，即谓词下推生效。

(3) full join

对左表where过滤

explain 
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id
where  t1.user_id >2 ;

explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即谓词下推生效。

对右表where过滤

explain
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id
where  t2.user_id is not null

explain查看执行计划，在对t1 表进行scan后，优先进行filter，过滤t2.user_id is not null，即谓词下推生效。

对左表on过滤

explain
select
     t1.user_id as t1_user_id,
     t1.name,
     t1.gender,
     t2.user_id as t2_user_id,
     t2.subject,
     t2.score
from studentInfo t1
 full  join studentScore t2
   on t1.user_id = t2.user_id and t1.user_id >2;


**自我介绍一下，小编13年上海交大毕业，曾经在小公司待过，也去过华为、OPPO等大厂，18年进入阿里一直到现在。**

**深知大多数大数据工程师，想要提升技能，往往是自己摸索成长或者是报班学习，但对于培训机构动则几千的学费，着实压力不小。自己不成体系的自学效果低效又漫长，而且极易碰到天花板技术停滞不前！**

**因此收集整理了一份《2024年大数据全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友。**
![img](https://img-blog.csdnimg.cn/img_convert/de79a7bad20e6ed5b3ad7a11d6ea43de.png)
![img](https://img-blog.csdnimg.cn/img_convert/933158de0a471a8e03a1b0bb020236e6.png)
![img](https://img-blog.csdnimg.cn/img_convert/bb936b5b6fc0d4f1a33f958b785acae8.png)
![img](https://img-blog.csdnimg.cn/img_convert/286d03aad3b8df2fdefdb03a46e41cda.png)
![img](https://img-blog.csdnimg.cn/img_convert/6e39fe039752d07412491b48fe743157.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上大数据开发知识点，真正体系化！**

**由于文件比较大，这里只是将部分目录大纲截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且后续会持续更新**

**如果你觉得这些内容对你有帮助，可以添加VX：vip204888 （备注大数据获取）**
![img](https://img-blog.csdnimg.cn/img_convert/be741be3db367a8bf4ecc5abaf986111.png)

2024年大数据全套学习资料》，初衷也很简单，就是希望能够帮助到想自学提升又不知道该从何学起的朋友。**
[外链图片转存中...(img-3z7E45Gt-1712818761280)]
[外链图片转存中...(img-YaGbm1YL-1712818761280)]
[外链图片转存中...(img-NGVmnJXd-1712818761281)]
[外链图片转存中...(img-veihlQbp-1712818761281)]
[外链图片转存中...(img-qD6Obz35-1712818761281)]

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，基本涵盖了95%以上大数据开发知识点，真正体系化！**

**由于文件比较大，这里只是将部分目录大纲截图出来，每个节点里面都包含大厂面经、学习笔记、源码讲义、实战项目、讲解视频，并且后续会持续更新**

**如果你觉得这些内容对你有帮助，可以添加VX：vip204888 （备注大数据获取）**
[外链图片转存中...(img-YvneUJPs-1712818761281)]

2301_79097795

关注

16
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
(08)Hive——Join连接、谓词下推_hive join是内连接吗(3)

explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即。explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即。explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即。explain查看执行计划，在对t2表进行scan后，优先对t1表进行filter，过滤t1.user_id >2，即。
复制链接

扫一扫