1.4 full join 满外连接
包含左、右两个表的全部行,不管另外一边的表中是否存在与它们匹配的行;在功能上等价于对这两个数据集合分别进行左外连接和右外连接,然后再使用消去重复行的操作将上述两个结果集合并为一个结果集。full join 本质等价于 left join union right join;
select
t1.user_id,
t1.name,
t1.gender,
t2.user_id,
t2.subject,
t2.score
from studentInfo t1
full join studentScore t2
on t1.user_id = t2.user_id;
ps:full join 本质等价于 left join union right join;
select
t1.user_id,
t1.name,
t1.gender,
t2.user_id,
t2.subject,
t2.score
from studentInfo t1
full join studentScore t2
on t1.user_id = t2.user_id;
----- 等价于下述代码
select
t1.user_id as t1_user_id ,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
left join studentScore t2
on t1.user_id = t2.user_id
union
select
t1.user_id as t1_user_id ,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
right join studentScore t2
on t1.user_id = t2.user_id
1.5 多表连接
注意:连接
n
个表,至少需要
n-1
个连接条件。例如:连接三个表,至少需要两个连接
条件。
**join on使用的key有几组就会被转化为几个MR任务,**使用相
同的key来连接,则只会被转化为1个MR任务。
1.6 cross join 交叉连接
交叉连接cross join,将会返回被连接的两个表的笛卡尔积,返回结果的行数等于两个表行数的乘积 N*M。对于大表来说,cross join慎用(笛卡尔积可能会造成数据膨胀)
在SQL标准中定义的cross join就是无条件的inner join。返回两个表的笛卡尔积,无需指定关联 键。
在HiveSQL语法中,cross join 后面可以跟where子句进行过滤,或者on条件过滤。
---举例:
select
t1.user_id as t1_user_id ,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1, studentScore t2
--- 等价于:
select
t1.user_id as t1_user_id ,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
join studentScore t2
---等价于:
select
t1.user_id as t1_user_id ,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
cross join studentScore t2
1.7 join on和where条件区别
1.8 join中不能有null
- group by字段为null,会导致结果不正确(null值也会参与group by 分组)
group by column1
- join字段为null会导致结果不正确(例如:下述 t2.b字段是null值)
t1 left join t2 on t1.a=t2.a and t1.b=t2.b
1.9 join操作导致数据膨胀
select *
from a
left join b
on a.id = b.id
如果主表a的id是唯一的,副表b的id有重复值,非唯一,那当on a.id = b.id 时,就会导致数据膨胀(一条变多条)。因此两表或多表join的时候,需保证join的字段唯一性,否则会出现一对多的数据膨胀现象。
二、Hive的谓词下推
2.1 谓词下推概念
在不影响结果的情况下,尽量将过滤条件提前执行。谓词下推后,过滤条件在map端执行,减少了map端的输出,降低了数据在集群上传输的量,提升任务性能。
在hive生成的物理执行计划中,有一个配置项用于管理谓词下推是否开启。
set hive.optimize.ppd=true; 默认是true
疑问:如果hive谓词下推的功能与join同时存在,那下推功能可以在哪些场景下生效?
2.2 谓词下推场景分析
数据准备:以上述两张表studentInfo、studentScore为例
查看谓词下推是否开启:set hive.optimize.ppd;
(1) inner join 内连接
- 对左表where过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
inner join studentScore t2 on t1.user_id = t2.user_id
where t1.user_id >2
explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。
- 对右表where过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
inner join studentScore t2 on t1.user_id = t2.user_id
where t2.user_id is not null
explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。
- 对左表on过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
inner join studentScore t2 on t1.user_id = t2.user_id and t1.user_id >2
explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。
- 对右表on过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
inner join studentScore t2 on t1.user_id = t2.user_id and t2.user_id is not null
explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。
(2) left join(right join 同理)
- 对左表where过滤
explain
select
t1.user_id,
t1.name,
t1.gender,
t2.user_id,
t2.subject,
t2.score
from studentInfo t1
left join studentScore t2
on t1.user_id = t2.user_id
where t1.user_id >2;
explain查看执行计划,在对t2表进行scan后,优先对t1表进行filter,过滤t1.user_id >2,即谓词下推生效。
- 对右表where过滤
explain
select
t1.user_id,
t1.name,
t1.gender,
t2.user_id,
t2.subject,
t2.score
from studentInfo t1
left join studentScore t2
on t1.user_id = t2.user_id
where t2.user_id is not null;
explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。
- 对左表on过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
left join studentScore t2
on t1.user_id = t2.user_id and t1.user_id >2
explain查看执行计划,在对t2表进行scan后,在对t1表未进行filter,即谓词下推不生效。
- 对右表on过滤
explain
select
t1.user_id as t1_user_id,
t1.name,
t1.gender,
t2.user_id as t2_user_id,
t2.subject,
t2.score
from studentInfo t1
left join studentScore t2
on t1.user_id = t2.user_id and t2.user_id is not null;
explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user_id is not null,即谓词下推生效。
既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!
id = t2.user_id and t2.user_id is not null;
explain查看执行计划,在对t2表进行scan后,优先进行filter,过滤t2.user\_id is not null,即**谓词下推生效。**
![](https://img-blog.csdnimg.cn/direct/705dc99a70c94a0595c9e3af6eabc69e.png)
[外链图片转存中...(img-5d9MgCCC-1725714113439)]
[外链图片转存中...(img-GuAZJjCY-1725714113441)]
[外链图片转存中...(img-yC00i7lb-1725714113442)]
**既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上大数据知识点,真正体系化!**