Hive连接查询join用法详解

最新推荐文章于 2024-05-09 02:30:00 发布

weixin_34059951

最新推荐文章于 2024-05-09 02:30:00 发布

阅读量264

点赞数

文章标签：大数据 python

原文链接：https://my.oschina.net/iamchenli/blog/845850

版权

2019独角兽企业重金招聘Python工程师标准>>>

先来个总的概括，Hive的join用法基本与SQL一致

建表语句不说了，直接看看两个表中的数据

表一：cl_student

hive (test)> select * from cl_student;
OK
1	chenli	21
2	xuzeng	22
3	xiaodan	23
4	hua	24

表2：cl_stu_sub

hive (test)> select * from cl_stu_sub;
OK
1	chinese
2	english
3	science
5	nature

1、内连接。inner join，即基于on语句，仅列出表1和表2符合连接条件的数据。

hive (test)> select a.*,b.* from cl_student a join cl_stu_sub b on a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science

hive (test)> select a.*,b.* from cl_student a,cl_stu_sub b where a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science

第一条语句与第二条SQL条语句的效果是一样的

2、左连接。左连接是显示左边的表的所有数据，如果有右边表与之对应，则显示；否则显示NULL

hive (test)> select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science
4	hua	24	NULL	NULL

深入了解，比较两者的不同，在上述语句中增加了一个where条件或者是and的条件

on条件会产生一个临时表，where条件是对这个临时表进行过滤

hive (test)> select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id where b.name='chinese';
1	chenli	21	1	chinese

hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id where a.name='chenli';
1	chenli	21	1	chinese

标准查询关键字执行顺序为 from->where->group by->having->order by

on是先对表进行筛选后再关联的，left关联则on对右表才有效的，左表都是要选出来的

如果是要条件查询后才连接应该把查询件放置于ON后。如果是想再连接完毕后才筛选就应把条件放置于where后面，对主表的帅选要用where条件

hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b where a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science

hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id and a.name='chenli';
1	chenli	21	1	chinese
2	xuzeng	22	NULL	NULL
3	xiaodan	23	NULL	NULL
4	hua	24	NULL	NULL

hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id and b.name='chinese';
1	chenli	21	1	chinese
2	xuzeng	22	NULL	NULL
3	xiaodan	23	NULL	NULL
4	hua	24	NULL	NULL

3、右连接.与左连接同理

hive (test)> select a.*,b.* from cl_student a right outer join cl_stu_sub b on a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science
NULL	NULL	NULL	5	nature

4、全连接。相当于表1和表2的数据都显示，如果没有对应的数据，则显示NULL

hive (test)> select a.*,b.* from cl_student a full outer join cl_stu_sub b on a.id=b.id;
1	chenli	21	1	chinese
2	xuzeng	22	2	english
3	xiaodan	23	3	science
4	hua	24	NULL	NULL
NULL	NULL	NULL	5	nature

5、左半开连接。left semi join，语法与左连接不一样，只能选择出左边表的数据，此数据符合on后面的条件。

semi join通常比inner join的效率更高

hive (test)>select a.* from cl_student a left semi join cl_stu_sub b on a.id=b.id;
1	chenli	21
2	xuzeng	22
3	xiaodan	23


hive (test)>select * from cl_student a left semi join cl_stu_sub b on a.id=b.id;
1	chenli	21
2	xuzeng	22
3	xiaodan	23

hive (test)>select a.*,b.* from cl_student a left semi join cl_stu_sub b on a.id=b.id;
FAILED: SemanticException [Error 10009]: Line 1:11 Invalid table alias 'b'

6、Hive不支持右半开连接

7、笛卡尔积join

笛卡尔积是一种连接，表示左边表的行数乘以右边表的行数。

hive (test)>select a.*,b.* from cl_student a join cl_stu_sub b;
1	chenli	21	1	chinese
1	chenli	21	2	english
1	chenli	21	3	science
1	chenli	21	5	nature
2	xuzeng	22	1	chinese
2	xuzeng	22	2	english
2	xuzeng	22	3	science
2	xuzeng	22	5	nature
3	xiaodan	23	1	chinese
3	xiaodan	23	2	english
3	xiaodan	23	3	science
3	xiaodan	23	5	nature
4	hua	24	1	chinese
4	hua	24	2	english
4	hua	24	3	science
4	hua	24	5	nature

转载于:https://my.oschina.net/iamchenli/blog/845850