先来个总的概括,Hive的join用法基本与SQL一致
建表语句不说了,直接看看两个表中的数据
表一:cl_student
hive (test)> select * from cl_student;
OK
1 chenli 21
2 xuzeng 22
3 xiaodan 23
4 hua 24
表2:cl_stu_sub
hive (test)> select * from cl_stu_sub;
OK
1 chinese
2 english
3 science
5 nature
1、内连接。inner join,即基于on语句,仅列出表1和表2符合连接条件的数据。
hive (test)> select a.*,b.* from cl_student a join cl_stu_sub b on a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
hive (test)> select a.*,b.* from cl_student a,cl_stu_sub b where a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
第一条语句与第二条SQL条语句的效果是一样的
2、左连接。左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示NULL
hive (test)> select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
4 hua 24 NULL NULL
深入了解,比较两者的不同,在上述语句中增加了一个where条件或者是and的条件
on条件会产生一个临时表,where条件是对这个临时表进行过滤
hive (test)> select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id where b.name='chinese';
1 chenli 21 1 chinese
hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id where a.name='chenli';
1 chenli 21 1 chinese
标准查询关键字执行顺序为 from->where->group by->having->order by
on是先对表进行筛选后再关联的,left关联则on对右表才有效的,左表都是要选出来的
如果是要条件查询后才连接应该把查询件放置于ON后。如果是想再连接完毕后才筛选就应把条件放置于where后面,对主表的帅选要用where条件
hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b where a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id and a.name='chenli';
1 chenli 21 1 chinese
2 xuzeng 22 NULL NULL
3 xiaodan 23 NULL NULL
4 hua 24 NULL NULL
hive (test)>select a.*,b.* from cl_student a left outer join cl_stu_sub b on a.id=b.id and b.name='chinese';
1 chenli 21 1 chinese
2 xuzeng 22 NULL NULL
3 xiaodan 23 NULL NULL
4 hua 24 NULL NULL
3、右连接.与左连接同理
hive (test)> select a.*,b.* from cl_student a right outer join cl_stu_sub b on a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
NULL NULL NULL 5 nature
4、全连接。相当于表1和表2的数据都显示,如果没有对应的数据,则显示NULL
hive (test)> select a.*,b.* from cl_student a full outer join cl_stu_sub b on a.id=b.id;
1 chenli 21 1 chinese
2 xuzeng 22 2 english
3 xiaodan 23 3 science
4 hua 24 NULL NULL
NULL NULL NULL 5 nature
5、左半开连接。left semi join,语法与左连接不一样,只能选择出左边表的数据,此数据符合on后面的条件。
semi join通常比inner join的效率更高
hive (test)>select a.* from cl_student a left semi join cl_stu_sub b on a.id=b.id;
1 chenli 21
2 xuzeng 22
3 xiaodan 23
hive (test)>select * from cl_student a left semi join cl_stu_sub b on a.id=b.id;
1 chenli 21
2 xuzeng 22
3 xiaodan 23
hive (test)>select a.*,b.* from cl_student a left semi join cl_stu_sub b on a.id=b.id;
FAILED: SemanticException [Error 10009]: Line 1:11 Invalid table alias 'b'
6、Hive不支持右半开连接
7、笛卡尔积join
笛卡尔积是一种连接,表示左边表的行数乘以右边表的行数。
hive (test)>select a.*,b.* from cl_student a join cl_stu_sub b;
1 chenli 21 1 chinese
1 chenli 21 2 english
1 chenli 21 3 science
1 chenli 21 5 nature
2 xuzeng 22 1 chinese
2 xuzeng 22 2 english
2 xuzeng 22 3 science
2 xuzeng 22 5 nature
3 xiaodan 23 1 chinese
3 xiaodan 23 2 english
3 xiaodan 23 3 science
3 xiaodan 23 5 nature
4 hua 24 1 chinese
4 hua 24 2 english
4 hua 24 3 science
4 hua 24 5 nature