hive中的join有以下几类:
inner join/left join/right join/full join/cross join/left semi join
测试数据:
ja.txt
1 zhang3
2 li4
3 wang5
jb.txt
1 19
2 20
4 21
==============================================================================
创建ja表:
hive> create table ja(id int,name String)
>row format delimited fields terminated by '\t';
插入数据:
hive> load data local inpath '/root/user.tmp/ja.txt' into table ja;
查看:
hive> select * from ja;
OK
1 zhang3
2 li4
3 wang5
创建jb表:
hive> create table jb(id int,age int)
>row format delimited fields terminated by '\t';
插入数据:
hive> load data local inpath '/root/user.tmp/jb.txt' into table jb;
查看:
hive> select * from jb;
OK
1 19
2 20
4 21
准备工作:
hive> set hive.strict.checks.cartesian.product=false;
hive> set hive.mapred.mode=nostrict;
===========================================================================
1、inner join:
hive> select * from ja inner join jb on ja.id=jb.id;
OK
1 zhang3 1 19
2 li4 2 20
=============================================================================
2、left join:
hive> select * from ja left join jb on ja.id=jb.id;
OK
1 zhang3 1 19
2 li4 2 20
3 wang5 NULL NULL
hive> select * from ja left join jb on ja.id=jb.id where jb.id is null;
OK
3 wang5 NULL NULL
============================================================================
3、right join:
hive> select * from ja right join jb on ja.id=jb.id;
OK
1 zhang3 1 19
2 li4 2 20
NULL NULL 4 21
hive> select * from ja right join jb on ja.id=jb.id where ja.id is null;
OK
NULL NULL 4 21
==============================================================================
4、full join:
hive> select * from ja full join jb on ja.id=jb.id;
OK
1 zhang3 1 19
2 li4 2 20
3 wang5 NULL NULL
NULL NULL 4 21
hive> select * from ja full join jb on ja.id=jb.id where ja.id is null or jb.id is null;
OK
3 wang5 NULL NULL
NULL NULL 4 21
================================================================================
5、cross join:
hive中不支持笛卡尔积操作,但任何join(除left semi join外)只要不写连接条件,就是笛卡尔积。
==================================================================================
6、left semi join
用来替换SQL中的exists/in操作
hive> select * from ja left semi join jb on ja.id=jb.id;
OK
1 zhang3
2 li4
以上的语句相当于MySQL中的:
select * from ja where ja.id in(select jb.id from jb);
select * from ja where exists(select jb.id from jb on ja.id=jb.id);
==============================================================================
7、补充MySQL中的exists/in操作
in:
select * from ja where ja.id in(select jb.id from jb);
流程:
先执行子查询:select jb.id from jb;
jb.id
1
2
4
再执行主查询:select * from ja;
ja.id ja.name
1 zhang3
2 li4
3 wang5
两个结果笛卡尔积:
ja.id ja.name jb.id
1 zhang3 1
2 li4 1
3 wang5 1
1 zhang3 2
2 li4 2
3 wang5 2
1 zhang3 4
2 li4 4
3 wang5 4
执行on操作:
ja.id ja.name jb.id
1 zhang3 1
2 li4 2
去除多余数据:
ja.id ja.name
1 zhang3
2 li4
exists:
select * from a where exists (select b.id from b where a.id=b.id);
流程:
执行主查询select * from a;
ja.id ja.name
1 zhang3
2 li4
3 wang5
将主查询结果的每条记录执行a.id=b.id操作,成立返回true,留下该条记录,否则丢弃该条记录:
ja.id ja.name
1 zhang3
2 li4
场景:
子表较大用exists。
not in 和no exists,无论哪种情况都用not exists。
参考文章:
https://www.cnblogs.com/liyasong/p/sql_in_exists.html
https://blog.csdn.net/wqc19920906/article/details/79800374
一张图总结join