亲测有效,在hive-2.1.1版本,存储为hadoop-2.7.3下,hive支持 in或not in 中包含查询子句的语法,此外,在有些版本或hive thrift包不支持的话,也可能通过left join实现
1、创建表
假设有一个用户浏览商品表scan_info,和一个用户购买商品表buy_info。如下
hive> create table scan_info (userId string,itemId string,time string) row format delimited fields terminated by '\t';
OK
Time taken: 0.784 seconds
hive> create table buy_info (userId string,itemId string,time string) row format delimited fields terminated by '\t';
OK
Time taken: 0.133 seconds
2、准备数据并导入数据
hive> load data local inpath '/home/wo/yy_work/scan_info.txt' overwrite into table scan_info;
Loading data to table test_hive.scan_info
OK
Time taken: 2.568 seconds
hive> load data local inpath '/home/wo/yy_work/buy_info.txt' overwrite into table buy_info;
Loading data to table test_hive.buy_info
OK
Time taken: 0.605 second
3、验证查看数据
hive> select * from scan_info;
OK
001 342 2018-12-14
002 382 2018-12-15
003 458 2018-12-15
004 468 2018-12-15
hive> select * from buy_info;
OK
001 342 2018-12-13
002 382 2018-12-14
003 458 2018-12-15
005 325 2018-12-15
4、左外连接查询
select * from scan_info left outer join buy_info on scan_info.userId = buy_info.userId and scan_info.itemId = buy_info.itemIdTotal MapReduce CPU Time Spent: 2 seconds 690 msec
OK
001 342 2018-12-14 001 342 2018-12-13
002 382 2018-12-15 002 382 2018-12-14
003 458 2018-12-15 003 458 2018-12-15
004 468 2018-12-15 NULL NULL NULL
Time taken: 23.802 seconds, Fetched: 4 row(s)
5、如果要查询在scan_info表中并且也在buy_info表中的信息,需要用in查询,hive sql如下:
select scan_info.userId , scan_info.itemId from scan_info left outer join buy_info
on scan_info.userId = buy_info.userId and scan_info.itemId = buy_info.itemId
where buy_info.userId is not null;
001 342
002 382
003 458
Time taken: 22.313 seconds, Fetched: 3 row(s)
6、如果要查询在scan_info表中并且不也在buy_info表中的信息,需要用not in查询,hive sql如下:
select scan_info.userId, scan_info.itemId from scan_info left outer join buy_info
on scan_info.userId=buy_info.userId and scan_info.itemId=buy_info.itemId
where buy_info.userId is null;
OK
004 468
Time taken: 20.133 seconds, Fetched: 1 row(s)
7、not in 子查询
select scan_info.userId, scan_info.itemId from scan_info where scan_info.userId not in (
select distinct buy_info.userId from buy_info )
OK
004 468
Time taken: 104.018 seconds, Fetched: 1 row(s)
8、in子查询
select scan_info.userId, scan_info.itemId from scan_info where scan_info.userId in (
select distinct buy_info.userId from buy_info )
OK
001 342
002 382
003 458
Time taken: 44.224 seconds, Fetched: 3 row(s)
参考:https://blog.csdn.net/yzhhughyzh/article/details/45604291
https://blog.csdn.net/chenyuexing0822/article/details/78525383