HIVE差集运算详解(转载)

最新推荐文章于 2024-02-23 17:08:47 发布

qcg_qcg

最新推荐文章于 2024-02-23 17:08:47 发布

阅读量1.6k

点赞数 1

分类专栏：大数据

原文链接：https://blog.csdn.net/Dr_Guo/article/details/51182626

版权

大数据专栏收录该内容

19 篇文章 1 订阅

订阅专栏

首先来介绍一下差集的概念：

上图A中蓝色区域为集合A-集合B，叫做A与B的差集。
比如说有这么两个表:

hive> select * from A;
OK
1	2
1	3
2	1
2	3
3	1
Time taken: 0.3 seconds, Fetched: 5 row(s)
hive> select * from B;
OK
1	2
1	4
2	2
2	3
Time taken: 0.086 seconds, Fetched: 4 row(s)

要取出A与B的差集（A-B）

1	3
2	1
3	1

Hive可不可以用not in？可以，但只能用于单个字段。select * from A where (uid,goods) not in (select uid,goods from B);这个oracle是支持的，但hive不行。

hive> select * from A  where uid not in (select uid from B);
3	1
Time taken: 46.09 seconds, Fetched: 1 row(s)

Hive可不可以用not exists？显然也可以！

hive> select * from A  where not exists (select * from B where A.uid=B.uid and A.goods=B.goods);
1	3
2	1
3	1
Time taken: 12.989 seconds, Fetched: 3 row(s)

不过前两种貌似很费资源，在ODPS里都有限制，下面来介绍一下hive常用的求差集方法，左（右）连接 left outer join
先看一下左连接之后表是什么样的

hive> select * from A a left outer join B b on a.uid=b.uid and a.goods=b.goods;
1	2	1	2
1	3	NULL	NULL
2	1	NULL	NULL
2	3	2	3
3	1	NULL	NULL
Time taken: 12.735 seconds, Fetched: 5 row(s)

现在只要取出B的uid和goods为null的行就可以了

hive> select a.* from A a left outer join B b on a.uid=b.uid and a.goods=b.goods where b.uid is null and b.goods is null;
1	3
2	1
3	1
Time taken: 13.023 seconds, Fetched: 3 row(s)

转载地址：https://blog.csdn.net/Dr_Guo/article/details/51182626

qcg_qcg

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
HIVE差集运算详解(转载)

首先来介绍一下差集的概念：上图A中蓝色区域为集合A-集合B，叫做A与B的差集。比如说有这么两个表:hive> select * from A;OK1 21 32 12 33 1Time taken: 0.3 seconds, Fetched: 5 row(s)hive> select * from B;OK1 21 42 22 3Time ta...
复制链接

扫一扫

专栏目录