Hive中HSQL中left semi join

最新推荐文章于 2023-05-11 09:26:38 发布

☞空白页

最新推荐文章于 2023-05-11 09:26:38 发布

阅读量700

点赞数

分类专栏： Hive

原文链接：https://blog.csdn.net/helloxiaozhe/article/details/87910386

版权

Hive 专栏收录该内容

32 篇文章 6 订阅

订阅专栏

Hive中HSQL中left semi join

证明在Hive 2.1.1版本中，是支持where子句中的子查询，SQL常用的exist in子句在Hive中是不支持的，但可以使用一个更高效的实现方式---- semi join最主要的使用场景就是解决exist in。

SQL1:

SELECT table1.student_no, table1.student_name FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no);

SQL2:

SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no);

SQL1和SQL2等价，只输出左表包含的那些列。且输出结果如下：

可以看到，只打印出了左边的表中的列，规律是如果主键在右边表中存在，则打印，否则过滤掉了。以上两个测试证明在博客https://blog.csdn.net/AntKengElephant/article/details/83029573中有错。

此外，注意哈，只存在 left SEMI JOIN，不存在SEMI JOIN 和 right SEMI JOIN。

hive> SELECT table1.student_no, table1.student_name FROM table1 SEMI JOIN table2 on (table1.student_no=table2.student_no);
FAILED: SemanticException [Error 10009]: Line 1:79 Invalid table alias 'table1'

此外，注意需要以下几项：
1、left semi join 的限制是， JOIN 子句中右边的表只能在 ON 子句中设置过滤条件，在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

hive> SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no) where table2.student_no>3;
FAILED: SemanticException [Error 10004]: Line 1:92 Invalid table alias or column reference 'table2': (possible column names are: student_no, student_name)

对右表的过滤条件只能写在on子句中：

hive> SELECT * FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table2.student_no and  table2.student_no>3);

2、left semi join 是只传递表的 join key 给 map 阶段，因此left semi join 中最后 select 的结果只许出现左表的那些列（参见SQL1和SQL2区别）。

3、因为 left semi join 是 in(keySet) 的关系，遇到右表重复记录，左表会跳过，而 join 则会一直遍历。这就导致右表有重复值得情况下 left semi join 只产生一条，join 会产生多条，也会导致 left semi join 的性能更高。

☞空白页

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hive中HSQL中left semi join

Hive中HSQL中left semi join证明在Hive 2.1.1版本中，是支持where子句中的子查询，SQL常用的exist in子句在Hive中是不支持的，但可以使用一个更高效的实现方式---- semi join最主要的使用场景就是解决exist in。SQL1:SELECT table1.student_no, table1.student_name FROM table1 LEFT SEMI JOIN table2 on ( table1.student_no =table
复制链接

扫一扫