Hive Join详解

Join 语法:

join_table:
    table_reference JOIN table_factor [join_condition]//内连接
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition//外连接
  | table_reference LEFT SEMI JOIN table_reference join_condition//左半连接
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)


table_reference:
    table_factor//表
  | join_table//join语句


table_factor:
    tbl_name [alias]//表名[别名]
  | table_subquery alias//子查寻[别名]
  | ( table_references )//带空号的table_reference


join_condition:
    ON expression//on开头的条件语句
hive> select * from example1;
1
2
hive> select * from example2;
2
3

内连接

  • 匹配不上的都丢弃
hive> select * from example1 join example2 on example1.id = example2.id;
2 2

外连接

  • 左外连接,左表都保留
hive> select * from example1 left outer join example2 on example1.id = example2.id;
1 NULL
2 2
  • 右外连接,右表都保留
hive> select * from example1 full outer join example2 on example1.id = example2.id;
2 2
NULL 3
  • 全外连接,左右表都保留
hive> select * from example1 left outer join example2 on example1.id = example2.id;
1 NULL
2 2
NULL 3

笛卡尔积连接

  • 排列组合所有情况
hive> select * from example1 cross join example2 on example1.id = example2.id;
1 2
1 3
2 2
2 3

左半连接

  • 功能等同于传统SQL中的IN,此时表b只能出现在ON子句中,不能出现在SELECT和WHERE中。(uncorrelated)

  • 传统

SELECT a.key, a.value FROM a WHERE a.key IN (SELECT b.key FROM b);
  • Hive
SELECT a.key, a.val FROM a LEFT SEMI JOIN b ON (a.key = b.key)

子查询

  • 0.12以后的Hive开始支持FROM从句的子查寻,并在0.13以后的版本加入AS
SELECT ... FROM (subquery) name ...
SELECT ... FROM (subquery) AS name ...   (Note: Only valid starting with Hive 0.13.0)
  • 0.13以后的Hive开始支持WHERE从句的子查询

  • IN和NOT IN中声明中的查询将被当作常量。(被称作uncorrelated subqueries,因为自查寻语句不能使用父查询的列。)

SELECT *
FROM A
WHERE A.a IN (SELECT foo FROM B);
  • 除了IN,现在也支持EXISTS和NOT EXISTS的子查寻,可以使用父查询的列:
SELECT A
FROM T1
WHERE EXISTS (SELECT B FROM T2 WHERE T1.X = T2.Y)
  • 一些限制:

    • 这些子查询只支持表达式在右边;

    • IN/NOT IN子句只能select单列;

    • EXISTS/NOT EXISTS拥有一个或多个有关联的断言;(EXISTS/NOT EXISTS must have one or more correlated predicates.)

    • References to the parent query are only supported in the WHERE clause of the subquery.

生成MapReduce作业

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
  • 解释:生成1个MapReduce作业,a和b被缓存在内存里,在reducers中与c进行join。
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
  • 解释:生成2个MapReduce作业,第一个作业a被缓存起来,在reducers中和b进行join。生成的结果被缓存起来,在第二个作业的reducers中与c进行join。
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
  • 解释:生成1个MapReduce作业,通过hint(提示)指定谁作为STREAM,其他的被缓存。此次b和c被缓存在内存里,在reducers中与a进行join。如果不写默认最右表为STREAM。

MAPJOIN

  • 如果所有表都很小,可以只是用map,不使用reduce。对a所有的maper都把b表读全。此时a FULL/RIGHT OUTER JOIN b无法实现,因为无法保证b表每条记录被且只被考虑一次。
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
  • 如果使用待join字段对表分了桶,并且分桶数目一致,就不再需要把b表所有数据分到a表,而是只用把对应桶的b分给a。如a的桶1在mapper过程中只会获得b的桶1。这种做法不是默认行为,需要设置参数。
set hive.optimize.bucketmapjoin = true
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
  • 如果使用待join字段对表分了桶还排了序,并且分桶数目一致,The mapper for the bucket for A will traverse the corresponding bucket for B(怕翻译不正确,这里保留官网原文,个人理解是,有顺序后,两个表可以同属顺寻遍历,时间复杂度从O(n^2)降低为O(n))。这种做法不是默认行为,需要设置参数。
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a JOIN B b ON a.key = b.key
  • 设定下面参数为True,当所有输入表都是分桶或分桶+排序过的时,Hive会自动进行分桶MapJoin或分桶排序MapJoin。
hive.auto.convert.join
  • 当多个表只有一个是大表,其他的都是小表时,我们希望使用2个map-only job实现,如果直接使用下面的代码则会生成2个map-only jobs。
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
  ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
    bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                   
  ) firstjoin                                                            
  JOIN                                                                 
  smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)   
  • 使用下面的配置参数才能保证该查询只执行1个map-reduce作业
hive.auto.convert.join.noconditionaltask - 该参数为on,且n表(tables)/分区(partitions)join中的n-1个部分的总尺寸小于预设大小时,自动转换成1个mapjoin。

hive.auto.convert.join.noconditionaltask.size - 上面提到的预设大小,默认为10M。

新特性:

  • 以前hive只支持等连接,外连接,左半连接。hive不支持非相等的join条件,因为它很难在map/reduce job实现这样的条件。Hive 2.2.0以后,开始支持非相等join条件。SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)

  • INNER joins

-- Disjunctions多逻辑或
SELECT *
FROM src1 JOIN src
ON (src1.key=src.key
  OR src1.value between 100 and 102
  OR src.value between 100 and 102)
LIMIT 10;
-- Conjunction with multiple inputs references in one side一侧多输入
SELECT *
FROM src1 JOIN src
ON (src1.key+src.key >= 100
  AND src1.key+src.key <= 102)
LIMIT 10;
-- Conjunct with no references无参
SELECT *
FROM src1 JOIN src
ON (src1.value between 100 and 102
  AND src.value between 100 and 102
  AND true)
LIMIT 10;
  • OUTER joins
-- Disjunctions多逻辑或
SELECT *
FROM src1 LEFT OUTER JOIN src
ON (src1.key=src.key
  OR src1.value between 100 and 102
  OR src.value between 100 and 102)
LIMIT 10;
-- Conjunction with multiple inputs references in one side一侧多输入
SELECT *
FROM src1 RIGHT OUTER JOIN src
ON (src1.key+src.key >= 100
  AND src1.key+src.key <= 102)
LIMIT 10;
-- Conjunct with no references无参
SELECT *
FROM src1 FULL OUTER JOIN src
ON (src1.value between 100 and 102
  AND src.value between 100 and 102
  AND true)
LIMIT 10;
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值