Join 语法:
join_table:
table_reference JOIN table_factor [join_condition]//内连接
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition//外连接
| table_reference LEFT SEMI JOIN table_reference join_condition//左半连接
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
table_reference:
table_factor//表
| join_table//join语句
table_factor:
tbl_name [alias]//表名[别名]
| table_subquery alias//子查寻[别名]
| ( table_references )//带空号的table_reference
join_condition:
ON expression//on开头的条件语句
hive> select * from example1;
1
2
hive> select * from example2;
2
3
内连接
- 匹配不上的都丢弃
hive> select * from example1 join example2 on example1.id = example2.id;
2 2
外连接
- 左外连接,左表都保留
hive> select * from example1 left outer join example2 on example1.id = example2.id;
1 NULL
2 2
- 右外连接,右表都保留
hive> select * from example1 full outer join example2 on example1.id = example2.id;
2 2
NULL 3
- 全外连接,左右表都保留
hive> select * from example1 left outer join example2 on example1.id = example2.id;
1 NULL
2 2
NULL 3
笛卡尔积连接
- 排列组合所有情况
hive> select * from example1 cross join example2 on example1.id = example2.id;
1 2
1 3
2 2
2 3
左半连接
功能等同于传统SQL中的IN,此时表b只能出现在ON子句中,不能出现在SELECT和WHERE中。(uncorrelated)
传统
SELECT a.key, a.value FROM a WHERE a.key IN (SELECT b.key FROM b);
- Hive
SELECT a.key, a.val FROM a LEFT SEMI JOIN b ON (a.key = b.key)
子查询
- 0.12以后的Hive开始支持FROM从句的子查寻,并在0.13以后的版本加入AS
SELECT ... FROM (subquery) name ...
SELECT ... FROM (subquery) AS name ... (Note: Only valid starting with Hive 0.13.0)
0.13以后的Hive开始支持WHERE从句的子查询
IN和NOT IN中声明中的查询将被当作常量。(被称作uncorrelated subqueries,因为自查寻语句不能使用父查询的列。)
SELECT *
FROM A
WHERE A.a IN (SELECT foo FROM B);
- 除了IN,现在也支持EXISTS和NOT EXISTS的子查寻,可以使用父查询的列:
SELECT A
FROM T1
WHERE EXISTS (SELECT B FROM T2 WHERE T1.X = T2.Y)
一些限制:
这些子查询只支持表达式在右边;
IN/NOT IN子句只能select单列;
EXISTS/NOT EXISTS拥有一个或多个有关联的断言;(EXISTS/NOT EXISTS must have one or more correlated predicates.)
References to the parent query are only supported in the WHERE clause of the subquery.
生成MapReduce作业
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
- 解释:生成1个MapReduce作业,a和b被缓存在内存里,在reducers中与c进行join。
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
- 解释:生成2个MapReduce作业,第一个作业a被缓存起来,在reducers中和b进行join。生成的结果被缓存起来,在第二个作业的reducers中与c进行join。
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
- 解释:生成1个MapReduce作业,通过hint(提示)指定谁作为STREAM,其他的被缓存。此次b和c被缓存在内存里,在reducers中与a进行join。如果不写默认最右表为STREAM。
MAPJOIN
- 如果所有表都很小,可以只是用map,不使用reduce。对a所有的maper都把b表读全。此时
a FULL/RIGHT OUTER JOIN b
无法实现,因为无法保证b表每条记录被且只被考虑一次。
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
- 如果使用待join字段对表分了桶,并且分桶数目一致,就不再需要把b表所有数据分到a表,而是只用把对应桶的b分给a。如a的桶1在mapper过程中只会获得b的桶1。这种做法不是默认行为,需要设置参数。
set hive.optimize.bucketmapjoin = true
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM a JOIN b ON a.key = b.key
- 如果使用待join字段对表分了桶还排了序,并且分桶数目一致,The mapper for the bucket for A will traverse the corresponding bucket for B(怕翻译不正确,这里保留官网原文,个人理解是,有顺序后,两个表可以同属顺寻遍历,时间复杂度从O(n^2)降低为O(n))。这种做法不是默认行为,需要设置参数。
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
SELECT /*+ MAPJOIN(b) */ a.key, a.value
FROM A a JOIN B b ON a.key = b.key
- 设定下面参数为True,当所有输入表都是分桶或分桶+排序过的时,Hive会自动进行分桶MapJoin或分桶排序MapJoin。
hive.auto.convert.join
- 当多个表只有一个是大表,其他的都是小表时,我们希望使用2个map-only job实现,如果直接使用下面的代码则会生成2个map-only jobs。
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)
) firstjoin
JOIN
smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)
- 使用下面的配置参数才能保证该查询只执行1个map-reduce作业
hive.auto.convert.join.noconditionaltask - 该参数为on,且n表(tables)/分区(partitions)join中的n-1个部分的总尺寸小于预设大小时,自动转换成1个mapjoin。
hive.auto.convert.join.noconditionaltask.size - 上面提到的预设大小,默认为10M。
新特性:
以前hive只支持等连接,外连接,左半连接。hive不支持非相等的join条件,因为它很难在map/reduce job实现这样的条件。Hive 2.2.0以后,开始支持非相等join条件。
SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id)
INNER joins
-- Disjunctions多逻辑或
SELECT *
FROM src1 JOIN src
ON (src1.key=src.key
OR src1.value between 100 and 102
OR src.value between 100 and 102)
LIMIT 10;
-- Conjunction with multiple inputs references in one side一侧多输入
SELECT *
FROM src1 JOIN src
ON (src1.key+src.key >= 100
AND src1.key+src.key <= 102)
LIMIT 10;
-- Conjunct with no references无参
SELECT *
FROM src1 JOIN src
ON (src1.value between 100 and 102
AND src.value between 100 and 102
AND true)
LIMIT 10;
- OUTER joins
-- Disjunctions多逻辑或
SELECT *
FROM src1 LEFT OUTER JOIN src
ON (src1.key=src.key
OR src1.value between 100 and 102
OR src.value between 100 and 102)
LIMIT 10;
-- Conjunction with multiple inputs references in one side一侧多输入
SELECT *
FROM src1 RIGHT OUTER JOIN src
ON (src1.key+src.key >= 100
AND src1.key+src.key <= 102)
LIMIT 10;
-- Conjunct with no references无参
SELECT *
FROM src1 FULL OUTER JOIN src
ON (src1.value between 100 and 102
AND src.value between 100 and 102
AND true)
LIMIT 10;