Hive之常用join

Hive 有很多join,这里主要讲述常用的三种join。

 

0 Map Reduce

map reduce

 

1 Common Join

common join

最为普通的join策略,不受数据量的大小影响,也可以叫做reduce side join ,最没效率的一种join 方式. 它由一个mapreduce job 完成.

首先将大表和小表分别进行map 操作, 在map shuffle 的阶段每一个map output key 变成了table_name_tag_prefix + join_column_value , 但是在进行partition 的时候它仍然只使用join_column_value 进行hash.

每一个reduce 接受所有的map 传过来的split , 在reducce 的shuffle 阶段,它将map output key 前面的table_name_tag_prefix 给舍弃掉进行比较. 

In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. 

 

2 Map Join

map join

MAPJOINs are processed by loading the smaller table into an in-memory hash map and matching keys with the larger table as they are streamed through. The prior implementation has this division of labor:

  • Local work:
    • read records via standard table scan (including filters and projections) from source on local machine
    • build hashtable in memory
    • write hashtable to local disk
    • upload hashtable to dfs
    • add hashtable to distributed cache
  • Map task
    • read hashtable from local disk (distributed cache) into memory
    • match records' keys against hashtable
    • combine matches and write to output
  • No reduce task

 

常见的优化就是,把关联的小表加载到内存中,去除reducer操作减少网络传输时间。参数如下:

SET hive.auto.convert.join=true;
SET hive.auto.convert.join.noconditionaltask=true;
SET hive.auto.convert.join.noconditionaltask.size=10000000000; 

If hive.auto.convert.join is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible.

Optimize Auto Join Conversion

When auto join is enabled, there is no longer a need to provide the map-join hints in the query. The auto join option can be enabled with two configuration parameters:

set hive.auto.convert.join.noconditionaltask = true;
set hive.auto.convert.join.noconditionaltask.size = 10000000;

The default for hive.auto.convert.join.noconditionaltask is true which means auto conversion is enabled. (Originally the default was false – see HIVE-3784 – but it was changed to true by HIVE-4146 before Hive 0.11.0 was released.)

The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory. Currently, n-1 tables of the join have to fit in memory for the map-join optimization to take effect. There is no check to see if the table is a compressed one or not and what the potential size of the table can be. The effect of this assumption on the results is discussed in the next section.

注:当hash table分发相同key的value时,所有的value会在一个list里。便于生成所有的关联,这点和left semi join不同。

 

3 Left Semi Join

left semi join

LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN are that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.

SELECT a.key, a.value

FROM a

WHERE a.key in

 (SELECT b.key

  FROM B);

can be rewritten to:

SELECT a.key, a.val

FROM a LEFT SEMI JOIN b ON (a.key = b.key)

left semi join 实现的是 hive in的功能,具体实现和map join类似。区别在于小表hashtable会进行key value的去重,同时不在hash table里的key会被丢弃,从而实现in的功能。

 

参考文件:

https://www.cnblogs.com/zzhangyuhang/p/9792800.html

http://shiyanjun.cn/archives/588.html

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值