Optimizing Skewed Joins

什么是Skewed Join?MapReduce是一个分布式的处理系统,不同的key会经过map处理以后发往不同的reduce,但是有一种可能是有一个key特别大,因为key是相同的是分不开的,如果有一个特别大会造成一个reduce运行特别缓慢,消耗非常多的内存。我们采取的方式是把超大key也分散到不同的reduce里面做。Pig对skewed  join的是先有三个步骤,第一个是通过Sample。我们主要是确定一下哪些key分配到不同的Reduce,还有超大的key到底需要多少的Reduce来进行。第二,用特殊的Partitioner,可以产生不同的结果需不需要多个reduce,如果需要的话到底有多少?partitioner而会把正确的发送到reduce上面。我们把超大key分开了,这些都是对左边的联系来做的,我们的skill主要指的是左边的。如果要有Pig做Skewed  join,一定要注意。左边的表进行sample分配到不同的reduce,在reduce阶段可以得到左边的记录,右边的关系是复制到每个reduce,右边的关系每个reduce都会拿到,这样左边的分配好的Key和右边的完整的关系做一个联系。

The Problem

A join of 2 large data tables is done by a set of MapReduce jobs which first sorts the tables based on the join key and then joins them. The Mapper gives all rows with a particular key to the same Reducer.

e.g., Suppose we have table A with a key column, "id" which has values 1, 2, 3 and 4, and table B with a similar column, which has values 1, 2 and 3.
We want to do a join corresponding to the following query

  • select A.id from A join B on A.id = B.id

A set of Mappers read the tables and gives them to Reducers based on the keys. e.g., rows with key 1 go to Reducer R1, rows with key 2 go to Reducer R2 and so on. These Reducers do a cross product of the values from A and B, and write the output. The Reducer R4 gets rows from A, but will not produce any results.

Now let's assume that A was highly skewed in favor of id = 1. Reducers R2 and R3 will complete quickly but R1 will continue for a long time, thus becoming the bottleneck. If the user has information about the skew, the bottleneck can be avoided manually as follows:

Do two separate queries

  • select A.id from A join B on A.id = B.id where A.id <> 1;
  • select A.id from A join B on A.id = B.id where A.id = 1 and B.id = 1;

The first query will not have any skew, so all the Reducers will finish at roughly the same time. If we assume that B has only few rows with B.id = 1, then it will fit into memory. So the join can be done efficiently by storing the B values in an in-memory hash table. This way, the join can be done by the Mapper itself and the data do not have to go to a Reducer. The partial results of the two queries can then be merged to get the final results.

  • Advantages
    • If a small number of skewed keys make up for a significant percentage of the data, they will not become bottlenecks.
  • Disadvantages
    • The tables A and B have to be read and processed twice.
    • Because of the partial results, the results also have to be read and written twice.
    • The user needs to be aware of the skew in the data and manually do the above process.

We can improve this further by trying to reduce the processing of skewed keys. First read B and store the rows with key 1 in an in-memory hash table. Now run a set of mappers to read A and do the following:

  • If it has key 1, then use the hashed version of B to compute the result.
  • For all other keys, send it to a reducer which does the join. This reducer will get rows of B also from a mapper.

This way, we end up reading only B twice. The skewed keys in A are only read and processed by the Mapper, and not sent to the reducer. The rest of the keys in A go through only a single Map/Reduce.

The assumption is that B has few rows with keys which are skewed in A. So these rows can be loaded into the memory.

Hive Enhancements

The skew data will be obtained from list bucketing (https://cwiki.apache.org/confluence/display/Hive/ListBucketing). There are no additions to the Hive grammar.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值