MR之partition自定义分区器

最新推荐文章于 2021-06-28 19:36:20 发布

奔跑的max蜗牛

最新推荐文章于 2021-06-28 19:36:20 发布

阅读量1.1k

点赞数 1

文章标签： MR

本文链接：https://blog.csdn.net/qq_34896163/article/details/84578445

版权

maptask执行的结果都会放到一个分区文件中，这个分区文件有自己的编号，这个编号是通过一个hash算法来生成的，通过对context.write(k,v)中的k进行hash会产生一个值，相同的key产生的值是一样的，所以这种办法能将相同的key值放到一个分区中。分区中的值会发送给reducetask进行相应的处理。
mapreduce框架中有默认的分区器，这个分区器叫做HashPartitioner，代码代码如下：
这是默认分区的源码

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapred.lib;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;

/** 
 * Partition keys by their {@link Object#hashCode()}. 
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {

  public void configure(JobConf job) {}

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

} 
所以需要自己定义一个分区器，这个分区器通过判断key的值返回相应的值。代码如下：
public class ProvincePartition extends Partitioner<Text,ProvinceBean> {

    //加载数据字典
    public static Map<String,Integer> provinceDict = new HashMap<String, Integer>();
    static {
        provinceDict.put("135",0);
        provinceDict.put("181",1);
        provinceDict.put("177",2);
        provinceDict.put("170",3);
    }

    public int getPartition(Text key, ProvinceBean flowBean, int numPartitions) {
        Integer id = provinceDict.get(key.toString().substring(0,3));
        System.err.println(key.toString().substring(0,3));
        System.err.println(id);
        return id==null?4:id;

    }
}

代码解释：
1.首先需要将Text类型的值转换成String类型，调用toString方法
2.切割手机号码的前三位，通过get方法获得key对应的value值，这个值也可以到数据库中加载。
3.做一个判断，判断是否能得到值，能得值就直接返回map中的value值，得不到值就直接放在另外一个分区中。

注意：
如果 reduceTask 的数量> getPartition 的结果数，则会多产生几个空的输出文件part-r-000xx；
如果 1<reduceTask 的数量<getPartition 的结果数，则有一部分分区数据无处安放，会Exception；
如果 reduceTask 的数量=1，则不管 mapTask 端输出多少个分区文件，最终结果都交给这一个 reduceTask，最终也就只会产生一个结果文件 part-r-00000；
Driver中需要加入partition类的二进制文件

//设置自定义的分区类
job.setPartitionerClass(ProvincePartition.class);
//同时还需要设置reduce的个数，这个个数跟分区的个数相对应
job.setNumReduceTasks(5);

奔跑的max蜗牛

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫