maptask执行的结果都会放到一个分区文件中,这个分区文件有自己的编号,这个编号是通过一个hash算法来生成的,通过对context.write(k,v)中的k进行hash会产生一个值,相同的key产生的值是一样的,所以这种办法能将相同的key值放到一个分区中。分区中的值会发送给reducetask进行相应的处理。
mapreduce框架中有默认的分区器,这个分区器叫做HashPartitioner,代码代码如下:
这是默认分区的源码
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapred.lib;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;
/**
* Partition keys by their {@link Object#hashCode()}.
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public class HashPartitioner<K2, V2> implements Partitioner<K2, V2> {
public void configure(JobConf job) {}
/** Use {@link Object#hashCode()} to partition. */
public int getPartition(K2 key, V2 value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
所以需要自己定义一个分区器,这个分区器通过判断key的值返回相应的值。代码如下:
public class ProvincePartition extends Partitioner<Text,ProvinceBean> {
//加载数据字典
public static Map<String,Integer> provinceDict = new HashMap<String, Integer>();
static {
provinceDict.put("135",0);
provinceDict.put("181",1);
provinceDict.put("177",2);
provinceDict.put("170",3);
}
public int getPartition(Text key, ProvinceBean flowBean, int numPartitions) {
Integer id = provinceDict.get(key.toString().substring(0,3));
System.err.println(key.toString().substring(0,3));
System.err.println(id);
return id==null?4:id;
}
}
代码解释:
1.首先需要将Text类型的值转换成String类型,调用toString方法
2.切割手机号码的前三位,通过get方法获得key对应的value值,这个值也可以到数据库中加载。
3.做一个判断,判断是否能得到值,能得值就直接返回map中的value值,得不到值就直接放在另外一个分区中。
注意:
如果 reduceTask 的数量> getPartition 的结果数,则会多产生几个空的输出文件part-r-000xx;
如果 1<reduceTask 的数量<getPartition 的结果数,则有一部分分区数据无处安放,会Exception;
如果 reduceTask 的数量=1,则不管 mapTask 端输出多少个分区文件,最终结果都交给这一个 reduceTask,最终也就只会产生一个结果文件 part-r-00000;
Driver中需要加入partition类的二进制文件
//设置自定义的分区类
job.setPartitionerClass(ProvincePartition.class);
//同时还需要设置reduce的个数,这个个数跟分区的个数相对应
job.setNumReduceTasks(5);