精通HADOOP（十） - MAPREDUCE任务的基础知识 - 创建客户化的Mapper和Reducer

最新推荐文章于 2022-09-14 10:00:04 发布

罗伯特北京

最新推荐文章于 2022-09-14 10:00:04 发布

阅读量1.3w

点赞数

分类专栏：云计算 - 精通Hadoop（翻译）文章标签：任务 hadoop mapreduce 作业框架 input

云计算 - 精通Hadoop（翻译）专栏收录该内容

12 篇文章 0 订阅

订阅专栏

1.1 创建客户化的Mapper和Reducer

正如你所见，MapReduceIntro类中你的第一个Hadoop程序产生了排序的输出，但是，因为作业的关键字是数字的，这个排序不是你所期望的，因为它按照字符排序，而不是按照数字排序。现在，我们看看如何使用客户化的Mapper进行数字排序。然后我们会看看如何使用客户化的Reducer输出一个容易解析的格式的内容。

1.1.1 设置客户化的Mapper

进行数字排序听起来并不难。现在我们把输出的关键字设置为框架提供的另外一个LongWritable类型：

使用

conf.setOutputKeyClass(LongWritable.class);

替换

conf.setOutputKeyClass(Text.class);

你可以在MapReduceIntroLongWritable.java文件中找到这些改变。你可以通过命令执行这个类：

hadoop jar DOWNLOAD_PATH/ch2.jar ➥

com.apress.hadoopbook.examples.ch2.MapReduceIntroLongWritable

你会看见如下输出：

mapred.LocalJobRunner: job_local_0001

java.io.IOException: Type mismatch in key from map: expected

org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text

at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)

at org.apache.hadoop.mapred.lib.IdentityMapper.map(IdentityMapper.java:37)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)

ch2.MapReduceIntroLongWritable: The job has failed due to an IO Exception

正如你所见，仅仅改变输出关键字的类型是不够的。如果你改变输出关键字的类型为LongWritable类型，你也需要修改map函数以至于它也输出LongWriatable关键字。

为了作业能够产生数字排序的输出，你必须改变作业配置，和提供客户化的Mapper类型。这需要对JobConf进行两个调用。

· conf.setOutputKeyClass(LongWritable.class): 告诉框架map和reduce输出的关键字类型。

· conf.setMapperClass(TransformKeysToLongMapper.class): 告诉框架提供map方法的客户化类型，这个类型的输入关键字是Text类型，输出关键字是LongWritable类型。

一个样例类MapReduceIntroLongWritableCorrect.java提供了这些配置。这个类和MapReduceIntro相比，除了上述两个方法调用不同以外是一致的。

请注意，作业配置选项也提供一个客户化的排序选择。如果你提供一个客户化的WritableComparable接口的实现也能达到此目的。另外一个方式可以在配置中指定CustomComparator，这可以通过在JobConf对象中的setOutputKeyComparatorClass()方法来实现。我们将在第9章中实现一个客户化的对比器的样例程序。

你也需要提供一个执行转换的mapper类型。TransformKeysToLongMapper类不需要修改就可以完成这样的功能。和IdentityMapper类相比，TransformKeysToLongMapper类有许多改变。如前面列表2-2所示。

首先，类声明不再是模板类型，而是具体的类型。

/** Transform the input Text, Text key value
* pairs into LongWritable, Text key/value pairs.
*/
public class TransformKeysToLongMapperMapper
extends MapReduceBase implements Mapper

请注意，这块代码事实上提供了输入和输出的键值对的类型，而原来IdentityMapper类型是一个通用模板类型。除此之外，以前的标志mapper的声明是，implements Mapper ，在类型TransformKeysToLongMapperMapp中，声明是implements Mapper 。

TransformKeysToLongMapper的map方法和IdentityMapper是非常不同的，它使用了reporter对象。

1.1.1.1 Reporter对象

map和reduce方法都使用四个参数：关键字，键值，输出收集器和报表对象。报表对象提供了通知框架作业现行状态的机制。

报表对象提供了3个方法：

· incrCounter()：提供了一套计数器，在作业完成时，汇报给框架。

· setStatus()：为map和reduce任务提供状态信息。

· getInputSplit()：为任务返回输入源的信息。如果输入是简单的文件，这能够为日志提供有用的信息。

对reporter对象和输出收集器的每一个调用都会导致对框架产生一个通讯联系。这通知框架，当前任务是可响应的，并且没有处于死锁。如果你的map和reduce方法占用大量的时间，这个方法必须周期性的调用报表对象，通知框架它正在工作。缺省情况下，如果600秒内框架没有接受到这个任务的任何通讯联系，框架就会终结这个任务。

列表2-6显示了TransformKeysToLongMapper mapper类的代码，这个类使用reporter对象。

列表 2-6 TransformKeysToLongMapper.java类中的报表对象

/** Map input to the output, transforming the input {@link Text}
* keys into {@link LongWritable} keys.
* The values are passed through unchanged.
*
* Report on the status of the job.
* @param key The input key, supplied by the framework, a {@link Text} value.
* @param value The input value, supplied by the framework, a {@link Text} value.
* @param output The {@link OutputCollector} that takes
* {@link LongWritable}, {@link Text} pairs.
* @param reporter The object that provides a way
* to report status back to the framework.
* @exception IOException if there is any error.
*/
public void map(Text key, Text value,
OutputCollector
  
  
   
    output, Reporter reporter)

   
   throws IOException {

   
   try {

   
   try {
reporter.incrCounter( "
   
   Input", "
   
   total records", 1 );
LongWritable newKey =

   
   new LongWritable( Long.parseLong( key.toString() ) );
reporter.incrCounter( "
   
   Input", "
   
   parsed records", 1 );
output.collect(newKey, value);
} 
   
   catch( NumberFormatException e ) {

   
   /** This is a somewhat expected case and we handle it specially. */
logger.warn( "
   
   Unable to parse key as a long for key,"
+"
   
    value " + key + "
   
    " + value, e );
reporter.incrCounter( "
   
   Input", "
   
   number format", 1 );

   
   return;
}
} 
   
   catch( Throwable e ) {

   
   /** It is very important to report back if there were
* exceptions in the mapper.
* In particular it is very handy to report the number of exceptions.
* If this is done, the driver can make better assumptions
* on the success or failure of the job.
*/
logger.error( "
   
   Unexpected exception in mapper for key,"
+ "
   
    value " + key + "
   
   , " + value, e );
reporter.incrCounter( "
   
   Input", "
   
   Exception", 1 );
reporter.incrCounter( "
   
   Exceptions", e.getClass().getName(), 1 );

   
   if (e 
   
   instanceof IOException) {

   
   throw (IOException) e;
}

   
   if (e 
   
   instanceof RuntimeException) {

   
   throw (RuntimeException) e;
}

   
   throw 
   
   new IOException( "
   
   Unknown Exception", e );
}
}

这块代码引进了一个新的对象， reporter, 还有一些最佳的实践模式。关键的部分是把Text关键字转换为LongWritable关键字。

LongWritable newKey = new LongWritable(Long.parseLong(key.toString()));
output.collect(newKey, value);

列表2-6的代码足以用来执行转换，它也包含追踪和汇报的附加代码。

代码效率:

在mapper中为转换对象创建新的关键字对象不是最有效的方式。大多数关键字类提供一个set()方法，这个用来设置当前的关键字的键值。output.collect()方法使用关键字的键值，一旦collect()方法完成后，关键字对象和键值对象就可以被释放了。

如果你配置一个作业使用多线程的map方法，你可以通过conf.setMapRunner(Multithreaded

MapRunner.class)来达到这个目的，map方法会被多个线程所调用。如果你在mapper类中使用成员变量，你需要非常的小心。一个ThreadLocal LongWritable对象能够被用于保证线程安全。为了简化这个样例，你需要构造一个新的LongWritable，在reduce方法中，没有线程安全问题。

对象使用方式对map方法会产生极大的效率影响，但是，对于reduce方法不会产生太大的影响。对象重用能够极大的提高运行效率。

1.1.1.2 计数器和异常

这个样例包含两个try/catch块和对reporter.incrCounter()方法的几次调用。在你的map和reduce方法封装在一个try块里，用catch语句抓住Throwable, 在catch语句里面汇报状态，这是个好办法。

在集群上管理作业执行的Hadoop核心服务器进程JobTracker会收集计数器值，把最终的结果放入作业输出。它也在JobTracker网页管理界面提供顺时计数器的值。缺省情况下，网址是http://jobtracker_host:50030/。我们会在第6章讨论这个接口的定义，第六章也会讨论多机器集群的安装。

现在你能够执行作业：

hadoop jar ch2.jar ➥

com.apress.hadoopbook.examples.ch2.MapReduceIntroLongWritableCorrect

计数器导致日志中的输出如下：

mapred.JobClient: Job complete: job_local_0001

mapred.JobClient: Counters: 13

mapred.JobClient: File Systems

mapred.JobClient: Local bytes read=78562

mapred.JobClient: Local bytes written=157868

mapred.JobClient: Input

mapred.JobClient: total records=126

mapred.JobClient: parsed records=126

mapred.JobClient: Map-Reduce Framework

mapred.JobClient: Reduce input groups=126

mapred.JobClient: Combine output records=0

mapred.JobClient: Map input records=126

mapred.JobClient: Reduce output records=126

mapred.JobClient: Map output bytes=5670

mapred.JobClient: Map input bytes=5992

mapred.JobClient: Combine input records=0

mapred.JobClient: Map output records=126

mapred.JobClient: Reduce input records=126

第一个catch块通过reporter.incrCounter( "Input", "number format", 1 );代码行处理异常，异常可能会在关键字转换的时候抛出的：

} catch( NumberFormatException e ) {
/** This is a somewhat expected case and we handle it specially. */
reporter.incrCounter( "Input", "number format", 1 );
return;
}

如果它不能转换一些关键字到Long值，你就会捕捉这个异常，这是你现在所期望的。reporter.incrCounter()调用告诉框架把Input组，number fomat名字的计数器增加1。如果计数器还不存在，则会创建一个新的计数器。

在这个样例输入中，没有能够引起数字格式异常的记录。仅仅被使用的计数器是Input.total和Input.parsed。这两个计数器在作业输出中属于Input组的一部分：

mapred.JobClient: Input

mapred.JobClient: total records=126

mapred.JobClient: parsed records=126

如果一个或者多个关键字在被转换成Long类型的时候引起异常，你会看到输出如下：

mapred.JobClient: Input

mapred.JobClient: total records=126

mapred.JobClient: parsed records=125

mapred.JobClient: number format=1

请注意，解析的记录数和数字格式的记录数之和等于总记录数。计数器在RunningJob中也是可以存取的，这允许对成功状态的更详细的查询。你的作业中的总记录数和这个样例程序可能是不同的。

1.1.2 作业完成

一旦作业完成，框架会提供一个具有完全信息的RunningJob对象。通过这个对象的方法conf.isSuccessful()你能够获得你的作业的成功状态和信息。如果框架不能完成任何一个map任务或者作业被手动终止，框架会报告给客户作业没有成功完成。

这些信息不足以说明事实上任务已经成功的完成。在map任务的方法中，对于每一个关键字或者大多数关键字都可能产生异常。如果你的map和reduce方法中对这些情况使用了作业计数器，你的主程序能更清楚的了解你的作业成功与否。

这个样例mapper在不同的情况下使用了几个计数器：

reporter.incrCounter( TransformKeysToLongMapper.INPUT, TransformKeysToLongMapper.TOTAL_RECORDS, 1 ): 汇报可见记录的个数。
reporter.incrCounter( TransformKeysToLongMapper.INPUT, TransformKeysToLongMapper.PARSED_RECORDS, 1 ): 汇报成功解析记录的个数。
reporter.incrCounter( TransformKeysToLongMapper.INPUT, TransformKeysToLongMapper.NUMBER_FORMAT, 1 ): 汇报不能解析的关键字的个数。
reporter.incrCounter( TransformKeysToLongMapper.INPUT, TransformKeysToLongMapper.EXCEPTION, 1 ): 汇报在解析关键字过程中出现异常的个数。
reporter.incrCounter( TransformKeysToLongMapper.EXCEPTIONS, e.getClass().getName(), 1 ): 回报类型相关异常个数。

1.1.2.1 检查计数器

一旦框架在RunningJob对象中填充了信息后，它会返回控制给主程序。主程序可以检查各个计数器的值，以及框架执行作业后的成功或者失败状态。

要使用一个计数器需要下面的几个步骤。

/** Get the job counters. {@see RunningJob.getCounters()}. */
Counters jobCounters = job.getCounters();
/** Look up the "Input" Group of counters. */
Counters.Group inputGroup = jobCounters.getGroup( TransformKeysToLongMapper.INPUT );
/** The map task potentially outputs 4 counters in the input group.
* Get each of them.
*/
long total = inputGroup.getCounter( TransformKeysToLongMapper.TOTAL_RECORDS );
long parsed = inputGroup.getCounter( TransformKeysToLongMapper.PARSED_RECORDS );
long format = inputGroup.getCounter( TransformKeysToLongMapper.NUMBER_FORMAT );
long exceptions = inputGroup.getCounter( TransformKeysToLongMapper.EXCEPTION );

既然主程序能够得到map方法使用的计数器，它能够更精确的判断出作业的成功与失败状态。

警告：精确的决定一个作业的成功与失败状态是非常重要的。在我的产品集群中，一个TaskTracker节点配置有错误。这个配置错误使计算密集型工作不能在map任务中执行，而且map方法遇到异常后立即返回。如果仅仅考虑到框架的执行，这个机器是非常快的，它在这台机器上调度几乎所有的map任务。要是仅仅考虑框架的执行，这个作业是成功的，但是，从业务逻辑方面分析，这个作业却是完全失败的。这主要是因为在实践的过程中，没有考虑到检查异常的计数器的数量，最终当结果的消耗着发现没有有效的输入记录才发现作业失败。为了避免这种尴尬的局面，你需要收集mapper和reducer对象的成功和失败信息，并且在你的主程序中检查这些结果。

1.1.2.2 查看作业是否成功

要检查成功的结果的一个重要的步骤是确保记录的输出数和记录的输入数是一致的。Hadoop作业通常用来处理大数量的块数据，这并不能保证100%的数据都是有效的，所以，极少部分的无效记录是可以接受的。

if (format != 0) {
logger.warn( "There were " + format + " keys that were not "
+ "transformable to long values");
}
/** Check to see if we had any unexpected exceptions.
* This usually indicates some significant problem,
* either with the machine running the task that had
* the exception, or the map or reduce function code.
* Log an error for each type of exception with the count.
*/
if (exceptions > 0 ) {
Counters.Group exceptionGroup = jobCounters.getGroup(
TransformKeysToLongMapper.EXCEPTIONS );
for (Counters.Counter counter : exceptionGroup) {
logger.error( "There were " + counter.getCounter()
+ " exceptions of type " + counter.getDisplayName() );
}
}
if (total == parsed) {
logger.info("The job completed successfully.");
System.exit(0);
}
// We had some failures in handling the input records.
// Did enough records process for this to be a successful job?
// is 90% good enough?
if (total * .9 <= parsed) {
logger.warn( "The job completed with some errors, "
+ (total - parsed) + " out of " + total );
System.exit( 0 );
}
logger.error( "The job did not complete successfully,"
+" too many errors processing the input, only "
+ parsed + " of " + total + "records completed" );
System.exit( 1 );

在这个特别的情况下，仅仅出现一些NumberFormatExceptions是正常的，但是不能有其他的异常。如果输入记录的总数基本等于解析的输入记录的总数，而且你没有看见任何异常，作业就是成功的。

1.1.3 创建客户化的Reducer

Reduce方法对于每一个关键字被调用一次，并且接受一个关键字和对应这个关键字的map输出值的迭代器作为参数。Reduce任务用来统计数据和移除重复数据的最佳位置。

请注意，针对已有数据集合来移除重复记录，把已有数据集合保存在HBase（Hadoop数据库）或者在一个类似Hadoop映射文件的排序格式中是比较合理的选择。如果你不采用这种方法，你需要合并并且排序已有数据集合和输入记录的数据集合。如果前述两个集合具有大量的数据，这将会浪费大量的时间。在使用HBase数据库的情况下，如果输入数据已经被排序了，你就会很容易的决定是否一条记录是重复的。如果已存记录被简单的排序，map任务也能用来执行合并操作。我们会在第10章讨论HBase，在第8和9章讨论使用map任务来执行合并操作。

在这节中的样例客户化的Reducer，我们合并一个关键字对应的键值成为一个逗号分割的形式（CSV），所以，对于一个关键字你有一个输出行，你很容易的会在这个简单的格式下解析说有的键值。

在前面小节中，你已经理解客户化的mapper是如何工作的，创建一个客户化的reducer也是相似的。你能在MapReduceIntroLongWritableReduce.java中找到样例程序，这个样例程序是基于MapReduceIntroLongWritableCorrect.java的。首先，框架需要知道reducer类的类型。关键的步骤是让框架知道reducer类的类型，所以，添加下面的一个单行：

/** Inform the framework that the reducer class will be the
* {@link MergeValuesToCSV}.
* This class simply writes an output record key,
* value record for each value in the key, valueset it receives as
* input.
* The value ordering is arbitrary.
*/
conf.setReducerClass(MergeValuesToCSV.class);

不需要改变输出类的类型，所以我们不需要对MapReduceIntroLongWritableCorrect.java进行改变。

事实上，执行作业的类是MergeValuesToCSVReducer.java。对于mapper样例类，TransformKeysToLongMapper，首先，你把它生命作为一个通用模板类型：

public class MergeValuesToCSVReducer
  
  

   
   extends MapReduceBase 
   
   implements Reducer
   
   
    
     {

Reduce方法不需要知道输入值的类型。它仅仅需要toSring（）方法正常工作。Reduce方法需要构造一个新的输出值，为了简单起见，对于这个转换，输出值被声明为文本类型。

事实上，方法声明也有同样的类型说明：

/** Merge the values for each key into a CSV text string.
*
* @param key The key object for this group.
* @param values Iterator to the set of values that share the key.
* @param output The {@see OutputCollector} to pass the transformed output to.
* @param reporter The reporter object to update counters and set task status.
* @exception IOException if there is an error.
*/

public void reduce(K key, Iterator
  
  
   
    values,
OutputCollector
   
   
    
     output, Reporter reporter)

    
    throws IOException {

如果作业期待一个不同于Text的输出值，框架就会抛出错误。关于这个mapper样例，你的方法体使用了报表对象。incrCounter()方法为作业提供了详细的信息，你可以通过web界面看到这些信息。作为一个性能优化，减少对象创建，你需要声明两个成员变量。这些变量在reduce（）方法中被使用到。

/** Used to construct the merged value.
* The {@link Text.set() Text.set} method is used
* to prevent object churn.
*/
protected Text mergedValue = new Text();
/** Working storage for constructing the resulting string. */
protected StringBuilder buffer = new StringBuilder();

buffer对象为输出构造CSV格式的行，mergedValue是在reduce（）方法调用过程中发送到输出的事实上的对象。声明这些成员变量，而不是声明作为局部变量，是安全的，因为框架使用单线程来执行Reduce作业。

请注意，有多个reduce任务在同一时刻执行，但是每一个任务执行在不同的虚拟机上，而且这些JVM可能执行在不同的物理机器上。

框架传递一个关键字和这个关键字对应的键值给reduce方法。你应该还记得，理想情况下，一个reduce任务不会对关键字有所改变，它会使用那个关键字作为在reduce（）方法中调用output.collect（）方法的关键字参数。Reduce（）方法的设计目标是为每一个关键字输出仅仅一行，这个行是由包含这个关键字所对应的所有的键值组成的逗号分隔的数据值。Reduce（）方法的核心有许多技巧使用重置StringBuilder对象来优化对象使用，还有一个循环用来处理键值中的每一个值。

buffer.setLength(0);
for (;values.hasNext(); valueCount++) {
reporter.incrCounter( OUTPUT, MergeValuesToCSVReducer.TOTAL_VALUES, 1 );
String value = values.next().toString();
if (value.contains("/"")) { // Perform Excel style quoting
value.replaceAll( "/"", "///"" );
}
buffer.append( '"' );
buffer.append( value );
buffer.append( "/"," );
}
buffer.setLength( buffer.length() - 1 );

在一个reduce（）方法中如果没有一个循环用来迭代处理一个关键字所对应的所有键值是非常罕见的。保持回报输出记录的个数是写程序的好习惯。在这个样例中，reporter.incrCounter

( OUTPUT, MergeValuesToCSVReducer.TOTAL_VALUES, 1 )，这行代码处理报表。

这个reducer依赖于键值对象的toString（）方法，对于文本输出作业这是合理的，因为框架也使用toString()方法来产生输出。前面代码块中剩余部分构造一个兼容Excel样式CSV文件的逗号分隔的格式。

事实上，输出代码块为输出构造新值的对象。在这种情况，我们使用一个成员变量mergedValue。在一个更大的作业，可能有数以万计关键字被传递给reduce（）方法，通过使用实例变量，我们极大的减少了创建对象的数量。在这个样例中，也有对输出记录的计数器：

mergedValue.set(buffer.toString());
reporter.incrCounter( OUTPUT, TOTAL_OUTPUT_RECORDS, 1 );
output.collect( key, mergedValue );

通过这行语句，mergedValue.set(buffer.toString())，我们对mergedValue对象设置值，我们使用这一行，output.collect( key, mergedValue )，把值输出给框架。这个样例使用Text作为输出值的类型。使用Writable作为输出值的类型也是允许的。如果输出格式是SequenceFile,，你的值对象就不需要实现toString()方法。

请注意，在collect（）方法中，框架将键值对序列化到输出流，当方法返回，用户可以随意的改变他们的对象值。

1.1.4 为什么Mapper和Reducer继承自MapReduceBase

客户化的mapper类型TransformKeysToLongMapper和reducer类型MergeValuesToCSVReducer都继承自基类org.apache.hadoop.mapred.MapReduceBase。这个类型提供了框架需要的mapper和reducer的两个附加方法的基本实现。框架在初始化一个任务的时候，调用configure()方法，当任务完成处理输入分割的时候，它也调用close()方法：

/** Default implementation that does nothing. */
public void close() throws IOException {
}
/** Default implementation that does nothing. */
public void configure(JobConf job) {
}

1.1.4.1 configure方法

对于你的任务来说，configure（）方法是唯一一个方式用来存取JobConf对象的。这个方法是用来完成per-task的配置和初始化的。如果你的应用程序使用Spring框架初始化，应用程序环境就是在这里被创建和相关的Bean在这里进行串联的。

对于一个开发人员来说，有一个JobConf成员变量是非常常见的，这个成员变量就是在这个方法中使用传入的JobConf对象初始化的。（我经常在这里使用日志记录输入分割的详细信息。）configure() 方法也是理想的位置来打开附加文件，在你的map（）和ｒｅｄｕｃｅ（）方法中需要去读写这些文件。

1.1.4.2 close方法

当所有的输入分割入口都被map()或者reduce()方法处理后，close()方法会被框架调用。关闭任何附加文件来确保这些文件的缓冲内容已经刷新到文件系统中。特别是对于HDFS，如果文件没有关闭，最后一块的数据就会丢失。

下面的例子在close（）方法中，对报表对象方法进行调用。

/** Keep track of the maximum number of keys a value had.
* Report it in the counters so that per task counters can be examined as needed
* and set the task status to include this maximum count.
*/
@Override
public void close() throws IOException {
super.close();
if (reporter!=null) {
reporter.incrCounter( OUTPUT, MAX_VALUES, maxValueCount );
reporter.setStatus( "Job Complete, maxixmum ValueCount was "
+ maxValueCount );
}
}

报表字段也是一个实例变量，如下面代码行，protected Reporter reporter，它是通过下面这一行，this.reporter = reporter，在reduce()方法中初始化的。在reduce（）方法中，键值的数量保存在valueCount中，如果它大于实例变量值，maxValueCount, maxValueCount就会被设置为valueCount。

在这个例子中，总数值不是特别有用的，因为那个值是所有最大值之和，但是per-task值是令人感兴趣的，而且通过web界面也是可以存取的。一个更有用的解决方案是去维护一个附加的输出文件，然后，往那个文件输出键值数。

当你在web界面选择一个完成的或者正在执行的任务的时候（缺省情况下在执行JobTracker的机器上这个端口是50030），在这个页面你能够看见作业的总体信息，以及map和reduce任务的详细信息的链接。每一个map和reduce任务都会有一个指向计数器的链接。

1.1.5 使用客户化分割器

缺省情况下，框架使用HashPartitioner类型根据关键字的hash值把你的输出分成不同的块。有很多情况下，你需要通过不同的形式输出数据。标准样例是用一个单个输出文件来代替多个输出文件，你可以通过把reduce任务数量减少到1来完成，通过如下代码行，conf.setNumReduces(1)，或者非排序/非reduce输出，这通过如下代码行完成，conf.setNumReduces(0)，如果你需要不同的分块，你可以对分块相关的更多选项进行设置。

这章的样例有Long关键字。一些简单的分块器概念是根据奇偶数进行排序，如果你知道关键字的最大和最小值，它还可以给予关键字的范围进行排序。根据值进行排序也是可能的。

如何实现分块的

当框架执行混淆的时候，它对mapper的每一个输出关键字进行检查，执行下列的操作，

int partition = partitioner.getPartition(key, value, partitions);

partitions的值是要去执行reduce任务的数量。如果最后由reducer执行输出，关键字会出现在输出文件的partition部分，为了保持文件名字具有相同的长度，可能需要对文件名前面进行补零。

关键的问题是，在作业开始的时候，块的数量就是确定的，块是由map任务的output.collect()方法所决定的。分块器载有的仅仅的信息是关键字，键值，和块的数量，以及当它被初始化的时候，什么样的数据对它来说是可得的。

分块器接口是非常简单的，如下列表2-7所示：

/**
* Partitions the key space.
*
* Partitioner controls the partitioning of the keys of the
* intermediate map-outputs. The key (or a subset of the key) is used to derive
* the partition, typically by a hash function. The total number of partitions
* is the same as the number of reduce tasks for the job. Hence this controls
* which of the m reduce tasks the intermediate key (and hence the
* record) is sent for reduction.
*
* @see Reducer
*/
public interface Partitioner
  
   
   
   extends JobConfigurable {

   
   /**
* Get the paritition number for a given key (hence record) given the total
* number of partitions i.e. number of reduce-tasks for the job.
*
* Typically a hash function on a all or a subset of the key.
*
* @param key the key to be paritioned.
* @param value the entry value.
* @param numPartitions the total number of partitions.
* @return the partition number for the key.
*/

   
   int getPartition(K2 key, V2 value, 
   
   int numPartitions);
}