InverseMapper.class二次排序-CSDN博客

原文地址：http://bbs.chinaunix.net/thread-1650880-1-1.html

用一个并行计算任务显然是无法同时完成单词词频统计和排序的，这时我们可以利用 Hadoop
的任务管道能力，用上一个任务(词频统计)的输出做为下一个任务(排序)的输入，顺序执行两个并行计算任务。主要工作是修改代码清单3中的 run
函数，在其中定义一个排序任务并运行之。

在 Hadoop 中要实现排序是很简单的，因为在 MapReduce 的过程中，会把中间结果根据
key 排序并按 key 切成 R 份交给 R 个 Reduce 函数，而 Reduce 函数在处理中间结果之前也会有一个按 key
进行排序的过程，故 MapReduce 输出的最终结果实际上已经按 key 排好序。词频统计任务输出的 key 是单词，value
是词频，为了实现按词频排序，我们指定使用 InverseMapper 类作为排序任务的 Mapper 类(
sortJob.setMapperClass(InverseMapper.class );)，这个类的 map 函数简单地将输入的 key 和
value 互换后作为中间结果输出，在本例中即是将词频作为 key,单词作为 value 输出,
这样自然就能得到按词频排好序的最终结果。我们无需指定 Reduce 类，Hadoop 会使用缺省的 IdentityReducer
类，将中间结果原样输出。

还有一个问题需要解决: 排序任务中的 Key 的类型是 IntWritable,
(sortJob.setOutputKeyClass(IntWritable.class)), Hadoop 默认对 IntWritable
按升序排序，而我们需要的是按降序排列。因此我们实现了一个 IntWritableDecreasingComparator
类,　并指定使用这个自定义的 Comparator 类对输出结果中的 key
(词频)进行排
序：sortJob.setOutputKeyComparatorClass(IntWritableDecreasingComparator.class)

详见代码清单 5 及其中的注释。

public int run(String[] args) throws Exception { Path tempDir = new Path("wordcount-temp-" + Integer.toString( new Random().nextInt(Integer.MAX_VALUE))); //定义一个临时目录 JobConf conf = new JobConf(getConf(), WordCount.class); try { conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(tempDir); //先将词频统计任务的输出结果写到临时目 //录中, 下一个排序任务以临时目录为输入目录。 conf.setOutputFormat(SequenceFileOutputFormat.class); JobClient.runJob(conf); JobConf sortJob = new JobConf(getConf(), WordCount.class); sortJob.setJobName("sort"); sortJob.setInputPath(tempDir); sortJob.setInputFormat(SequenceFileInputFormat.class); /*InverseMapper由hadoop库提供，作用是实现map()之后的数据对的key和value交换*/ sortJob.setMapperClass(InverseMapper.class); sortJob.setNumReduceTasks(1); //将 Reducer 的个数限定为1, 最终输出的结果　　　　　　　　　　 //文件就是一个。 sortJob.setOutputPath(new Path(args[1])); sortJob.setOutputKeyClass(IntWritable.class); sortJob.setOutputValueClass(Text.class); sortJob.setOutputKeyComparatorClass(IntWritableDecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(conf).delete(tempDir);　//删除临时目录 } return 0; } private static class IntWritableDecreasingComparator extends IntWritable.Comparator { public int compare(WritableComparable a, WritableComparable b) { return -super.compare(a, b); } /*这个比较是什么意思？*/ public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { return -super.compare(b1, s1, l1, b2, s2, l2); } }