Hadoop mapreduce自定义排序WritableComparable

最新推荐文章于 2024-04-30 00:18:16 发布

冥想者-定

最新推荐文章于 2024-04-30 00:18:16 发布

阅读量1.2k

点赞数 1

分类专栏： __MapReduce

__MapReduce 专栏收录该内容

75 篇文章 1 订阅

订阅专栏

本文发表于本人博客。

    今天继续写练习题，上次对分区稍微理解了一下，那根据那个步骤分区、排序、分组、规约来的话，今天应该是要写个排序有关的例子了，那好现在就开始！

     说到排序我们可以查看下hadoop源码里面的WordCount例子中对LongWritable类型定义，它实现抽象接口WritableComparable，代码如下：

1. public interface WritableComparable<T> extends Writable, Comparable<T> {

2. }

3. public interface Writable {

4. void write(DataOutput out) throws IOException;

5. void readFields(DataInput in) throws IOException;

6. }

其中Writable抽象接口定义了write以及readFields方法，分别是写入数据流以及读取数据流。而Comparable中又有compareTo方法定义比较。竟然hadoop的内置类型有比较大小功能，那么它使用这个内置类型作为map端输出的话是怎么样去排序的，这个问题我们先来查看下map任务类MapTask源代码，内部有内置MapOutputBuffer类，在spill accounting注释下面有个排序字段：

1. private final IndexedSorter sorter;

这个字段是由：

1. sorter = ReflectionUtils.newInstance(job.getClass('map.sort.class', QuickSort.class, IndexedSorter.class), job);

可以看出，这个排序算法可以在配置文件中指定，不过默认是快速排序QuickSort。这个QuickSort内部有几个重要的方法：

1. public void sort(final IndexedSortable s, int p, int r,final Progressable rep);

2. private static void sortInternal(final IndexedSortable s, int p, int r,final Progressable rep, int depth);

其中在传递参数IndexSortable的时候是用MapOutputBuffer当前来传递，因为这个MapOutputBuffer也继承IndexedSortable.这样在QuickSort排序sort中就会使用MapOutputBuffer类中的compare方法进行比较，可以看下面源代码：

01. public int compare(int i, int j) {

02. final int ii = kvoffsets[i % kvoffsets.length];

03. final int ij = kvoffsets[j % kvoffsets.length];

04. // sort by partition

05. if (kvindices[ii + PARTITION] != kvindices[ij + PARTITION]) {

06. return kvindices[ii + PARTITION] - kvindices[ij + PARTITION];

07. }

08. // sort by key

09. return comparator.compare(kvbuffer,

10. kvindices[ii + KEYSTART],

11. kvindices[ii + VALSTART] - kvindices[ii + KEYSTART],

12. kvbuffer,

13. kvindices[ij + KEYSTART],

14. kvindices[ij + VALSTART] - kvindices[ij + KEYSTART]);

15. }

然而这个方法中comparator默认是由节点“mapred.output.key.comparator.class”决定，也可以看源码：

1. public RawComparator getOutputKeyComparator() {

2. Class<? extends RawComparator> theClass = getClass('mapred.output.key.comparator.class',

3. null, RawComparator.class);

4. if (theClass != null)

5. return ReflectionUtils.newInstance(theClass, this);

6. returnWritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class));

7. }

就是这样把排序以及比较方法关联起来了！那现在我们可以按照LongWritable的思路实现自己的自定义类型并且读取、写入、比较。下面写写代码加深下记忆，既然是排序那我们准备下数据，如下有2列数据要求按照第一列升序，第二列降序排序：

1. 1    2

2. 1    1

3. 3    0

4. 3    2

5. 2    2

6. 1    2

先自定义类型SortAPI：

01. public class SortAPI implements WritableComparable<SortAPI> {

02. /**

03. * 第一列数据

04. */

05. public Long first;

06. /**

07. * 第二列数据

08. */

09. public Long second;

10.

11. public SortAPI(){}

12. public SortAPI(long first,long second){

13. this.first = first;

14. this.second = second;

15. }

16. /**

17. * 排序就在这里当：this.first - o.first > 0 升序，小于0倒序

18. */

19. @Override

20. public int compareTo(SortAPI o) {

21. long mis = (this.first - o.first);

22. if(mis != 0 ){

23. return (int)mis;

24. }

25. else{

26. return (int)(this.second - o.second);

27. }

28. }

29.

30. @Override

31. public void write(DataOutput out) throws IOException {

32. out.writeLong(first);

33. out.writeLong(second);

34. }

35.

36. @Override

37. public void readFields(DataInput in) throws IOException {

38. this.first = in.readLong();

39. this.second = in.readLong();

40.

41. }

42.

43. @Override

44. public int hashCode() {

45. return this.first.hashCode() + this.second.hashCode();

46. }

47.

48. @Override

49. public boolean equals(Object obj) {

50. if(obj instanceof SortAPI){

51. SortAPI o = (SortAPI)obj;

52. return this.first == o.first && this.second == o.second;

53. }

54. return false;

55. }

56. @Override

57. public String toString() {

58. return 'first:' + this.first + 'second:' + this.second;

59. }

60. }

这类型重写compareTo(SortAPI o)、write(DataOutput out)、readFields(DataInput in)，既然是有比较那么以前说的就一定要重写hashCode()、equals(Object obj)方法了，这点不要忘记！还需要主要在write方法以及readFields方法中读写是有顺序：先write什么字段就先read什么字段。其次这个compareTo(SortAPI o)方法中返回是整型大于0、0、以及小于0代表大于、等于、小于。至于怎么判断2行数据是不是相等，不相等怎么比较着逻辑可以慢慢看下。

下面写个自定义Mapper、Reducer类以及main函数：

01. public class MyMapper extends Mapper<LongWritable, Text, SortAPI, LongWritable> {

02.

03. @Override

04. protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException {

05. String[] splied = value.toString().split(' ');

06. try {

07. long first = Long.parseLong(splied[0]);

08. long second = Long.parseLong(splied[1]);

09. context.write(new SortAPI(first,second), new LongWritable(1));

10. } catch (Exception e) {

11. System.out.println(e.getMessage());

12. }

13. }

14. }

1. public class MyReduce extends Reducer<SortAPI, LongWritable, LongWritable, LongWritable> {

2.

3. @Override

4. protected void reduce(SortAPI key, Iterable<LongWritable> values, Context context)throws IOException, InterruptedException {

5. context.write(new LongWritable(key.first), new LongWritable(key.second));

6. }

7.

8. }

01. static final String OUTPUT_DIR = 'hdfs://hadoop-master:9000/sort/output/';

02. static final String INPUT_DIR = 'hdfs://hadoop-master:9000/sort/input/test.txt';

03.

04. public static void main(String[] args) throws Exception {

05. Configuration conf = new Configuration();

06. Job job = new Job(conf, Test.class.getSimpleName());

07. deleteOutputFile(OUTPUT_DIR);

08.

09. //1设置输入目录

10. FileInputFormat.setInputPaths(job, INPUT_DIR);

11. //2设置输入格式化类

12. job.setInputFormatClass(TextInputFormat.class);

13. //3设置自定义Mapper以及键值类型

14. job.setMapperClass(MyMapper.class);

15. job.setMapOutputKeyClass(SortAPI.class);

16. job.setMapOutputValueClass(LongWritable.class);

17. //4分区

18. job.setPartitionerClass(HashPartitioner.class);

19. job.setNumReduceTasks(1);

20.

21. //5排序分组

22. //6设置在一定Reduce以及键值类型

23. job.setReducerClass(MyReduce.class);

24. job.setOutputKeyClass(LongWritable.class);

25. job.setOutputValueClass(LongWritable.class);

26. //7设置输出目录

27. FileOutputFormat.setOutputPath(job, new Path(OUTPUT_DIR));

28. //8提交job

29. job.waitForCompletion(true);

30. }

31.

32. static void deleteOutputFile(String path) throws Exception{

33. Configuration conf = new Configuration();

34. FileSystem fs = FileSystem.get(new URI(INPUT_DIR),conf);

35. if(fs.exists(new Path(path))){

36. fs.delete(new Path(path));

37. }

38. }

这样在eclipse下就可以直接运行查看结果：

1. 1       1

2. 1       2

3. 2       2

4. 3       0

5. 3       2

这结果正确，那如果要求第一列倒叙第二列升序呢，怎么办，这只需要修改下compareTo(SortAPI o)：

01. @Override

02. public int compareTo(SortAPI o) {

03. long mis = (this.first - o.first) * -1 ;

04. if(mis != 0 ){

05. return (int)mis;

06. }

07. else{

08. return (int)(this.second - o.second);

09. }

10. }

这样保存在运行，结果：

1. 3       0

2. 3       2

3. 2       2

4. 1       1

5. 1       2

也正确吧符合自己的这个要求。

留个小问题：这个compareTo(SortAPI o)方法在什么时候调用了，总共调用了几次？

这次先到这里。坚持记录点点滴滴！

冥想者-定

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop mapreduce自定义排序WritableComparable

本文发表于本人博客。今天继续写练习题，上次对分区稍微理解了一下，那根据那个步骤分区、排序、分组、规约来的话，今天应该是要写个排序有关的例子了，那好现在就开始！说到排序我们可以查看下hadoop源码里面的WordCount例子中对LongWritable类型定义，它实现抽象接口WritableComparable，代码如下：1.public inte
复制链接

扫一扫