MapReduce 二次排序_map.values().stream().reduce-CSDN博客

本文链接：https://blog.csdn.net/lljjyy001/article/details/102583759

MapReduce 二次排序

需求：

有这样的一堆数据：

22      12
22      13
22      6
22      17
21      5
28      79
28      63
28      100
1       79
23      84
1       63
67      45
18      23
19      74
1       100
21      41
57      21
23      79
12      13
22      12
22      13
.......

要求将key相同的数据都放到一起，输出时按照key的降序排序，key相同的，将值按照升序排序，结果输出如下：

100:1 1 1 28 28 28 
84:23 23 23 
79:1 1 1 23 23 23 28 28 28 
74:19 19 19 
67:23 23 45 45 45 79 
63:1 1 1 28 28 28 
57:21 21 21 22 22 
45:67 67 67 
41:21 21 21 
28:18 18 19 19 63 63 63 67 67 79 79 79 100 100 100 
23:18 18 18 21 21 21 21 41 79 79 79 84 84 84 
22:1 1 6 6 6 12 12 12 13 13 13 17 17 17 23 23 28 28 28 28 
21:1 1 5 5 5 22 22 41 41 41 57 57 57 
19:22 22 74 74 74 
18:12 12 13 23 23 23 
.........

如何用MR实现这个简单的需求呢？

方式1

采用内存进行排序。具体做法是在map阶段，将key和value输出，reduce端拉数据并合并相同key的value，最后数据格式为<key,Iterable>，然后在reduce方法中将values都取出，放到一个可排序的集合中，排序后直接输出。这种做法简单，好理解，但是随着数据量的增加，会发生内存溢出的风险，所以这种做法不推荐。

方式2

我们知道，shuffle过程中会将数据进行洗牌，排序。我们可以利用这个特点，让MapReduce框架帮我们去排序。具体的做法是：

将文件中的key和value都作为map端输出的key，文件中value作为map端输出的value。所以我们需要创建一个类来作为map端输出的key，同时将文件的key 和value都作为该类的属性，为了不混淆，文件的key作为该类的first属性，文件的value作为该类的second属性。同时该类要实现WritableComparable接口，在compareTo方法中现比较first，如果first相同，继续比较second。
第1完事以后，我们还需要一个Group操作，也就是job.setGroupingComparatorClass方法，其作用是将map阶段输出的相同的key都发送到一个reduce中去。该方法接收一个RawComparator类型的Class。Hadoop已经有一个WritableComparator类，该类实现RawComparator，我们可以一个类去继承WritableComparator类暂且称为分组插件类，然后从写其compare方法。在这个方法的实现中，我们采用了一个小技巧，我们只比较1中生成的key的first，也就是将first都相同的都发送到一个reduce中，然后value相同的，再根据1中提到的compareTo方法去比较，排序。这样就可以实现我们的需求了，也即二次排序。这地方有点难理解，可以结合代码，多理解几遍。思考？如果没有这一步，结果会是什么样的呢？可以将job.setGroupingComparatorClass注释掉，看结果。
因为是分布式计算，要保证全局有序的，还得从分区上做手脚(或者设置reducer个数为1个，不推荐)。就上面的需求中，我做法是范围划分，即根据key的大小以及分区个数，而不同范围是有序的，加上我们第1，2步，保证的分区内有序，这样也就认为是全局有序了。

代码

定义的Key类：

class Key implements WritableComparable<Key> {
    private Long first;
    private Long second;
    
    @Override
    public int compareTo(Key o) {
        int res = first.compareTo(o.first);
        if (res == 0) {
            res = second.compareTo(o.second);
        }else return -res;
        return res;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(first);
        dataOutput.writeLong(second);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.first = dataInput.readLong();
        this.second = dataInput.readLong();
    }

    public Long getFirst() {
        return first;
    }

    public void setFirst(Long first) {
        this.first = first;
    }

    public Long getSecond() {
        return second;
    }

    public void setSecond(Long second) {
        this.second = second;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Key key = (Key) o;
        return Objects.equals(first, key.first) && Objects.equals(second,key.second);
    }

    @Override
    public int hashCode() {
        return Objects.hash(first,second);
    }
}

分组插件类：

  class PairGroupComparator extends WritableComparator {
  
      public PairGroupComparator() {
          super(Key.class, true);
      }
  
      @Override
      public int compare(WritableComparable a, WritableComparable b) {
          Key pa = (Key) a;
          Key pb = (Key) b;
          return pa.getFirst().compareTo(pb.getFirst());
      }
  }

分区器：

class PairSortPartitioner extends Partitioner<Key, LongWritable> {
       /**
        * 我的数据的key都在0-100之间，所以简单的将0-100的数据划分成与分区数相等的几个范围，
        * 然后将根据这些范围判断key因该属于哪个分区
        * 这么做有很大的局限性：
        * 1. 存在很严重的热点问题。
        * 2. 如果数不再0-100之间，没法灵活改变。
        * 
        * 有很好的算法，可以告知,感谢
        */
       @Override
       public int getPartition(Key key, LongWritable value, int i) {
           Long first = key.getFirst();
           int MAX = 100;
           int step = MAX / i;
           for (int j = 1; j <= i; j++) {
               if ((j - 1) * step < first && first <= j * step) {
                   return j - 1;
               }
           }
           throw new IllegalArgumentException("key没有在0-100之间");
       }
   }

Mapper类：

    class PairSortMapper extends Mapper<LongWritable, Text, Key, LongWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] pair = value.toString().split("\t");
            Key sortKey = new Key();
            sortKey.setFirst(Long.parseLong(pair[0]));
            long second = Long.parseLong(pair[1]);
            sortKey.setSecond(second);
            context.write(sortKey, new LongWritable(second));
        }
    }

Reducer类：

class PairSortReducer extends Reducer<Key, LongWritable, NullWritable, Text> {

    private Text out = new Text();

    @Override
    protected void reduce(Key key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

        StringBuilder sb = new StringBuilder();
        sb.append(key.getFirst()).append(":");
        for (LongWritable value : values) {
            sb.append(value.get()).append(" ");
        }
        String outline = sb.toString();
        out.set(outline);
        context.write(NullWritable.get(), out);
        System.err.println(outline);
    }
}

Driver类：

public class PairSecondarySortDriver extends Configured implements Tool {

    private final static Path input = new Path("/tmp/pair/in/*");
    private final static Path output = new Path("/tmp/pair/out");

    @Override
    public int run(String[] strings) throws Exception {
        Job job = Job.getInstance(getConf());
        job.setJarByClass(this.getClass());
        job.setJobName(this.getClass().getSimpleName());

        job.setMapperClass(PairSortMapper.class);
        job.setMapOutputKeyClass(Key.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setReducerClass(PairSortReducer.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        job.setNumReduceTasks(4);
        job.setPartitionerClass(PairSortPartitioner.class);
        job.setGroupingComparatorClass(PairGroupComparator.class);

        job.setInputFormatClass(TextInputFormat.class);
        TextInputFormat.addInputPath(job, input);

        FileSystem fs = FileSystem.get(getConf());
        if (fs.exists(output)) {
            fs.delete(output, true);
        }

        job.setOutputFormatClass(TextOutputFormat.class);
        TextOutputFormat.setOutputPath(job, output);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int run = ToolRunner.run(new PairSecondarySortDriver(), null);
        System.exit(run);
    }
}