Hadoop 实例11 二次排序讲解

最新推荐文章于 2021-07-15 10:45:07 发布

garychenqin

最新推荐文章于 2021-07-15 10:45:07 发布

阅读量749

点赞数

文章标签： hadoop 排序实例

本文链接：https://blog.csdn.net/garychenqin/article/details/48286409

版权

说明：
关于二次排序主要涉及到这么几个东西：

在0.20.0 以前使用的是

    setPartitionerClass 

    setOutputkeyComparatorClass

    setOutputValueGroupingComparator 

 在0.20.0以后使用是

    job.setPartitionerClass(Partitioner p);

    job.setSortComparatorClass(RawComparator c);

    job.setGroupingComparatorClass(RawComparator c);

1、二次排序原理
在map阶段，使用job.setInputFormatClass定义的InputFormat将输入的数据集分割成小数据块splites，同时InputFormat提供一个RecordReder的实现。

本例子中使用的是TextInputFormat，他提供的RecordReader会将文本的字节偏移量作为key，这一行的文本作为value。

这就是自定义Map的输入是<LongWritable, Text>的原因。然后调用自定义Map的map方法，将一个个<LongWritable, Text>对输入给Map的map方法。

注意输出应该符合自定义Map中定义的输出<IntPair, IntWritable>。最终是生成一个List<IntPair, IntWritable>。

在map阶段的最后，会先调用job.setPartitionerClass对这个List进行分区，每个分区映射到一个reducer。

每个分区内又调用job.setSortComparatorClass设置的key比较函数类排序。可以看到，这本身就是一个二次排序。

如果没有通过job.setSortComparatorClass设置key比较函数类，则使用key的实现的compareTo方法。

在第一个例子中，使用了IntPair实现的compareTo方法，而在下一个例子中，专门定义了key比较函数类。

在reduce阶段，reducer接收到所有映射到这个reducer的map输出后，也是会调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序。

然后开始构造一个key对应的value迭代器。这时就要用到分组，使用job.setGroupingComparatorClass设置的分组函数类。

只要这个比较器比较的两个key相同，他们就属于同一个组，它们的value放在一个value迭代器，而这个迭代器的key使用属于同一个组的所有key的第一个key。

最后就是进入Reducer的reduce方法，reduce方法的输入是所有的（key和它的value迭代器）。同样注意输入与输出的类型必须与自定义的Reducer中声明的一致。

核心总结：
1、map最后阶段进行partition分区，一般使用job.setPartitionerClass设置的类，如果没有自定义Key的hashCode()方法进行排序。
2、每个分区内部调用job.setSortComparatorClass设置的key的比较函数类进行排序，如果没有则使用Key的实现的compareTo方法。
3、当reduce接收到所有map传输过来的数据之后，调用job.setSortComparatorClass设置的key比较函数类对所有数据对排序，如果没有则使用Key的实现的compareTo方法。
4、紧接着使用job.setGroupingComparatorClass设置的分组函数类，进行分组，同一个Key的value放在一个迭代器里面.

2、如何自定义Key
所有自定义的key应该实现接口WritableComparable，因为是可序列的并且可比较的。并重载方法

        //反序列化，从流中的二进制转换成IntPair 
        public void readFields(DataInput in) throws IOException 

        //序列化，将IntPair转化成使用流传送的二进制 
        public void write(DataOutput out) 

        //key的比较 
        public int compareTo(IntPair o) 

        //另外新定义的类应该重写的两个方法 
        //The hashCode() method is used by the HashPartitioner (the default partitioner in MapReduce) 
        public int hashCode() 
        public boolean equals(Object right)

package cn.edu.bjut.twosort;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class IntPair implements WritableComparable<IntPair> {

    private String first = "";
    private String two = "";

    public IntPair() {
        super();
    }

    public IntPair(String first, String two) {
        super();
        this.first = first;
        this.two = two;
    }


    public void write(DataOutput out) throws IOException {
        out.writeUTF(this.first);
        out.writeUTF(this.two);
    }

    public void readFields(DataInput in) throws IOException {
        this.first = in.readUTF();
        this.two = in.readUTF();
    }

    public int compareTo(IntPair o) {

        if(!o.getFirst().equals(this.first)) {
            return o.getFirst().compareTo(this.first);
        } else if(o.getTwo().equals(this.two)) {
            return o.getTwo().compareTo(this.two);
        } else {
            return 0;
        }

    }

    @Override
    public int hashCode() {
        return this.getFirst().hashCode()*127 + this.getTwo().hashCode();
    }

    @Override
    public boolean equals(Object obj) {
        if(null == obj) {
            return false;
        }
        if(this == obj) {
            return true;
        }
        if(obj instanceof IntPair) {
            IntPair intPair = (IntPair) obj;
            return this.first.equals(intPair.getFirst()) && this.two.equals(intPair.getTwo());
        } else {
            return false;
        }
    }

    public String getFirst() {
        return first;
    }

    public void setFirst(String first) {
        this.first = first;
    }

    public String getTwo() {
        return two;
    }

    public void setTwo(String two) {
        this.two = two;
    }

}

3、如何自定义分区函数类。这是key的第一次比较。

public static class FirstPartitioner extends Partitioner<IntPair,IntWritable>

在job中设置使用setPartitionerClasss

package cn.edu.bjut.twosort;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class MyPartitioner extends Partitioner<IntPair, Text> {

    @Override
    public int getPartition(IntPair key, Text value, int numPartitions) {
        return Math.abs(key.getFirst().hashCode() * 127) % numPartitions;
    }

}

4、如何自定义key比较函数类。这是key的第二次比较。这是一个比较器，需要继承WritableComparator。

public static class KeyComparat
or extends WritableComparator

必须有一个构造函数，并且重载 public int compare(WritableComparable w1, WritableComparable w2)
另一种方法是实现接口RawComparator。
在job中设置使用setSortComparatorClass。

5、如何自定义分组函数类。
在reduce阶段，构造一个key对应的value迭代器的时候，只要first相同就属于同一个组，放在一个value迭代器。这是一个比较器，需要继承WritableComparator。

public static class GroupingComparator extends WritableComparator

同key比较函数类，必须有一个构造函数，并且重载 public int compare(WritableComparable w1, WritableComparable w2)
同key比较函数类，分组函数类另一种方法是实现接口RawComparator。
在job中设置使用setGroupingComparatorClass。

package cn.edu.bjut.twosort;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class MyComparator extends WritableComparator {

    protected MyComparator() {
        super(IntPair.class, true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        IntPair a1 = (IntPair) a;
        IntPair a2 = (IntPair) b;

        return a1.getFirst().compareTo(a2.getFirst());
    }

}

6.Mapper程序：

package cn.edu.bjut.twosort;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class TwoSortMapper extends Mapper<LongWritable, Text, IntPair, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] arr = line.split("\t");
        if(3 == arr.length) {
            IntPair intPair = new IntPair();
            intPair.setFirst(arr[0]);
            intPair.setTwo(arr[1]);
            context.write(intPair, value);
        }
    }
}

7.Reducer程序：

package cn.edu.bjut.twosort;

import java.io.IOException;

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class TwoReducer extends Reducer<IntPair, Text, NullWritable, Text> {

    private static final Text SEP = new Text("---------------------------------"); 
    @Override
    protected void reduce(IntPair key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        context.write(NullWritable.get(), SEP);

        for(Text text : values) {
            context.write(NullWritable.get(), text);
        }
    }

}

8.主程序：

package cn.edu.bjut.twosort;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MainJob {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "twosort");
        job.setJarByClass(MainJob.class);

        job.setMapperClass(TwoSortMapper.class);
        job.setMapOutputKeyClass(IntPair.class);
        job.setMapOutputValueClass(Text.class);

        job.setPartitionerClass(MyPartitioner.class);
        job.setGroupingComparatorClass(MyComparator.class);

        job.setReducerClass(TwoReducer.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        Path outPath = new Path(args[1]);
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(outPath)) {
            fs.delete(outPath, true);
        }

        FileOutputFormat.setOutputPath(job, outPath);

        job.waitForCompletion(true);
    }
}

garychenqin

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hadoop 实例11 二次排序讲解

说明：关于二次排序主要涉及到这么几个东西：在0.20.0 以前使用的是 setPartitionerClass setOutputkeyComparatorClass setOutputValueGroupingComparator 在0.20.0以后使用是 job.setPartitionerClass(Partitioner p); job.setS
复制链接

扫一扫