MapReduce二次排序的原理
1.在Mapper阶段,会通过inputFormat的getSplits来把数据集分割成split
public abstract class InputFormat<K, V> {
public InputFormat() {}
public abstract List<InputSplit> getSplits(JobContext var1) throws IOException, InterruptedException;
2.inputFormat会提供RecordReader来读取每一条Record,读取之后会传给map来接收和处理
public abstract RecordReader<K, V> createRecordReader(InputSplit var1, TaskAttemptContext var2) throws IOException, InterruptedException;
3.在Mapper阶段最后通过partition对Mapper的计算进行分区,可以通过job的setPartitionerClass来自定义Partitioner
public abstract class Partitioner<KEY, VALUE> {
public Partitioner() {}
public abstract int getPartition(KEY var1, VALUE var2, int var3);
}
public void setPartitionerClass(Class<? extends Partitioner> cls) throws IllegalStateException {
this.ensureState(Job.JobState.DEFINE);
this.conf.setClass("mapreduce.job.partitioner.class", cls, Partitioner.class);
}
4.在每个分区内可以调用Job的setSortComparatorClass来对分区内基于Key比较函数进行排序
public void setSortComparatorClass(Class<? extends RawComparator> cls) throws IllegalStateException {
this.ensureState(Job.JobState.DEFINE);
this.conf.setOutputKeyComparatorClass(cls);
}
public interface RawComparator<T> extends Comparator<T> {
int compare(byte[] var1, int var2, int var3, byte[] var4, int var5, int var6);
}
如果没有设置则会调用key 的compareTo方法
5. 在Reducer端会接收到所有Mapper端中属于自己的数据,其实我们可以通过Job的setComparatorClass来对当前Reducer收到的所有的数据基于key比较函数进行排序,由于Reducer端每一个key对应的是valueList,因此需要Job的的setGroupingComparator来设置分组函数的类;
public void setCombinerKeyGroupingComparatorClass(Class<? extends RawComparator> cls) throws IllegalStateException {
this.ensureState(Job.JobState.DEFINE);
this.conf.setCombinerKeyGroupingComparator(cls);
}
6.最后调用Reducer对自己收到的数据进行最后的处理
总结:
Mapper端:
设置partitioner控制partition的过程;
设置Comparator控制每个分区内的排序;
Reducer端:
设置Comparator控制reduce收到的所有数据的排序;
设置GroupingComparator对数据进行Group操作;