一、概述
MapReduce框架对处理结果的输出会根据key值进行默认的排序,这个默认排序可以满足一部分需求,但是也是十分有限的。在我们实际的需求当中,往往有要对reduce输出结果进行二次排序的需求。对于二次排序的实现,网络上已经有很多人分享过了,但是对二次排序的实现的原理以及整个MapReduce框架的处理流程的分析还是有非常大的出入,而且部分分析是没有经过验证的。本文将通过一个实际的MapReduce二次排序例子,讲述二次排序的实现和其MapReduce的整个处理流程,并且通过结果和map、reduce端的日志来验证所描述的处理流程的正确性。
二、需求描述
1、输入数据:
zhangsan,3
lisi,7
wangwu,11
lisi,4
wangwu,66
lisi,7
wangwu,12
zhangsan,45
lisi,72
zhangsan,34
lisi,89
zhangsan,34
lisi,77
2、目标输出
zhangsan 3
zhangsan 34
zhangsan 34
zhangsan 45
lisi 4
lisi 7
lisi 7
lisi 72
lisi 77
lisi 89
wangwu 11
wangwu 12
wangwu 66
三、解决思路
1、首先,在思考解决问题思路时,我们先应该深刻的理解MapReduce处理数据的整个流程,这是最基础的,不然的话是不可能找到解决问题的思路的。我描述一下MapReduce处理数据的大概简单流程:首先,MapReduce框架通过getSplit方法实现对原始文件的切片之后,每一个切片对应着一个map task,inputSplit输入到Map函数进行处理,中间结果经过环形缓冲区的排序,然后分区、自定义二次排序(如果有的话)和合并,再通过shuffle操作将数据传输到reduce task端,reduce端也存在着缓冲区,数据也会在缓冲区和磁盘中进行合并排序等操作,然后对数据按照Key值进行分组,然后没处理完一个分组之后就会去调用一次reduce函数,最终输出结果。大概流程我画了一下,如下图:
2、具体解决思路
(1)构建进行二次排序的key值
根据上面的需求,我们有一个非常明确的目标就是要对第一列相同的记录合并,并且对合并后的数字进行排序。我们都知道MapReduce框架不管是默认排序或者是自定义排序都只是对Key值进行排序,现在的情况是这些数据不是key值,怎么办?其实我们可以将原始数据的Key值和其对应的数据组合成一个新的Key值,然后新的Key值对应的还是之前的数字。这里可以采用两种方法,一种是自定义类实现WritableComparable接口来进行二次排序,类必须包含需要进行二次排序的属性,比如这里的first和secondary两个属性,第二种方式是在map端程序将key和value组合成一个key值,value依旧还是原来的值。这里采用第一种方式进行实现,以下是自定义类的具体代码:
package com.ibeifeng.sort;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class FirstSecondary implements WritableComparable<FirstSecondary>{
private String first;
private Integer secondary;
public String getFirst() {
return first;
}
public void setFirst(String first) {
this.first = first;
}
public Integer getSecondary() {
return secondary;
}
public void setSecondary(Integer secondary) {
this.secondary = secondary;
}
@Override
public String toString() {
return "first=" + first + ", secondary=" + secondary;
}
public void write(DataOutput out) throws IOException {
out.writeUTF(this.getFirst());
out.writeInt(this.getSecondary());
}
public void readFields(DataInput in) throws IOException {
this.first = in.readUTF();
this.secondary = in.readInt();
}
public int compareTo(FirstSecondary o) {
int comp = this.first.compareTo(o.getFirst());
if (0 != comp) {
return comp;
}
return Integer.valueOf(getSecondary()).compareTo(
Integer.valueOf(o.getSecondary()));
}
}
(2)二次排序实现所需的分区,分组的详细描述
为了能够看到二次排序的效果,在reduce阶段就不进行迭代累加,而是进行普通的输出,由于在map阶段输出的key值是自定义类型,所以需要自定义分区和分组,这里的分区采用FirstSecondary的first属性进行hash取值,并且指定分区数为2,这里的分组也是根据FirstSecondary的first属性。详细代码如下:
主程序入口代码:
package com.ibeifeng.sort;
import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.task.reduce.Shuffle;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SecondarySortMapReduce extends Configured implements Tool{
//定义map处理类模板
public static class map extends Mapper<LongWritable, Text, FirstSecondary, IntWritable>{
private IntWritable outputValue = new IntWritable();
FirstSecondary outputKey = new FirstSecondary();
protected void map(LongWritable key, Text values, Context context)
throws IOException, InterruptedException {
//1.分割values
String str = values.toString();
String[] split = str.split(",");
//2.新建FirstSecondary对象进行赋值
outputKey.setFirst(split[0]);
outputKey.setSecondary(Integer.valueOf(split[1]));
//3.进行输出
outputValue.set(Integer.valueOf(split[1]));
context.write(outputKey, outputValue);
}
}
//定义reduce处理类模板
public static class reduce extends Reducer<FirstSecondary, IntWritable, Text, IntWritable>{
private Text text = new Text();
@Override
protected void reduce(FirstSecondary key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
for(IntWritable in : values) {
text.set(key.getFirst());
context.write(text, in);
}
}
}
//配置Driver模块
public int run(String[] args) {
//1.获取配置配置文件对象
Configuration configuration = new Configuration();
//2.创建给mapreduce处理的任务
Job job = null;
try {
job = Job.getInstance(configuration,this.getClass().getSimpleName());
} catch (IOException e) {
e.printStackTrace();
}
try {
//3.创建输入路径
Path source_path = new Path(args[0]);
FileInputFormat.addInputPath(job, source_path);
//4.创建输出路径
Path des_path = new Path(args[1]);
FileOutputFormat.setOutputPath(job, des_path);
} catch (IllegalArgumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//设置让任务打包jar运行
job.setJarByClass(SecondarySortMapReduce.class);
//5.设置map
job.setMapperClass(map.class);
job.setMapOutputKeyClass(FirstSecondary.class);
job.setMapOutputValueClass(IntWritable.class);
//================shuffle========================
//1.分区
job.setPartitionerClass(MyPartitioner.class);
//2.排序
// job.setSortComparatorClass(cls);
//3.分组
job.setGroupingComparatorClass(MyGroup.class);
//4.可选项,设置combiner,相当于map过程的reduce处理,优化选项
// job.setCombinerClass(Combiner.class);
//设置reduce个数
job.setNumReduceTasks(2);
//================shuffle========================
//6.设置reduce
job.setReducerClass(reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//7.提交job到yarn组件上
boolean isSuccess = false;
try {
isSuccess = job.waitForCompletion(true);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
return isSuccess?0:-1;
}
//书写主函数
public static void main(String[] args) {
Configuration configuration = new Configuration();
//1.书写输入和输出路径
String[] args1 = new String[] {
"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/input",
"hdfs://hadoop-senior01.ibeifeng.com:8020/user/beifeng/wordcount/output"
};
//2.设置系统以什么用户执行job任务
System.setProperty("HADOOP_USER_NAME", "beifeng");
//3.运行job任务
int status = 0;
try {
status = ToolRunner.run(configuration, new SecondarySortMapReduce(), args1);
} catch (Exception e) {
e.printStackTrace();
}
// int status = new MyWordCountMapReduce().run(args1);
//4.退出系统
System.exit(status);
}
}
自定义分区代码:
package com.ibeifeng.sort;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<FirstSecondary, IntWritable> {
@Override
public int getPartition(FirstSecondary key, IntWritable value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % 2;
}
}
自定义分组代码:
package com.ibeifeng.sort;
import org.apache.hadoop.io.RawComparator;
public class MyGroup implements RawComparator<FirstSecondary> {
public int compare(FirstSecondary o1, FirstSecondary o2) {
return o1.getFirst().compareTo(o2.getFirst());
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return new String(b1,0,l1-4).compareTo(new String(b2,0,l2-4));
}
}
四.打包并测试结果
将打成的jar包还有测试数据都上传到HDFS。执行命令:
bin/yarn jar datas/sort.jar /user/beifeng/wordcount/input/ /user/beife
ng/wordcount/output
在HDFS上看到了两个reducetask处理完的结果,使用以下命令进行查看:bin/hdfs dfs -text /user/beifeng/wordcount/output/part*
结果如下:
zhangsan 3
zhangsan 34
zhangsan 34
zhangsan 45
lisi 4
lisi 7
lisi 7
lisi 72
lisi 77
lisi 89
wangwu 11
wangwu 12
wangwu 66