自定义combiner
map端合并数据,减少网络io
前言:在map端使用combiner合并数据可以减少需要通过网络io的数据,有效增加map reduce程序的运行效率。
一、普通的combiner
在map端提前使用combiner合并数据是广为人知的一种优化策略。
但是这种优化策略有两个缺陷,一个是数据量要比较大,不过考虑到map reduce程序处理的数据一般都是大量的数据,所以这个问题不是关键。
使combiner不那么受人重视的是另一个关键缺陷,因为combiner是要reducer程序在map端的提前,所以普遍的策略是combiner直接采用已有的reducer代码,而采用这种相同逻辑的combiner要求提前执行combiner程序,合并的数据不会影响到reducer端最终的合并。
这种要求只符合一些简单逻辑的程序,比如统计单词、求最大/小值等,这些程序的数据提前合并不会影响到reducer端的最终合并。
对比较复杂的程序逻辑来说是不能满足的,比如求平均数,一般的对一个学生的各科成绩score求平均数,reduce的逻辑是平均数(单个map的平均数)=score值的和/score个数,如果提前执行这样的combiner,reduce处理的数据将会变成最终平均数=单个map的平均数的和/map个数,将前者带入后者中即有最终平均数=score值的和/(score个数*map个数),可以发现如果每个map的score个数相等还是可以得出正确答案,但如果score个数不相等,则score个数小的会变成score大的值,从而使分母变大,进而导致最终平均值变小。
有如下数据:
score1.txt:
zhangsan 英语 60
zhangsan 政治 70
zhangsan 化学 80
lisi 英语 60
lisi 政治 70
lisi 化学 80
wanger 英语 60
wanger 政治 70
wanger 化学 80
score2.txt
zhangsan 语文 60
zhangsan 数学 70
lisi 语文 60
lisi 数学 70
wanger 语文 60
wanger 数学 70
mapper程序:
package cn.yy.hadoop.mapreduce.average;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* description: a mapper program for get average number from data files
* author: bob yy
* since: 1.8
**/
public class AverageMapper extends Mapper<LongWritable, Text, Text, FloatWritable> {
private Text text = new Text();
private FloatWritable number = new FloatWritable();
/**
* get student and score from input file, push its to reduce
*
* @param key line number
* @param value a line data of input file
* @param context context of map reduce
* @throws IOException write
* @throws InterruptedException write
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] row = value.toString().split("\t");
String student = row[0];
String score = row[2];
text.set(student);
number.set(Float.parseFloat(score));
context.write(text, number);
}
}
reducer程序:
package cn.yy.hadoop.mapreduce.average;
import cn.yy.test.Test;
import com.sun.tools.javac.comp.Flow;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* description: a reducer program for get average number from mapper output
* author: bob yy
* since: 1.8
**/
public class AverageReducer extends Reducer<Text, FloatWritable, Text, FloatWritable> {
private FloatWritable avg = new FloatWritable();
/**
* add sum and count,get average by sum/count.
*
* @param key student flag
* @param values scores of key student
* @param context context of map reduce, write data by the context.
* @throws IOException write
* @throws InterruptedException write
*/
@Override
protected void reduce(Text key, Iterable<FloatWritable> values, Context context) throws IOException, InterruptedException {
float sum = 0;
int count = 0;
for (FloatWritable value : values) {
sum += value.get();
count++;
}
avg.set(sum / count);
context.write(key, avg);
}
}
正常的求平均值驱动程序:对于相同逻辑的combiner和reducer的求平均值程序是不能使用combiner在map端提前聚合的。
/**
* get average use map reduce
*
* @throws IOException getInstance
* @throws ClassNotFoundException waitForCompletion
* @throws InterruptedException waitForCompletion
*/
private void getAverage() throws IOException, ClassNotFoundException, InterruptedException {
String in = "J:\\data\\average\\input";
String out = "J:\\data\\average\\output";
Path inPath = new Path(in);
Path outPath = new Path(out);
// 删除该文件
LocalFileSystem.deleteFile(out);
Job job = Job.getInstance();
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.class);
// set output format of mapper and reducer
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
// set input and output path
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
}
正确的结果:
lisi 68.0
wanger 68.0
zhangsan 68.0
如下是错误的案例:
/**
* it is fail, combiner of get average must be self defined; combiner of map side added count number, reduced
* average.
*
* @throws IOException getInstance
* @throws ClassNotFoundException waitForCompletion
* @throws InterruptedException waitForCompletion
*/
private void getAverageNoCombiner() throws IOException, ClassNotFoundException, InterruptedException {
String in = "J:\\data\\average\\input";
String out = "J:\\data\\average\\output_noCombiner";
Path inPath = new Path(in);
Path outPath = new Path(out);
// 删除该文件
LocalFileSystem.deleteFile(out);
Job job = Job.getInstance();
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.class);
// set output format of mapper and reducer
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FloatWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
// set Combiner no define
job.setCombinerClass(AverageReducer.class);
// set input and output path
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
}
错误的结果:可以发现最终结果比正确的平均值变小了。
lisi 67.5
wanger 67.5
zhangsan 67.5
主要错误是在map端使用combiner程序进行提前聚合;如上述所示,提前使用combiner程序聚合平均值会导致最终结果发生变化。
二、自定义combiner,实现自由合并
根据官方api我们知道,job设置combiner时接收Reducer的子类,一般的为了简化代码,直接将必有的Reducer实现类AverageReducer(业务reducer实现类)当作combiner传入job执行。
但是根据上面的论述,这种简单的方式导致了对于较为复杂的业务不能使用combiner提前合并。
所以,对于复杂业务,我们需要自定义combiner类。
因为mapreduce程序处理的都是大数据,所以map数据传递给reduce端所经过的网络i/o一般都是比较大,这使得我们可以对大多数mapreduce程序都采用提前combiner策略,即map传给reduce程序的数据都可以是经过提前合并的结果数据,从而极大的优化网络i/o。
对于求平均值的mapreduce程序,提前combiner会丢失隐性的数值,即score的个数。
自定义combiner逻辑:
很明显,当提前计算平均值就会导致数值变化,所以,可以统计结果(<student,<sum,count>>,保留对score求和得到的sum,统计score个数得到的count)但不求平均值(所以也就不会丢失数值),而将这些统计结果传给reduce程序,这里我们需要改变reduce原来的逻辑,现在的逻辑是遍历values是统计每个student在每个map中求得的sum和count,设所有sum和为ss,所有count和为cs,得到student平均值=ss/cs。
这里使用combiner统计结果元素,但不执行会导致数值变化的运算,在减少数据量的同时保持了combiner的正确性。
如此,可以说,对于大数据量的复杂业务,使用自定义combiner也完全可以成为常规的有效的优化手段,而代价则是增加了代码复杂度,所以使用这种优化手段的同时,我们应该注意系统的复杂度,如果太过复杂,则应该及时优化该代码。
自定义的combiner程序:这里只是简单的使用text完成数据的序列化,如果业务复杂可以使用javabean优化。
package cn.yy.hadoop.mapreduce.average;
import cn.yy.test.Test;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* description: a define Average Combiner for get average number
* author: bob yy
* since: 1.8
**/
public class AverageCombiner extends Reducer<Text, Text, Text, Text> {
private Text text = new Text();
/**
* get count and sum of student score, push it to reducer by use context write Text; the combiner output data
* is result of a mapper, so data is reduced.
*
* @param key student from mapper
* @param values scores of student
* @param context write count and sum to reducer
* @throws IOException write
* @throws InterruptedException write
*/
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
int sum = 0;
int count = 0;
for (Text value : values) {
sum += Integer.parseInt(value.toString());
count++;
}
text.set("" + sum + "\t" + count);
context.write(key, text);
}
}
因为combiner改变了map和reduce的输入输出格式,所以map进行关于text的简化,简化后的map程序:
package cn.yy.hadoop.mapreduce.average;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/**
* description: a mapper program for use combiner get average number,combiner get data from mapper output
* author: bob yy
* since: 1.8
**/
public class CombinerMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text student = new Text();
private Text score = new Text();
/**
* get student and score, serially score by Text for unique output format of mapper and reducer.
*
* @param key line number
* @param value line data of input file, include student and score...
* @param context context of map reduce, write key value to reducer by the context
* @throws IOException write
* @throws InterruptedException write
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\t");
student.set(split[0]);
score.set(split[2]);
context.write(student, score);
}
}
逻辑改变后的reduce程序:
package cn.yy.hadoop.mapreduce.average;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/**
* description: a reducer for define combin,it get data from combiner output
* author: bob yy
* since: 1.8
**/
public class CombinerReducer extends Reducer<Text, Text, Text, FloatWritable> {
private FloatWritable avg = new FloatWritable();
/**
* count score number and score sum of a student, run sum/count(score number) get average; because cache
* count(score number) so that combiner not lose count.
*
* @param key student flag
* @param values some count and sum of a student
* @param context use context write student and average to file
* @throws IOException write
* @throws InterruptedException write
*/
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
float sum = 0;
float count = 0;
for (Text value : values) {
String[] split = value.toString().split("\t");
sum += Integer.parseInt(split[0]);
count += Integer.parseInt(split[1]);
}
avg.set(sum / count);
context.write(key, avg);
}
}
改变job流程的driver程序:
/**
* define combiner, get average use map reduce and add define combiner.
*
* @throws IOException getInstance
* @throws ClassNotFoundException waitForCompletion
* @throws InterruptedException waitForCompletion
*/
private void getAverageByCombiner() throws IOException, ClassNotFoundException, InterruptedException {
String in = "J:\\data\\average\\input";
String out = "J:\\data\\average\\output_Combiner";
Path inPath = new Path(in);
Path outPath = new Path(out);
// 删除该文件
LocalFileSystem.deleteFile(out);
Job job = Job.getInstance();
// use define combiner for get average number
job.setMapperClass(CombinerMapper.class);
job.setReducerClass(CombinerReducer.class);
// set output format of mapper and reducer
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
// set combiner
job.setCombinerClass(AverageCombiner.class);
// set input and output path
FileInputFormat.setInputPaths(job, inPath);
FileOutputFormat.setOutputPath(job, outPath);
job.waitForCompletion(true);
}
使用自定义combiner求平均值的mapreduce程序运行如上数据,结果:
lisi 68.0
wanger 68.0
zhangsan 68.0
对于大多数的mapreduce程序,自定义combiner都可以运行,所以我们又多了一种对mapreduce程序优化的有效手段。
并且网上说combiner对于复杂业务不适用的情况也是可以避免的。
大数据运算技术之mapreduce。