combiner组合器
1. 作用:作用于Mapper端
-
【但不能影响最终结果,max、sum行,avg不行】
-
a.降低Mapper端的本地磁盘输出
-
b.减少Reducer端的网络通信
-
【在Map端做了一次Reduce操作】
2. Temperature案例
【在Mapper后,开启Combiner,意味着在Reducer前执行了一次Reduce操作,可以降低Mapper端的本地磁盘输出以及减少Reduce端的网络通信】
-
TempMapper
package combiner; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class TempMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> { String line; String year; String temp; String quality; IntWritable _year = new IntWritable(); IntWritable _temp = new IntWritable(); int iy; int it; @Override protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, IntWritable, IntWritable>.Context context) throws IOException, InterruptedException { line = value.toString(); year = line.substring(15, 19); temp = line.substring(87, 92); quality = line.substring(92, 93); iy = Integer.valueOf(year); it = Integer.valueOf(temp); if(Math.abs(it) != 9999 && quality.matches("[01459]")) { _year.set(iy); _temp.set(it); context.write(_year, _temp); } } }
-
TempCombiner
package combiner; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; public class TempCombiner extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> { IntWritable max_temp = new IntWritable(); @Override protected void reduce(IntWritable key, Iterable<IntWritable> values, Reducer<IntWritable, IntWritable, IntWritable, IntWritable>.Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; for (IntWritable value : values) { max = Math.max(max, value.get()); } max_temp.set(max); context.write(key, max_temp); } }
-
TempReducer
package combiner; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; public class TempReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> { IntWritable max_temp = new IntWritable(); @Override protected void reduce(IntWritable key, Iterable<IntWritable> values, Reducer<IntWritable, IntWritable, IntWritable, IntWritable>.Context context) throws IOException, InterruptedException { int max = Integer.MIN_VALUE; for (IntWritable value : values) { max = Math.max(max, value.get()); } max_temp.set(max); context.write(key, max_temp); } }
-
TempDriver
package combiner; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class TempDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); conf.set("mapreduce.framework.name", "local"); Path outPut = new Path("file:///D:/out"); FileSystem fs = outPut.getFileSystem(conf); if(fs.exists(outPut)) { fs.delete(outPut, true); } Job job = Job.getInstance(conf); job.setJobName("temp"); job.setJarByClass(TempDriver.class); job.setMapperClass(TempMapper.class); job.setCombinerClass(TempCombiner.class); job.setReducerClass(TempReducer.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path("file:///D:/temp")); FileOutputFormat.setOutputPath(job, outPut); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
-
开启Combiner前:
File System Counters
FILE: Number of bytes read=4707063
FILE: Number of bytes written=1333907
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=0
HDFS: Number of read operations=0
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=13130
Map output records=13129
Map output bytes=105032
Map output materialized bytes=131302
Input split bytes=166
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=131302
Reduce input records=13129
Reduce output records=2
Spilled Records=26258
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=4
Total committed heap usage (bytes)=879230976
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1777168
File Output Format Counters
Bytes Written=30 -
开启Combiner后:
File System Counters
FILE: Number of bytes read=4444523
FILE: Number of bytes written=871031
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=0
HDFS: Number of read operations=0
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Map-Reduce Framework
Map input records=13130
Map output records=13129
Map output bytes=105032
Map output materialized bytes=32
Input split bytes=166
Combine input records=13129
Combine output records=2
Reduce input groups=2
Reduce shuffle bytes=32
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=0
Total committed heap usage (bytes)=868220928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1777168
File Output Format Counters
Bytes Written=30
3. 结果
1901 317
1902 244
【自然键转变成组合键时,导致分组、分区、排序的影响,因此要修改这些操作】
3. MapReduce流程(无Combiner)
- InputFormat
- InputSplit(切分)
- Map()函数
- Buffer(环形缓冲区)
- Partition(分区)
- Sort(排序QuickSort)
- Spill to disk(溢写)
- Merge on disk(合并)
- Sort(Collection.sort())
- fetch(通过Http拉取数据)【默认5个线程并发,性能最高】
- Merge
- Sort(Collection.sort())
- Reduce()
- OutputFormat
- close()
4. MapReduce流程(有Combiner)
- InputFormat
- InputSplit(切分)
- Map()函数
- Buffer(环形缓冲区)
- Partition(分区)
- Sort(排序QuickSort)
- 》》combiner()
- Spill to disk(溢写)
- 》》【溢写文件>=3时,才再做一次combiner操作】
- Merge on disk(合并)
- Sort(Collection.sort())
- fetch(通过Http拉取数据)
- 》》【设置了combiner,此处才会再次执行,合并】
- Merge
- Sort(Collection.sort())
- Reduce()
- OutputFormat
- close()