1. MapReduce的输出
输出依赖于作业中Reduce任务的数量,下面是一些优化建议:
- 压缩输出,以节省存储空间,同时也提升HDFS写入吞吐量;
- 避免写入带外端文件(out-of-band side file)作为Reduce任务的输出。
- 根据作业输出文件的消费者的需求,可以分割的压缩技术或许适合;
- 以较大块容量设置,写入较大的HDFS文件,有助于减少Map任务数。
2. 任务的推测执行
Straggle(掉队者)是指那些跑的很慢但最终会成功完成的任务。一个掉队的Map任务会阻止Reduce任务开始执行。
Hadoop不能自动纠正掉队任务,但是可以识别那些跑的比较慢的任务,然后它会产生另一个等效的任务作为备份,并使用首先完成的那个任务的结果,此时另外一个任务则会被要求停止执行。这种技术称为推测执行(speculative execution)。
默认使用推测执行。
属性 | 描述 |
---|---|
mapreduce.map.speculative | 控制Map任务的推测执行(默认true) |
mapreduce.reduce.speculative | 控制Reduce任务的推测执行(默认true) |
mapreduce.job.speculative.speculativecap | 推测执行功能的任务能够占总任务数量的比例(默认0.1,范围0~1) |
mapreduce.job.speculative.slownodethreshold | 判断某个TaskTracker是否适合启动某个task的speculative task(默认1) |
mapreduce.job.speculative.slowtaskthreshold | 判断某个task是否可以启动speculative task(默认1) |
3. MapReduce作业的计数器
计数器是在作业层面收集统计信息,帮助我们对MapReduce作业进行质量控制、性能监控和问题识别。和日志不同,它们生来就是全局的,因此不需要进行聚合操作就可以执行分析操作。
因为是在分布式环境中的全局变量,所以必须慎重使用。否则集群跟踪时负载太大。
示例:
package com.zw.mr.counter;
import com.zw.util.HdfsUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import java.io.IOException;
import java.util.StringTokenizer;
/**
* 计数器演示程序
*
* Created by zhangws on 16/10/11.
*/
public class CounterDemo {
public static enum WORDS_IN_LINE_COUNTER {
ZERO_WORDS,
LESS_THAN_FIVE_WORDS,
MORE_THAN_FIVE_WORDS
}
public static class CounterMapper extends Mapper<LongWritable, Text,
Text, IntWritable> {
private IntWritable countOfWords = new IntWritable(0);
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
int words = tokenizer.countTokens();
if (words == 0) {
context.getCounter(WORDS_IN_LINE_COUNTER.ZERO_WORDS).increment(1);
} else if (words > 0 && words <= 5) {
context.getCounter(WORDS_IN_LINE_COUNTER.LESS_THAN_FIVE_WORDS).increment(1);
} else {
context.getCounter(WORDS_IN_LINE_COUNTER.MORE_THAN_FIVE_WORDS);
}
while(tokenizer.hasMoreTokens()) {
String target = tokenizer.nextToken();
context.write(new Text(target), new IntWritable(1));
}
}
}
public static class CounterReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int count = 0;
for (IntWritable v : values) {
count += v.get();
}
//输出key
context.write(key, new IntWritable(count));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] values = new GenericOptionsParser(conf, args).getRemainingArgs();
if (values.length < 2) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
//先删除output目录
HdfsUtil.rmr(conf, values[values.length - 1]);
Job job = Job.getInstance(conf, CounterDemo.class.getSimpleName());
job.setJarByClass(CounterDemo.class);
job.setMapperClass(CounterMapper.class);
job.setReducerClass(CounterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(values[0]));
FileOutputFormat.setOutputPath(job, new Path(values[1]));
if (job.waitForCompletion(true)) {
HdfsUtil.cat(conf, values[1] + "/part-r-00000");
System.out.println("success");
} else {
System.out.println("fail");
}
}
}
输入数据文件
xxx xxxx
hadoop spark
storm hadoop
storm storm storm storm storm storm
stop stop stop stop stop stop stop
输出结果
File System Counters
FILE: Number of bytes read=792
FILE: Number of bytes written=510734
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=214
HDFS: Number of bytes written=45
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=6
Map output records=19
Map output bytes=182
Map output materialized bytes=226
Input split bytes=104
Combine input records=0
Combine output records=0
Reduce input groups=6
Reduce shuffle bytes=226
Reduce input records=19
Reduce output records=6
Spilled Records=38
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=2
Total committed heap usage (bytes)=391118848
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
com.zw.mr.counter.CounterDemo$WORDS_IN_LINE_COUNTER
LESS_THAN_FIVE_WORDS=3
MORE_THAN_FIVE_WORDS=0
ZERO_WORDS=1
File Input Format Counters
Bytes Read=107
File Output Format Counters
Bytes Written=45
hadoop 2
spark 1
stop 7
storm 7
xxx 1
xxxx 1
4. 参考
《精通Hadoop》 [印] Sandeep Karanth著 刘淼等译
Hadoop2.6.0运行mapreduce之推断(speculative)执行(上)