MapReduce学习笔记之输出与作业的计数器(五)

1. MapReduce的输出

输出依赖于作业中Reduce任务的数量,下面是一些优化建议:

  1. 压缩输出,以节省存储空间,同时也提升HDFS写入吞吐量;
  2. 避免写入带外端文件(out-of-band side file)作为Reduce任务的输出。
  3. 根据作业输出文件的消费者的需求,可以分割的压缩技术或许适合;
  4. 以较大块容量设置,写入较大的HDFS文件,有助于减少Map任务数。

2. 任务的推测执行

Straggle(掉队者)是指那些跑的很慢但最终会成功完成的任务。一个掉队的Map任务会阻止Reduce任务开始执行。

Hadoop不能自动纠正掉队任务,但是可以识别那些跑的比较慢的任务,然后它会产生另一个等效的任务作为备份,并使用首先完成的那个任务的结果,此时另外一个任务则会被要求停止执行。这种技术称为推测执行(speculative execution)。

默认使用推测执行。

属性描述
mapreduce.map.speculative控制Map任务的推测执行(默认true)
mapreduce.reduce.speculative控制Reduce任务的推测执行(默认true)
mapreduce.job.speculative.speculativecap推测执行功能的任务能够占总任务数量的比例(默认0.1,范围0~1)
mapreduce.job.speculative.slownodethreshold判断某个TaskTracker是否适合启动某个task的speculative task(默认1)
mapreduce.job.speculative.slowtaskthreshold判断某个task是否可以启动speculative task(默认1)

3. MapReduce作业的计数器

计数器是在作业层面收集统计信息,帮助我们对MapReduce作业进行质量控制、性能监控和问题识别。和日志不同,它们生来就是全局的,因此不需要进行聚合操作就可以执行分析操作。


因为是在分布式环境中的全局变量,所以必须慎重使用。否则集群跟踪时负载太大。

示例:

package com.zw.mr.counter;

import com.zw.util.HdfsUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;
import java.util.StringTokenizer;

/**
 * 计数器演示程序
 *
 * Created by zhangws on 16/10/11.
 */
public class CounterDemo {

    public static enum WORDS_IN_LINE_COUNTER {
        ZERO_WORDS,
        LESS_THAN_FIVE_WORDS,
        MORE_THAN_FIVE_WORDS
    }

    public static class CounterMapper extends Mapper<LongWritable, Text,
            Text, IntWritable> {

        private IntWritable countOfWords = new IntWritable(0);

        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            StringTokenizer tokenizer = new StringTokenizer(value.toString());
            int words = tokenizer.countTokens();

            if (words == 0) {
                context.getCounter(WORDS_IN_LINE_COUNTER.ZERO_WORDS).increment(1);
            } else if (words > 0 && words <= 5) {
                context.getCounter(WORDS_IN_LINE_COUNTER.LESS_THAN_FIVE_WORDS).increment(1);
            } else {
                context.getCounter(WORDS_IN_LINE_COUNTER.MORE_THAN_FIVE_WORDS);
            }
            while(tokenizer.hasMoreTokens()) {
                String target = tokenizer.nextToken();
                context.write(new Text(target), new IntWritable(1));
            }
        }
    }

    public static class CounterReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {

            int count = 0;
            for (IntWritable v : values) {
                count += v.get();
            }
            //输出key
            context.write(key, new IntWritable(count));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] values = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (values.length < 2) {
            System.err.println("Usage: wordcount <in> [<in>...] <out>");
            System.exit(2);
        }

        //先删除output目录
        HdfsUtil.rmr(conf, values[values.length - 1]);

        Job job = Job.getInstance(conf, CounterDemo.class.getSimpleName());

        job.setJarByClass(CounterDemo.class);

        job.setMapperClass(CounterMapper.class);
        job.setReducerClass(CounterReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(values[0]));
        FileOutputFormat.setOutputPath(job, new Path(values[1]));

        if (job.waitForCompletion(true)) {
            HdfsUtil.cat(conf, values[1] + "/part-r-00000");
            System.out.println("success");
        } else {
            System.out.println("fail");
        }
    }
}

输入数据文件

xxx xxxx
hadoop spark
storm hadoop
storm storm storm storm storm storm

stop stop stop stop stop stop stop

输出结果

    File System Counters
        FILE: Number of bytes read=792
        FILE: Number of bytes written=510734
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=214
        HDFS: Number of bytes written=45
        HDFS: Number of read operations=15
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=6
    Map-Reduce Framework
        Map input records=6
        Map output records=19
        Map output bytes=182
        Map output materialized bytes=226
        Input split bytes=104
        Combine input records=0
        Combine output records=0
        Reduce input groups=6
        Reduce shuffle bytes=226
        Reduce input records=19
        Reduce output records=6
        Spilled Records=38
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=2
        Total committed heap usage (bytes)=391118848
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    com.zw.mr.counter.CounterDemo$WORDS_IN_LINE_COUNTER
        LESS_THAN_FIVE_WORDS=3
        MORE_THAN_FIVE_WORDS=0
        ZERO_WORDS=1
    File Input Format Counters 
        Bytes Read=107
    File Output Format Counters 
        Bytes Written=45
hadoop  2
spark   1
stop    7
storm   7
xxx 1
xxxx    1

4. 参考

《精通Hadoop》 [印] Sandeep Karanth著 刘淼等译

Hadoop2.6.0运行mapreduce之推断(speculative)执行(上)

Hadoop2.6.0运行mapreduce之推断(speculative)执行(下)

Hadoop中Speculative Task调度策略

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值