MapReduce计数器详解

最新推荐文章于 2024-06-27 10:25:41 发布

Wang_AI

最新推荐文章于 2024-06-27 10:25:41 发布

阅读量4.8k

点赞数 1

本文链接：https://blog.csdn.net/xw_classmate/article/details/50954384

版权

Hadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

计数器：计数器是用来记录job的执行进度和状态的。它的作用可以理解为日志。我们通常可以在程序的某个位置插入计数器，用来记录数据或者进度的变化情况，它比日志更便利进行分析。

1. 内置计数器

Hadoop其实内置了很多计数器，那么这些计数器在哪看呢？

我们先来看下最简单的wordcount程序。

HDFS上的源文件：

[hadoop@master logfile]$ hadoop fs -cat  /MR_Counter/diary
Today is 2016-3-22
I study mapreduce counter
I realized that mapreduce counter is simple

WordCount.java：

package com.oner.mr.mrcounter;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

	public static void main(String[] args) throws IOException,
			InterruptedException, URISyntaxException, ClassNotFoundException {
		Path inPath = new Path("/MR_Counter/");// 输入目录
		Path outPath = new Path("/MR_Counter/out");// 输出目录

		Configuration conf = new Configuration();
		// conf.set("fsdefaultFS", "hdfs://master:9000");
		FileSystem fs = FileSystem.get(new URI("hdfs://master:9000"), conf,
				"hadoop");
		if (fs.exists(outPath)) {// 如果输出目录已存在，则删除
			fs.delete(outPath, true);
		}

		Job job = Job.getInstance(conf);

		job.setJarByClass(WordCount.class);

		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);

		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);

		FileInputFormat.setInputPaths(job, inPath);
		FileOutputFormat.setOutputPath(job, outPath);

		job.waitForCompletion(true);
	}

	public static class MyMapper extends
			Mapper<LongWritable, Text, Text, LongWritable> {

		private static Text k = new Text();
		private static LongWritable v = new LongWritable(1);

		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String[] words = line.split(" ");
			for (String word : words) {
				k.set(word);
				context.write(k, v);
			}
		}
	}

	public static class MyReducer extends
			Reducer<Text, LongWritable, Text, LongWritable> {

		private static LongWritable v = new LongWritable();

		@Override
		protected void reduce(Text key, Iterable<LongWritable> values,
				Reducer<Text, LongWritable, Text, LongWritable>.Context context)
				throws IOException, InterruptedException {
			int sum = 0;
			for (LongWritable value : values) {
				sum += value.get();
			}

			v.set(sum);

			context.write(key, v);
		}
	}

}

打成jar包后执行： hadoop jar wc.jar com.oner.mr.mrcounter.WordCount

发现有如下信息（注释部分是自己加的）：

16/03/22 14:25:30 INFO mapreduce.Job: Counters: 49 // 表示本次job共49个计数器
	File System Counters // 文件系统计数器
		FILE: Number of bytes read=235
		FILE: Number of bytes written=230421
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=189
		HDFS: Number of bytes written=86
		HDFS: Number of read operations=6
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters // 作业计数器
		Launched map tasks=1 // 启动的map数为1
		Launched reduce tasks=1 // 启动的reduce数为1
		Data-local map tasks=1 
		Total time spent by all maps in occupied slots (ms)=12118
		Total time spent by all reduces in occupied slots (ms)=11691
		Total time spent by all map tasks (ms)=12118
		Total time spent by all reduce tasks (ms)=11691
		Total vcore-seconds taken by all map tasks=12118
		Total vcore-seconds taken by all reduce tasks=11691
		Total megabyte-seconds taken by all map tasks=12408832
		Total megabyte-seconds taken by all reduce tasks=11971584
	Map-Reduce Framework //MapReduce框架计数器
		Map input records=3
		Map output records=14
		Map output bytes=201
		Map output materialized bytes=235
		Input split bytes=100
		Combine input records=0
		Combine output records=0
		Reduce input groups=10
		Reduce shuffle bytes=235
		Reduce input records=14
		Reduce output records=10
		Spilled Records=28
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=331
		CPU time spent (ms)=2820
		Physical memory (bytes) snapshot=306024448
		Virtual memory (bytes) snapshot=1690583040
		Total committed heap usage (bytes)=136122368
	Shuffle Errors // Shuffle错误计数器
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters // 文件输入格式计数器
		Bytes Read=89 // Map从HDFS上读取的字节数，共89个字节
	File Output Format Counters // 文件输出格式计数器
		Bytes Written=86 //Reduce输出到HDFS上的字节数，共86个字节

上面的信息就是内置计数器的一些信息，包括：

文件系统计数器（File System Counters）

作业计数器（Job Counters）

MapReduce框架计数器（Map-Reduce Framework）

Shuffle 错误计数器（Shuffle Errors）

文件输入格式计数器（File Output Format Counters）

文件输出格式计数器（File Input Format Counters）

2. 自定义计数器

Hadoop也支持自定义计数器，在Hadoop2.x中可以使用Context的getCounter()方法（其实是接口TaskAttemptContext的方法，Context继承了该接口）得到自定义计数器。

public Counter getCounter(Enum<?> counterName)：Get the Counter for the given counterName

public Counter getCounter(String groupName, String counterName)：Get the Counter for the given groupName and counterName

由此可见，可以通过枚举或者字符串来得到计数器。

计数器常见的方法有几下几个：

String getName()：Get the name of the counter

String getDisplayName()：Get the display name of the counter

long getValue()：Get the current value

void setValue(long value)：Set this counter by the given value

void increment(long incr)：Increment this counter by the given value

假设现在要在控制台输出源文件中的一些敏感词的个数，这里设定“mapreduce”为敏感词，该如何做呢？

package com.oner.mr.mrcounter;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

	public static void main(String[] args) throws IOException,
			InterruptedException, URISyntaxException, ClassNotFoundException {
		Path inPath = new Path("/MR_Counter/");// 输入目录
		Path outPath = new Path("/MR_Counter/out");// 输出目录

		Configuration conf = new Configuration();
		// conf.set("fsdefaultFS", "hdfs://master:9000");
		FileSystem fs = FileSystem.get(new URI("hdfs://master:9000"), conf,
				"hadoop");
		if (fs.exists(outPath)) {// 如果输出目录已存在，则删除
			fs.delete(outPath, true);
		}

		Job job = Job.getInstance(conf);

		job.setJarByClass(WordCount.class);

		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);

		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(TextOutputFormat.class);

		FileInputFormat.setInputPaths(job, inPath);
		FileOutputFormat.setOutputPath(job, outPath);

		job.waitForCompletion(true);

	}

	public static class MyMapper extends
			Mapper<LongWritable, Text, Text, LongWritable> {

		private static Text k = new Text();
		private static LongWritable v = new LongWritable(1);

		@Override
		protected void map(LongWritable key, Text value, Context context)
				throws IOException, InterruptedException {

			<span style="color:#FF0000;">Counter sensitiveCounter = context.getCounter("Sensitive Words:",
					"mapreduce");</span>// 创建一个组是Sensitive Words，名是mapreduce的计数器

			String line = value.toString();
			String[] words = line.split(" ");

			for (String word : words) {
				<span style="color:#FF0000;">if (word.equalsIgnoreCase("mapreduce")) {//如果出现了mapreduce，则计数器值加1
					sensitiveCounter.increment(1L);
				}</span>
				k.set(word);
				context.write(k, v);
			}
		}
	}

	public static class MyReducer extends
			Reducer<Text, LongWritable, Text, LongWritable> {

		private static LongWritable v = new LongWritable();

		@Override
		protected void reduce(Text key, Iterable<LongWritable> values,
				Reducer<Text, LongWritable, Text, LongWritable>.Context context)
				throws IOException, InterruptedException {
			int sum = 0;
			for (LongWritable value : values) {
				sum += value.get();
			}

			v.set(sum);

			context.write(key, v);
		}
	}

}

打成jar包后重新执行，发现控制台中确实多了一组计数器Sensitive Words:，其中有一个名叫mapreduce的计数器，值为2。