Hadoop中的计数器有点类似于日志,可以输出Hadoop在运行过程的运运算信息。
在之前运行的WordCount中,控制台输出的信息有以下内容(可以再运行一次WordCount案例进行查看):Counters: 38
File System Counters #10个
FILE: Number of bytes read=462
FILE: Number of bytes written=541399
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=38
HDFS: Number of bytes written=19
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework #20个
Map input records=2
Map output records=4
Map output bytes=35
Map output materialized bytes=49
Input split bytes=109
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=49
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=59
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=242360320
Shuffle Errors #6个
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters #1个
Bytes Read=19
File Output Format Counters #1个
Bytes Written=19
可以看到输出的日志中,提示总共有38个计数器,分成了5组,File System Counters等称之为组名,可以看出每组分别有10、20、6、1、1个计数器。
对于这38个计数器,我们并不是每一个都关心,以下重点讲解部分计数器的作用
一、计数器讲解
1、File Input Format CountersFile Input Format Counters #1个
Bytes Read=19
表示的是我们从HDFS中读取的文件的字节数总共是19个字节
回归之前的word.txt中的文本内容hello you
helo me
5+3+5+2=15,加上2个空格和一个换行,一个结束符也是19个字符。
2、Map-Reduce FrameworkMap-Reduce Framework #20个
Map input records=2
Map output records=4
Map output bytes=35
Map output materialized bytes=49
Input split bytes=109
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=49
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=59
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=242360320
Map input records=2hello you
hello me
刚好是2行
Map output records=4
由于我们的mapper中,是每读取一个单词,就输出一个键值对,因此map任务的输出是:,
,
,
刚好有四个
Reduce input records=4
map输出的记录就是reduce输入的记录数,因此也是四个
Reduce input groups=3
关于分组group的概念我们之后会详细讲解,实际上就是将mapper的输出的记录进行分组,即把相同key的分为一组,所以分组后是
刚好分成3组。
Reduce output records=3
WordCount案例中的输出为hello 2
you 1
me 1
刚好是3行。
Combine input records=0、Combine output records=0
这是属于规约,在后面我们会详细的讲解规约的概念。
二、自定义计数器
计数器用Counter对象表示,每个计数器都有一个组,只要组名(groupName)相同,那么这些计数器就自动属于一个组。并且每个计数器还有这自己的名字(counterName),用以区分同一个组下的不同计数器。
获得一个计数器实例的方法如下:Counter counter = context.getCounter(groupName, counterName);
例如,我们现在要进行敏感词统计,即分析某段文本内容中出现了多少次敏感词。假设我们把"hello"认为是一个敏感词。在WordCount案例的基础上,我们可以将TokenizerMapper的代码修改如下public static class TokenizerMapper extends
Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// StringTokenizer是java工具类,将字符串按照空格进行分割
StringTokenizer itr = new StringTokenizer(value.toString());
//自定义计数器
String groupName="Custom Group";//
String counterName="Sensitive words";
Counter counter = context.getCounter(groupName, counterName);
// 每次出现一个单词,单词次数加1
while (itr.hasMoreTokens()) {
String nextToken = itr.nextToken();
if(nextToken.equals("hello")){//假设"hello"为敏感词,每次输出,即加1
counter.increment(1);
}
word.set(nextToken);
context.write(word, one);
}
}
}
再次运行WordCount案例,我们可以看到控制台中输出了我们自定义的计数器Counters: 39
File System Counters
FILE: Number of bytes read=462
FILE: Number of bytes written=541399
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=38
HDFS: Number of bytes written=19
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=6
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=35
Map output materialized bytes=49
Input split bytes=109
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=49
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=38
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=242360320
Custom Group #我们自定义的组名
Sensitive words=2 #我们自定义的计数器的值为2
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=19
File Output Format Counters
Bytes Written=19