Map Aggregation
Aggregation
The term refers to a Mapper combining its <key, value> pairs, with the goal of reducing the amount of network traffic between the Mapper and the Reducer.
There are two ways to perform Map Aggregation in Hadoop:
Combiners --- The MapReduce framework has the concept of a Combiner, where you write a class that defines the aggregation, and the framework decides when to perform the aggregation.
In-map Aggregation --- The Mapper contains logic that aggregates records, typically accomplished by buffering records in memory prior to writing them out.
Overview of Combiners
The < key ,value > records output by the Mapper are serialized, so the Combiner has to deserialize them.
A Combiner only aggregates data on one node. It does not combine the output of multiple Mappers.
Reduce-side Combining
The Combiner is also used in the reduce phase if the intermediate <key,value> pairs from Mappers are spilled to disk.
The fact that the Reducer uses the Combiner behind-the-scenes to improve file I/O.
Counters
The pre-defined counters include usefulinformation,like the number of map input records, or the amount of byteswritten to HDFS.
The Hadoop counters are global -they are asummation of events that occurs across the entire cluster.
User-defined Counters
Two ways to define your own counter in Hadoop:
1.Use an enum to define a group,and the elements in the enum become the counter names.
2.Use strings for the group name and counter name.
Combiner Example
public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable outputValue = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable count : values) { sum += count.get(); } outputValue.set(sum); context.write(key, outputValue); } }
In-Map Aggregation
In-Map Aggregation Example
public class TopResultsMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private ArrayList<Word> words = new ArrayList<Word>(); private PriorityQueue<Word> queue; private int maxResults; @Override protected void setup(Context context) throws IOException, InterruptedException { maxResults = Integer.parseInt(context.getConfiguration() .get("maxResults")); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] input = StringUtils.split(value.toString(), '\\', ' '); for (String word : input) { Word currentWord = new Word(word, 1); if (words.contains(currentWord)) { //increment the existing Word's frequency for (Word w : words) { if (w.equals(currentWord)) { w.frequency++; break; } } } else { words.add(currentWord); } } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { Text outputKey = new Text(); IntWritable outputValue = new IntWritable(); queue = new PriorityQueue<Word>(words.size()); queue.addAll(words); for (int i = 1; i <= maxResults; i++) { Word tail = queue.poll(); if (tail != null) { outputKey.set(tail.value); outputValue.set(tail.frequency); context.write(outputKey, outputValue); } } } }
public class Word implements Comparable<Word> {
public String value;
public int frequency;
public Word(String value, int frequency) {
this.value = value;
this.frequency = frequency;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof Word) {
return value.equalsIgnoreCase(((Word) obj).value);
} else {
return false;
}
}
@Override
public int compareTo(Word w) {
return w.frequency - this.frequency;
}
}
public enum MyCounters {
GOOD_RECORDS, BAD_RECORDS
}
context.getCounter(MyCounters.GOOD_RECORDS).increment(1);