“Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’s output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a guarantee of how many times it will call it for a particular map output record, if at all. In other words, calling the combiner function zero, one, or many times should produce the same output from the reducer.” -- 《Hadoop: The Definitive Guide》


reducer端的combiner

while ( my $line = <STDIN> ) { chomp($line);
( $user_id,$country, $timestamp ) = split( /\t/,$line );

# set base key
$key =$base_key;

if ($cur_key) { if ($key ne $cur_key ) { &onEndKey(); &onBeginKey(); } &onSameKey(); } else { &onBeginKey(); &onSameKey(); } } if ($cur_key) {
&onEndKey();
}

• 广告
• 抄袭
• 版权
• 政治
• 色情
• 无意义
• 其他

120