In MapReduce, the programmers needs only to implement the mapper, the reducer, and optionally, the combiner and the partitioner.
The execution frameworkk handles everything else.
Local Aggregation
Local Aggregation of intermediate results is one of the key to efficient algorithm.
Through use of the combiner and by taking advantage of the ability to preserve state across multiple inputs, it is often possible to substantially reduce both the number and size of key-value pairs that need to be shued from the mappers to the reducers.
Importance of Local Aggregation
- Ideal scaling characteristics: Twice the data, twice the running time; Twice the resources, half the running time
- Why can’t we achieve this? Synchronization requires communication; Communication kills performance
- Thus, we can avoid communication: Reduce intermediate data via local aggregation; Combiners can help, too
Example1: Word Count
//Base line
class MAPPER
method MAP(docid a, doc d) //the words needed are in the doc d
for all term t∈doc d do
EMIT(term t, count 1)
class REDUCER
method REDUCE(term t, counts[c1, c2,...])
sum = 0
for all count c∈counts[c1, c2,...] do
sum = sum + c
EMIT(term t, count s)
//Version 1
class MAPPER
method MAP(docid a, doc d) //the words needed are in the doc d
H = new ASSOCIATIVEARRAY //Define the variable H in the MAP method, combine in the mapper
for all term t∈doc d do
H{t} = H{t} + 1 //Tally counts for entire document
for all term t∈H do
EMIT(term t, count H{t})
//Version 2
class MAPPER
method INITIALIZE
H = new ASSOCIATIVEARRAY //Define the variable H out of the MAP method, combine across multiple mappers
method MAP(docid a, doc d) //the words needed are in the doc d
for all term t∈doc d do
H{t} = H{t} + 1 //Tally counts for entire document
method CLOSE
for all term t∈H do
EMIT(term t, count H{t})
Design Pattern for Local Aggregation
- “In-mapper combining”
- Fold the functionality of the combiner into the mapper by preserving state across multiple map calls
- Advantages
- Speed
- Disadvantages
- Explicit memory management required: variable H can’t be free immediately when the mapper is done
- Potential for order-dependent bugs: the speed of different mappers can be different
Combiner Design
- Combiners and Reducers share the same method signature
- 但是combiner一般执行本地的中间结果汇聚,“mini-reducer”;而reducer一般执行不同mapper输出结果的汇聚 Combiners are optional optimations
- Should not affect algorithm correctness
- May be run 0, 1 or multiple times: 故combiner的输出结果格式应该与mapper的输出结果格式相同,与reducer的输入结果格式相同Example2: Find average of integers associated with the same key(例如:找出所有叫“张三”的人的年龄的平均值)
//Version 1
class MAPPER
method MAP(string t, integer r)
EMIT(string t, integer r)
class REDUCER
method REDUCE(string t, integers[r1, r2, ...])
sum = 0
cnt = 0
for all integer r∈integers[r1, r2, ...] do
sum = sum + r
cnt = cnt + 1
r_avg = sum / cnt
EMIT(string t, integer r_avg)
//Version 2
//In fact, 本版本是无法执行的
//因为当combiner不执行的时候,reducer是无法执行的,由于输入格式和mapper的输出格式不一致
class MAPPER
method MAP(string t, integer r)
EMIT(string t, integer r)
//对某个mapper中的叫“张三”的人的年龄进行汇聚
class COMBINER
method COMBINE(string t, integers[r1, r2, ...])
sum = 0
cnt = 0
for all integer r∈integers[r1, r2, ...] do
sum = sum + r
cnt = cnt + 1
EMIT(string t, pair(sum, cnt))
class REDUCER
method REDUCE(string t, pairs[(s1, r1), (s2, r2), ...])
sum = 0
cnt = 0
for all integer (s,r)∈pairs[(s1, r1), (s2, r2), ...] do
sum = sum + r
cnt = cnt + 1
r_avg = sum / cnt
EMIT(string t, integer r_avg)
//Version 3
//将mapper的输出格式改成和combiner的输出格式一样
class MAPPER
method MAP(string t, integer r)
EMIT(string t, pair(r, 1))
class COMBINER
method COMBINE(string t, integers[r1, r2, ...])
sum = 0
cnt = 0
for all integer r∈integers[r1, r2, ...] do
sum = sum + r
cnt = cnt + 1
EMIT(string t, pair(sum, cnt))
class REDUCER
method REDUCE(string t, pairs[(s1, r1), (s2, r2), ...])
sum = 0
cnt = 0
for all integer (s,r)∈pairs[(s1, r1), (s2, r2), ...] do
sum = sum + r
cnt = cnt + 1
r_avg = sum / cnt
EMIT(string t, integer r_avg)
//Version 4
//设置了S C两个变量,使之在不同Mapper之间传递
//若S C跨越了所有的mapper,则后期则不再需要combiner
class MAPPER
method INITIALIZE
S = new ASSOCIATIVEARRAY
C = new ASSOCIATIVEARRAY
method MAP(string t, integer r)
S{t} = S{t} + r
C{t} = C{t} + 1
method CLOSE
for all term t∈S do
EMIT(term t, pair(S{t}, C{t}))