###### MapReduce的模式、算法和用例

 1 2 3 4 5 6 7 8 9 10 11 class  Mapper      method Map(docid id, doc d)          for  all  term t in  doc d do               Emit(term t, count 1)   class  Reducer      method Reduce(term t, counts [c1, c2,...])           sum  =  0       for  all  count c in  [c1, c2,...] do                sum  =  sum  +  c                Emit(term t, count sum)

 1 2 3 4 5 6 7 class  Mapper        method Map(docid id, doc d)           H =  new AssociativeArray           for  all  term t in  doc d do               H{t} =  H{t} +  1           for  all  term t in  H do              Emit(term t, count H{t})

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 class  Mapper        method Map(docid id, doc d)           for  all  term t in  doc d do              Emit(term t, count 1)       class  Combiner        method Combine(term t, [c1, c2,...])           sum  =  0           for  all  count c in  [c1, c2,...] do               sum  =  sum  +  c           Emit(term t, count sum)       class  Reducer        method Reduce(term t, counts [c1, c2,...])           sum  =  0           for  all  count c in  [c1, c2,...] do               sum  =  sum  +  c           Emit(term t, count sum)

Log 分析, 数据查询

ETL，数据分析

网络存储为系列节点的结合，每个节点包含有其所有邻接点ID的列表。按照这个概念，MapReduce 迭代进行，每次迭代中每个节点都发消息给它的邻接点。邻接点根据接收到的信息更新自己的状态。当满足了某些条件的时候迭代停止，如达到了最大迭代次数（网络半径）或两次连续的迭代几乎没有状态改变。从技术上来看，Mapper 以每个邻接点的ID为键发出信息，所有的信息都会按照接受节点分组，reducer 就能够重算各节点的状态然后更新那些状态改变了的节点。下面展示了这个算法：

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class  Mapper    method Map(id  n, object  N)       Emit(id  n, object  N)       for  all  id  m in  N.OutgoingRelations do          Emit(id  m, message getMessage(N))   class  Reducer    method Reduce(id  m, [s1, s2,...])       M =  null       messages =  []       for  all  s in  [s1, s2,...] do           if  IsObject(s) then              M =  s           else                //  s is  a message              messages.add(s)       M.State =  calculateState(messages)       Emit(id  m, item M)

 1 2 3 4 5 6 7 class  N        State in  {True  =  2, False  =  1, null =  0},     initialized 1  or  2  for  end-of-line categories, 0  otherwise     method getMessage(object  N)        return  N.State     method calculateState(state s, data [d1, d2,...])        return  max( [d1, d2,...] )

 1 2 3 4 5 6 7 class  N     State is  distance,  initialized 0  for  source node, INFINITY for  all  other nodes  method getMessage(N)     return  N.State +  1  method calculateState(state s, data [d1, d2,...])     min( [d1, d2,...] )

案例研究：网页排名和 Mapper 端数据聚合

 1 2 3 4 5 6 class  N       State is  PageRank   method getMessage(object  N)       return  N.State /  N.OutgoingRelations.size()   method calculateState(state s, data [d1, d2,...])       return  ( sum([d1, d2,...]) )

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 class  Mapper        method Initialize           H =  new AssociativeArray        method Map(id  n, object  N)           p =  N.PageRank  /  N.OutgoingRelations.size()           Emit(id  n, object  N)           for  all  id  m in  N.OutgoingRelations do              H{m} =  H{m} +  p        method Close           for  all  id  n in  H do              Emit(id  n, value H{n})       class  Reducer        method Reduce(id  m, [s1, s2,...])           M =  null           p =  0           for  all  s in  [s1, s2,...] do               if  IsObject(s) then                  M =  s               else                  p =  p +  s           M.PageRank =  p           Emit(id  m, item M)

 1 2 3 4 5 6 7 8 9 10 Record 1: F=1, G={a, b}  Record 2: F=2, G={a, d, e}  Record 3: F=1, G={b}  Record 4: F=3, G={a, b}    Result:  a -> 3  //  F=1, F=2, F=3  b -> 2  //  F=1, F=3  d -> 1  //  F=2  e -> 1  //  F=2

 1 2 3 4 5 6 7 8 class  Mapper   method Map(null, record [value f, categories [g1, g2,...]])     for  all  category g in  [g1, g2,...]       Emit(record [g, f], count 1)   class  Reducer   method Reduce(record [g, f], counts [n1, n2, ...])     Emit(record [g, f], null )

 1 2 3 4 5 6 7 class  Mapper     method Map(record [f, g], null)       Emit(value g, count 1)     class  Reducer     method Reduce(value g, counts [n1, n2,...])       Emit(value g, sum( [n1, n2,...] ) )

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 class  Mapper method Map(null, record [value f, categories [g1, g2,...] ) for  all  category g in  [g1, g2,...] Emit(value f, category g)   class  Reducer method Initialize H =  new AssociativeArray : category -> count method Reduce(value f, categories [g1, g2,...]) [g1', g2',..] =  ExcludeDuplicates( [g1, g2,..] ) for  all  category g in  [g1', g2',...] H{g} =  H{g} +  1 method Close for  all  category g in  H do Emit(category g, count H{g})

• 使用 combiners 带来的的好处有限，因为很可能所有项对都是唯一的
• 不能有效利用内存
 1 2 3 4 5 6 7 8 9 10 class  Mapper method Map(null, items [i1, i2,...] ) for  all  item i in  [i1, i2,...] for  all  item j in  [i1, i2,...] Emit(pair [i j], count 1)   class  Reducer method Reduce(pair [i j], counts [c1, c2,...]) s =  sum([c1, c2,...]) Emit(pair[i j], count s)

Stripes Approach（条方法？不知道这个名字怎么理解）

• 中间结果的键数量相对较少，因此减少了排序消耗。
• 可以有效利用 combiners。
• 可在内存中执行，不过如果没有正确执行的话也会带来问题。
• 实现起来比较复杂。
• 一般来说， “stripes” 比 “pairs” 更快
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 class  Mapper method Map(null, items [i1, i2,...] ) for  all  item i in  [i1, i2,...] H =  new AssociativeArray : item -> counter for  all  item j in  [i1, i2,...] H{j} =  H{j} +  1 Emit(item i, stripe H)   class  Reducer method Reduce(item i, stripes [H1, H2,...]) H =  new AssociativeArray : item -> counter H =  merge-sum( [H1, H2,...] ) for  all  item j in  H.keys() Emit(pair [i j], H{j})

References:

1. Lin J. Dyer C. Hirst G. Data Intensive Processing MapReduce

 1 2 3 4 class  Mapper method Map(rowkey key, tuple  t) if  t satisfies the predicate Emit(tuple  t, null)

 1 2 3 4 5 6 7 8 class  Mapper method Map(rowkey key, tuple  t) tuple  g =  project(t) //  extract required fields to tuple  g Emit(tuple  g, null)   class  Reducer method Reduce(tuple  t, array n) //  n is  an array of nulls Emit(tuple  t, null)

 1 2 3 4 5 6 7 class  Mapper method Map(rowkey key, tuple  t) Emit(tuple  t, null)   class  Reducer method Reduce(tuple  t, array n) //  n is  an array of one or  two nulls Emit(tuple  t, null)

 1 2 3 4 5 6 7 8 class  Mapper method Map(rowkey key, tuple  t) Emit(tuple  t, null)   class  Reducer method Reduce(tuple  t, array n) //  n is  an array of one or  two nulls if  n.size() =  2 Emit(tuple  t, null)

 1 2 3 4 5 6 7 8 class  Mapper method Map(rowkey key, tuple  t) Emit(tuple  t, string t.SetName) //  t.SetName is  either 'R'  or  'S'   class  Reducer method Reduce(tuple  t, array n) //  array n can be ['R'], ['S'], ['R'  'S'], or  ['S', 'R'] if  n.size() =  1  and  n[1] =  'R' Emit(tuple  t, null)

 1 2 3 4 5 6 class  Mapper method Map(null, tuple  [value GroupBy, value AggregateBy, value ...]) Emit(value GroupBy, value AggregateBy) class  Reducer method Reduce(value GroupBy, [v1, v2,...]) Emit(value GroupBy, aggregate( [v1, v2,...] ) ) //  aggregate() : sum(), max(),...

MapperReduce框架可以很好地处理连接，不过在面对不同的数据量和处理效率要求的时候还是有一些技巧。在这部分我们会介绍一些基本方法，在后面的参考文档中还列出了一些关于这方面的专题文章。

• Mapper要输出所有的数据，即使一些key只会在一个集合中出现。
• Reducer 要在内存中保有一个key的所有数据，如果数据量打过了内存，那么就要缓存到硬盘上，这就增加了硬盘IO的消耗。

 1 2 3 4 5 6 7 8 9 10 11 12 class  Mapper method Map(null, tuple  [join_key k, value v1, value v2,...]) Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] )   class  Reducer method Reduce(join_key k, tagged_tuples [t1, t2,...]) H =  new AssociativeArray : set_name -> values for  all  tagged_tuple t in  [t1, t2,...] //  separate values into 2  arrays H{t.tag}.add(t.values) for  all  values r in  H{'R'} //  produce a cross-join of the two arrays for  all  values l in  H{'L'} Emit(null, [k r l] )

 1 2 3 4 5 6 7 8 9 10 class  Mapper method Initialize H =  new AssociativeArray : join_key -> tuple  from  R R =  loadR() for  all  [ join_key k, tuple  [r1, r2,...] ] in  R H{k} =  H{k}.append( [r1, r2,...] )   method Map(join_key k, tuple  l) for  all  tuple  r in  H{k} Emit(null, tuple  [k r l] )

#### 大数据之Mapreduce

2018年04月17日 10:45

#### 详解MapReduce的模式、算法和用例

2014-04-06 19:23:44

#### MapReduce 模式、算法和用例

2014-01-08 16:11:05

2016-03-28 11:26:25

#### MapReduce的模式，算法以及用例

2015-01-28 21:07:58

#### MapReduce常见计算模式

2016-03-04 15:16:17

2012-11-20 09:07:18

#### 海量数据处理---分布式处理之MapReduce

2015-05-26 15:48:36

#### 软件建模与设计 UML、用例、模式和软件体系结构.pdf

2018年03月13日 179.52MB 下载

#### MapReduce编程实战之“调试”和"调优"

2014-04-19 23:27:21

## 不良信息举报

MapReduce的模式、算法和用例