Google MapReduce 学习笔记

比我想象中的要好理解,可能是因为没有说的太详细,也可能是要在实际应用中体现出它的精妙。
总之,首先,最重要的是理解这两个式子:
  • map(k1,v1) ->list(k2,v2)
  • reduce(k2,list(v2)) ->list(v2)
式子含义是:
k1,v1是原始的输入key,value;
list(k2,v2)代表的是map函数经过将k1,v1分布式计算后的一个结果集合,但未经整合,所以是中间结果;
所以需要第三步的reduce函数来根据k2的值来合并v2;
最终得到的输出结果list(v2)便是我们所求的;

一个比较有趣的例子是:
倒转网络链接图:Map函数在源页面(source)中搜索所有的链接目标(target)并输出为(target,source) (这里k1是source集合,v1是target集合,k2,v2是Map函数将众多的k1,v1数据分成M份后个每一份后,根据target来归类source) 。Reduce 函数把给定链接目标(target)的链接组合成一个列表,输出(target,list(source)) 。 (reduce函数是再将多份结果根据target来归类source)

执行过程:
  1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
  2. One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
  3. A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The interme-diate key/value pairs produced by the Map function are buffered in memory.
  4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
  5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically
    many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
    这里有一个问题:一个reduce worker是需要获得所有属于它的k/v后才能进行排序,然后再运行reduce函数。
    但是它什么时候才能确保获得所有的k/v?换句话说,master什么时候知道所有属于它这个reduce worker的中间k2/v2 split已经准备就绪,然后可以把所有含有它的split数据的map worker的地址信息发送给它?还是说,要等到所有的map worker出结果后再由master统一分配?那样感觉会比较慢。。。是否在当初进行M分配的时候有排序过?这样比较好控制呢?但那样的话master是不是又要维护一张表。
  6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it asses the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
  7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user pro-gram returns back to the user code.
 


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值