MapReduce实现-6.824-p1

小贺的学习日记

已于 2023-10-02 17:41:23 修改

阅读量87

点赞数

文章标签： mapreduce 大数据分布式

于 2023-10-02 17:37:45 首次发布

本文链接：https://blog.csdn.net/Hxy_666/article/details/133496640

版权

在看这个之前希望你已经精读了mapreduce这篇论文，起码精读了前四章

MapReduce的结点类型

一共有两种类型

Matster(Coordinator/main): 用于安排任务，协调控制，每个mapreduce任务中只有一个。

Woker

Map: takes an input pair and produces a set of intermediate key/value pairs.
Reduce: acepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values.

Map和Reduce均称隶属于之为worker结点，每个mapreduce任务中可以有多个并行执行。

执行过程

如图1所示：

Master结点负责分配任务，其中有M个MapTask和N个ReduceTask。
其中worker就是一个服务器或者终端，在单机上是一个进程。
在map阶段，每次读入一个文件，将这些文件完成映射，以wordcount为例，生成一个Key/Value对为（word,"1"）这里即使有多个相同的word也是在kv对里边多加一个映射而不是word对应的value值加一，并把这些Key/Value存储在中间文件上，这些中间文件的命名应该实现对reduce的正交映射，如图二所示。注意，中间文件应该存储在本地磁盘上
在reduce阶段，读入属于自己的中间文件，将所有中间文件按照Key值进行排序，并统计生成最终的文件

代码逻辑

worker结点向Coordinator申请一个task（map or reduce）结点之间的通信通过RPC实现，只有当map task全部完成之后才开始reduce task

执行map的worker应当接受一个文件，reduce的执行数量，map的任务号（用于正交映射），和mapf。读取文件并通过mapf完成映射，对于完成映射的K/V写入中间文件，这里需要注意原子性，即文件要么被完整写入要么不写入。这里源网站上给了提示

To ensure that nobody observes partially written files in the presence of crashes, the MapReduce paper mentions the trick of using a temporary file and atomically renaming it once it is completely written. You can use ioutil.TempFile to create a temporary file and os.Rename to atomically rename it.

对于写入文件的格式官方推荐使用json并给出了具体的编码和解码

The worker's map task code will need a way to store intermediate key/value pairs in files in a way that can be correctly read back during reduce tasks. One possibility is to use Go's encoding/json package. To write key/value pairs in JSON format to an open file:
  enc := json.NewEncoder(file)
  for _, kv := ... {
    err := enc.Encode(&kv)
and to read such a file back:
  dec := json.NewDecoder(file)
  for {
    var kv KeyValue
    if err := dec.Decode(&kv); err != nil {
      break
    }
    kva = append(kva, kv)
  }

文件的具体命名和hash映射也一并给出

A reasonable naming convention for intermediate files is mr-X-Y, where X is the Map task number, and Y is the reduce task number.
The map part of your worker can use the ihash(key) function (in worker.go) to pick the reduce task for a given key.

最后通过RPC将中间文件返回给Coordinator，Coordinater记录map完成的数量，并且将中间文件根据名字给予对应的reducetask。等到map任务全部完成，便可开始分配reduce任务

注意：多个worker和一个coordinator进行通信，在coordinator上应注意并发的问题

执行reduce的worker结点应当接受一系列中间文件，自己的任务号以及reducef

其首先对中间文件根据Key值进行排序，然后进行解码，进行reducef之后写入最终文件中去，写文件注意原子性，最后将文件名返回给coordinator。

等到所有reducetask结束后，coordinator提交结束，终止所有进程。

参考：

MapReduce Paperhttp://research.google.com/archive/mapreduce-osdi04.pdf

6.824-lab1http://nil.csail.mit.edu/6.824/2022/labs/lab-mr.html

mapreduce论文导读https://hardcore.feishu.cn/docs/doccnxwr1i2y3Ak3WXmFlWLaCbh