MIT6.824 2018 MapReduce lab1总结

最新推荐文章于 2023-10-04 21:14:11 发布

wwxy261

最新推荐文章于 2023-10-04 21:14:11 发布

阅读量265

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/wwxy1995/article/details/111688615

版权

算法专栏收录该内容

3633 篇文章 120 订阅

订阅专栏

什么是MapReduce

用自己的话来说，MapReduce是一种通用的分布式计算框架，只需要用户提供MapF函数，和RecudeF函数，不需要知道分布式底层原理，就能解决一些特定的问题。

因此这个实验就是来实现一个mini的MapReduce框架。

Introduction

In this lab you'll build a MapReduce library as an introduction to programming in Go and to building fault tolerant distributed systems. In the first part you will write a simple MapReduce program. In the second part you will write a Master that hands out tasks to MapReduce workers, and handles failures of workers. The interface to the library and the approach to fault tolerance is similar to the one described in the original MapReduce paper.

Preamble: Getting familiar with the source

The mapreduce package provides a simple Map/Reduce library (in the mapreduce directory). Applications should normally call Distributed() [located in master.go] to start a job, but may instead call Sequential() [also in master.go] to get a sequential execution for debugging.

The code executes a job as follows:

1.The application provides a number of input files, a map function, a reduce function, and the number of reduce tasks (nReduce).

2. A master is created with this knowledge. It starts an RPC server (see master_rpc.go), and waits for workers to register (using the RPC call Register() [defined in master.go]). As tasks become available (in steps 4 and 5), schedule() [schedule.go] decides how to assign those tasks to workers, and how to handle worker failures.

3. The master considers each input file to be one map task, and calls doMap() [common_map.go] at least once for each map task. It does so either directly (when using Sequential()) or by issuing the DoTask RPC to a worker [worker.go]. Each call to doMap() reads the appropriate file, calls the map function on that file's contents, and writes the resulting key/value pairs to nReduce intermediate files. doMap() hashes each key to pick the intermediate file and thus the reduce task that will process the key. There will be nMap x nReduce files after all map tasks are done. Each file name contains a prefix, the map task number, and the reduce task number. If there are two map tasks and three reduce tasks, the map tasks will create these six intermediate files:

mrtmp.xxx-0-0
mrtmp.xxx-0-1
mrtmp.xxx-0-2
mrtmp.xxx-1-0
mrtmp.xxx-1-1
mrtmp.xxx-1-2

Each worker must be able to read files written by any other worker, as well as the input files. Real deployments use distributed storage systems such as GFS to allow this access even though workers run on different machines. In this lab you'll run all the workers on the same machine, and use the local file system.
The master next calls doReduce() [common_reduce.go] at least once for each reduce task. As with doMap(), it does so either directly or through a worker. The doReduce() for reduce task r collects the r'th intermediate file from each map task, and calls the reduce function for each key that appears in those files. The reduce tasks produce nReduce result files.
The master calls mr.merge() [master_splitmerge.go], which merges all the nReduce files produced by the previous step into a single output.
The master sends a Shutdown RPC to each of its workers, and then shuts down its own RPC server.

Over the course of the following exercises, you will have to write/modify doMap, doReduce, and schedule yourself. These are located in common_map.go, common_reduce.go, and schedule.go respectively. You will also have to write the map and reduce functions in ../main/wc.go.

You should not need to modify any other files, but reading them might be useful in order to understand how the other methods fit into the overall architecture of the system.

在2018的版本中，只需要实现几个接口。

Part I: Map/Reduce input and output

The Map/Reduce implementation you are given is missing some pieces. Before you can write your first Map/Reduce function pair, you will need to fix the sequential implementation. In particular, the code we give you is missing two crucial pieces: the function that divides up the output of a map task, and the function that gathers all the inputs for a reduce task. These tasks are carried out by the doMap() function in common_map.go, and the doReduce() function in common_reduce.go respectively. The comments in those files should point you in the right direction.

To help you determine if you have correctly implemented doMap() and doReduce(), we have provided you with a Go test suite that checks the correctness of your implementation. These tests are implemented in the file test_test.go. To run the tests for the sequential implementation that you have now fixed, run:

第一部分是实现Map/Reduce框架的输入输出，对应的是commom_map.go的doMap()函数和common_reduce.go 的doReduce()函数，对于doMap函数

func doMap(
	jobName string, // the name of the MapReduce job
	mapTask int, // which map task this is
	inFile string,
	nReduce int, // the number of reduce task that will be run ("R" in the paper)
	mapF func(filename string, contents string) []KeyValue,
)

输出文件的格式jobName+mapTask+iReduce，这里iReduce是从0到nReduce，inFile是传入的文件，mapF是用户定义的mapF函数

这个程序的逻辑是，读取inFIle文件，得到contents，将inFile和contens传到mapF中，得到[] KeyValue

然后借助一个Hash函数，将Key Value分割成nReduce份不同的文件。

将这些文件用jobName+mapTask+iReduce的格式保存。任务完成

细节点：Go的文件IO以及用json保存文件，先将Key Value slice分好组，再写入到对应的文件。

func doReduce(
	jobName string, // the name of the whole MapReduce job
	reduceTask int, // which reduce task this is
	outFile string, // write the output here
	nMap int, // the number of map tasks that were run ("M" in the paper)
	reduceF func(key string, values []string) string,
)

这个函数的逻辑是，读取对应的jobName+mapTask+reduceTask的文件(这里要对所有的mapTask进行处理), 然后执行reduceF函数，将最终的kv结果，写到outFIle文件中。

细节点：Go的文件IO（json），借助map数据结构将kv数组变为key，value([]string形式), 再传入reduceF进行reduce操作。

完成这两个函数，并测试通过，并完成Part 1.

Part II: Single-worker word count

这一部分就是对单个词频统计，设计自己的MapF和ReduceF函数，并测试通过

func mapF(filename string, contents string) []mapreduce.KeyValue

这个函数输入filename和contents，返回对应的kv数组，关键点就是对contents进行分割，然后直接生成kv数组，这里value全是1，先不合并（reduce）

func reduceF(key string, values []string) string

这个函数返回将values合并后的value

然后对这两个函数进行测试。

Part III: Distributing MapReduce tasks

这一部分就是Master的分配任务算法：schedule()

func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string)

jonName是当前调度的工作名字，mapFiles是对应的map任务的文件名数组，nReduce是Reduce任务的个数，jobPhase取值为MapPhase和ReducePhase，第三个registerChan是一个chan，master从里面获取到worker的address，通过RPC分配任务给worker。

这里如何分配是一个经典的生产者消费者问题。

master先将任务写入到任务队列，然后master再从任务队列里面读出任务，然后从registerChan读出worker的地址分配给worker。worker干完活后，在将worker放回registerChan，如果任务失败，那么master用重新将任务放到任务队列。

考虑到并发问题，Master的写任务和分配worker完全可以异步来进行。

这段代码使用了非常经典的并发编程模型

func schedule(jobName string, mapFiles []string, nReduce int, phase jobPhase, registerChan chan string) {
	var ntasks int
	var n_other int // number of inputs (for reduce) or outputs (for map)
	switch phase {
	case mapPhase:
		ntasks = len(mapFiles)
		n_other = nReduce
	case reducePhase:
		ntasks = nReduce
		n_other = len(mapFiles)
	}

	fmt.Printf("Schedule: %v %v tasks (%d I/Os)\n", ntasks, phase, n_other)

	// All ntasks tasks have to be scheduled on workers. Once all tasks
	// have completed successfully, schedule() should return.
	//
	// Your code here (Part III, Part IV).
	//
	taskChan := make(chan int)
	var wg sync.WaitGroup
	go func() {
		for taskNumber := 0; taskNumber < ntasks; taskNumber++ {
			taskChan <- taskNumber
			fmt.Printf("taskChan <- %d in %s\n", taskNumber, phase)
			wg.Add(1)

		}

		wg.Wait()							//ntasks个任务执行完毕后才能通过
		close(taskChan)
	}()

	for task := range taskChan {			//所有任务都处理完后跳出循环
		worker := <- registerChan           //消费worker
		fmt.Printf("given task %d to %s in %s\n", task, worker, phase)

		arg := DoTaskArgs{
			JobName: jobName,
			Phase: phase,
			TaskNumber: task,
			NumOtherPhase: n_other,
		}

		if phase == mapPhase {
			arg.File = mapFiles[task]
		}

		go func(worker string, arg DoTaskArgs) {
			if call(worker, "Worker.DoTask", arg, nil) {
				//执行成功后，worker需要执行其它任务
				//注意：需要先掉wg.Done()，然后调register<-worker，否则会出现死锁
				//fmt.Printf("worker %s run task %d success in phase %s\n", worker, task, phase)
				wg.Done()
				registerChan <- worker  //回收worker
			} else {
				//如果失败了，该任务需要被重新执行
				//注意：这里不能用taskChan <- task，因为task这个变量在别的地方可能会被修改。比如task 0执行失败了，我们这里希望
				//将task 0重新加入到taskChan中，但是因为执行for循环的那个goroutine，可能已经修改task这个变量为1了，我们错误地
				//把task 1重新执行了一遍，并且task 0没有得到执行。
				taskChan <- arg.TaskNumber
			}
		}(worker, arg)

	}
	fmt.Printf("Schedule: %v done\n", phase)


	fmt.Printf("Schedule: %v done\n", phase)

}