MapReduce论文
MapReduce,解决数据切割,机器间通信,集群调度,处理错误等。
在输入数据的“逻辑”记录上应用Map操作得出一个中间key/value pair集合,然后在所有具有相同key值的value值上应用Reduce操作,从而达到合并中间的数据,得到一个想要的结果的目的。使用MapReduce模型,再结合用户实现的Map和Reduce函数,我们就可以非常容易的实现大规模并行化计算;通过MapReduce模型自带的“再次执行”(re-execution)功能,也提供了初级的容灾实现方案。
Scala用在Spark中做云计算;Go用在网络编程做分布式。
Go's io package, Reader/Writer interface, has Read/Write method.基本上被其他的go pkgs使用。
func Copy(dst Writer, src Reader) (written int64, err error), copy data Reader -> Writer
var buf bytes.Buffer Buffer是struct类型
buf.Write([]byte("test"))
MapReduce相当于管理怎么分配计算任务;GFS是分布式文件系统,管理存储在不同磁盘上的数据。
Map reduce更多的例子
倒转网络链接图:Map函数在源页面(source)中搜索所有的链接目标(target)并输出为(target,source)。Reduce函数把给定链接目标(target)的链接组合成一个列表,输出(target,list(source))。【相当于从点击结果做一个倒向map找到从哪个页面点击】
一个master,多个worker,每个worker完成一个map或reduce任务。o
将输入数据分区,分配到不同机器;map worker读取输入,从input解析出key/value pair,然后调用用户定义的Map函数,计算中间值key/value pair存储在local memory内存;
缓存中的pairs周期性写入local disk,parititioning函数将之分为R regions;将local disk的地址发送给master;master将这些地址发送给reduce workers.
master通知reduce worker;reduce worker使用RPC从map worker local disks读这些数据,基于intermediate keys排序,这样所有的相同keys grouped【通常many different keys->same reduce task;这里会考察大数据排序法,如external sort】;reduce worker遍历sorted itermediate data,对每个unique intermediate key,将key和相应的intermediate vlaues传给Reduce函数,将输出放到这个reduce partition对应的final output函数里。最终,有R个文件,通常这R个文件会作为下一个分布式任务的输入。
注意,map/reduce worker做了很多工作,如分区,存储,排序等;Map/Reduce是用户输入的函数,被相应的worker调用。
master掌管task status(idle, in-progress, completed) 和worker identity;并存储intermediate files locations&sizes。
Fault处理
master定时与worker之间ping/pong,如果map worker fail,不管map task是否结束,都重新执行,因为map task result stores locally in map worker;如果reduce task in-progress works on a fail machine, reschedule;如果completed,因为result已经交给master,没事。
定期将master数据备份,如果master fails, 客户决定是否重启(vOOM).
Each in-progress task writes its output to private temporary files.(atomic commits)
A reduce task produces one such file, and a map task produces R such files (one per reduce task).
If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file. We rely on the atomic rename operation provided by the underlying file system to guarantee that the final file system state contains just the data produced by one execution of the reduce task.
文件会做三份备份,master尽量将相关task放在file replica的机器上运行;如果schedule失败,master将相关任务放到同一个子网的其他machine。
M >> R >> number of machines
M通常取决于输入数据量,保持每个task 32~64MB之间,R取决于用户想要的输出文件数量。需要做o(M+R)次调度,存储o(M*R)个状态,因为需要存储每个map输出由哪个R执行。
当一个MapReduce操作接近完成的时候,master调度备用(backup)任务进程来执行剩下的、处于处理中状态(in-progress)的任务。无论是最初的执行进程、还是备用(backup)任务进程完成了任务,我们都把这个任务标记成为已经完成。“
通过对编程模型进行限制,MapReduce框架把问题分解成为大量的“小”任务。这些任务在可用的worker集群上动态的调度,这样快速的worker就可以执行更多的任务。通过对编程模型进行限制,我们可用在工作接近完成的时候调度备用任务,缩短在硬件配置不均衡的情况下缩小整个操作完成的时间
”
MapReduce的使用者通常会指定Reduce任务和Reduce任务输出文件的数量(R)。我们在中间key上使用分区函数来对数据进行分区,之后再输入到后续任务执行进程。分块方法基本都是hash(func(key))/R
一般M/R操作只输出到对应的一个文件,或者多个文件,但一般不会多个task输出到同一个文件。
4.7、本地执行
调试Map和Reduce函数的bug是非常困难的,因为实际执行操作时不但是分布在系统中执行的,而且通常是在好几千台计算机上执行,具体的执行位置是由master进行动态调度的,这又大大增加了调试的难度。为了简化调试、profile和小规模测试,我们开发了一套MapReduce库的本地实现版本,通过使用本地版本的MapReduce库,MapReduce操作在本地计算机上顺序的执行。用户可以控制MapReduce操作的执行,可以把操作限制到特定的Map任务上。用户通过设定特别的标志来在本地执行他们的程序,之后就可以很容易的使用本地调试和测试工具(比如gdb)。
4.8、状态信息
【有点类似于train Jenkins 平台】
master使用嵌入式的HTTP服务器(如Jetty)显示一组状态信息页面,用户可以监控各种执行状态。状态信息页面显示了包括计算执行的进度,比如已经完成了多少任务、有多少任务正在处理、输入的字节数、中间数据的字节数、输出的字节数、处理百分比等等。页面还包含了指向每个任务的stderr和stdout文件的链接。用户根据这些数据预测计算需要执行大约多长时间、是否需要增加额外的计算资源。这些页面也可以用来分析什么时候计算执行的比预期的要慢。
另外,处于最顶层的状态页面显示了哪些worker失效了,以及他们失效的时候正在运行的Map和Reduce任务。这些信息对于调试用户代码中的bug很有帮助。
有可能需要在Map/Reduce函数中嵌入计数器,将计数结果负载在pong response中传递给master进行累计;master可以显示在状态页面上,查看计算进度;需要避免重复运行Map/Reduce导致的计数器值重复累加。
假如任务是对一个1TB的文档排序,通过Map函数计算出key作为排序的key,value就是文档的文本行,得到itermediate key/value pair;
“
输入数据被分成64MB的Block(M=15000)。我们把排序后的输出结果分区后存储到4000个文件(R=4000)。分区函数使用key的原始字节来把数据分区到R个片段中。
”也就是Map能操作所有的Reduce输入。所以上面说有o(M*R)个状态。
5.3 排序
问题:既然所有的每个Map task都可以访问所有的Reduce输入,而每个Map task的结束时间不同,Reduce开始时候可能只有少部分Map完成,拿相当于对同一个文件,有多个Writer,有一个Reader?
MapReduce封装了并行处理、容错处理、数据本地化优化、负载均衡等等技术难点的细节,这使得MapReduce库易于使用。
Lecture 1
Lec1 Intro https://pdos.csail.mit.edu/6.824/notes/l01.txtLab 1: MapReduce
Lab 2: replication for fault-tolerance using Raft
Lab 3: fault-tolerant key/value store
Lab 4: sharded key/value store
we give you the tests, so you know whether you'll do well
careful: if it often passes, but sometimes fails,
chances are it will fail when we run it
Topic: implementation
RPC, threads, concurrency control.
Topic: performance
The dream: scalable throughput.
Nx servers -> Nx total throughput via parallel CPU, disk, net.
So handling more load only requires buying more computers.
Scaling gets harder as N grows:
Load im-balance, stragglers. straggelers是指99%的任务已经完成,但是剩下1%一直blocked。
Non-parallelizable code: initialization, interaction.
Bottlenecks from shared resources, e.g. network.
Note that some performance problems aren't easily attacked by scaling
e.g. decreasing response time for a single user request
might require programmer effort rather than just more computers
Topic: fault tolerance
1000s of servers, complex net -> always something broken
We'd like to hide these failures from the application.
We often want:
Availability -- app can make progress despite failures
Durability -- app will come back to life when failures are repaired
Big idea: replicated servers.
If one server crashes, client can proceed using the other(s).
Topic: consistency
General-purpose infrastructure needs well-defined behavior.
E.g. "Get(k) yields the value from the most recent Put(k,v)."
Achieving good behavior is hard!
"Replica" servers are hard to keep identical.
Clients may crash midway through multi-step update.
Servers crash at awkward moments, e.g. after executing but before replying.
Network may make live servers look dead; risk of "split brain".
Consistency and performance are enemies.
Consistency requires communication, e.g. to get latest Put().
"Strong consistency" often leads to slow systems.
High performance often imposes "weak consistency" on applications.
People have pursued many design points in this spectrum.
M/R
[diagram: MapReduce API --
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2) -> list(k2, v3)] master会将map的输出,相同key聚为一个list(v)
04年时候的最大的limitation是网络带宽,所以优化也是所有的数据尽量local或者在同一个子网中;但现在data center将这个问题重要性降低。
Map worker hashes intermediate keys into R partitions, on local disk
Map worker hashes intermediate keys into R partitions, on local disk
Q: What's a good data structure for implementing this?
How does detailed design reduce effect of slow network?
Map input is read from GFS replica on local disk, not over network.
Intermediate data goes over network just once.
Map worker writes to local disk, not GFS.
Intermediate data partitioned into files holding many keys.
Q: Why not stream the records to the reducer (via TCP) as they are being
produced by the mappers?
How do they get good load balance?
Critical to scaling -- bad for N-1 servers to wait for 1 to finish.
But some tasks likely take longer than others.
[diagram: packing variable-length tasks into workers]
Solution: many more tasks than workers.
Master hands out new tasks to workers who finish previous tasks.
So no task is so big it dominates completion time (hopefully).
So faster servers do more work than slower ones, finish abt the same time.
What about fault tolerance?
I.e. what if a server crashes during a MR job?
Hiding failures is a huge part of ease of programming!
Q: Why not re-start the whole job from the beginning?
MR re-runs just the failed Map()s and Reduce()s.
MR requires them to be pure functions:
they don't keep state across calls,
they don't read or write files other than expected MR inputs/outputs,
there's no hidden communication among tasks.
So re-execution yields the same output.
The requirement for pure functions is a major limitation of
MR compared to other parallel programming schemes.
But it's critical to MR's simplicity.
Details of worker crash recovery:
* Map worker crashes:
master sees worker no longer responds to pings
crashed worker's intermediate Map output is lost
but is likely needed by every Reduce task!
master re-runs, spreads tasks over other GFS replicas of input.
some Reduce workers may already have read failed worker's intermediate data.
here we depend on functional and deterministic Map()!
master need not re-run Map if Reduces have fetched all intermediate data
though then a Reduce crash would then force re-execution of failed Map
* Reduce worker crashes.
finshed tasks are OK -- stored in GFS, with replicas.
master re-starts worker's unfinished tasks on other workers.
* Reduce worker crashes in the middle of writing its output.
GFS has atomic rename that prevents output from being visible until complete.
so it's safe for the master to re-run the Reduce tasks somewhere else.
Other failures/problems:
* What if the master gives two workers the same Map() task?
perhaps the master incorrectly thinks one worker died.
it will tell Reduce workers about only one of them.
* What if the master gives two workers the same Reduce() task?
they will both try to write the same output file on GFS!
atomic GFS rename prevents mixing; one complete file will be visible.
* What if a single worker is very slow -- a "straggler"?
perhaps due to flakey hardware.
master starts a second copy of last few tasks.
* What if a worker computes incorrect output, due to broken h/w or s/w?
too bad! MR assumes "fail-stop" CPUs and software.
* What if the master crashes?
recover from check-point, or give up on job
For what applications *doesn't* MapReduce work well?
Not everything fits the map/shuffle/reduce pattern.
Small data, since overheads are high. E.g. not web site back-end.
Small updates to big data, e.g. add a few documents to a big index
Unpredictable reads (neither Map nor Reduce can choose input)
Multiple shuffles, e.g. page-rank (can use multiple MR but not very efficient)
More flexible systems allow these, but more complex model.
MapReduce适合大规模的重复性工作,task相互之间没有影响,数据处理结果之间也没有依赖性。
How might a real-world web company use MapReduce?
"CatBook", a new company running a social network for cats; needs to:
1) build a search index, so people can find other peoples' cats
2) analyze popularity of different cats, to decide advertising value
3) detect dogs and remove their profiles
Can use MapReduce for all these purposes!
- run large batch jobs over all profiles every night
1) build inverted index: map(profile text) -> (word, cat_id) map根据word输出到R个不同的文件
reduce(word, list(cat_id) -> list(word, list(cat_id))有R个word即R个文件,reduce只有一个输入(word相同),对list(cat_id)处理。
2) count profile visits: map(web logs) -> (cat_id, "1")
reduce(cat_id, list("1")) -> list(cat_id, count)
3) filter profiles: map(profile image) -> img analysis -> (cat_id, "dog!")
reduce(cat_id, list("dog!")) -> list(cat_id)