https://www.bilibili.com/video/BV1R7411t71W?spm_id_from=333.337.search-card.all.click
What is a distributed system?
multiple cooperating computers
distributed system examples: storage for big web sites, big data computations such as MapReduce, peer-to-peer file sharing
lots of critical infrastructure is distributed
Why do people build distributed systems?
to increase capacity via parallelism(并行性)
to tolerate faults via replication(容错通过复制)
to place computing physically close to external entities
to achieve security via isolation
But:
many concurrent parts, complex interactions /许多并发部件,复杂的交互
must cope with partial failure /必须应付局部故障
tricky to realize performance potential /实现性能潜力的技巧
这门课的实验专注于性能和容错
开始一切有趣工作开头的一篇论文:mapreduce。
Lab 1: MapReduce
Lab 2: replication for fault-tolerance using Raft
Lab 3: fault-tolerant key/value store
Lab 4: sharded key/value store
MAIN TOPICS
This is a course about infrastructure for applications.
- Storage. /how to build,use,build replicated fault tolerant high-performance diatributed implementations of storage
- Communication. /like MapReduce---->6.829
- Computation. /reliability
如何简化分布式存储以及计算基础设计的接口设计,像是文件系统接口一样,让人感知分布式系统就好似一个非分布式系统
The big goal: abstractions that hide the complexity of distribution.
A couple of topics will come up repeatedly in our search.
Topic: implementation
RPC(remote procedure call)他的目的是掩盖我们正在通过不可靠网络交流的事实
threads:编程技术,提供了一种结构化的并发操作;允许我们利用多核计算机的东西。思考关于并发控制的东西:lock
concurrency control./lock
以上都是构建分布式系统所需要的工具
关于分布式系统的实现,你需要考虑RPC,线程,并发以及锁等技术
Topic: performance
The goal: scalable throughput(构建高性能分布式数据的目标是可扩展性加速:scalability or scalable speed-up)
与其花费更多预算想着提高算法性能,不如通过增加分布式系统中的节点来提高性能
Nx servers -> Nx total throughput via parallel CPU, disk, net.
[diagram: users, application servers, storage servers]
但是这种可扩展性很少是无限的,当你一直增加服务器时,DB又成了瓶颈
So handling more load only requires buying more computers.
Rather than re-design by expensive programmers.
Effective when you can divide work w/o much interaction.
Scaling gets harder as N grows:
Load im-balance, stragglers, slowest-of-N latency.
Non-parallelizable code: initialization, interaction.
Bottlenecks from shared resources, e.g. network.
Some performance problems aren’t easily solved by scaling
e.g. quick response time for a single user request
e.g. all users want to update the same data
often requires better design rather than just more computers
Lab 4
Topic: fault tolerance
容错定义: 1000s of servers, big network -> always something broken
We’d like to hide these failures from the application.
We often want:
Availability – app can make progress despite failures/可用性都是建立在特定错误类型上,如果出现了超出范围的错误,则不可用了
Recoverability – app will come back to life when failures are repaired/如果出现了问题,他会停止工作,不响应请求。如果在修复完成之后,没有更糟糕的事发生,系统将继续正常运行。这比可用性要弱一些
为了让系统保持可用性,直到一定量的错误发生。如果有太多错误发生,这个系统就会停止工作,或者停止对一切的响应。但是一旦足够的问题被修复,系统就会继续正确工作。一个好的高可用系统应该也是可恢复的。最重要的解决工具是非易失性存储(non-volatile storage like hard drives or flash or SSD->根据check point or log about system)
现在有很多针对非易失性存储的管理,但是应该避免往非易失性存储进行写操作,因为针对非易性存储的更新很昂贵。需要移动磁盘臂and waiting for a disk platter to rotate,这两个过程都很缓慢。
另一个重要的容错工具是replication
Big idea: replicated servers.
If one server crashes, can proceed using the other(s).
Labs 1, 2 and 3
Topic: consistency
一致性是在well-defined behavior下,服务器向DB1发送一个操作,但是之后他可能断电了,就没办法向DB2同样发送这个操作。这个时候,针对k=1,存在新旧两种value,不符合强一致性要求。
强一致性就是保证get得到最新的put版本。
弱一致性不做任何保证,最后可能在一个put操作之后的无限时间内都看到的是一个旧的get值。对其感兴趣是因为,即使强一致性可以保证你看到的是最新的值,但是这个实现可能很昂贵。需要做很多通信来完成强一致性的概念的实现,你需要与其每个副本进行通信
强一致性需要更昂贵的通信
现在很多学者都在研究如何构建弱一致性保证,对应用程序才真正有用,以及如何利用他们实质性获得高性能
General-purpose infrastructure needs well-defined behavior.
E.g. “Get(k) yields the value from the most recent Put(k,v).”
Achieving good behavior is hard!
“Replica” servers are hard to keep identical.
Clients may crash midway through multi-step update.
Servers may crash, e.g. after executing but before replying.
Network partition may make live servers look dead; risk of “split brain”.
Consistency and performance are enemies.
Strong consistency requires communication,
e.g. Get() must check for a recent Put().
Many designs provide only weak consistency, to gain speed.
e.g. Get() does not yield the latest Put()!
Painful for application programmers but may be a good trade-off.
Many design points are possible in the consistency/performance spectrum!
CASE STUDY: MapReduce
【这个其实是个批处理,因为得等map做完了才能做reduce。有人提问有没有流处理方法】
2004年的paper Google设计框架,要解决的问题是在TB级的数据上进行大量计算,比如创建所有web内容的索引或分析整个web的链接结构以识别出最重要的页面或最权威的页面。构建索引基本等于跑遍所有的数据,还要对整个内容进行排序。期望在数以千计的计算机上快速完成计算。
Let’s talk about MapReduce (MR) as a case study
a good illustration of 6.824’s main topics
hugely influential
the focus of Lab 1
MapReduce overview
context: multi-hour computations on multi-terabyte data-sets
e.g. build search index, or sort, or analyze structure of web
only practical with 1000s of computers
applications not written by distributed systems experts
overall goal: easy for non-specialist programmers
programmer just defines Map and Reduce functions
often fairly simple sequential code
MR takes care of, and hides, all aspects of distribution!
Abstract view of a MapReduce job
input is (already) split into M files
并行,Map输入是文件,输出是键值list,被称为中间数据。第二步是运行Reduce,收集map所有实例key。以下例子为单词统计。整个reduce计算称为job,每个map称为task
Input1 -> Map -> a,1 b,1
Input2 -> Map -> b,1
Input3 -> Map -> a,1 c,1
| | |
| | -> Reduce -> c,1
| -----> Reduce -> b,2
---------> Reduce -> a,2
MR calls Map() for each input file, produces set of k2,v2
“intermediate” data Map对应的输出叫做中间数据
each Map() call is a “task”
MR gathers all intermediate v2’s for a given k2,
and passes each key + values to a Reduce call
final output is set of <k2,v3> pairs from Reduce()s
Example: word count
input is thousands of text files
Map(k, v)---->k is the file name,v is the content of this maps input file
split v into words
for each word w
emit(w, “1”)
Reduce(k, v)—>k:key;v:value
emit(len(v))
MapReduce scales well:
N “worker” computers get you Nx throughput.
Maps()s can run in parallel, since they don’t interact.
Same for Reduce()s.
So you can get more throughput by buying more computers.
MapReduce hides many details:
sending app code to servers
tracking which tasks are done
moving data from Maps to Reduces
balancing load over servers
recovering from failures
However, MapReduce limits what apps can do:
No interaction or state (other than via intermediate output).
No iteration, no multi-stage pipelines.
No real-time or streaming processing.
Input and output are stored on the GFS cluster file system
MR needs huge parallel input and output throughput.
GFS splits files over many servers, in 64 MB chunks
Maps read in parallel
Reduces write in parallel
GFS also replicates each file on 2 or 3 servers
Having GFS is a big win for MapReduce
What will likely limit the performance?
We care since that’s the thing to optimize.
CPU? memory? disk? network?
In 2004 authors were limited by network capacity.
What does MR send over the network?
Maps read input from GFS.
Reduces read Map output.
Can be as large as input, e.g. for sorting.
Reduces write output files to GFS.
[diagram: servers, tree of network switches]
In MR’s all-to-all shuffle, half of traffic goes through root switch.
Paper’s root switch: 100 to 200 gigabits/second, total
1800 machines, so 55 megabits/second/machine.
55 is small, e.g. much less than disk or RAM speed.
Today: networks and root switches are much faster relative to CPU/disk.
Some details (paper’s Figure 1):
one master, that hands out tasks to workers and remembers progress.
1. master gives Map tasks to workers until all Maps complete
Maps write output (intermediate data) to local disk
Maps split output, by hash, into one file per Reduce task
2. after all Maps have finished, master hands out Reduce tasks
each Reduce fetches its intermediate output from (all) Map workers
each Reduce task writes a separate output file on GFS
How does MR minimize network use?
Master tries to run each Map task on GFS server that stores its input.
All computers run both GFS and MR workers
So input is read from local disk (via GFS), not over network.
Intermediate data goes over network just once.
Map worker writes to local disk.
Reduce workers read directly from Map workers, not via GFS.
Intermediate data partitioned into files holding many keys.
R is much smaller than the number of keys.
Big network transfers are more efficient.
How does MR get good load balance?
Wasteful and slow if N-1 servers have to wait for 1 slow server to finish.
But some tasks likely take longer than others.
Solution: many more tasks than workers.
Master hands out new tasks to workers who finish previous tasks.
So no task is so big it dominates completion time (hopefully).
So faster servers do more tasks than slower ones, finish abt the same time.
What about fault tolerance?
I.e. what if a worker crashes during a MR job?
We want to completely hide failures from the application programmer!
Does MR have to re-run the whole job from the beginning?
Why not?
MR re-runs just the failed Map()s and Reduce()s.
Suppose MR runs a Map twice, one Reduce sees first run’s output,
another Reduce sees the second run’s output?
Correctness requires re-execution to yield exactly the same output.
So Map and Reduce must be pure deterministic functions:
they are only allowed to look at their arguments.
no state, no file I/O, no interaction, no external communication.
What if you wanted to allow non-functional Map or Reduce?
Worker failure would require whole job to be re-executed,
or you’d need to create synchronized global checkpoints.
Details of worker crash recovery:
- Map worker crashes:
master notices worker no longer responds to pings
master knows which Map tasks it ran on that worker
those tasks’ intermediate output is now lost, must be re-created
master tells other workers to run those tasks
can omit re-running if Reduces already fetched the intermediate data - Reduce worker crashes.
finished tasks are OK – stored in GFS, with replicas.
master re-starts worker’s unfinished tasks on other workers.
Other failures/problems:
- What if the master gives two workers the same Map() task?
perhaps the master incorrectly thinks one worker died.
it will tell Reduce workers about only one of them. - What if the master gives two workers the same Reduce() task?
they will both try to write the same output file on GFS!
atomic GFS rename prevents mixing; one complete file will be visible. - What if a single worker is very slow – a “straggler”?
perhaps due to flakey hardware.
master starts a second copy of last few tasks. - What if a worker computes incorrect output, due to broken h/w or s/w?
too bad! MR assumes “fail-stop” CPUs and software. - What if the master crashes?
Current status?
Hugely influential (Hadoop, Spark, &c).
Probably no longer in use at Google.
Replaced by Flume / FlumeJava (see paper by Chambers et al).
GFS replaced by Colossus (no good description), and BigTable.
Conclusion
MapReduce single-handedly made big cluster computation popular.
- Not the most efficient or flexible.
- Scales well.
- Easy to program – failures and data movement are hidden.
These were good trade-offs in practice.
We’ll see some more advanced successors later in the course.
Have fun with the lab!