MIT 6.824 l01 Introduction

6.824 2020 Lecture 1: Introduction

6.824: Distributed Systems Engineering

What is a distributed system?(什么是分布式系统?)

  • multiple cooperating computers(多台协作的计算机)
  • storage for big web sites, MapReduce, peer-to-peer sharing, &c(大型网站的存储,MapReduce,点对点共享,等等)
  • lots of critical infrastructure is distributed(许多关键的基础设施是分布式的)

Why do people build distributed systems?(为什么要建立分布式系统?)

  • to increase capacity via parallelism(通过并行增加容量)
  • to tolerate faults via replication(通过复制增加容错性)
  • to place computing physically close to external entities(使计算在物理上靠近外部实体)
  • to achieve security via isolation(通过隔离实现安全性)

But:(但是还在存在着一些问题)

  • many concurrent parts, complex interactions(很多并发的部分,有着复杂的交互)
  • must cope with partial failure(必须处理部分错误)
  • tricky to realize performance potential(难以发挥性能潜力)

Why take this course?

  • interesting – hard problems, powerful solutions(有趣——难问题,强有力的解决方案)
  • used by real systems – driven by the rise of big Web sites(被真实系统所使用——大型网站的兴起所驱动)
  • active research area – important unsolved problems(活跃的研究领域——重要的未解决的问题)
  • hands-on – you’ll build real systems in the labs(动手操作——你将在实验中建立真实的系统)

COURSE STRUCTURE

http://pdos.csail.mit.edu/6.824

Course staff:

  • Robert Morris, lecturer
  • Anish Athalye, TA
  • Aakriti Shroff, TA
  • Favyen Bastani, TA
  • Tossaporn Saengja, TA

Course components:

  • lectures
  • papers
  • two exams
  • labs
  • final project (optional)

Lectures:

  • big ideas, paper discussion, and labs
  • will be video-taped, available online

Papers:-

  • research papers, some classic, some new
  • problems, ideas, implementation details, evaluation
  • many lectures focus on papers
  • please read papers before class!
  • each paper has a short question for you to answer
    and we ask you to send us a question you have about the paper
    submit question&answer by midnight the night before

Exams:

  • Mid-term exam in class
  • Final exam during finals week
  • Mostly about papers and labs

Labs:

  • goal: deeper understanding of some important techniques
  • goal: experience with distributed programming
  • first lab is due a week from Friday
  • one per week after that for a while

Lab 1: MapReduce
Lab 2: replication for fault-tolerance using Raft
Lab 3: fault-tolerant key/value store
Lab 4: sharded key/value store

Optional final project at the end, in groups of 2 or 3.

  • The final project substitutes for Lab 4.
  • You think of a project and clear it with us.
  • Code, short write-up, short demo on last day.

Lab grades depend on how many test cases you pass, we give you the tests, so you know whether you’ll do well

Debugging the labs can be time-consuming

  • start early
  • come to TA office hours
  • ask questions on Piazza

MAIN TOPICS

This is a course about infrastructure for applications.(这是一门关于应用程序基础设施的课程。)

  • Storage.(存储)
  • Communication.(通信)
  • Computation.(计算)

The big goal:

  • abstractions that hide the complexity of distribution.(隐藏分布复杂性的抽象)
  • A couple of topics will come up repeatedly in our search.(在我们的搜索中会反复出现几个主题。)

Topic: implementation(实现)
RPC, threads, concurrency control(并行控制).
The labs…

Topic: performance(性能)
The goal: scalable throughput(可扩展的吞吐量)
Nx servers -> Nx total throughput via parallel CPU, disk, net.
[diagram: users, application servers, storage servers]
So handling more load only requires buying more computers.(因此,处理更多的负载只需要购买更多的计算机。)
Rather than re-design by expensive programmers.(而不是由昂贵的程序员重新设计。)
Effective when you can divide work w/o much interaction.(当你可以把工作和很多互动分开时,你的效率会很高。)
Scaling gets harder as N grows(随着N的增长,扩容变得越来越困难):

  • Load im-balance, stragglers, slowest-of-N latency.(负载不平衡,混乱,最慢的N延迟。)
  • Non-parallelizable code: initialization, interaction.(不可并行化的代码:初始化,交互。)
  • Bottlenecks from shared resources, e.g. network.(来自共享资源的瓶颈,例如网络。)
    Some performance problems aren’t easily solved by scaling(一些性能问题不能简单的通过扩容来解决)
    e.g. quick response time for a single user request(例如单个用户请求的快速响应)
    e.g. all users want to update the same data(例如所有用户想要同时更新一些数据)
    often requires better design rather than just more computers(通常需要更好的设计而不是仅仅更多的计算机)
    Lab 4

Topic: fault tolerance(容错)
1000s of servers, big network -> always something broken
We’d like to hide these failures from the application.
We often want:
Availability – app can make progress despite failures
Recoverability – app will come back to life when failures are repaired
Big idea: replicated servers.(复制服务器)
If one server crashes, can proceed using the other(s).(如果一个服务发生崩溃,能够使用其他服务来继续)
Labs 1, 2 and 3

Topic: consistency(一致性)
General-purpose infrastructure needs well-defined behavior.(通用基础设施需要定义良好的行为。)
E.g. “Get(k) yields the value from the most recent Put(k,v).”(例如。 “ Get(k)从最新的Put(k,v)产生值。”)
Achieving good behavior is hard!(实现良好的行为是很困难的!)
“Replica” servers are hard to keep identical.(复制的服务是很难保持相同)
Clients may crash midway through multi-step update.(客户端可能会中途崩溃在多步更新的时候)
Servers may crash, e.g. after executing but before replying.(服务器可能会崩溃,例如在执行之后但在响应之前。)
Network partition may make live servers look dead; risk of “split brain”.(网络分区可能会使活动服务器看上去死机; 有“脑裂”的风险。)
Consistency and performance are enemies.(一致性和性能如敌人一样)
Strong consistency requires communication, e.g. Get() must check for a recent Put().(强一致性需要通信,例如Get()必须检查最近一次Put())
Many designs provide only weak consistency, to gain speed.(很多设计只提供了弱一致性,用以提高速度。)
e.g. Get() does not yield the latest Put()!
Painful for application programmers but may be a good trade-off.(对于应用程序程序员来说很痛苦,但是这可能是一个很好的权衡。)
Many design points are possible in the consistency/performance spectrum!(在一致性/性能范围内,许多设计点都是可能的!)

CASE STUDY: MapReduce

百度百科:MapReduce

MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)“和"Reduce(归约)”,是它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。

Let’s talk about MapReduce (MR) as a case study
a good illustration of 6.824’s main topics(很好地说明了6.824的主要主题)
hugely influential
the focus of Lab 1

MapReduce overview
context: multi-hour computations on multi-terabyte data-sets(多TB数据集的多小时计算)
e.g. build search index, or sort, or analyze structure of web(例如建立一个搜索索引/排序或者分析一个网站的结构)
only practical with 1000s of computers(仅适用于数千台计算机)
applications not written by distributed systems experts(非分布式系统专家编写的应用程序)
overall goal: easy for non-specialist programmers(总体目标:非专业程序员容易)
programmer just defines Map and Reduce functions(程序员仅仅定义Map和Reduce方法)
often fairly simple sequential code(通常是相当简单的顺序代码)
MR takes care of, and hides, all aspects of distribution!(MR负责并隐藏分发的各个方面!)

Abstract view of a MapReduce job(以抽象视角来看一个MR工作)
input is (already) split into M files(输入(已经)被分割成M个文件)
Input1 -> Map -> a,1 b,1
Input2 -> Map -> b,1
Input3 -> Map -> a,1 c,1
| | |
| | -> Reduce -> c,1
| -----> Reduce -> b,2
---------> Reduce -> a,2
MR calls Map() for each input file, produces set of k2,v2
“intermediate” data
each Map() call is a “task”
MR gathers all intermediate v2’s for a given k2,
and passes each key + values to a Reduce call
final output is set of <k2,v3> pairs from Reduce()s

Example: word count
input is thousands of text files
Map(k, v)
split v into words
for each word w
emit(w, “1”) // 每遇到一个w,就生成键值对(w, “1”)
Reduce(k, v)
emit(len(v))

MapReduce scales well(MapReduce可以很好的进行扩展):
N “worker” computers get you Nx throughput.
Maps()s can run in parallel, since they don’t interact.
Same for Reduce()s.
So you can get more throughput by buying more computers.(所以你能购买更多的计算机来得到更大的吞吐量)

MapReduce hides many details(MapReduce隐藏了很多细节):

  • sending app code to servers(将应用程序代码发送到服务器)
  • racking which tasks are done(跟踪任务的完成)
  • moving data from Maps to Reduces(移动数据从Maps到Reduces)
  • balancing load over servers(负载均衡)
  • recovering from failures(从失败中恢复)

However, MapReduce limits what apps can do(然而MapReduce会限制应用完成这些):

  • No interaction or state (other than via intermediate output).(没有交互或状态(除了中间输出))
  • No iteration, no multi-stage pipelines.(没有迭代,没有多阶段管道。)
  • No real-time or streaming processing.(没有实时或流处理)

Input and output are stored on the GFS cluster file system(输入和输出存储在GFS集群文件系统上)

  • MR needs huge parallel input and output throughput.(MR需要巨大的并行输入和输出吞吐量。)
  • GFS splits files over many servers, in 64 MB chunks(GFS跨越多个服务器,将文件以64M的块进行分割)
    Maps read in parallel
    Reduces write in parallel
  • GFS also replicates each file on 2 or 3 servers(GFS还在2-3个服务器上复制文件)
    Having GFS is a big win for MapReduce(拥有GFS是MapReduce的一大胜利)

What will likely limit the performance?(什么将最可能限制性能)
We care since that’s the thing to optimize.(我们关心的是要优化的东西。)
CPU? memory? disk? network?
In 2004 authors were limited by network capacity.(在2004年,一个作者提到性能是被网络容量所限制了。)
What does MR send over the network?(MR通过网络发送什么?)
Maps read input from GFS.
Reduces read Map output.
Can be as large as input, e.g. for sorting.
Reduces write output files to GFS.
[diagram: servers, tree of network switches]
In MR’s all-to-all shuffle, half of traffic goes through root switch.(在MR的全部重组中,一半的流量通过根交换机进行。)
Paper’s root switch: 100 to 200 gigabits/second, total 1800 machines, so 55 megabits/second/machine.(文章中的根交换机:100-200 G/s,总共1800台机器,所以55兆位/秒/机器)
55 is small, e.g. much less than disk or RAM speed.
Today: networks and root switches are much faster relative to CPU/disk.

Some details (paper’s Figure 1):
one master, that hands out tasks to workers and remembers progress.(一个主服务器,将任务交给从服务器并记住进度。)

  1. master gives Map tasks to workers until all Maps complete(主服务器将Map任务交给从服务器,直到所有Map完成)
    Maps write output (intermediate data 中间数据) to local disk
    Maps split output, by hash, into one file per Reduce task
  2. after all Maps have finished, master hands out Reduce tasks
    each Reduce fetches its intermediate output from (all) Map workers(每个Reduce取从Map工作节点中的中间输出数据)
    each Reduce task writes a separate output file on GFS

How does MR minimize network use?(MR如何最大程度地减少网络使用?)
Master tries to run each Map task on GFS server that stores its input.(Master尝试在存储其输入的GFS服务器上运行每个Map任务。)
All computers run both GFS and MR workers(所有的计算机都运行GFS和MR工作节点)
So input is read from local disk (via GFS), not over network.(所以输入通过GFS从本地硬盘中读取,而不是网络。)
Intermediate data goes over network just once.(中间数据仅通过网络传输一次。)
Map worker writes to local disk.(Map工作节点写进本地硬盘)
Reduce workers read directly from Map workers, not via GFS.(Reduce工作节点直接读取从Map工作节点不通过GFS)
Intermediate data partitioned into files holding many keys.(Intermediate data partitioned into files holding many keys.)
R is much smaller than the number of keys.(R比数据键更小)
Big network transfers are more efficient.(大型网络传输更有效)

How does MR get good load balance?(MR怎么做到好的负载均衡)
Wasteful and slow if N-1 servers have to wait for 1 slow server to finish.(如果N-1台服务器必须等待1台慢速服务器完成,那将很浪费而且很慢。)
But some tasks likely take longer than others.(但是某些任务可能比其他任务花费更长的时间。)
Solution: many more tasks than workers.(解决方法:安排比工作节点更多的任务)
Master hands out new tasks to workers who finish previous tasks.(Master会把新任务分发给完成了之前任务的Worker。)
So no task is so big it dominates completion time (hopefully).(因此,没有什么任务能如此大地控制完成时间(希望如此)。)
So faster servers do more tasks than slower ones, finish abt the same time.(因此,速度较快的服务器要比速度较慢的服务器执行更多任务,并同时完成。)

What about fault tolerance?
I.e. what if a worker crashes during a MR job?(假如一个worker在执行MR任务的时候崩溃?)
We want to completely hide failures from the application programmer!(我们想完整的隐藏错误从应用程序中!)
Does MR have to re-run the whole job from the beginning?(MR是否必须从头开始重新执行整个工作?)
Why not?
MR re-runs just the failed Map()s and Reduce()s.(MR只重新运行失败的Map()和Reduce()。)
Suppose MR runs a Map twice, one Reduce sees first run’s output,
another Reduce sees the second run’s output?(假设MR运行一个Map两次,一个Reduce看到第一次的运行输出,另外一个Reduce看到了第二次的输出?)
Correctness requires re-execution to yield exactly the same output.(正确性要求重新执行才能产生完全相同的输出。)
So Map and Reduce must be pure deterministic functions(因此Map和Reduce必须是纯粹的确定性函数):
they are only allowed to look at their arguments.(它们只被允许看到它们的参数)
no state, no file I/O, no interaction, no external communication.(没有状态,没有文件I / O,没有交互,没有外部通信。)
What if you wanted to allow non-functional Map or Reduce?(如果您想允许非功能性Map或Reduce怎么办?)
Worker failure would require whole job to be re-executed, or you’d need to create synchronized global checkpoints.(Worker失败将需要重新执行整个工作,否则您需要创建同步的全局检查点。)

​ 否则您需要创建同步的全局检查点。

Details of worker crash recovery(worker恢复的细节):

  • Map worker crashes:
    master notices worker no longer responds to pings(master通知worker不再响应pings)
    master knows which Map tasks it ran on that worker those tasks’ intermediate output is now lost, must be re-created. master tells other workers to run those tasks(master知道它在该工作程序上运行了哪些Map任务,这些任务的中间输出现在丢失了,必须重新创建。master告诉其他worker去执行那些任务。)

    can omit re-running if Reduces already fetched the intermediate data(如果Reduce已获取中间数据,则可以省略重新运行)

  • Reduce worker crashes.
    finished tasks are OK – stored in GFS, with replicas.
    master re-starts worker’s unfinished tasks on other workers.(完成的任务就可以了——与副本一起存储在GFS中。master重新启动worker在其他工人上未完成的任务。)

Other failures/problems:

  • What if the master gives two workers the same Map() task?(假设master给两个worker相同的Map任务?)
    perhaps the master incorrectly thinks one worker died.(可能master错误的认为某个worker死机。)
    it will tell Reduce workers about only one of them.(它将告诉Reduce工作节点这两个之一。)
  • What if the master gives two workers the same Reduce() task?(假设master给两个worker相同的Reduce任务)
    they will both try to write the same output file on GFS!(它们将试图写相同的输出文件在GFS上!)
    atomic GFS rename prevents mixing; one complete file will be visible.(原子GFS重命名可防止混淆; 一个完整的文件将可见。)
  • What if a single worker is very slow – a “straggler”?(如果一个单独的worker非常的慢——一个掉队者?)
    perhaps due to flakey hardware.(也许是由于古怪的硬件问题。)
    master starts a second copy of last few tasks.(主机启动了最近几个任务的第二个副本。)
  • What if a worker computes incorrect output, due to broken h/w or s/w?(如果worker是由于硬件或软件损坏而计算出错误的输出怎么办?)
    too bad! MR assumes “fail-stop” CPUs and software.(太糟糕了! MR估计使用“故障停止” CPU和软件。)
  • What if the master crashes?

Current status?(当前现状)
Hugely influential (Hadoop, Spark, &c).
Probably no longer in use at Google.
Replaced by Flume / FlumeJava (see paper by Chambers et al).
GFS replaced by Colossus (no good description), and BigTable.

Conclusion
MapReduce single-handedly made big cluster computation popular.(MapReduce单枪匹马使大集群计算流行起来。)

  • Not the most efficient or flexible.
  • Scales well.
  • Easy to program – failures and data movement are hidden.
    These were good trade-offs in practice.
    We’ll see some more advanced successors later in the course.
    Have fun with the lab!
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值