What is a distributed system? multiple cooperating computers storage for big web sites, MapReduce, peer-to-peer sharing, &c lots of critical infrastructure is distributed
Why do people build distributed systems? to increase capacity via parallelism to tolerate faults via replication to place computing physically close to external entities to achieve security via isolation
But: many concurrent parts, complex interactions must cope with partial failure tricky to realize performance potential
MAIN TOPICS This is a course about infrastructure for applications. * Storage. * Communication. * Computation. The big goal: abstractions that hide the complexity of distribution. A couple of topics will come up repeatedly in our search. Topic: implementation RPC, threads, concurrency control. The labs... Topic: performance The goal: scalable throughput Nx servers -> Nx total throughput via parallel CPU, disk, net. [diagram: users, application servers, storage servers] So handling more load only requires buying more computers. Rather than re-design by expensive programmers. Effective when you can divide work w/o much interaction. Scaling gets harder as N grows: Load im-balance, stragglers, slowest-of-N latency. Non-parallelizable code: initialization, interaction. Bottlenecks from shared resources, e.g. network. Some performance problems aren't easily solved by scaling e.g. quick response time for a single user request e.g. all users want to update the same data often requires better design rather than just more computers Lab 4 Topic: fault tolerance 1000s of servers, big network -> always something broken We'd like to hide these failures from the application. We often want: Availability -- app can make progress despite failures Recoverability -- app will come back to life when failures are repaired Big idea: replicated servers. If one server crashes, can proceed using the other(s). Labs 1, 2 and 3 Topic: consistency General-purpose infrastructure needs well-defined behavior. E.g. "Get(k) yields the value from the most recent Put(k,v)." Achieving good behavior is hard! "Replica" servers are hard to keep identical. Clients may crash midway through multi-step update. Servers may crash, e.g. after executing but before replying. Network partition may make live servers look dead; risk of "split brain". Consistency and performance are enemies. Strong consistency requires communication, e.g. Get() must check for a recent Put(). Many designs provide only weak consistency, to gain speed. e.g. Get() does *not* yield the latest Put()! Painful for application programmers but may be a good trade-off.
CASE STUDY: MapReduce Let's talk about MapReduce (MR) as a case study a good illustration of 6.824's main topics hugely influential the focus of Lab 1 MapReduce overview context: multi-hour computations on multi-terabyte data-sets e.g. build search index, or sort, or analyze structure of web only practical with 1000s of computers applications not written by distributed systems experts overall goal: easy for non-specialist programmers programmer just defines Map and Reduce functions often fairly simple sequential code MR takes care of, and hides, all aspects of distribution!
Abstract view of a MapReduce job input is (already) split into M files Input1 -> Map -> a,1 b,1 Input2 -> Map -> b,1 Input3 -> Map -> a,1 c,1 | | | | | -> Reduce -> c,1 | -----> Reduce -> b,2 ---------> Reduce