MIT6.824 第一课MapReduce 以及实验思路

最新推荐文章于 2024-07-03 16:01:13 发布

wwxy261

最新推荐文章于 2024-07-03 16:01:13 发布

阅读量747

点赞数

分类专栏：算法

本文链接：https://blog.csdn.net/wwxy1995/article/details/111601590

版权

What is a distributed system?
  multiple cooperating computers
  storage for big web sites, MapReduce, peer-to-peer sharing, &c
  lots of critical infrastructure is distributed

Why do people build distributed systems?
  to increase capacity via parallelism
  to tolerate faults via replication
  to place computing physically close to external entities
  to achieve security via isolation

But:
  many concurrent parts, complex interactions
  must cope with partial failure
  tricky to realize performance potential

MAIN TOPICS

This is a course about infrastructure for applications.
  * Storage.
  * Communication.
  * Computation.

The big goal: abstractions that hide the complexity of distribution.
  A couple of topics will come up repeatedly in our search.

Topic: implementation
  RPC, threads, concurrency control.
  The labs...

Topic: performance
  The goal: scalable throughput
    Nx servers -> Nx total throughput via parallel CPU, disk, net.
    [diagram: users, application servers, storage servers]
    So handling more load only requires buying more computers.
      Rather than re-design by expensive programmers.
    Effective when you can divide work w/o much interaction.
  Scaling gets harder as N grows:
    Load im-balance, stragglers, slowest-of-N latency.
    Non-parallelizable code: initialization, interaction.
    Bottlenecks from shared resources, e.g. network.
  Some performance problems aren't easily solved by scaling
    e.g. quick response time for a single user request
    e.g. all users want to update the same data
    often requires better design rather than just more computers
  Lab 4

Topic: fault tolerance
  1000s of servers, big network -> always something broken
  We'd like to hide these failures from the application.
  We often want:
    Availability -- app can make progress despite failures
    Recoverability -- app will come back to life when failures are repaired
  Big idea: replicated servers.
    If one server crashes, can proceed using the other(s).
    Labs 1, 2 and 3

Topic: consistency
  General-purpose infrastructure needs well-defined behavior.
    E.g. "Get(k) yields the value from the most recent Put(k,v)."
  Achieving good behavior is hard!
    "Replica" servers are hard to keep identical.
    Clients may crash midway through multi-step update.
    Servers may crash, e.g. after executing but before replying.
    Network partition may make live servers look dead; risk of "split brain".
  Consistency and performance are enemies.
    Strong consistency requires communication,
      e.g. Get() must check for a recent Put().
    Many designs provide only weak consistency, to gain speed.
      e.g. Get() does *not* yield the latest Put()!
      Painful for application programmers but may be a good trade-off.

CASE STUDY: MapReduce

Let's talk about MapReduce (MR) as a case study
  a good illustration of 6.824's main topics
  hugely influential
  the focus of Lab 1

MapReduce overview
  context: multi-hour computations on multi-terabyte data-sets
    e.g. build search index, or sort, or analyze structure of web
    only practical with 1000s of computers
    applications not written by distributed systems experts
  overall goal: easy for non-specialist programmers
  programmer just defines Map and Reduce functions
    often fairly simple sequential code
  MR takes care of, and hides, all aspects of distribution!

Abstract view of a MapReduce job
  input is (already) split into M files
  Input1 -> Map -> a,1 b,1
  Input2 -> Map ->     b,1
  Input3 -> Map -> a,1     c,1
                    |   |   |
                    |   |   -> Reduce -> c,1
                    |   -----> Reduce -> b,2
                    ---------> Reduce