Notes : The Google File System - MIT 6.824

6.824 2016 Lecture 3: GFS Case Study

 

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003

 

Why are we reading this paper?
  the file system for map/reduce
  case study of handling storage failures
    trading consistency for simplicity and performance
    motivation for subsequent designs
  good performance -- great parallel I/O performance
  good systems paper -- details from apps all the way to network
  all main themes of 6.824 show up in this paper
    performance, fault-tolerance, consistency
  influential
    many other systems use GFS (e.g., Bigtable, Spanner @ Google)
    HDFS (Hadoop Distributed File System) based on GFS

 

What is consistency?
  A correctness condition
  Important when data is replicated and concurrently accessed by applications
    if an application performs a write, what will a later read observe?
      what if the read is from a different application?
  Weak consistency
    read() may return stale data --- not the result of the most recent write
  Strong consistency
    read() always returns the data from the most recent write()

 

  General tension between these:
    strong consistency is easy for application writers
    strong consistency is bad for performance
    weak consistency has good performance and is easy to scale to many servers
    weak consistency is complex to reason about
  Many trade-offs give rise to different correctness conditions
    These are called "consistency models"

 

History of consistency models
  Much independent development in architecture, systems, and database communities
    Concurrent processors with private caches accessing a shared memory
    Concurrent clients accessing a distributed file system
    Concurrent transactions on distributed database
  Many different models with different trade-offs
    serializability
    sequential consistency
    linearizability
    entry consistency
    release consistency
    ....

 

"Ideal" consistency model
  A replicated files behaves like as a non-replicated file system
    picture: many clients on the same machine accessing files on a single disk
  If one application writes, later reads will observe that write
  What if two application concurrently write to the same file
    In file systems often undefined --- file may have some mixed content
  What if two application concurrently write to the same directory
    One goes first, the other goes second (use locking)

 

Sources of inconsistency
  Concurrency
  Machine failures
  Network partitions

 

Example from GFS paper:
  primary is partitioned from backup B
  client appends 1
  primary sends 1 to itself and backup A
  reports failure to client
  meanwhile client 2 may read backup B and observe old value

 

Why is the ideal difficult to achieve in a distributed file system
  Protocols can become complex
  Difficult to implement system correctly
  Protocols require communication between clients and servers
  May cost performance

 

GFS designers give up on ideal to get better performance and simpler design
  Can make life of application developers harder
    application observe behaviors that are non-observable in an ideal system
    e.g., reading stale data
    e.g., duplicate append records
  But the data isn't your bank account, so maybe ok
  The paper is an example of the struggle between:
    consistency
    fault-tolerance
    performance
    simplicity of design

 

GFS goal
  create a shared file system
  hundreds or thousands of (commodity, Linux based) physical machines
  to enable storing massive data sets

 

 

 

What does GFS store?
  authors don't actually say
  guesses for 2003:
    search indexes & databases
    all the HTML files on the web
    all the images on the web
    ...

 

Properties of files:
  Multi-terabyte data sets
  Many of the files are large
  Authors suggest 1M files x 100 MB = 100 TB
    but that was in 2003
  Files are generally append only

 

Central challenge:
  With so many machines failures are common
    assume a machine fails once per year
    w/ 1000 machines, ~3 will fail per day.
  High-performance: many concurrent readers and writers
    Map/Reduce jobs read and store final result in GFS
    Note: *not* the temporary, intermediate files
  Use network efficiently

 

High-level design
  Directories, files, names, open/read/write
    But not POSIX
  100s of Linux chunk servers with disks
    store 64MB chunks (an ordinary Linux file for each chunk)
    each chunk replicated on three servers

 


  Q: why 3x replication?
  A: 1. For reliability, each chunk should be replicated on multiple
   chunkservers
   2. For simplicity and performance, they store 3 replicas
  Q: Besides availability of data, what does 3x replication give us?
  A:  1. load balancing for reads to hot files
    2. affinity
  Q: why not just store one copy of each file on a RAID'd disk?
  A: 1. RAID isn't commodity
   2. Want fault-tolerance for whole machine; not just storage device
  Q: why are the chunks so big?
  A: 1. Reduce clients' need to interact with master for chuck location information
   2. Reduce network overhead if a client performs many operations on a given chunk
    3. Reduce metadata stored on the master

 


  GFS master server knows directory hierarchy
    for dir, what files are in it
    for file, knows chunk servers for each 64 MB
  master keeps state in memory
    64 bytes of metadata per each chunk
  master has private recoverable database for metadata
    master can recovery quickly from power failure
  shadow masters that lag a little behind master
    can be promoted to master

 

Basic operation
  client read:
    send file name and offset to master
    master replies with set of servers that have that chunk
     response includes version # of chunk; clients cache that information for a little while
    ask nearest chunk server: checks version #, if version # is wrong, client re-contacts master
  client write:
    ask master where to store
    maybe master chooses a new set of chunk servers if crossing 64 MB
    one chunk server is primary
    it chooses order of updates and forwards to two backups 

 

Two different fault-tolerance plans
  One for master
  One for chunk servers

 

Master fault tolerance
  Single master
    Clients always talk to master
    Master orders all operations
  Stores limited information persistently
    name spaces (directories)
    file-to-chunk mappings
  Log changes to these two in a log
    log is replicated on several backups
    clients operations that modify state return *after* recording changes in *logs*
    logs play a central role in many systems we will read about
  Limiting the size of the log
    Make a checkpoint of the master state
    Remove all operations from log from before checkpoint
    Checkpoint is replicated to backups
  Recovery
    replay log starting from last checkpoint
    chunk location information is recreated by asking chunk servers
  Master is single point of failure
    recovery is fast, because master state is small
      so maybe unavailable for short time
    shadow masters
      lag behind master
        they replay from the log that is replicated
      can perform server read-only operations, but may return stale data
    if master cannot recovery, master is started somewhere else
    must be done with great care to avoid two masters

 

Chunk fault tolerance
  Master grants a chunk lease to one of the replicas
    That replica is the primary chunk server
  Primary determines orders operations
  Clients pushes data to replicas
    Replicas form a chain
    Chain respects network topology
    Allows fast replication
  Client sends write request to primary
    Primary assigns sequence number
    Primary applies change locally
    Primary forwards request to replicates
    Primary responds to client after receiving acks from all replicas
  If one replica doesn't respond, client retries
  Master replicates chunks if number replicas drop below some number
  Master rebalances replicas

 

Consistency of chunks
  Some chunks may get out of date
    they miss mutations
  Detect stale data with chunk version number
    before handing out a lease
      increments chunk version number
      sends it to primary and backup chunk servers
    master and chunk servers store version persistently
  Send version number also to client
  Version number allows master and client to detect stale replicas

 

Concurrent writes/appends
  clients may write to the same region of file concurrently
  the result is some mix of those writes--no guarantees
    few applications do this anyway, so it is fine
    concurrent writes on Unix can also result in a strange outcome
  many client may want to append concurrently to, e.g., a log file
    GFS support atomic, at-least-once append
    the primary chunk server chooses the offset where to append a record
    sends it to all replicas.
    if it fails to contact a replica, the primary reports an error to client
    client retries; if retry succeeds:
      some replicas will have the append twice (the ones that succeeded)
    the file may have a "hole" too
      when GFS pads to chunk boundary, if an append would across chunk boundary

 

Consistency model
  Strong consistency for directory operations
    Master performs changes to metadata atomically
    Directory operations follow the "ideal"
    But, when master is off-line, only shadow masters
      Read-only operations only, which may return stale data
  Weak consistency for chunk operations
    A failed mutation leaves chunks inconsistent
      The primary chunk server updated chunk
      But then failed and the replicas are out of date
    A client may read an not-up-to-date chunk
    When client refreshes lease it will learn about new version #
  Authors claims weak consistency is not a big problems for apps
    Most file updates are append-only updates
      Application can use UID in append records to detect duplicates
      Application may just read less data (but not stale data)
    Application can use temporary files and atomic rename

 

Performance
  huge aggregate throughput for read (3 copies, striping)
   125 MB/sec in aggregate
    Close to saturating network
  writes to different files lower than possible maximum
    authors blame their network stack
    it causes delays in propagating chunks from one replica to next
  concurrent appends to single file
    limited by the server that stores last chunk
  numbers and specifics have changed a lot in 15 years! (2018)

 

Summary
  Important FT techniques used by GFS
    Logging & checkpointing
    Primary-backup replication for chunks
      but with consistencies

 

  what works well in GFS?
    huge sequential reads and writes
    appends
    huge throughput (3 copies, striping)
    fault tolerance of data (3 copies)
  what less well in GFS?
    fault-tolerance of master
    small files (master a bottleneck)
    clients may see stale data
    appends maybe duplicated

 


References
  http://queue.acm.org/detail.cfm?id=1594206 (discussion of gfs evolution)
  http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.htm

 

转载于:https://www.cnblogs.com/william-cheung/p/5268533.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值