6.824 2016 Lecture 3: GFS Case Study
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003
Why are we reading this paper?
the file system for map/reduce
case study of handling storage failures
trading consistency for simplicity and performance
motivation for subsequent designs
good performance -- great parallel I/O performance
good systems paper -- details from apps all the way to network
all main themes of 6.824 show up in this paper
performance, fault-tolerance, consistency
influential
many other systems use GFS (e.g., Bigtable, Spanner @ Google)
HDFS (Hadoop Distributed File System) based on GFS
What is consistency?
A correctness condition
Important when data is replicated and concurrently accessed by applications
if an application performs a write, what will a later read observe?
what if the read is from a different application?
Weak consistency
read() may return stale data --- not the result of the most recent write
Strong consistency
read() always returns the data from the most recent write()
General tension between these:
strong consistency is easy for application writers
strong consistency is bad for performance
weak consistency has good performance and is easy to scale to many servers
weak consistency is complex to reason about
Many trade-offs give rise to different correctness conditions
These are called "consistency models"
History of consistency models
Much independent development in architecture, systems, and database communities
Concurrent processors with private caches accessing a shared memory
Concurrent clients accessing a distributed file system
Concurrent transactions on distributed database
Many different models with different trade-offs
serializability
sequential consistency
linearizability
entry consistency
release consistency
....
"Ideal" consistency model
A replicated files behaves like as a non-replicated file system
picture: many clients on the same machine accessing files on a single disk
If one application writes, later reads will observe that write
What if two application concurrently write to the same file
In file systems often undefined --- file may have some mixed content
What if two application concurrently write to the same directory
One goes first, the other goes second (use locking)
Sources of inconsistency
Concurrency
Machine failures
Network partitions
Example from GFS paper:
primary is partitioned from backup B
client appends 1
primary sends 1 to itself and backup A
reports failure to client
meanwhile client 2 may read backup B and observe old value
Why is the ideal difficult to achieve in a distributed file system
Protocols can become complex
Difficult to implement system correctly
Protocols require communication between clients and servers
May cost performance
GFS designers give up on ideal to get better performance and simpler design
Can make life of application developers harder
application observe behaviors that are non-observable in an ideal system
e.g., reading stale data
e.g., duplicate append records
But the data isn't your bank account, so maybe ok
The paper is an example of the struggle between:
consistency
fault-tolerance
performance
simplicity of design
GFS goal
create a shared file system
hundreds or thousands of (commodity, Linux based) physical machines
to enable storing massive data sets
What does GFS store?
authors don't actually say
guesses for 2003:
search indexes & databases
all the HTML files on the web
all the images on the web
...
Properties of files:
Multi-terabyte data sets
Many of the files are large
Authors suggest 1M files x 100 MB = 100 TB
but that was in 2003
Files are generally append only
Central challenge:
With so many machines failures are common
assume a machine fails once per year
w/ 1000 machines, ~3 will fail per day.
High-performance: many concurrent readers and writers
Map/Reduce jobs read and store final result in GFS
Note: *not* the temporary, intermediate files
Use network efficiently
High-level design
Directories, files, names, open/read/write
But not POSIX
100s of Linux chunk servers with disks
store 64MB chunks (an ordinary Linux file for each chunk)
each chunk replicated on three servers
Q: why 3x replication?
A: 1. For reliability, each chunk should be replicated on multiple
chunkservers
2. For simplicity and performance, they store 3 replicas
Q: Besides availability of data, what does 3x replication give us?
A: 1. load balancing for reads to hot files
2. affinity
Q: why not just store one copy of each file on a RAID'd disk?
A: 1. RAID isn't commodity
2. Want fault-tolerance for whole machine; not just storage device
Q: why are the chunks so big?
A: 1. Reduce clients' need to interact with master for chuck location information
2. Reduce network overhead if a client performs many operations on a given chunk
3. Reduce metadata stored on the master
GFS master server knows directory hierarchy
for dir, what files are in it
for file, knows chunk servers for each 64 MB
master keeps state in memory
64 bytes of metadata per each chunk
master has private recoverable database for metadata
master can recovery quickly from power failure
shadow masters that lag a little behind master
can be promoted to master
Basic operation
client read:
send file name and offset to master
master replies with set of servers that have that chunk
response includes version # of chunk; clients cache that information for a little while
ask nearest chunk server: checks version #, if version # is wrong, client re-contacts master
client write:
ask master where to store
maybe master chooses a new set of chunk servers if crossing 64 MB
one chunk server is primary
it chooses order of updates and forwards to two backups
Two different fault-tolerance plans
One for master
One for chunk servers
Master fault tolerance
Single master
Clients always talk to master
Master orders all operations
Stores limited information persistently
name spaces (directories)
file-to-chunk mappings
Log changes to these two in a log
log is replicated on several backups
clients operations that modify state return *after* recording changes in *logs*
logs play a central role in many systems we will read about
Limiting the size of the log
Make a checkpoint of the master state
Remove all operations from log from before checkpoint
Checkpoint is replicated to backups
Recovery
replay log starting from last checkpoint
chunk location information is recreated by asking chunk servers
Master is single point of failure
recovery is fast, because master state is small
so maybe unavailable for short time
shadow masters
lag behind master
they replay from the log that is replicated
can perform server read-only operations, but may return stale data
if master cannot recovery, master is started somewhere else
must be done with great care to avoid two masters
Chunk fault tolerance
Master grants a chunk lease to one of the replicas
That replica is the primary chunk server
Primary determines orders operations
Clients pushes data to replicas
Replicas form a chain
Chain respects network topology
Allows fast replication
Client sends write request to primary
Primary assigns sequence number
Primary applies change locally
Primary forwards request to replicates
Primary responds to client after receiving acks from all replicas
If one replica doesn't respond, client retries
Master replicates chunks if number replicas drop below some number
Master rebalances replicas
Consistency of chunks
Some chunks may get out of date
they miss mutations
Detect stale data with chunk version number
before handing out a lease
increments chunk version number
sends it to primary and backup chunk servers
master and chunk servers store version persistently
Send version number also to client
Version number allows master and client to detect stale replicas
Concurrent writes/appends
clients may write to the same region of file concurrently
the result is some mix of those writes--no guarantees
few applications do this anyway, so it is fine
concurrent writes on Unix can also result in a strange outcome
many client may want to append concurrently to, e.g., a log file
GFS support atomic, at-least-once append
the primary chunk server chooses the offset where to append a record
sends it to all replicas.
if it fails to contact a replica, the primary reports an error to client
client retries; if retry succeeds:
some replicas will have the append twice (the ones that succeeded)
the file may have a "hole" too
when GFS pads to chunk boundary, if an append would across chunk boundary
Consistency model
Strong consistency for directory operations
Master performs changes to metadata atomically
Directory operations follow the "ideal"
But, when master is off-line, only shadow masters
Read-only operations only, which may return stale data
Weak consistency for chunk operations
A failed mutation leaves chunks inconsistent
The primary chunk server updated chunk
But then failed and the replicas are out of date
A client may read an not-up-to-date chunk
When client refreshes lease it will learn about new version #
Authors claims weak consistency is not a big problems for apps
Most file updates are append-only updates
Application can use UID in append records to detect duplicates
Application may just read less data (but not stale data)
Application can use temporary files and atomic rename
Performance
huge aggregate throughput for read (3 copies, striping)
125 MB/sec in aggregate
Close to saturating network
writes to different files lower than possible maximum
authors blame their network stack
it causes delays in propagating chunks from one replica to next
concurrent appends to single file
limited by the server that stores last chunk
numbers and specifics have changed a lot in 15 years! (2018)
Summary
Important FT techniques used by GFS
Logging & checkpointing
Primary-backup replication for chunks
but with consistencies
what works well in GFS?
huge sequential reads and writes
appends
huge throughput (3 copies, striping)
fault tolerance of data (3 copies)
what less well in GFS?
fault-tolerance of master
small files (master a bottleneck)
clients may see stale data
appends maybe duplicated
References
http://queue.acm.org/detail.cfm?id=1594206 (discussion of gfs evolution)
http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.htm