Beaver, D., Kumar, S., Li, H. C., Sobel, J., & Vajgel, P. (2010, October). Finding a Needle in Haystack: Facebook’s Photo Storage. In OSDI (Vol. 10, pp. 1-8).
Introduction
Haystack is Facebook’s file storage system which handles billions of images and more than 20 petabytes of data.
Environment:
- Written once
- Read often
- Never modified
- Rarely deleted
Disadvantage of POSIX based filesystem:
The per-file metadata is never used, which limits the read throughput, and CDNs must be used for reads.
Goal:
- High throughput and low latency: at most one disk operation per read
- Fault-tolerant: replicates each photo in geographically distinct locations
- Cost-effective:
- Simple
Typical design:
NFS-based design:
Design & Implementation
Serving a photo:
Core components:
- Haystack Store
- Haystack Directory
- Haystack Cache
http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>
Uploading a photo:
Haystack Directory
- Provides a mapping from logical volumes to physical volumes.
- Load balances writes across logical volumes and reads across physical volumes.
- Determines whether a photo request should be handled by the CDN or by the Cache.
- Identifies those logical volumes that are read-only either because of operational reasons or because those volumes have reached their storage capacity.
Haystack Cache
A distributed hash table and use a photo’s id as the key.
It caches a photo only if:
- the request comes directly from a user and not the CDN
- the photo is fetched from a write-enabled Store machine
Why only write-enabled:
- photos are most heavily accessed soon after they are uploaded
- perform better when doing either reads or writes but not both
Haystack Store
Physical volume -> a very large file (100 GB) saved as /hay/haystack_<logical volumn id>
Store machine keeps:
- open file descriptors
- in-memory mapping of photo ids to the filesystem metadata (file, offset and size in bytes)
Physical volume: a large file consisting of a superblock followed by a sequence of needles (photo).
The use of the alternate key is due to Facebook’s historical reasons, since each image have different sizes.
Photo Read
Supplies:
- logical volume id
- key
- alternate key
- cookie (randomly assigned, eliminates attacks aimed at guessing valid URLs for photos)
Photo Write
The latest version of a needle within a physical volume is the one at the highest offset.
Photo Delete
Sets the delete flag.
The Index File
Used to reconstruct its in-memory mappings.
An index file for each of their volumes.
Filesystem
Should use a filesystem that does not need much memory to be able to perform random seeks within a large file quickly.
XFS:
- blockmaps for several contiguous large files can be small enough to be stored in the main memory
- provides efficient
- file preallocation
- mitigating fragmentation
- reining in how large block maps can grow
Recovery from failures
Detection:
maintain a background task, dubbed pitchfork
Repair:
Bulk sync
Optimizations
Compaction
Free up space from deleted photos. (Young photos are a lot more likely to be deleted)