1、HDFS: Motivation:
(1)Based on Google’s GFS
(2)Redundant storage of massive amounts of data on cheap and unreliable computers
(3)Why not use an existing file system?
– Different workload and design priorities;
– Handles much bigger dataset sizes than other filesystems
2、HDFS Design Decisions
(1)Files stored as blocks-Much larger size than most filesystems (default is 64MB)
(2)Reliability through replication
– Each block replicated across 3+ DataNodes
(3)Single master (NameNode) coordinates access, metadata
– Simple centralized management
(4)No data caching-– Little benefit due to large data sets, streaming reads
(5)Familiar interface, but customize the API
– Simplify the problem; focus on distributed apps
3、HDFS Client Block Diagram
4、Based on GFS Architecture
5、Metadata
(1)Single NameNode stores all metadata
– Filenames, locations on DataNodes of each file
(2)Maintained entirely in RAM for fast lookup
(3)DataNodes store opaque file contents in “block” objects on underlying local filesystem
6、HDFS Conclusions
(1)HDFS supports large-scale processing workloads on commodity hardware
–designed to tolerate frequent component failures;
–optimized for huge files that are mostly appended and read
– filesystem interface is customized for the job, but still retains familiarity for developers
– simple solutions can work (e.g., single master)
(2)Reliably stores several TB in individual clusters