HDFS is a filesystem designed for storing very large files with streaming data access pattern.
Streaming data access
HDFS is built around the idea thta the most efficient data processing pattern is a write-noce, read-many-times pattern.
HDFS is not very well support to
Low-latency data access
As HDFS is optimized for delievering a high thoughput of data, and this may be at expense of latency
Lots of small files
As the namenode hodls filesystem metadata in memory, therefore the size of the number of files i a filesystem is limited by the amout of memroies in the namenode.
Block
In HDFS, disk is partitioned as a block, the defalut block size is 64m, and the reason that keep such a large block size is that it will first reduce the seek time. A file can be located to differernt locations with divided by blocks. It will also simplify the system, as there is no need to share the meta inforamtion for all blocks for a file.
block fit well with replication for providing fault tolerance and availability. Each block is replicated to a small number of seperated machines.
NameNode and DataNodes
NameNode: manages namespaces, maintains the filesystem tree and the metadata for all firls. These information is persisted in tow files: the namespace image and the edit log. also knows where all the blocks for a given file are located.
DataNode: Store and retreieve the blocks and report back to the namenode periodically.
How to Deal with failure of Datanode:
1. back up all the files on multiple filesystems.
2. Run a secondary namenode, the secondary periodically mearge the naemspace iamge with the edit log to prevent the edit log becoming too large
Operations
To set up
There are two properties to set up the file system, the first tis fs.default.name, set hdfs as the default file system, usually it's a URI with a port name, it's default port name is 8020, second dfs.replication, if it is a standalone mode, then the replication should be 1.
We could copy the files from local system to the HDFS, and vice versa. Using -copyFromLocal and copyToLocal.
Hadoop abstract the filesystem with an API. it supports multiple filesystems such as HDFS, KFS, S3, local. For non JAVA applications to access Haddop. It provides such as thrift ,C and other API.
Reading data in HDFS
The client call open() on a FileSystem object, which for HDFS is a DistributedFileSystem instance, the instance will call the NamdeNode using RPC. To get the locations of the first few blocks in the file. For each block, Namde node retrurns the addresses of the datanodes that holds the block and was sorted by the promixity of the nodes. The DistributedFileSystem will return a FSDataInputStream to the client for it to read from. Then the client call read() on the FSDataInputStream, then read cotinuously.
Writing on a HDFS