HDFS小结-CSDN博客

本文链接：https://blog.csdn.net/johnson_it/article/details/7081257

HDFS is a filesystem designed for storing very large files with streaming data access pattern.

Streaming data access

HDFS is built around the idea thta the most efficient data processing pattern is a write-noce, read-many-times pattern.

HDFS is not very well support to

Low-latency data access

As HDFS is optimized for delievering a high thoughput of data, and this may be at expense of latency

Lots of small files

As the namenode hodls filesystem metadata in memory, therefore the size of the number of files i a filesystem is limited by the amout of memroies in the namenode.

Block

In HDFS, disk is partitioned as a block, the defalut block size is 64m, and the reason that keep such a large block size is that it will first reduce the seek time. A file can be located to differernt locations with divided by blocks. It will also simplify the system, as there is no need to share the meta inforamtion for all blocks for a file.

block fit well with replication for providing fault tolerance and availability. Each block is replicated to a small number of seperated machines.

NameNode and DataNodes

NameNode: manages namespaces, maintains the filesystem tree and the metadata for all firls. These information is persisted in tow files: the namespace image and the edit log. also knows where all the blocks for a given file are located.

DataNode: Store and retreieve the blocks and report back to the namenode periodically.

How to Deal with failure of Datanode:

1. back up all the files on multiple filesystems.

2. Run a secondary namenode, the secondary periodically mearge the naemspace iamge with the edit log to prevent the edit log becoming too large

Operations

To set up

There are two properties to set up the file system, the first tis fs.default.name, set hdfs as the default file system, usually it's a URI with a port name, it's default port name is 8020, second dfs.replication, if it is a standalone mode, then the replication should be 1.

We could copy the files from local system to the HDFS, and vice versa. Using -copyFromLocal and copyToLocal.

Hadoop abstract the filesystem with an API. it supports multiple filesystems such as HDFS, KFS, S3, local. For non JAVA applications to access Haddop. It provides such as thrift ,C and other API.

Reading data in HDFS

The client call open() on a FileSystem object, which for HDFS is a DistributedFileSystem instance, the instance will call the NamdeNode using RPC. To get the locations of the first few blocks in the file. For each block, Namde node retrurns the addresses of the datanodes that holds the block and was sorted by the promixity of the nodes. The DistributedFileSystem will return a FSDataInputStream to the client for it to read from. Then the client call read() on the FSDataInputStream, then read cotinuously.

Writing on a HDFS

The client create the file by call create() on the Distributed FileSystem, the filesystem will call namenode using RPC, then the namenode will run a couple of checks to make sure that the file doesn't exsit, then the namnode will create a namespace. Then the distributedFileSsytem returns an FSDataOutputStream for the client start to write data. As the client writes data,the output stream will splits the data into packets and writes to an internal queue. called data queue. This queue is consumed by DataStreamer, which ask namnode to allocate the new blocks by picking a list of suitable datanodes. The DataStreamer streams the packets to the first dataonode in the pipeline then to the next. There is also a ack queue, a packet is removed from the ack queue when it has been ackknowled.