HDFS

zjfzjf2012

于 2018-04-19 14:51:12 发布

阅读量241

点赞数

分类专栏： big data

41 篇文章 0 订阅

订阅专栏

- suitable

- not suitable for some applications with,

- Block

replication on block

- namenode and datanode

namenode - filesystem namespace image and edit logs which are written to multiple file systems or NFS. they are critical and cannot be lost
namenode - block pool having blocks reported from datanodes when system startup
secondary namenode - merge edit logs to the namespace image. when primary namenode is down, copy namespace image and edit logs from NFS and make it as primary namenode. Then,

- HDFS Federation

map file namespaces to different namenodes, like /usr and /share
block pools in namenodes are not partitioned. they get block reports from same datanodes if they register with the namenodes.
client uses mount table to map path to namenodes

- HA (High Availability)

primary namenode and standby namenode share the storage for the image and edit logs
datanodes send block reports to the both namenodes
standby namenode does the merge of edit logs
zookeeper to select namenode. failover and fencing

- Java API

- Anatomy of file read

read block by block.
close connection to datanode when its block reading is done
network topology need to be set for hadoop, same node -> same rack -> same data center -> different data centers

- anatomy of file write

arrange blocks (by DataStreamer), the first replica is in local, the second is in off-rack, the third is in different node in the same rack as the second.
write completes on minimum replica requirement, usually 1. data queue and ack queue.

- coherency