HDFS 概述
- 分布式
- commodity hardware
- fault-tolerant 容错
- high throughput 高吞吐
- large data sets
HDFS前提和设计目标
Hardware Failure 硬件错误
每个机器只存储文件的部分数据,blocksize=128M,block存放在不同服务器,默认3副本机制
Streaming Data Access 流式数据访问
The emphasis is on high throughput of data access rather than low latency of data access.
Large Data Sets
数据大不怕,怕数据小
Moving Computation is Cheaper than Moving Data 移动计算比移动数据更划算
HDFS架构 *****
- NameNode(master) and DataNodes
- master/slave的架构
- NN: the file system namspace ; regulates access to files by clients
- DN: storage
- HDFS exposes a file system namespace and allows user data to be stored in files
- a file is split into one or more blocks
- blocks are stored in a set of DataNodes (容错)
- NameNode executes file system namespace operations: CRUD
- NameNode determines the mapping of blocks to DataNodes (决定文件block的映射,用户不感知)