hdfs硬盘中dfs.data.dir相关和一些说明

最新推荐文章于 2024-03-21 11:14:48 发布

zhzf1511

最新推荐文章于 2024-03-21 11:14:48 发布

阅读量3.2k

点赞数

分类专栏： hadoop 文章标签： transition string structure file 数据结构服务器

hadoop 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

HDFS 通过 dfs.data.dir 字段在配置文件中查询 DFS 的数据在本地文件系统中的存放位置。如果在服务器上配置了多块硬盘（假设都已经挂载到本地文件系统中），我们希望 HDFS 能尽量均衡、充分的利用磁盘。理论上 HDFS 也确实能胜任这项工作。在 HDFS 中，这样的一个存放数据的本地文件系统中的目录被称为 volume。
直接定位到 Datanode.java 中的代码：

1
2
3
4
5
6
7
8
9
10
11
12
13

public static DataNode createDataNode ( String args [ ], Configuration conf ) throws IOException {
DataNode dn = instantiateDataNode(args, conf);
runDatanodeDaemon (dn ) ;
return dn ;
}

public static DataNode instantiateDataNode ( String args [ ], Configuration conf ) throws IOException {
//...
String[] dataDirs = conf.getStrings("dfs.data.dir");
dnThreadName = "DataNode: [" +
StringUtils. arrayToString (dataDirs ) + "]" ;
return makeInstance (dataDirs, conf ) ;
}

在真正实例化之前，代码会先拿到配置文件中定义的 dfs.data.dir 对应的字符串 dataDirs。然后在 makeInstance(dataDirs, conf) 方法中检查 dataDirs 在本地文件系统中是否存在、可用。只要有一个 DIR 可用，就会 new 一个 DataNode 出来。
构造函数 DataNode() 直接调用 startDataNode(conf, dataDirs) 方法。这其中跟数据相关的代码如下：

1
2
3
4
5
6
7
8
9
10
11

startDataNode ( ) {
//…
storage = new DataStorage ( ) ;
//…
// read storage info, lock data dirs and transition fs state if necessary
storage.recoverTransitionRead(nsInfo, dataDirs, startOpt);
// adjust
this. dnRegistration. setStorageInfo (storage ) ;
// initialize data node internal structure
this.data = new FSDataset(storage, conf);
}

在 storage.recoverTransitionRead(nsInfo, dataDirs, startOpt) 中还会对 dataDirs 做检查：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

for (Iterator <File > it = dataDirs. iterator ( ) ; it. hasNext ( ) ; ) {
File dataDir = it. next ( ) ;
StorageDirectory sd = new StorageDirectory (dataDir ) ;
StorageState curState ;
try {
curState = sd. analyzeStorage (startOpt ) ;
// sd is locked but not opened
switch (curState ) {
case NORMAL :
break ;
case NON_EXISTENT :
// ignore this storage
LOG. info ( "Storage directory " + dataDir + " does not exist." ) ;
it. remove ( ) ;
continue ;
case NOT_FORMATTED : // format
LOG. info ( "Storage directory " + dataDir + " is not formatted." ) ;
LOG. info ( "Formatting ..." ) ;
format (sd, nsInfo ) ;
break ;
default : // recovery part is common
sd. doRecover (curState ) ;
}
} catch ( IOException ioe ) {
sd. unlock ( ) ;
throw ioe ;
}
// add to the storage list
addStorageDir (sd ) ;
dataDirStates. add (curState ) ;
}

在 startDataNode() 中跟 volume 直接相关的代码就是最后一行

10	this. data = new FSDataset (storage, conf ) ;

FSDataset.java 文件定义了 DFS 的很多数据结构，如 FSDir, FSVolume, FSVolumeSet。

1
2
3
4
5
6
7
8
9
10
11

public FSDataset (DataStorage storage, Configuration conf ) throws IOException {
this. maxBlocksPerDir = conf. getInt ( "dfs.datanode.numblocks", 64 ) ;
FSVolume[] volArray = new FSVolume[storage.getNumStorageDirs()];
for ( int idx = 0 ; idx < storage. getNumStorageDirs ( ) ; idx ++ ) {
volArray [idx ] = new FSVolume (storage. getStorageDir (idx ). getCurrentDir ( ), conf ) ;
}
volumes = new FSVolumeSet(volArray);
volumeMap = new HashMap<Block, DatanodeBlockInfo>();
volumes.getVolumeMap(volumeMap);
registerMBean (storage. getStorageID ( ) ) ;
}

在这个构造函数中，volumeMap 保存了 HDFS 中每一个 Block 和一个 DatanodeBlockInfo 的对应关系，而 DatanodeBlockInfo 维护了一个 Block 到它的 metada 的映射：

1
2
3
4
5
6

class DatanodeBlockInfo {
private FSVolume volume; // volume where the block belongs
private File file ; // block file
private boolean detached ; // copy-on-write done for block
//...
}

而通过 volumes.getVolumeMap(volumeMap)，便递归的完成每个 volume 下面已经存在的 block 的映射关系的维护。

至此，HDFS 便基本上完成本地文件系统上的文件与 DFS 上的文件/block 的映射。其中 FSDataset 是非常重要的类。下一篇 blog 将要讲述的修改 HDFS 以便让一个 SequenceFile 被创建在指定的 volume 上就需要挖掘这里的很多方法。比如：

1
2
3
4
5
6
7
8
9
10
11

synchronized FSVolume getNextVolume ( long blockSize ) throws IOException {
int startVolume = curVolume ;
while ( true ) {
FSVolume volume = volumes[curVolume];
curVolume = (curVolume + 1 ) % volumes. length ;
if (volume. getAvailable ( ) > blockSize ) { return volume ; }
if (curVolume == startVolume ) {
throw new DiskOutOfSpaceException ( "Insufficient space for an additional block" ) ;
}
}
}

这个方法保证 HDFS 能‘均衡’的使用配置的每个 volume。