http://hi.baidu.com/thinkdifferent/blog/item/95de0e2416c4da3fc89559b8.html
对于HDFS针对多硬盘节点的存储策略,一直没有找到比较确实的依据,只有Hadoop官网上说过一句nodes with multiple disks should be managed internally(大致如此,懒得再看了)。今天看到一篇博客,直接把代码片段给贴上去了。现转贴如下:
from: kzk's blog
To use multiple disks in Hadoop DataNode, you should add comma-separated directories to dfs.data.dir in hdfs-site.xml. The following is an example of using four disks.
- <property>
- <name>dfs.data.dir</name>
- <value>/disk1, /disk2, /disk3, /disk4</value>
- </property>
But how to use these disks in Hadoop? I found the following code snippet in ./hdfs/org/apache/hadoop/hdfs/server/datanode/FSDataset.java at hadoop-0.20.1.
- synchronized FSVolume getNextVolume( long blockSize) throws IOException {
- int startVolume = curVolume;
- while ( true ) {
- FSVolume volume = volumes[curVolume];
- curVolume = (curVolume + 1 ) % volumes.length;
- if (volume.getAvailable() > blockSize) { return volume; }
- if (curVolume == startVolume) {
- throw new DiskOutOfSpaceException( "Insufficient space for an additional block" );
- }
- }
- }
synchronized FSVolume getNextVolume(long blockSize) throws IOException { int startVolume = curVolume; while (true) { FSVolume volume = volumes[curVolume]; curVolume = (curVolume + 1) % volumes.length; if (volume.getAvailable() > blockSize) { return volume; } if (curVolume == startVolume) { throw new DiskOutOfSpaceException("Insufficient space for an additional block"); } } }
FSVolume represents the single directory specified at dfs.data.dir. This code places the blocks in round-robin fashion into multiple disks, while considering the available disk capacities.
One more thing. If the disk utilization reaches the 100%, the other important data (c,f. error log) cannot be written. To prevent this, Hadoop prepares the "dfs.datanode.du.reserved" value. When calculating the disk capacity in Hadoop, this value is always subtracted from the real capacity. Setting this value as severay hundreds of megabytes would be safe.
This is the default strategy of Hadoop, but I think considering the disk load avg would be better. If one disk is busy, Hadoop should avoid to use that disk. However, the block distribution would not be same across the disks in this method. Therefore, the read performance will drop. This is a very difficult problem. Do you come up with a better strategy?