第七章:小朱笔记hadoop之源码分析-hdfs分析
第三节:hdfs实现分析
3.4 datanode数据结构
与Storage相关的类:从宏观上刻画了每个存储目录的组织结构,管理由HDFS属性dfs.data.dir指定的目录,如current、previous、detach、tmp、storage等目录和文件,并定义了对整个存储的相关操作;
与Dataset相关的类:描述了块文件及其元数据文件的组织方式.
与块相关的操作由FSDataset相关的类处理,存储结构由大到小是卷(FSVolume)、目录(FSDir)和文件(Block和元数据等)
(1) FSDatasetInterface
FSDatasetInterface是DataNode对底层存储的抽象
getMetaDataLength(Block b):获得块元数据文件大小
getMetaDataInputStream(Block b):获得块元数据文件输入流
metaFileExists(Block b):检查元数据文件是否存在
getLength(Block b):获得块数据文件大小
getStoredBlock(Block b):根据块ID得到块信息
getBlockInputStream(long blkid):获得块数据文件输入流
getBlockInputStream(Block b, long seekOffset):获得位于块数据文件特定位置的输入流
getTmpInputStreams(Block b, long blkoff, long ckoff):获得位于块数据文件特定位置的输入流,块文件还位于临时目录中
writeToBlock(Block b, boolean isRecovery):创建块文件,并获得文件输出流
updateBlock(Block oldblock, Block newblock):更新块
finalizeBlock(Block b):完成块的写操作
unfinalizeBlock(Block b):关闭块文件的写,删除与块相关的临时文件
getBlockReport():得到块的报告
isValidBlock(Block b):检查块是否正常
invalidate(Block invalidBlks[]):检查多个块是否正常
checkDataDir():检查存储目录是否正常
shutdown():关闭FSDataset
getChannelPosition(Block b, BlockWriteStreams stream):获得在数据输出流中的当前位置
setChannelPosition(Block b, BlockWriteStreams stream, long dataOffset, long ckOffset):设置在数据输出流中的位置
validateBlockMetadata(Block b):验证块元数据文件
(2) FSDataset
FSDataset实现了接口FSDatasetInterface,所有和数据块相关的操作,都在FSDataset相关的类中进行处理关键成员变量:
//FSDataset使用的所有Storage 卷集合
FSVolumeSet volumes;
//Block到ActiveFile的映射,正在创建的Block,都会记录在ongoingCreates里。
private HashMap<Block,ActiveFile> ongoingCreates = new HashMap<Block,ActiveFile>();
//每个目录能保存的最大块数
private int maxBlocksPerDir = 0;
//元文件和内存中的映射表:block在DN上存储FSVolume以及具体文件信息
HashMap<Block,DatanodeBlockInfo> volumeMap = new HashMap<Block, DatanodeBlockInfo>();
关键方法:
//通过Block的ID,找到对应的Block
public synchronized Block getStoredBlock(long blkid) throws IOException {
File blockfile = findBlockFile(blkid);
if (blockfile == null) {
return null;
}
File metafile = findMetaFile(blockfile);
Block block = new Block(blkid);
return new Block(blkid, getVisibleLength(block),
parseGenerationStamp(blockfile, metafile));
}
//判断block的元数据的元数据文件是否存在
public boolean metaFileExists(Block b) throws IOException {
return getMetaFile(b).exists();
}
//得到一个block的元数据长度。通过block的ID,找对应的元数据文件,返回文件长度
public long getMetaDataLength(Block b) throws IOException {
File checksumFile = getMetaFile( b );
return checksumFile.length();
}
//得到一个block的元数据输入流。通过block的ID,找对应的元数据文件,在上面打开输入流
public MetaDataInputStream getMetaDataInputStream(Block b)
throws IOException {
File checksumFile = getMetaFile( b );
return new MetaDataInputStream(new FileInputStream(checksumFile),
checksumFile.length());
}
/**
* Returns handles to the block file and its metadata file
* 得到Block的临时输入流。注意,临时输入流是指对应的文件处于tmp目录中。
* 新创建块时,块数据应该写在tmp目录中,直到写操作成功,文件才会被移动到current目录中,如果失败,就不会影响current目录了。
*
*/
public synchronized BlockInputStreams getTmpInputStreams(Block b,
long blkOffset, long ckoff) throws IOException {
DatanodeBlockInfo info = volumeMap.get(b);
if (info == null) {
throw new IOException("Block " + b + " does not exist in volumeMap.");
}
FSVolume v = info.getVolume();
File blockFile = info.getFile();
//新创建块时,块数据应该写在tmp目录中
if (blockFile == null) {
blockFile = v.getTmpFile(b);
}
RandomAccessFile blockInFile = new RandomAccessFile(blockFile, "r");
if (blkOffset > 0) {
blockInFile.seek(blkOffset);
}
File metaFile = getMetaFile(blockFile, b);
RandomAccessFile metaInFile = new RandomAccessFile(metaFile, "r");
if (ckoff > 0) {
metaInFile.seek(ckoff);
}
return new BlockInputStreams(new FileInputStream(blockInFile.getFD()),
new FileInputStream(metaInFile.getFD()));
}
/** {@inheritDoc}
*
* updateBlock的最外层是一个死循环,循环的结束条件,是没有任何和这个数据块相关的写线程。每次循环,updateBlock都会去调用一个叫tryUpdateBlock的内部方法。tryUpdateBlock发现已经没有线程在写这个块,就会跟新和这个数据块相关的信息,
* 包括元文件和内存中的映射表volumeMap。如果tryUpdateBlock发现还有活跃的线程和该块关联,那么,updateBlock会试图结束该线程,并等在join上等待
*
* */
public void updateBlock(Block oldblock, Block newblock) throws IOException {
if (oldblock.getBlockId() != newblock.getBlockId()) {
throw new IOException("Cannot update oldblock (=" + oldblock
+ ") to newblock (=" + newblock + ").");
}
// Protect against a straggler updateblock call moving a block backwards
// in time.
boolean isValidUpdate =
(newblock.getGenerationStamp() > oldblock.getGenerationStamp()) ||
(newblock.getGenerationStamp() == oldblock.getGenerationStamp() &&
newblock.getNumBytes() == oldblock.getNumBytes());
if (!isValidUpdate) {
throw new IOException(
"Cannot update oldblock=" + oldblock +
" to newblock=" + newblock + " since generation stamps must " +
"increase, or else length must not change.");
}
for(;;) {
final List<Thread> threads = tryUpdateBlock(oldblock, newblock);
if (threads == null) {
return;
}
interruptAndJoinThreads(threads);
}
}
* Try to update an old block to a new block.
* If there are ongoing create threads running for the old block,
* the threads will be returned without updating the block.
* 用于将旧块截断成新块(调用truncateBlock),并截断相应的元数据文件,以及更新ongoingCreates、volumeMap。
* @return ongoing create threads if there is any. Otherwise, return null.
*/
private synchronized List<Thread> tryUpdateBlock(Block oldblock, Block newblock) throws IOException {
//check ongoing create threads 获取与此块相关的文件及访问这个块的线程
ArrayList<Thread> activeThreads = getActiveThreads(oldblock);
if (activeThreads != null) {
return activeThreads; //如果近期有对此块进行操作,返回存活的操作线程
}
//近期无对此块操作的线程,就更新块
获得旧块的文件
//No ongoing create threads is alive. Update block.
File blockFile = findBlockFile(oldblock.getBlockId());
if (blockFile == null) {
throw new IOException("Block " + oldblock + " does not exist.");
}
File oldMetaFile = findMetaFile(blockFile);
long oldgs = parseGenerationStamp(blockFile, oldMetaFile);
// First validate the update
//update generation stamp
//旧块stamp比新块stamp大,不合法
if (oldgs > newblock.getGenerationStamp()) {
throw new IOException("Cannot update block (id=" + newblock.getBlockId()
+ ") generation stamp from " + oldgs
+ " to " + newblock.getGenerationStamp());
}
//update length
//新块的大小大于旧块的大小
if (newblock.getNumBytes() > oldblock.getNumBytes()) {
throw new IOException("Cannot update block file (=" + blockFile
+ ") length from " + oldblock.getNumBytes() + " to " + newblock.getNumBytes());
}
// Now perform the update
//rename meta file to a tmp file
//旧块元数据文件重命名
File tmpMetaFile = new File(oldMetaFile.getParent(),oldMetaFile.getName()+"_tmp" + newblock.getGenerationStamp());
if (!oldMetaFile.renameTo(tmpMetaFile)){
throw new IOException("Cannot rename block meta file to " + tmpMetaFile);
}
//新块的大小小于旧块的大小 截断旧块和旧块的元数据文件
if (newblock.getNumBytes() < oldblock.getNumBytes()) {
truncateBlock(blockFile, tmpMetaFile, oldblock.getNumBytes(), newblock.getNumBytes());
}
//rename the tmp file to the new meta file (with new generation stamp)
File newMetaFile = getMetaFile(blockFile, newblock);
if (!tmpMetaFile.renameTo(newMetaFile)) {
throw new IOException("Cannot rename tmp meta file to " + newMetaFile);
}
updateBlockMap(ongoingCreates, oldblock, newblock);
updateBlockMap(volumeMap, oldblock, newblock);
// paranoia! verify that the contents of the stored block
// matches the block file on disk.
validateBlockMetadata(newblock);
return null;
}
//truncateBlock对旧块blockFile和对应的元数据文件metaFile进行截断,截断后旧块长度为newlen(newlen<oldlen)。
static void truncateBlock(File blockFile, File metaFile,long oldlen, long newlen) throws IOException {
if (newlen == oldlen) {
return;
}
if (newlen > oldlen) {
throw new IOException("Cannot truncate block to from oldlen (=" + oldlen
+ ") to newlen (=" + newlen + ")");
}
if (newlen == 0) {
// Special case for truncating to 0 length, since there's no previous
// chunk.
RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");
try {
//truncate blockFile
blockRAF.setLength(newlen);
} finally {
blockRAF.close();
}
//update metaFile
RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
try {
metaRAF.setLength(BlockMetadataHeader.getHeaderSize());
} finally {
metaRAF.close();
}
return;
}
//由于只是对就块进行截断,所有新块的最后一个校验和字段可能在旧块中不一样,
//所有setLength进行截断后,要读取最后一个校验和字段
DataChecksum dcs = BlockMetadataHeader.readHeader(metaFile).getChecksum();
int checksumsize = dcs.getChecksumSize();
int bpc = dcs.getBytesPerChecksum();
long newChunkCount = (newlen - 1)/bpc + 1;//校验和的段数
long newmetalen = BlockMetadataHeader.getHeaderSize() + newChunkCount*checksumsize;//新的校验和文件的长度
long lastchunkoffset = (newChunkCount - 1)*bpc;//最后一个校验和字段的偏移位置
int lastchunksize = (int)(newlen - lastchunkoffset); //最后一个校验和的开始位置
byte[] b = new byte[Math.max(lastchunksize, checksumsize)];
RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");//对旧块进行读取
try {
//truncate blockFile
blockRAF.setLength(newlen);
//read last chunk
blockRAF.seek(lastchunkoffset);
blockRAF.readFully(b, 0, lastchunksize);
} finally {
blockRAF.close();
}
//compute checksum
dcs.update(b, 0, lastchunksize);
dcs.writeValue(b, 0, false);
//update metaFile
RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
try {
metaRAF.setLength(newmetalen);
metaRAF.seek(newmetalen - checksumsize);
metaRAF.write(b, 0, checksumsize);
} finally {
metaRAF.close();
}
}
/**
* 提交(或叫:结束finalize)通过writeToBlock打开的block,这意味着写过程没有出错,
* 可以正式把Block从tmp文件夹放到current文件夹。在FSDataset中,
* finalizeBlock将从ongoingCreates中删除对应的block,同时将block对应的DatanodeBlockInfo,
* 放入volumeMap中。我们还是以blk_3148782637964391313为例,当DataNode提交Block ID为3148782637964391313数据块文件时,
* DataNode将把tmp/blk_3148782637964391313移到current下某一个目录,
* 以subdir12为例,这是tmp/blk_3148782637964391313将会挪到current/subdir12/blk_3148782637964391313。对应的meta文件也在目录current/subdir12下。
*/
@Override
public void finalizeBlock(Block b) throws IOException {
finalizeBlockInternal(b, false);
}
/**
* Start writing to a block file
* If isRecovery is true and the block pre-exists, then we kill all
volumeMap.put(b, v);
volumeMap.put(b, v);
* other threads that might be writing to this block, and then reopen the file.
* If replicationRequest is true, then this operation is part of a block
* replication request.
*
如果块数据文件已经存在并且当前要进行恢复,将块数据文件跟与其硬链接的文件分离,需要对文件进行恢复有两种情况:客户端重新打开连接并重新发送数据包;往块追加数据
如果块在ongoingCreates中,将其删去
如果不是恢复操作,获得存放块文件的卷,创建临时文件,将块添加到volumeMap中
对于恢复操作,如果块临时文件存在,重用临时文件,将块添加到volumeMap中;如果块临时文件不存在,将块数据文件和对应的元数据文件移动到临时目录中,将块添加到volumeMap中
将块添加到ongoingCreates中
返回块文件输出流
*
*/
public BlockWriteStreams writeToBlock(Block b, boolean isRecovery,
boolean replicationRequest) throws IOException {
//
// Make sure the block isn't a valid one - we're still creating it!
//
if (isValidBlock(b)) {
if (!isRecovery) {
throw new BlockAlreadyExistsException("Block " + b + " is valid, and cannot be written to.");
}
// If the block was successfully finalized because all packets
// were successfully processed at the Datanode but the ack for
// some of the packets were not received by the client. The client
// re-opens the connection and retries sending those packets.
// The other reason is that an "append" is occurring to this block.
detachBlock(b, 1);
}
long blockSize = b.getNumBytes();
//
// Serialize access to /tmp, and check if file already there.
//
File f = null;
List<Thread> threads = null;
synchronized (this) {
//
// Is it already in the create process?
//
ActiveFile activeFile = ongoingCreates.get(b);
if (activeFile != null) {
f = activeFile.file;
threads = activeFile.threads;
if (!isRecovery) {
throw new BlockAlreadyExistsException("Block " + b +
" has already been started (though not completed), and thus cannot be created.");
} else {
for (Thread thread:threads) {
thread.interrupt();
}
}
ongoingCreates.remove(b);
}
FSVolume v = null;
if (!isRecovery) {
v = volumes.getNextVolume(blockSize);
// create temporary file to hold block in the designated volume
f = createTmpFile(v, b, replicationRequest);
} else if (f != null) {
DataNode.LOG.info("Reopen already-open Block for append " + b);
// create or reuse temporary file to hold block in the designated volume
v = volumeMap.get(b).getVolume();
volumeMap.put(b, new DatanodeBlockInfo(v, f));
} else {
// reopening block for appending to it.
DataNode.LOG.info("Reopen Block for append " + b);
v = volumeMap.get(b).getVolume();
f = createTmpFile(v, b, replicationRequest);
File blkfile = getBlockFile(b);
File oldmeta = getMetaFile(b);
File newmeta = getMetaFile(f, b);
// rename meta file to tmp directory
DataNode.LOG.debug("Renaming " + oldmeta + " to " + newmeta);
if (!oldmeta.renameTo(newmeta)) {
throw new IOException("Block " + b + " reopen failed. " +
" Unable to move meta file " + oldmeta +
" to tmp dir " + newmeta);
}
// rename block file to tmp directory
DataNode.LOG.debug("Renaming " + blkfile + " to " + f);
if (!blkfile.renameTo(f)) {
if (!f.delete()) {
throw new IOException("Block " + b + " reopen failed. " +
" Unable to remove file " + f);
}
if (!blkfile.renameTo(f)) {
throw new IOException("Block " + b + " reopen failed. " +
" Unable to move block file " + blkfile +
" to tmp dir " + f);
}
}
}
if (f == null) {
DataNode.LOG.warn("Block " + b + " reopen failed " +
" Unable to locate tmp file.");
throw new IOException("Block " + b + " reopen failed " +
" Unable to locate tmp file.");
}
// If this is a replication request, then this is not a permanent
// block yet, it could get removed if the datanode restarts. If this
// is a write or append request, then it is a valid block.
if (replicationRequest) {
volumeMap.put(b, new DatanodeBlockInfo(v));
} else {
volumeMap.put(b, new DatanodeBlockInfo(v, f));
}
ongoingCreates.put(b, new ActiveFile(f, threads));
}
try {
if (threads != null) {
for (Thread thread:threads) {
thread.join();
}
}
} catch (InterruptedException e) {
throw new IOException("Recovery waiting for thread interrupted.");
}
//
// Finally, allow a writer to the block file
// REMIND - mjc - make this a filter stream that enforces a max
// block size, so clients can't go crazy
//
File metafile = getMetaFile(f, b);
DataNode.LOG.debug("writeTo blockfile is " + f + " of size " + f.length());
DataNode.LOG.debug("writeTo metafile is " + metafile + " of size " + metafile.length());
return createBlockWriteStreams( f , metafile);
}
detach技术:
系统在升级时会创建一个snapshot,snapshot的文件和current里的数据块文件和数据块元文件是通过硬链接,指向了相同的内容。当改变current里的文件时,如果不进行detach操作,那么,修改的内容就会影响snapshot里的文件,这时,我们需要将对应的硬链接解除掉。方法很简单,就是在临时文件夹里,复制文件,然后将临时文件改名成为current里的对应文件,这样的话,current里的文件和snapshot里的文件就detach了。这样的技术,也叫copy-on-write,是一种有效提高系统性能的方法。DatanodeBlockInfo中的detachBlock,能够对Block对应的数据文件和元数据文件进行detach操作。
(3) DataStorage
对所有的本地存储路径进行统一管理,也就是对StorageDirectory进行管理,并没有对存储路径中的具体数据文件进行管理
(4) FSVolumeSet
FSVolumeSet对所有的FSVolume对象进行管理,实际上就是对所有的存储路径进行管理。FSVolumeSet主要为上层(DataNode进程)提供存储数据块选择一个的存储路径(分区),就是为该数据块创建一个对应的本地磁盘文件,同时也负载统计它的存储空间的状态信息和收集所有的数据块信息
(5) FSVolume
管理块文件,统计存储目录的使用情况.
private File currentDir;
private File blocksBeingWritten; // clients write here
private FSDir dataDir; //存储有效的数据块的最终位置(current/)
private File tmpDir; //存储数据块的中间位置(tmp/)
private File detachDir; //存储数据块的copy on write(detach/)
private DF usage; //获取当前存储目录的空间使用信息
private DU dfsUsage; //获取当前存储目录所在的磁盘分区空间信息
private long reserved; //预留存储空间大小
(6) FSDir
FSDir对应着HDFS中的一个目录,目录里存放着数据块文件和它的元文件。默认情况下,每个目录下最多有64个子目录,最多能存储64个块。在初始化一个目录时,会递归扫描该目录下的目录和文件,从而形成一个树状结构。当有数据块到达DataNode节点时,DataNode并不是马上在current/中为这个数据块选择合适的存储目录,而是先把它存放到存储路径的tmp/子目录下,当这个数据块被DataNode节点成功接受之后,才把它移动到current/下的合适目录中DataNode节点会首先把文件的数据块存储到存储路径的子目录current/下;当子目录current/中已经存储了maxBlocksPerDir个数据块之后,就会在目录current/下创建maxBlocksPerDir个子目录,然后从中选择一个子目录,把数据块存储到这个子目录中;如果选择的子目录也已经存储了maxBlocksPerDir个数据块,则又在这个子目录下创建maxBlocksPerDir个子目录,从这些子目录中选一个来存储数据块,就这样一次递归下去,直到存储路径的剩余存储空间不够存储一个数据块为止。maxBlocksPerDir的默认值是64,但也可以通过DataNode的配置文件来设置,它对应的配置选项是dsf.datanode.numblocks。
File dir;//存储路径的子目录current/
int numBlocks = 0;//存储目录当前已经存储的数据块的数量
FSDir children[];//目录current/的子目录
int lastChildIdx = 0;//存储上一个数据块的子目录序号
(7) BlockAndFile ActiveFile
BlockAndFile:block信息和对应的文件 ActiveFile:正在写入的文件
(8) DatanodeBlockInfo
DatanodeBlockInfo存放的是Block在文件系统上的信息。它保存了Block存放的卷(FSVolume),文件名和detach状态。 private FSVolume volume; // volume where the block belongs
private File file; // block file
private boolean detached; // copy-on-write done for block