1. HDFS元数据写入机制剖析
1.1 HDFS元数据如何写入内存?
我们接着上一篇博客的
mkdirs
之后到底发生了什么?之后我们目录已经创建到了我们的目录树上,那么接下来,我们是不是需要看一下元数据是如何写入磁盘的呢
1.1.1 createSingleDirectory
private static INodesInPath createSingleDirectory(FSDirectory fsd,
INodesInPath existing, String localName, PermissionStatus perm)
throws IOException {
assert fsd.hasWriteLock();
//TODO 更新目录树,这颗目录树是存在内存中的
//更新内存里面的数据
existing = unprotectedMkdir(fsd, fsd.allocateNewInodeId(), existing,
localName.getBytes(Charsets.UTF_8), perm, null, now());
if (existing == null) {
return null;
}
final INode newNode = existing.getLastINode();
// Directory creation also count towards FilesCreated
// to match count of FilesDeleted metric.
NameNode.getNameNodeMetrics().incrFilesCreated();
String cur = existing.getPath();
//TODO 把元数据信息记录到磁盘上(但是一开始先写到内存)
//往磁盘上面记录元数据日志
fsd.getEditLog().logMkDir(cur, newNode);
if (NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("mkdirs: created directory " + cur);
}
return existing;
}
1.1.2 fsd.getEditLog().logMkDir(cur, newNode)
/**
* Add create directory record to edit log
*/
public void logMkDir(String path, INode newNode) {
PermissionStatus permissions = newNode.getPermissionStatus();
//TODO 创建日志对象 建造者模式
MkdirOp op = MkdirOp.getInstance(cache.get())
.setInodeId(newNode.getId())
.setPath(path)
.setTimestamp(newNode.getModificationTime())
.setPermissionStatus(permissions);
AclFeature f = newNode.getAclFeature();
if (f != null) {
op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
}
XAttrFeature x = newNode.getXAttrFeature();
if (x != null) {
op.setXAttrs(x.getXAttrs());
}
//TODO 记录日志
logEdit(op);
}
1.1.3 logEdit
/**
* Write an operation to the edit log. Do not sync to persistent
* store yet.
*/
void logEdit(final FSEditLogOp op) {
//来了就加锁
synchronized (this) {
assert isOpenForWrite() :
"bad state: " + state;
// wait if an automatic sync is scheduled
//一开始不需要等待
waitIfAutoSyncScheduled();
//TODO 步骤一:获取当前的独一无二的事务ID
long start = beginTransaction();
op.setTransactionId(txid);
try {
/**
* 1. namenode editlog 文件缓冲里面
* 2. journalnode的内存缓冲
* JournalSetOutputStream
*/
//TODO QuormJournalManager QuormOutputStream
//TODO FileJournalManager EditLogFileOutputStream
//TODO 步骤二: 把元数据接入到内存缓冲
editLogStream.write(op);
} catch (IOException ex) {
// All journals failed, it is handled in logSync.
} finally {
op.reset();
}
endTransaction(start);
// check if it is time to schedule an automatic sync
if (!shouldForceSync()) {
return;
}
isAutoSyncScheduled = true;
}
// sync buffered edit log entries to persistent store
//TODO 写元数据到磁盘
logSync();
}
1.1.4 QuorumOutputStream的write
@Override
public void write(FSEditLogOp op) throws IOException {
buf.writeOp(op);
}
1.1.5 EditsDoubleBuffer的writeOp
public void writeOp(FSEditLogOp op) throws IOException {
if (firstTxId == HdfsConstants.INVALID_TXID) {
firstTxId = op.txid;
} else {
assert op.txid > firstTxId;
}
writer.writeOp(op);
numTxns++;
}
1.1.5 EditsDoubleBuffer的重要属性
//第一个缓冲
private TxnBuffer bufCurrent; // current buffer for writing
//第二个缓冲
private TxnBuffer bufReady; // buffer ready for flushing
1.1.6 FSEditLogOp的writeOp
/**
* Write an operation to the output stream
*
* @param op The operation to write
* @throws IOException if an error occurs during writing.
*/
public void writeOp(FSEditLogOp op) throws IOException {
int start = buf.getLength();
// write the op code first to make padding and terminator verification
// work
buf.writeByte(op.opCode.getOpCode());
buf.writeInt(0); // write 0 for the length first
buf.writeLong(op.txid);
op.writeFields(buf);
int end = buf.getLength();
// write the length back: content of the op + 4 bytes checksum - op_code
int length = end - start - 1;
buf.writeInt(length, start + 1);
checksum.reset();
checksum.update(buf.getData(), start, end-start);
int sum = (int)checksum.getValue();
buf.writeInt(sum);
}
/** 其实调用的就是NIO写入内存**/
1.1.7 FSEditLogOp的writeOp(和JournalOutputStream是一样的)
/**
* Write an operation to the output stream
*
* @param op The operation to write
* @throws IOException if an error occurs during writing.
*/
public void writeOp(FSEditLogOp op) throws IOException {
int start = buf.getLength();
// write the op code first to make padding and terminator verification
// work
buf.writeByte(op.opCode.getOpCode());
buf.writeInt(0); // write 0 for the length first
buf.writeLong(op.txid);
op.writeFields(buf);
int end = buf.getLength();
// write the length back: content of the op + 4 bytes checksum - op_code
int length = end - start - 1;
buf.writeInt(length, start + 1);
checksum.reset();
checksum.update(buf.getData(), start, end-start);
int sum = (int)checksum.getValue();
buf.writeInt(sum);
}
}
1.1.8 双缓冲交换内存
public void setReadyToFlush() {
assert isFlushed() : "previous data not flushed yet";
TxnBuffer tmp = bufReady;
bufReady = bufCurrent;
bufCurrent = tmp;
}
1.2 HDFS元数据如何写入磁盘?
1.2.1 EditLogFileOutputStream的写入磁盘
/**
* Writes <code>len</code> bytes from the specified byte array
* starting at offset <code>off</code> to this output stream.
* The general contract for <code>write(b, off, len)</code> is that
* some of the bytes in the array <code>b</code> are written to the
* output stream in order; element <code>b[off]</code> is the first
* byte written and <code>b[off+len-1]</code> is the last byte written
* by this operation.
* <p>
* The <code>write</code> method of <code>OutputStream</code> calls
* the write method of one argument on each of the bytes to be
* written out. Subclasses are encouraged to override this method and
* provide a more efficient implementation.
* <p>
* If <code>b</code> is <code>null</code>, a
* <code>NullPointerException</code> is thrown.
* <p>
* If <code>off</code> is negative, or <code>len</code> is negative, or
* <code>off+len</code> is greater than the length of the array
* <code>b</code>, then an <tt>IndexOutOfBoundsException</tt> is thrown.
*
* @param b the data.
* @param off the start offset in the data.
* @param len the number of bytes to write.
* @exception IOException if an I/O error occurs. In particular,
* an <code>IOException</code> is thrown if the output
* stream is closed.
*/
public void write(byte b[], int off, int len) throws IOException {
if (b == null) {
throw new NullPointerException();
} else if ((off < 0) || (off > b.length) || (len < 0) ||
((off + len) > b.length) || ((off + len) < 0)) {
throw new IndexOutOfBoundsException();
} else if (len == 0) {
return;
}
for (int i = 0 ; i < len ; i++) {
write(b[off + i]);
}
}
1.2.2 EditLogFileOutputStream的flushAndSync
protected void flushAndSync(boolean durable) throws IOException {
int numReadyBytes = buf.countReadyBytes();
if (numReadyBytes > 0) {
int numReadyTxns = buf.countReadyTxns();
long firstTxToFlush = buf.getFirstReadyTxId();
assert numReadyTxns > 0;
// Copy from our double-buffer into a new byte array. This is for
// two reasons:
// 1) The IPC code has no way of specifying to send only a slice of
// a larger array.
// 2) because the calls to the underlying nodes are asynchronous, we
// need a defensive copy to avoid accidentally mutating the buffer
// before it is sent.
DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
buf.flushTo(bufToSend);
assert bufToSend.getLength() == numReadyBytes;
byte[] data = bufToSend.getData();
assert data.length == bufToSend.getLength();
//把数据写入到journalnode中磁盘上面
QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
numReadyTxns, data);
//TODO 这是一个阻塞的方法等待写入到journalnode集群的处理结果
loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
// Since we successfully wrote this batch, let the loggers know. Any future
// RPCs will thus let the loggers know of the most recent transaction, even
// if a logger has fallen behind.
loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
}
}
1.2.3 sendEdits
public QuorumCall<AsyncLogger, Void> sendEdits(
long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
for (AsyncLogger logger : loggers) {
//TODO 遍历所有的loggers
// 每个AsyncLogger对象代表就是一个journalnode
ListenableFuture<Void> future =
// 往journal发送数据
logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
calls.put(logger, future);
}
return QuorumCall.create(calls);
}
1.2.4 logger.sendEdits
@Override
public ListenableFuture<Void> sendEdits(
final long segmentTxId, final long firstTxnId,
final int numTxns, final byte[] data) {
try {
reserveQueueSpace(data.length);
} catch (LoggerTooFarBehindException e) {
return Futures.immediateFailedFuture(e);
}
// When this batch is acked, we use its submission time in order
// to calculate how far we are lagging.
final long submitNanos = System.nanoTime();
ListenableFuture<Void> ret = null;
try {
ret = singleThreadExecutor.submit(new Callable<Void>() {
@Override
public Void call() throws IOException {
throwIfOutOfSync();
long rpcSendTimeNanos = System.nanoTime();
try {
//TODO 获取Journalnode的代理
getProxy().journal(createReqInfo(),
segmentTxId, firstTxnId, numTxns, data);
} catch (IOException e) {
QuorumJournalManager.LOG.warn(
"Remote journal " + IPCLoggerChannel.this + " failed to " +
"write txns " + firstTxnId + "-" + (firstTxnId + numTxns - 1) +
". Will try to write to this JN again after the next " +
"log roll.", e);
synchronized (IPCLoggerChannel.this) {
outOfSync = true;
}
throw e;
} finally {
long now = System.nanoTime();
long rpcTime = TimeUnit.MICROSECONDS.convert(
now - rpcSendTimeNanos, TimeUnit.NANOSECONDS);
long endToEndTime = TimeUnit.MICROSECONDS.convert(
now - submitNanos, TimeUnit.NANOSECONDS);
metrics.addWriteEndToEndLatency(endToEndTime);
metrics.addWriteRpcLatency(rpcTime);
if (rpcTime / 1000 > WARN_JOURNAL_MILLIS_THRESHOLD) {
QuorumJournalManager.LOG.warn(
"Took " + (rpcTime / 1000) + "ms to send a batch of " +
numTxns + " edits (" + data.length + " bytes) to " +
"remote journal " + IPCLoggerChannel.this);
}
}
synchronized (IPCLoggerChannel.this) {
highestAckedTxId = firstTxnId + numTxns - 1;
lastAckNanos = submitNanos;
}
return null;
}
});
} finally {
if (ret == null) {
// it didn't successfully get submitted,
// so adjust the queue size back down.
unreserveQueueSpace(data.length);
} else {
// It was submitted to the queue, so adjust the length
// once the call completes, regardless of whether it
// succeeds or fails.
Futures.addCallback(ret, new FutureCallback<Void>() {
@Override
public void onFailure(Throwable t) {
unreserveQueueSpace(data.length);
}
@Override
public void onSuccess(Void t) {
unreserveQueueSpace(data.length);
}
});
}
}
return ret;
}
1.2.6 JournalNodeRpcServer的journal方法(写入到Journalnode磁盘)
/**
* Write a batch of edits to the journal.
* {@see QJournalProtocol#journal(RequestInfo, long, long, int, byte[])}
*/
synchronized void journal(RequestInfo reqInfo,
long segmentTxId, long firstTxnId,
int numTxns, byte[] records) throws IOException {
checkFormatted();
checkWriteRequest(reqInfo);
checkSync(curSegment != null,
"Can't write, no segment open");
if (curSegmentTxId != segmentTxId) {
// Sanity check: it is possible that the writer will fail IPCs
// on both the finalize() and then the start() of the next segment.
// This could cause us to continue writing to an old segment
// instead of rolling to a new one, which breaks one of the
// invariants in the design. If it happens, abort the segment
// and throw an exception.
JournalOutOfSyncException e = new JournalOutOfSyncException(
"Writer out of sync: it thinks it is writing segment " + segmentTxId
+ " but current segment is " + curSegmentTxId);
abortCurSegment();
throw e;
}
checkSync(nextTxId == firstTxnId,
"Can't write txid " + firstTxnId + " expecting nextTxId=" + nextTxId);
long lastTxnId = firstTxnId + numTxns - 1;
if (LOG.isTraceEnabled()) {
LOG.trace("Writing txid " + firstTxnId + "-" + lastTxnId);
}
// If the edit has already been marked as committed, we know
// it has been fsynced on a quorum of other nodes, and we are
// "catching up" with the rest. Hence we do not need to fsync.
boolean isLagging = lastTxnId <= committedTxnId.get();
boolean shouldFsync = !isLagging;
curSegment.writeRaw(records, 0, records.length);
curSegment.setReadyToFlush();
StopWatch sw = new StopWatch();
sw.start();
curSegment.flush(shouldFsync);
sw.stop();
long nanoSeconds = sw.now();
metrics.addSync(
TimeUnit.MICROSECONDS.convert(nanoSeconds, TimeUnit.NANOSECONDS));
long milliSeconds = TimeUnit.MILLISECONDS.convert(
nanoSeconds, TimeUnit.NANOSECONDS);
if (milliSeconds > WARN_SYNC_MILLIS_THRESHOLD) {
LOG.warn("Sync of transaction range " + firstTxnId + "-" + lastTxnId +
" took " + milliSeconds + "ms");
}
if (isLagging) {
// This batch of edits has already been committed on a quorum of other
// nodes. So, we are in "catch up" mode. This gets its own metric.
metrics.batchesWrittenWhileLagging.incr(1);
}
metrics.batchesWritten.incr(1);
metrics.bytesWritten.incr(records.length);
metrics.txnsWritten.incr(numTxns);
highestWrittenTxId = lastTxnId;
nextTxId = lastTxnId + 1;
}
1.3 写入元数据总结
- 我们在编码过程中首先会创建一个FileSystem的一个方法来进行创建我们的文件,其实他底层调用的是我们的NameNodeRpcServer代理对象的mkdirs方法
- 通过我们的NameNodeRpcServer中的FSNamesyem管理元数据的对象,获取目录树,将我们的要创建的目录节点依次封装为INode(INodeDirectory,INodeFile)对象,然后添加到我们的目录树上,也就是更新我们的目录树,此时我们通过50070端口就可以进行查看了
- 然后会调用EditLog对象进行元数据的刷写,这里是整个HDFS代码写入的精华部分,采用了分段加锁双缓冲机制,分为两条直线,其实是一个方法实现的,
- 写入磁盘是首先写入内存,内存满了或者达到一定条件之后进行双缓冲区的交换,将我们的buffReady采用NIO的方式写入到磁盘上面,也就是我们看到的hdfs元数据文件,带有很多数字
- 写入Journalnode也是采用的双缓冲机制,但是刷写的时候有一些不同,因为是进程间通信,需要通过Rpc或者Http这里采用的是Rpc,首先是获取到我们的AsyncL oggerSet集合,这个时候采用的是异步写入,调用的JournalnodeRpcServer的journal方法进行刷写到journalnode的磁盘