这里以创建目录为例来观察hadoop对文件目录元数据的管理过程,这里首先来个java操作hdfs的小例子:
private boolean mkdirsInt(final String srcArg, PermissionStatus permissions,
boolean createParent) throws IOException, UnresolvedLinkException {
String src = srcArg;
if(NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
}
if (!DFSUtil.isValidName(src)) {
throw new InvalidPathException(src);
}
FSPermissionChecker pc = getPermissionChecker();
checkOperation(OperationCategory.WRITE);
//以/分割路径为二维数组
byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
HdfsFileStatus resultingStat = null;
boolean status = false;
//加写锁
writeLock();
try {
//检查是否有权限操作
checkOperation(OperationCategory.WRITE);
//检查nn是否处于安全模式
checkNameNodeSafeMode("Cannot create directory " + src);
src = resolvePath(src, pathComponents);
//创建文件夹入口
status = mkdirsInternal(pc, src, permissions, createParent);
if (status) {
resultingStat = getAuditFileInfo(src, false);
}
} finally {
// 释放锁
writeUnlock();
}
//写editlog
getEditLog().logSync();
if (status) {
logAuditEvent(true, "mkdirs", srcArg, null, resultingStat);
}
return status;
}
在我们使用java调用hadoop api来操作hdfs时候,我们往往需要一个FileSystem类来完成:
// 创建Configuration对象
Configuration conf=new Configuration();
// 创建FileSystem对象
FileSystem fs=FileSystem.get(URI.create(args[0]),conf);
实际上,该类是个抽象类,它有两个具体实现,LocalFileSystem和DistributedFileSystem,前者是本地的一个文件系统,我们显然需要分析DistributedFileSystem。这里以创建一个目录来分析整个流程。首先,从DistributedFileSystem的mkdir开始:
public boolean mkdir(Path f, FsPermission permission) throws IOException {
return mkdirsInternal(f, permission, false);
}
进入mkdirsInternal方法
private boolean mkdirsInternal(Path f, final FsPermission permission,
final boolean createParent) throws IOException {
statistics.incrementWriteOps(1);
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver<Boolean>() {
@Override
public Boolean doCall(final Path p)
throws IOException, UnresolvedLinkException {
// 实际调用的创建目录代码
return dfs.mkdirs(getPathName(p), permission, createParent);
}
@Override
public Boolean next(final FileSystem fs, final Path p)
throws IOException {
// FileSystem doesn't have a non-recursive mkdir() method
// Best we can do is error out
if (!createParent) {
throw new IOException("FileSystem does not support non-recursive"
+ "mkdir");
}
return fs.mkdirs(p, permission);
}
}.resolve(this, absF);
}
原来这里调用了return dfs.mkdirs(getPathName(p), permission, createParent);方法,这里的dfs是个DFSClient对象,这里我们看类注释就行了:
/********************************************************
* DFSClient can connect to a Hadoop Filesystem and
* perform basic file tasks. It uses the ClientProtocol
* to communicate with a NameNode daemon, and connects
* directly to DataNodes to read/write block data.
*
* Hadoop DFS users should obtain an instance of
* DistributedFileSystem, which uses DFSClient to handle
* filesystem tasks.
*
********************************************************/
@InterfaceAudience.Private
public class DFSClient implements java.io.Closeable, RemotePeerFactory,
DataEncryptionKeyFactory {
DFSClient就是我们用来与nn或者dn通信的底层实现,就是NN或者DN上的rpcServer的一个RPCClient,我们需要连接hadoop集群的时候,持有的FileSystem(实现类DistributedFileSystem)中会持有DFSClient对象,与hadoop namenode或者datanode通信都是通过该类来实现。那么到这里我们可以画一张图来表示:
继续来看DfsClient的mkdir方法:
public boolean mkdirs(String src, FsPermission permission,
boolean createParent) throws IOException {
if (permission == null) {
permission = FsPermission.getDefault();
}
FsPermission masked = permission.applyUMask(dfsClientConf.uMask);
return primitiveMkdir(src, masked, createParent);
}
primitiveMkdir:
public boolean primitiveMkdir(String src, FsPermission absPermission,
boolean createParent)
throws IOException {
checkOpen();
if (absPermission == null) {
absPermission =
FsPermission.getDefault().applyUMask(dfsClientConf.uMask);
}
if(LOG.isDebugEnabled()) {
LOG.debug(src + ": masked=" + absPermission);
}
try {
//namenode是个ClientProtocol对象,这里就是一个rpcclient,该方法将会调用namenodeRpcServer端的对应方法
return namenode.mkdirs(src, absPermission, createParent);
} catch(RemoteException re) {
throw re.unwrapRemoteException(AccessControlException.class,
InvalidPathException.class,
FileAlreadyExistsException.class,
FileNotFoundException.class,
ParentNotDirectoryException.class,
SafeModeException.class,
NSQuotaExceededException.class,
DSQuotaExceededException.class,
UnresolvedPathException.class,
SnapshotAccessControlException.class);
}
}
这里通过ClientProtocol这个协议的对应方法,底层就会通过RPC网络通信,调用namenode端对应的RPCServer也就是namenodeRpcServer端的对应一个方法。
我们来到服务端,在namenode上运行的RPCServer一直在监听客户端请求,当它监听到客户端发送过来的创建目录的RPC请求时,就会调用对应的方法,我们来看NamenodeRPCServer的对应方法被调用的时候发生的事情:找到namenodeRpcServer类的mkdir方法:
public boolean mkdirs(String src, FsPermission masked, boolean createParent)
throws IOException {
checkNNStartup();
if(stateChangeLog.isDebugEnabled()) {
stateChangeLog.debug("*DIR* NameNode.mkdirs: " + src);
}
if (!checkPathLength(src)) {
throw new IOException("mkdirs: Pathname too long. Limit "
+ MAX_PATH_LENGTH + " characters, " + MAX_PATH_DEPTH + " levels.");
}
return namesystem.mkdirs(src,
new PermissionStatus(getRemoteUser().getShortUserName(),
null, masked), createParent);
}
可以看到rpcServer持有一个nameSystem,调用了namesystem的mkdirs方法,而该namesystem就是之前我们在namenode启动时候分析过的FSNamesystem类。该FSNamesystem类维护了namenode所有的元数据。进入namesystem.mkdirs:
boolean mkdirs(String src, PermissionStatus permissions,
boolean createParent) throws IOException, UnresolvedLinkException {
boolean ret = false;
try {
//创建目录
ret = mkdirsInt(src, permissions, createParent);
} catch (AccessControlException e) {
logAuditEvent(false, "mkdirs", src);
throw e;
}
return ret;
}
进入mkdirsInt:
private boolean mkdirsInt(final String srcArg, PermissionStatus permissions,
boolean createParent) throws IOException, UnresolvedLinkException {
String src = srcArg;
if(NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("DIR* NameSystem.mkdirs: " + src);
}
if (!DFSUtil.isValidName(src)) {
throw new InvalidPathException(src);
}
FSPermissionChecker pc = getPermissionChecker();
checkOperation(OperationCategory.WRITE);
//以/分割路径为二维数组
byte[][] pathComponents = FSDirectory.getPathComponentsForReservedPath(src);
HdfsFileStatus resultingStat = null;
boolean status = false;
//加写锁
writeLock();
try {
//检查是否有权限操作
checkOperation(OperationCategory.WRITE);
//检查nn是否处于安全模式
checkNameNodeSafeMode("Cannot create directory " + src);
src = resolvePath(src, pathComponents);
//创建文件夹入口
status = mkdirsInternal(pc, src, permissions, createParent);
if (status) {
resultingStat = getAuditFileInfo(src, false);
}
} finally {
// 释放锁
writeUnlock();
}
//写editlog
getEditLog().logSync();
if (status) {
logAuditEvent(true, "mkdirs", srcArg, null, resultingStat);
}
return status;
}
这里有两行重要的代码:
1.status = mkdirsInternal(pc, src, permissions, createParent);
2.getEditLog().logSync();
首先分析第一行,还是在继续创建目录的过程
private boolean mkdirsInternal(FSPermissionChecker pc, String src, PermissionStatus permissions, boolean createParent)
throws IOException, UnresolvedLinkException {
//判断当前线程是否有写锁
assert hasWriteLock();
//权限判断
if (isPermissionEnabled) {
checkTraverse(pc, src);
}
//dir:FSDirectory 包含了文件目录树
//返回当前文件是否可变的
if (dir.isDirMutable(src)) {
// all the users of mkdirs() are used to expect 'true' even if a new directory is not created.
return true;
}
if (isPermissionEnabled) {
checkAncestorAccess(pc, src, FsAction.WRITE);
}
if (!createParent) {
verifyParentDir(src);
}
// validate that we have enough inodes. This is, at best, a
// heuristic because the mkdirs() operation might need to
// create multiple inodes.
//检查是否超过了inodes个数上限 heuristic:启发式的
checkFsObjectLimit();
//mkdirsRecursively:创建目录
if (!mkdirsRecursively(src, permissions, false, now())) {
throw new IOException("Failed to create directory: " + src);
}
return true;
}
这里最重要的还是mkdirsRecursively方法:
// 创建目录,如果父目录不存在则会创建父目录
private boolean mkdirsRecursively(String src, PermissionStatus permissions,
boolean inheritPermission, long now)
throws FileAlreadyExistsException, QuotaExceededException,
UnresolvedLinkException, SnapshotAccessControlException,
AclException {
// 如果路径以“/”结束,则把‘/’去掉
src = FSDirectory.normalizePath(src);
//拆分目录为二维数组
byte[][] components = INode.getPathComponents(src);
final int lastInodeIndex = components.length - 1;
//加写锁
dir.writeLock();
try {
//通过fsDirectory获取components目录已经存在的路径,比如要创建/a/a/c/v/c.
// 如果已经存在/a/a/c/目录,那么就只创建后面的目录就行了
INodesInPath iip = dir.getExistingPathINodes(components);
if (iip.isSnapshot()) {
throw new SnapshotAccessControlException(
"Modification on RO snapshot is disallowed");
}
//已经存在的inode路径
INode[] inodes = iip.getINodes();
// find the index of the first null in inodes[]
StringBuilder pathbuilder = new StringBuilder();
int i = 1;
//这个循环将会把已经存在的路径加入到pathbuilder中
for(; i < inodes.length && inodes[i] != null; i++) {
pathbuilder.append(Path.SEPARATOR).
append(DFSUtil.bytes2String(components[i]));
if (!inodes[i].isDirectory()) {
throw new FileAlreadyExistsException(
"Parent path is not a directory: "
+ pathbuilder + " "+inodes[i].getLocalName());
}
}
// default to creating parent dirs with the given perms
PermissionStatus parentPermissions = permissions;
// if not inheriting and it's the last inode, there's no use in
// computing perms that won't be used
if (inheritPermission || (i < lastInodeIndex)) {
// if inheriting (ie. creating a file or symlink), use the parent dir,
// else the supplied permissions
// NOTE: the permissions of the auto-created directories violate posix
FsPermission parentFsPerm = inheritPermission
? inodes[i-1].getFsPermission() : permissions.getPermission();
// ensure that the permissions allow user write+execute
if (!parentFsPerm.getUserAction().implies(FsAction.WRITE_EXECUTE)) {
parentFsPerm = new FsPermission(
parentFsPerm.getUserAction().or(FsAction.WRITE_EXECUTE),
parentFsPerm.getGroupAction(),
parentFsPerm.getOtherAction()
);
}
if (!parentPermissions.getPermission().equals(parentFsPerm)) {
parentPermissions = new PermissionStatus(
parentPermissions.getUserName(),
parentPermissions.getGroupName(),
parentFsPerm
);
// when inheriting, use same perms for entire path
if (inheritPermission) permissions = parentPermissions;
}
}
//这个循环将会把已经存在的路径后面的部分加入到pathbuilder中
// create directories beginning from the first null index
for(; i < inodes.length; i++) {
pathbuilder.append(Path.SEPARATOR).
append(DFSUtil.bytes2String(components[i]));
//创建dir逻辑
dir.unprotectedMkdir(allocateNewInodeId(), iip, i, components[i],
(i < lastInodeIndex) ? parentPermissions : permissions, null,
now);
if (inodes[i] == null) {
return false;
}
// Directory creation also count towards FilesCreated
// to match count of FilesDeleted metric.
NameNode.getNameNodeMetrics().incrFilesCreated();
final String cur = pathbuilder.toString();
//往editlog中写创建dir的日志
//输出日志到磁盘上
getEditLog().logMkDir(cur, inodes[i]);
if(NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug(
"mkdirs: created directory " + cur);
}
}
} finally {
dir.writeUnlock();
}
return true;
}
这里调用了dir的dir.unprotectedMkdir方法,这里的dir是个FSDirectory对象:FSDirectory描述了整个namenode维护的文件目录树:
void unprotectedMkdir(long inodeId, INodesInPath inodesInPath,
int pos, byte[] name, PermissionStatus permission,
List<AclEntry> aclEntries, long timestamp)
throws QuotaExceededException, AclException {
assert hasWriteLock();
// 封装了一个INodeDirectory对象代表了创建的目录
final INodeDirectory dir = new INodeDirectory(inodeId, name, permission,
timestamp);
//将INodeDirectory挂在文件目录树上
if (addChild(inodesInPath, pos, dir, true)) {
if (aclEntries != null) {
AclStorage.updateINodeAcl(dir, aclEntries, Snapshot.CURRENT_STATE_ID);
}
inodesInPath.setINode(pos, dir);
}
}
这里就是生成不存在的各个节点,挂在文件目录树上。此时用图来总结:红色的节点是新挂到文件目录树上的节点:
上面在执行完创建目录后,还有一个方法getEditLog().logMkDir(cur, inodes[i]);首先,getEditLog返回一个FSEditLog对象,该对象维护了namespace变化的日志,元数据每次被更新就会有一个transactionid生成,比如一次mkdir,就会有个transactionid与之对应。可以在/editslog目录下面看到如下格式的editslog和fsimage在里面。第一个文件edits_00000001-00000005.log存放了transactionid为1-5的操作日志。edits_00000006-00000009.log存放了transactionid为6-9的操作日志,从而分段存储。edits_inprogress_0000000455代表了最新的日志到了455这个transactionid,代表了当前正在被写入的文件。而fsimage代表了包含已经合并了多少的transactionid。
然后调用FSEditLog对象的logMkDir方法用于向editlog中记录一条创建目录的日志:
public void logMkDir(String path, INode newNode) {
PermissionStatus permissions = newNode.getPermissionStatus();
//构造器模式
MkdirOp op = MkdirOp.getInstance(cache.get())
.reset()
.setInodeId(newNode.getId())
.setPath(path)
.setTimestamp(newNode.getModificationTime())
.setPermissionStatus(permissions);
AclFeature f = newNode.getAclFeature();
if (f != null) {
op.setAclEntries(AclStorage.readINodeLogicalAcl(newNode));
}
XAttrFeature x = newNode.getXAttrFeature();
if (x != null) {
op.setXAttrs(x.getXAttrs());
}
//记录MkdirOp
logEdit(op);
}
进入logEdit:
void logEdit(final FSEditLogOp op) {
//写editlog的主要流程,FSEditlog是全局唯一的,保证多线程并发写editlog的时候一定是同步的
synchronized (this) {
assert isOpenForWrite() :
"bad state: " + state;
// wait if an automatic sync is scheduled
waitIfAutoSyncScheduled();
//分配唯一的transactionId
long start = beginTransaction();
op.setTransactionId(txid);
try {
//对外输出操作日志
// editlog写入本地磁盘文件,并写入journalNode,standby nn会从journalnode同步editlog
//jornalnode只会写editlog
editLogStream.write(op);
} catch (IOException ex) {
// All journals failed, it is handled in logSync.
}
//结束当前transaction
endTransaction(start);
// check if it is time to schedule an automatic sync
if (!shouldForceSync()) {
return;
}
isAutoSyncScheduled = true;
}
// sync buffered edit log entries to persistent store
//将editlog刷到磁盘,先写入内存缓冲,写完之后一次性将内存缓冲的editlog同步到磁盘
logSync();
}
editLogStream.write(op);这里会做两件事,一是写道本地磁盘文件,而是写入journeynode。在配置的journeynode目录中可以看到所有的editslog,因为每次写入本地的时候,都会同步journeynode。首先来看long start = beginTransaction();
private long beginTransaction() {
//当前线程必须持有锁
assert Thread.holdsLock(this);
// get a new transactionId
//对txid递增,每次有新的修改都会递增该值
txid++;
//
// record the transactionId when new data was written to the edits log
//myTransactionId:ThreadLocal<TransactionId>,从该线程的threadLocal变量中获取transactionid,
// 也就是每个线程持有自己的transactionid,该transactionid=全局递增的txid,这样只需要管理自己的transactionid
//其他的线程不会改变该线程对应的transactionid
TransactionId id = myTransactionId.get();
id.txid = txid;
return now();
}
这里主要就是生成transactionid给该线程,这里可以看到threadlocal的用法,每个线程维护自己的threadlocal变量,用的时候就从自己的线程里面取,别的线程修改永远也修改不到threadlocal中的值。由于txid被synchronized修饰的块内,所以txid每次只有一个线程可以做递增操作,将txid赋给transactionid也是线程安全的。beginTransaction()执行完成,为线程生成transaction,回到logEdit方法,op.setTransactionId(txid);将这个递增完成的txid放到FSEditLogOp中。
下面来看editLogStream.write(op);该操作将操作记录写入内存缓冲和journalnode集群中,首先来看editLogStream是个EditLogOutputStream类型,这个类是个抽象类:public abstract class EditLogOutputStream implements Closeable
而它的write方法也是个抽象方法:abstract public void write(FSEditLogOp op) throws IOException;
因此,具体写数据的操作应该在它的子类中实现,那么就要找到具体的实现子类:
editLogStream = journalSet.startLogSegment(segmentTxId,
NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);
该段代码中,journalSet是个JournalSet类,代表了Journals的集合,在startLogSegment方法中,该方法返回的是一个JournalSetOutputStream类。
public EditLogOutputStream startLogSegment(final long txId,
final int layoutVersion) throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
jas.startLogSegment(txId, layoutVersion);
}
}, "starting log segment " + txId);
return new JournalSetOutputStream();
}
该方法的含义是开启一个新的editslog,那么这个方法是什么被调用的呢。搜索发现FsImage中有调用该方法:
/**
* Save the contents of the FS image to a new image file in each of the
* current storage directories.
*/
public synchronized void saveNamespace(FSNamesystem source, NameNodeFile nnf,
Canceler canceler) throws IOException {
assert editLog != null : "editLog must be initialized";
LOG.info("Save namespace ...");
storage.attemptRestoreRemovedStorage();
boolean editLogWasOpen = editLog.isSegmentOpen();
if (editLogWasOpen) {
editLog.endCurrentLogSegment(true);
}
long imageTxId = getLastAppliedOrWrittenTxId();
try {
saveFSImageInAllDirs(source, nnf, imageTxId, canceler);
storage.writeAll();
} finally {
if (editLogWasOpen) {
editLog.startLogSegment(imageTxId + 1, true);
// Take this opportunity to note the current transaction.
// Even if the namespace save was cancelled, this marker
// is only used to determine what transaction ID is required
// for startup. So, it doesn't hurt to update it unnecessarily.
storage.writeTransactionIdFileToStorage(imageTxId + 1);
}
}
}
在前面文章提到的namenode在启动的时候,有个步骤是会将fsimage和editslog合并成为新的fsimage并开启一个新的editslog。在FsNameSystem的loadFSImage方法中:
if (needToSave) {
//将合并后的fsimage文件写入到配置的多个目录下
fsImage.saveNamespace(this);
} else {
updateStorageVersionForRollingUpgrade(fsImage.getLayoutVersion(),
startOpt);
// No need to save, so mark the phase done.
StartupProgress prog = NameNode.getStartupProgress();
prog.beginPhase(Phase.SAVING_CHECKPOINT);
prog.endPhase(Phase.SAVING_CHECKPOINT);
}
这段代码中的fsImage.saveNamespace(this)方法:
public synchronized void saveNamespace(FSNamesystem source, NameNodeFile nnf,
Canceler canceler) throws IOException {
assert editLog != null : "editLog must be initialized";
LOG.info("Save namespace ...");
storage.attemptRestoreRemovedStorage();
boolean editLogWasOpen = editLog.isSegmentOpen();
if (editLogWasOpen) {
editLog.endCurrentLogSegment(true);
}
long imageTxId = getLastAppliedOrWrittenTxId();
try {
saveFSImageInAllDirs(source, nnf, imageTxId, canceler);
storage.writeAll();
} finally {
if (editLogWasOpen) {
//初始化了segment代码
editLog.startLogSegment(imageTxId + 1, true);
// Take this opportunity to note the current transaction.
// Even if the namespace save was cancelled, this marker
// is only used to determine what transaction ID is required
// for startup. So, it doesn't hurt to update it unnecessarily.
storage.writeTransactionIdFileToStorage(imageTxId + 1);
}
}
}
editLog.startLogSegment(imageTxId + 1, true);上面的这段代码调用了startLogSegment方法,触发了初始化:
synchronized void startLogSegment(final long segmentTxId,
boolean writeHeaderTxn) throws IOException {
LOG.info("Starting log segment at " + segmentTxId);
Preconditions.checkArgument(segmentTxId > 0,
"Bad txid: %s", segmentTxId);
Preconditions.checkState(state == State.BETWEEN_LOG_SEGMENTS,
"Bad state: %s", state);
Preconditions.checkState(segmentTxId > curSegmentTxId,
"Cannot start writing to log segment " + segmentTxId +
" when previous log segment started at " + curSegmentTxId);
Preconditions.checkArgument(segmentTxId == txid + 1,
"Cannot start log segment at txid %s when next expected " +
"txid is %s", segmentTxId, txid + 1);
numTransactions = totalTimeTransactions = numTransactionsBatchedInSync = 0;
// TODO no need to link this back to storage anymore!
// See HDFS-2174.
storage.attemptRestoreRemovedStorage();
try {
//返回的是EditLogOutputStream的子类JournalSetOutputStream实例
editLogStream = journalSet.startLogSegment(segmentTxId,
NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);
} catch (IOException ex) {
throw new IOException("Unable to start log segment " +
segmentTxId + ": too few journals successfully started.", ex);
}
curSegmentTxId = segmentTxId;
state = State.IN_SEGMENT;
if (writeHeaderTxn) {
logEdit(LogSegmentOp.getInstance(cache.get(),
FSEditLogOpCodes.OP_START_LOG_SEGMENT));
logSync();
}
}
因此,namenode在启动的时候,在fsimage与editslog合并后,就会初始化editLogStream。来看这行代码:
editLogStream = journalSet.startLogSegment(segmentTxId,
NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);
调用的是journalSet的startLogSegment方法。再来看journalSet的初始化:
private synchronized void initJournals(List<URI> dirs) {
int minimumRedundantJournals = conf.getInt(
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_KEY,
DFSConfigKeys.DFS_NAMENODE_EDITS_DIR_MINIMUM_DEFAULT);
synchronized(journalSetLock) {
//初始化journalSet
journalSet = new JournalSet(minimumRedundantJournals);
for (URI u : dirs) {
boolean required = FSNamesystem.getRequiredNamespaceEditsDirs(conf)
.contains(u);
//LOCAL_URI_SCHEME:file
//如果当前是LOCAL_URI_SCHEME是file,是本地系统的话
if (u.getScheme().equals(NNStorage.LOCAL_URI_SCHEME)) {
StorageDirectory sd = storage.getStorageDirectory(u);
if (sd != null) {
//创建FileJournalManager,就是专门负责将editlog写入本地磁盘的
journalSet.add(new FileJournalManager(conf, sd, storage),
required, sharedEditsDirs.contains(u));
}
} else {
//如果不是本地文件系统,就会在这里创建createJournal(u),创建的就是JournalManager,
//JournalManager负责管理所有的journalnode
journalSet.add(createJournal(u), required,
sharedEditsDirs.contains(u));
}
}
}
if (journalSet.isEmpty()) {
LOG.error("No edits directories configured!");
}
}
journalSet中维护了本地文件和journalnode的列表。
JournalSetOutputStream类定义了对流的write,create,flush等各种操作。来看它的write方法:
public void write(final FSEditLogOp op)
throws IOException {
mapJournalsAndReportErrors(new JournalClosure() {
@Override
public void apply(JournalAndStream jas) throws IOException {
if (jas.isActive()) {
jas.getCurrentStream().write(op);
}
}
}, "write op");
}
这里传给了mapJournalsAndReportErrors方法一个闭包匿名类,目的是为了传递实现子类中不确定的行为apply,顺便来看JournalClosure中的apply方法:
private interface JournalClosure {
/**
* The operation on JournalAndStream.
* @param jas Object on which operations are performed.
* @throws IOException
*/
public void apply(JournalAndStream jas) throws IOException;
}
再来看mapJournalsAndReportErrors方法:我们将上面的匿名内部类实现的行为传递给mapJournalsAndReportErrors方法,该方法在for (JournalAndStream jas : journals) {中对journals(List<JournalAndStream>)中每个JournalAndStream都去调用apply中定义的方法。
private void mapJournalsAndReportErrors(
JournalClosure closure, String status) throws IOException{
List<JournalAndStream> badJAS = Lists.newLinkedList();
//journals是个List<JournalAndStream>集合
for (JournalAndStream jas : journals) {
try {
//对于集合中的每一个JournalAndStream,都会调用apply方法
//也就是上面定义的jas.getCurrentStream().write(op);
closure.apply(jas);
} catch (Throwable t) {
if (jas.isRequired()) {
final String msg = "Error: " + status + " failed for required journal ("
+ jas + ")";
LOG.fatal(msg, t);
// If we fail on *any* of the required journals, then we must not
// continue on any of the other journals. Abort them to ensure that
// retry behavior doesn't allow them to keep going in any way.
abortAllJournals();
// the current policy is to shutdown the NN on errors to shared edits
// dir. There are many code paths to shared edits failures - syncs,
// roll of edits etc. All of them go through this common function
// where the isRequired() check is made. Applying exit policy here
// to catch all code paths.
terminate(1, msg);
} else {
LOG.error("Error: " + status + " failed for (journal " + jas + ")", t);
badJAS.add(jas);
}
}
}
disableAndReportErrorOnJournals(badJAS);
if (!NameNodeResourcePolicy.areResourcesAvailable(journals,
minimumRedundantJournals)) {
String message = status + " failed for too many journals";
LOG.error("Error: " + message);
throw new IOException(message);
}
}
这里的含义是:journals是journalset类中维护的一个cow集合:List<JournalAndStream> journals =
new CopyOnWriteArrayList<JournalSet.JournalAndStream>();该集合存放了所有要写入的流,在调用mapJournalsAndReportErrors方法的时候就会分别将数据写入这些流中,从而实现了本地文件流和journalnode流的同时写入。
在前面的那个EditLogOutputStream里面,其实是可以封装多个流的,主要是在初始化的时候,有一个JournalSet,FileJournalManager(负责写本地磁盘),QuorumJournalManager(负责写journalnodes,基于JournalSet搞一个EditLogOutputStream出来,然后这个东西底层就封装了多个流,调用write()方法的时候,他其实会在底层遍历所有的流,依次调用这些流,而且这些流,他都是先写入内存缓冲的。然后在内存缓冲都写完了之后,会有另外一个单独的方法,来将内存缓冲中的数据刷入磁盘,或者是刷入网络发送到journal node去。
在上面的代码中,jas包含的流可能是个文件editlog的流,也可能是journalnode的流。jas.getCurrentStream().write(op);
如果写的是editlog的流:也就是EditLogFileOutputStream这个流:
public void write(FSEditLogOp op) throws IOException {
doubleBuf.writeOp(op);
}
那么调用的是个双缓冲EditsDoubleBuffer类的一个机制:
public void writeOp(FSEditLogOp op) throws IOException {
bufCurrent.writeOp(op);
}
public void writeOp(FSEditLogOp op) throws IOException {
if (firstTxId == HdfsConstants.INVALID_TXID) {
firstTxId = op.txid;
} else {
assert op.txid > firstTxId;
}
writer.writeOp(op);
numTxns++;
}
这里说下EditsDoubleBuffer这个双缓存,第一个缓冲是用来写数据,第二个缓冲区是用来刷到磁盘的。每次双缓存第二个缓冲刷到磁盘的时候,两个缓冲区会交换一下。这个可允许editslog持续写入内存缓冲的同时,还能写入到网络和和磁盘中。上面的editLogStream.write(op);执行完成后,就会执行logSync();
/**
* Sync all modifications done by this thread.
*
* The internal concurrency design of this class is as follows:
* - Log items are written synchronized into an in-memory buffer,
* and each assigned a transaction ID.
* - When a thread (client) would like to sync all of its edits, logSync()
* uses a ThreadLocal transaction ID to determine what edit number must
* be synced to.
* - The isSyncRunning volatile boolean tracks whether a sync is currently
* under progress.
*
* The data is double-buffered within each edit log implementation so that
* in-memory writing can occur in parallel with the on-disk writing.
*
* Each sync occurs in three steps:
* 1. synchronized, it swaps the double buffer and sets the isSyncRunning
* flag.
* 2. unsynchronized, it flushes the data to storage
* 3. synchronized, it resets the flag and notifies anyone waiting on the
* sync.
*
* The lack of synchronization on step 2 allows other threads to continue
* to write into the memory buffer while the sync is in progress.
* Because this step is unsynchronized, actions that need to avoid
* concurrency with sync() should be synchronized and also call
* waitForSyncToFinish() before assuming they are running alone.
*/
public void logSync() {
long syncStart = 0;
// Fetch the transactionId of this thread.
long mytxid = myTransactionId.get().txid;
boolean sync = false;
try {
EditLogOutputStream logStream = null;
synchronized (this) {
try {
printStatistics(false);
// if somebody is already syncing, then wait
//每个线程写完数据都会尝试同步缓存数据到磁盘上去
//同一时间只能有一个线程执行buffer到磁盘的工作,isSyncRunning代表了标志位,如果有线程正在同步那么这个标志就是个true
//mytxid > synctxid这个条件说明当前线程大于正在同步数据的线程的txid,可以等,
// 反之,如果mytxid < synctxid ,说明后面来的线程
//已经将本线程写入缓存区的数据正在同步,也就是下面的if (mytxid <= synctxid)代码,
// 此时就什么都不需要做了
while (mytxid > synctxid && isSyncRunning) {
try {
wait(1000);
} catch (InterruptedException ie) {
}
}
//
// If this transaction was already flushed, then nothing to do
//此时说明比当前mytxid更大的id已经将该线程的buffer数据刷到磁盘了
if (mytxid <= synctxid) {
//记录同步次数自增
numTransactionsBatchedInSync++;
if (metrics != null) {
// Metrics is non-null only when used inside name node
metrics.incrTransactionsBatchedInSync();
}
//此时直接返回,不需要做什么了
return;
}
//能进入这里,说明mytxid > synctxid && isSyncRunning=false
// now, this thread will do the sync
syncStart = txid;
isSyncRunning = true;
sync = true;
// swap buffers
try {
if (journalSet.isEmpty()) {
throw new IOException("No journals available to flush");
}
//交换buffer双缓冲
editLogStream.setReadyToFlush();
} catch (IOException e) {
final String msg =
"Could not sync enough journals to persistent storage " +
"due to " + e.getMessage() + ". " +
"Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
} finally {
// Prevent RuntimeException from blocking other log edit write
doneWithAutoSyncScheduling();
}
//editLogStream may become null,
//so store a local variable for flush.
logStream = editLogStream;
}
// do the sync
long start = now();
try {
if (logStream != null) {
//执行flush操作,此时只有一个线程能够执行该操作,其他线程会被锁住
logStream.flush();
}
} catch (IOException ex) {
synchronized (this) {
final String msg =
"Could not sync enough journals to persistent storage. "
+ "Unsynced transactions: " + (txid - synctxid);
LOG.fatal(msg, new Exception());
synchronized(journalSetLock) {
IOUtils.cleanup(LOG, journalSet);
}
terminate(1, msg);
}
}
long elapsed = now() - start;
if (metrics != null) { // Metrics non-null only when used inside name node
metrics.addSync(elapsed);
}
} finally {
// Prevent RuntimeException from blocking other log edit sync
synchronized (this) {
if (sync) {
//将synctxid设置为已经同步完成的txid,标志位设为false,并通知等待的线程
synctxid = syncStart;
isSyncRunning = false;
}
this.notifyAll();
}
}
}
在这里,如果比较小的txid过来,会直接return,因为大的txid会将比他小的数据都flush到磁盘中。对于editLogStream.setReadyToFlush();双缓冲交换:
public void setReadyToFlush() {
assert isFlushed() : "previous data not flushed yet";
TxnBuffer tmp = bufReady;
bufReady = bufCurrent;
bufCurrent = tmp;
}
上面是写磁盘editslog的逻辑,下面来看些journalnode的逻辑,在QuorumJournalManager中。
首先该类有个loggers:
/**
* 里面封装了多个AscynLogger,在flush的时候,他会通过这个组件,里面的每一个AsyncLogger都会往一个journalnode中发送editslog
* 他这里会封装一个quorum算法,只要大多数的journalnode都写成功了,就可以
*/
private final AsyncLoggerSet loggers;
该QuorumOutputStream流通过下面方法获取:
public EditLogOutputStream startLogSegment(long txId, int layoutVersion)
throws IOException {
Preconditions.checkState(isActiveWriter,
"must recover segments before starting a new one");
QuorumCall<AsyncLogger, Void> q = loggers.startLogSegment(txId,
layoutVersion);
loggers.waitForWriteQuorum(q, startSegmentTimeoutMs,
"startLogSegment(" + txId + ")");
//写到journalnode的流,通过双缓冲机制写入数据
return new QuorumOutputStream(loggers, txId,
outputBufferCapacity, writeTxnsTimeoutMs);
推送到journalnode的代码在下面代码中
@Override
protected void flushAndSync(boolean durable) throws IOException {
int numReadyBytes = buf.countReadyBytes();
if (numReadyBytes > 0) {
int numReadyTxns = buf.countReadyTxns();
long firstTxToFlush = buf.getFirstReadyTxId();
assert numReadyTxns > 0;
// Copy from our double-buffer into a new byte array. This is for
// two reasons:
// 1) The IPC code has no way of specifying to send only a slice of
// a larger array.
// 2) because the calls to the underlying nodes are asynchronous, we
// need a defensive copy to avoid accidentally mutating the buffer
// before it is sent.
DataOutputBuffer bufToSend = new DataOutputBuffer(numReadyBytes);
buf.flushTo(bufToSend);
assert bufToSend.getLength() == numReadyBytes;
byte[] data = bufToSend.getData();
assert data.length == bufToSend.getLength();
//每个AsyncLogger都是基于线程池异步发送网络请求,rpc接口调用QuorumCall来获取执行结果
QuorumCall<AsyncLogger, Void> qcall = loggers.sendEdits(
segmentTxId, firstTxToFlush,
numReadyTxns, data);
//等待大多数的数据发送给journalnode成功,大多数成功就算写journalenode成功
loggers.waitForWriteQuorum(qcall, writeTimeoutMs, "sendEdits");
// Since we successfully wrote this batch, let the loggers know. Any future
// RPCs will thus let the loggers know of the most recent transaction, even
// if a logger has fallen behind.
loggers.setCommittedTxId(firstTxToFlush + numReadyTxns - 1);
}
}
首先会将缓存中的数据拷贝到一个新的字节数组里,另外由于异步发送,也是为了防止发送的时候数据被修改了。
public QuorumCall<AsyncLogger, Void> sendEdits(
long segmentTxId, long firstTxnId, int numTxns, byte[] data) {
Map<AsyncLogger, ListenableFuture<Void>> calls = Maps.newHashMap();
for (AsyncLogger logger : loggers) {
//每个AsyncLogger对应一个journalnode,并向其发送数据
ListenableFuture<Void> future =
logger.sendEdits(segmentTxId, firstTxnId, numTxns, data);
calls.put(logger, future);
}
return QuorumCall.create(calls);
}
QuorumOutputStream这两个类是用来向journalnode发送数据的,它持有一个AsyncLoggerSet类,AsyncLoggerSet类持有一系列的AsyncLogger,每个AsyncLogger对应着一个journalnode。通过AsyncLogger异步发送到每个journalenode,当AsyncLoggerSet中的大多数AsyncLogger发送成功了之后,就认为写入journalenode成功了。AsyncLogger是个接口,其实现类是IPCLoggerChannel类。那么我们来看该实现类的sendEdits方法:
public ListenableFuture<Void> sendEdits(
final long segmentTxId, final long firstTxnId,
final int numTxns, final byte[] data) {
try {
reserveQueueSpace(data.length);
} catch (LoggerTooFarBehindException e) {
return Futures.immediateFailedFuture(e);
}
// When this batch is acked, we use its submission time in order
// to calculate how far we are lagging.
final long submitNanos = System.nanoTime();
ListenableFuture<Void> ret = null;
try {
// 提交给线程池去执行
ret = singleThreadExecutor.submit(new Callable<Void>() {
@Override
public Void call() throws IOException {
throwIfOutOfSync();
long rpcSendTimeNanos = System.nanoTime();
try {
//发送数据,getProxy()返回一个QJournalProtocol类。就是一个rpc接口。data是buff缓存中的数据
getProxy().journal(createReqInfo(),
segmentTxId, firstTxnId, numTxns, data);
} catch (IOException e) {
QuorumJournalManager.LOG.warn(
"Remote journal " + IPCLoggerChannel.this + " failed to " +
"write txns " + firstTxnId + "-" + (firstTxnId + numTxns - 1) +
". Will try to write to this JN again after the next " +
"log roll.", e);
synchronized (IPCLoggerChannel.this) {
outOfSync = true;
}
throw e;
} finally {
long now = System.nanoTime();
long rpcTime = TimeUnit.MICROSECONDS.convert(
now - rpcSendTimeNanos, TimeUnit.NANOSECONDS);
long endToEndTime = TimeUnit.MICROSECONDS.convert(
now - submitNanos, TimeUnit.NANOSECONDS);
metrics.addWriteEndToEndLatency(endToEndTime);
metrics.addWriteRpcLatency(rpcTime);
if (rpcTime / 1000 > WARN_JOURNAL_MILLIS_THRESHOLD) {
QuorumJournalManager.LOG.warn(
"Took " + (rpcTime / 1000) + "ms to send a batch of " +
numTxns + " edits (" + data.length + " bytes) to " +
"remote journal " + IPCLoggerChannel.this);
}
}
synchronized (IPCLoggerChannel.this) {
highestAckedTxId = firstTxnId + numTxns - 1;
lastAckNanos = submitNanos;
}
return null;
}
});
} finally {
if (ret == null) {
// it didn't successfully get submitted,
// so adjust the queue size back down.
unreserveQueueSpace(data.length);
} else {
// It was submitted to the queue, so adjust the length
// once the call completes, regardless of whether it
// succeeds or fails.
Futures.addCallback(ret, new FutureCallback<Void>() {
@Override
public void onFailure(Throwable t) {
unreserveQueueSpace(data.length);
}
@Override
public void onSuccess(Void t) {
unreserveQueueSpace(data.length);
}
});
}
}
return ret;
}
这里最重要的就是 getProxy().journal(createReqInfo(),segmentTxId, firstTxnId, numTxns, data);这个方法了。getProxy()方法获取的是一个QJournalProtocol接口,这个接口显然就是namenode中这个logger与journalenode通信的接口,通过该接口的rpc通信。将数据发送给了journalnode。
namenode发送到journalenode数据,此时namenode是作为rpc的client端与journalnode作为服务端来进行通信,那么我们来看journalnode是如何响应该请求的:找到JournalNodeRpcServer类:通过客户端调用的方法名称知道该方法叫journal方法
@Override
public void journal(RequestInfo reqInfo,
long segmentTxId, long firstTxnId,
int numTxns, byte[] records) throws IOException {
jn.getOrCreateJournal(reqInfo.getJournalId())
.journal(reqInfo, segmentTxId, firstTxnId, numTxns, records);
}
jn.getOrCreateJournal(reqInfo.getJournalId())返回一个journal类,调用journal类的journal方法:
synchronized void journal(RequestInfo reqInfo,
long segmentTxId, long firstTxnId,
int numTxns, byte[] records) throws IOException {
checkFormatted();
checkWriteRequest(reqInfo);
checkSync(curSegment != null,
"Can't write, no segment open");
if (curSegmentTxId != segmentTxId) {
// Sanity check: it is possible that the writer will fail IPCs
// on both the finalize() and then the start() of the next segment.
// This could cause us to continue writing to an old segment
// instead of rolling to a new one, which breaks one of the
// invariants in the design. If it happens, abort the segment
// and throw an exception.
JournalOutOfSyncException e = new JournalOutOfSyncException(
"Writer out of sync: it thinks it is writing segment " + segmentTxId
+ " but current segment is " + curSegmentTxId);
abortCurSegment();
throw e;
}
checkSync(nextTxId == firstTxnId,
"Can't write txid " + firstTxnId + " expecting nextTxId=" + nextTxId);
long lastTxnId = firstTxnId + numTxns - 1;
if (LOG.isTraceEnabled()) {
LOG.trace("Writing txid " + firstTxnId + "-" + lastTxnId);
}
// If the edit has already been marked as committed, we know
// it has been fsynced on a quorum of other nodes, and we are
// "catching up" with the rest. Hence we do not need to fsync.
boolean isLagging = lastTxnId <= committedTxnId.get();
boolean shouldFsync = !isLagging;
curSegment.writeRaw(records, 0, records.length);
curSegment.setReadyToFlush();
Stopwatch sw = new Stopwatch();
sw.start();
curSegment.flush(shouldFsync);
sw.stop();
metrics.addSync(sw.elapsedTime(TimeUnit.MICROSECONDS));
if (sw.elapsedTime(TimeUnit.MILLISECONDS) > WARN_SYNC_MILLIS_THRESHOLD) {
LOG.warn("Sync of transaction range " + firstTxnId + "-" + lastTxnId +
" took " + sw.elapsedTime(TimeUnit.MILLISECONDS) + "ms");
}
if (isLagging) {
// This batch of edits has already been committed on a quorum of other
// nodes. So, we are in "catch up" mode. This gets its own metric.
metrics.batchesWrittenWhileLagging.incr(1);
}
metrics.batchesWritten.incr(1);
metrics.bytesWritten.incr(records.length);
metrics.txnsWritten.incr(numTxns);
highestWrittenTxId = lastTxnId;
nextTxId = lastTxnId + 1;
}
对于journalnode来说,接收到namenode发送过来的数据后,会将数据刷到磁盘。
最后做个总结,当客户端发起请求的时候,会通过客户端DfsClient来调用远程namenode的RPCServer,rpcServer持有FsNameSystem类,FsNameSystem类维护了FsEditsLog和FsDirectory两个类。FsEditsLog就是操作日志文件,FsDirectory是内存中的文件目录树,一个rpcserver接收请求后,会首先在FsDirectory这个文件目录树下挂一个节点。然后往FsEditsLog中写数据,该数据会被写入两处,一处是写入本地磁盘文件,一处是写入远程journalnode节点。