hadoop 源码解析-HDFS的上传源码解析

LD is tigger forever,CG are not brothers forever, throw the pot and shine.
Efficient work is better than attitude。All right, hide it。Advantages should be hidden.
talk is selected, show others the code,Keep progress,make a better result.

目录

概述

hadoop datanode HDFS上传源码

正文:

HDFS 的读写数据流程:
1.向NamdeNode请求上传文件,
2.响应可以上传的文件
3.请求上传第一个block 的(0-128M),请返回DataNode
4.返回dn1,dn2,dn3的节点,表示采用这三个节点存储数据。
5.当FS的DataOutputStream的请求时,请求建立Block传输通道到
D通过bytebuffer 的形式,请求建立通道,到DN2,请求的数据建为DN3
6.到数据完成后。节点应答完成,DataNode3应答成功,dn2应答成功,dn1应答成功,,
7.传输数据packet(chunk512+chunksum4byte)
8.直到传输完成。

源码流程如下:
1.fs.create
2.分布式文件系统
3.新的流失文件, 的
4.文件读取
5. 文件读写
6.512+4byte+chunk,够一个packet
如果写失败,会把packet重写防护dataQueue并将packet从ackQueueue中移除。
dataQueue.wait队列堵塞等待接受数据到DataStream,
7.此时dataStremas将Socket请求发送packet到DataNode1,DataNode2的,DataNode3的,经过DataNode的读写完成后持久化到磁盘。

源码如下:
3.1 create 创建过程

org.apache.hadoop hadoop-hdfs-client 3.1.3 provided junit junit 4.12 org.slf4j slf4j-log4j12 1.7.30

3.1.1 DN 向 NN 发起创建请求
用户自己的代码:

@Test
public void testPut2() throws IOException {
FSDataOutputStream fos = fs.create(new Path(“/input”));
fos.write(“hello world”.getBytes());
}

FileSystem.java
public FSDataOutputStream create(Path f) throws IOException {
return create(f, true);
}

public FSDataOutputStream create(Path f, boolean overwrite)
throws IOException {
return create(f, overwrite,
getConf().getInt(IO_FILE_BUFFER_SIZE_KEY,
IO_FILE_BUFFER_SIZE_DEFAULT),
getDefaultReplication(f),
getDefaultBlockSize(f));
}
public FSDataOutputStream create(Path f,
boolean overwrite,
int bufferSize,
short replication,
long blockSize) throws IOException {
return create(f, overwrite, bufferSize, replication, blockSize, null);
}
public FSDataOutputStream create(Path f,
boolean overwrite,
int bufferSize,
short replication,
long blockSize,
Progressable progress
) throws IOException {
return this.create(f, FsCreateModes.applyUMask(
FsPermission.getFileDefault(), FsPermission.getUMask(getConf())),
overwrite, bufferSize, replication, blockSize, progress);
}

public abstract FSDataOutputStream create(Path f,
FsPermission permission,
boolean overwrite,
int bufferSize,
short replication,
long blockSize,
Progressable progress) throws IOException;
public FSDataOutputStream create(Path f, FsPermission permission,
boolean overwrite, int bufferSize, short replication, long blockSize,
Progressable progress) throws IOException {
return this.create(f, permission,
overwrite ? EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
: EnumSet.of(CreateFlag.CREATE), bufferSize, replication,
blockSize, progress, null);
}
@Override
public FSDataOutputStream create(final Path f, final FsPermission permission,
final EnumSet cflags, final int bufferSize,
final short replication, final long blockSize,
final Progressable progress, final ChecksumOpt checksumOpt)
throws IOException {
statistics.incrementWriteOps(1);
storageStatistics.incrementOpCounter(OpType.CREATE);
Path absF = fixRelativePart(f);
return new FileSystemLinkResolver() {
@Override
public FSDataOutputStream doCall(final Path p) throws IOException {
// 创建获取了一个输出流对象
final DFSOutputStream dfsos = dfs.create(getPathName§, permission,
cflags, replication, blockSize, progress, bufferSize,
checksumOpt);
// 这里将上面创建的 dfsos 进行包装并返回

public DFSOutputStream create(String src, FsPermission permission,
EnumSet flag, boolean createParent, short replication,
long blockSize, Progressable progress, int buffersize,
ChecksumOpt checksumOpt, InetSocketAddress[] favoredNodes,
String ecPolicyName) throws IOException {
checkOpen();
final FsPermission masked = applyUMask(permission);
LOG.debug(“{}: masked={}”, src, masked);
final DFSOutputStream result = DFSOutputStream.newStreamForCreate(this,
src, masked, flag, createParent, replication, blockSize, progress,
dfsClientConf.createChecksum(checksumOpt),
getFavoredNodesStr(favoredNodes), ecPolicyName);
beginFileLease(result.getFileId(), result);
return result;
}

static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
FsPermission masked, EnumSet flag, boolean createParent,
short replication, long blockSize, Progressable progress,

DataChecksum checksum, String[] favoredNodes, String ecPolicyName)
throws IOException {
try (TraceScope ignored =
dfsClient.newPathTraceScope(“newStreamForCreate”, src)) {
HdfsFileStatus stat = null;
// Retry the create if we get a RetryStartFileException up to a maximum
// number of times
boolean shouldRetry = true;
int retryCount = CREATE_RETRY_COUNT;
while (shouldRetry) {
shouldRetry = false;
try {
// DN 将创建请求发送给 NN(RPC)
stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
new EnumSetWritable<>(flag), createParent, replication,
blockSize, SUPPORTED_CRYPTO_VERSIONS, ecPolicyName);
break;
} catch (RemoteException re) {
… ….
}
}
Preconditions.checkNotNull(stat, “HdfsFileStatus should not be null!”);
final DFSOutputStream out;
if(stat.getErasureCodingPolicy() != null) {
out = new DFSStripedOutputStream(dfsClient, src, stat,
flag, progress, checksum, favoredNodes);
} else {
out = new DFSOutputStream(dfsClient, src, stat,
flag, progress, checksum, favoredNodes, true);
}
// 开启线程 run,DataStreamer extends Daemon extends Thread
out.start();
return out;
}
}

3.1.2 NN处理DN的创建请求:
1)点击 create
clientProtool.java
HdfsFileStatus create(String src, FsPermission masked,
String clientName, EnumSetWritable flag,
boolean createParent, short replication, long blockSize,
CryptoProtocolVersion[] supportedVersions, String ecPolicyName)
throws IOException;

查询点击create方法:
public HdfsFileStatus create(String src, FsPermission masked,
String clientName, EnumSetWritable flag,
boolean createParent, short replication, long blockSize,
CryptoProtocolVersion[] supportedVersions, String ecPolicyName)
throws IOException {
// 检查 NN 启动
checkNNStartup();
… …
HdfsFileStatus status = null;
try {
PermissionStatus perm = new PermissionStatus(getRemoteUser()
.getShortUserName(), null, masked);
// 重要
status = namesystem.startFile(src, perm, clientName, clientMachine,
flag.get(), createParent, replication, blockSize, supportedVersions,
ecPolicyName, cacheEntry != null);
} finally {
RetryCache.setState(cacheEntry, status != null, status);
}
metrics.incrFilesCreated();
metrics.incrCreateFileOps();
return status;
}

开始文件:
HdfsFileStatus startFile(String src, PermissionStatus permissions,
String holder, String clientMachine, EnumSet flag,
boolean createParent, short replication, long blockSize,
CryptoProtocolVersion[] supportedVersions, String ecPolicyName,
boolean logRetryCache) throws IOException {
HdfsFileStatus status;
try {
status = startFileInt(src, permissions, holder, clientMachine, flag,
createParent, replication, blockSize, supportedVersions, ecPolicyName,
logRetryCache);
} catch (AccessControlException e) {
logAuditEvent(false, “create”, src);
throw e;
}
logAuditEvent(true, “create”, src, status);
return status;
}

private HdfsFileStatus startFileInt(String src,

private HdfsFileStatus startFileInt(String src,
PermissionStatus permissions, String holder, String clientMachine,
EnumSet flag, boolean createParent, short replication,
long blockSize, CryptoProtocolVersion[] supportedVersions,
String ecPolicyName, boolean logRetryCache) throws IOException {
… …
stat = FSDirWriteFileOp.startFile(this, iip, permissions, holder,
clientMachine, flag, createParent, replication, blockSize, feInfo,
toRemoveBlocks, shouldReplicate, ecPolicyName, logRetryCache);

static HdfsFileStatus startFile(
… …)
throws IOException {
… …
FSDirectory fsd = fsn.getFSDirectory();
// 文件路径是否存在校验
if (iip.getLastINode() != null) {
if (overwrite) {
List toRemoveINodes = new ChunkedArrayList<>();
List toRemoveUCFiles = new ChunkedArrayList<>();
long ret = FSDirDeleteOp.delete(fsd, iip, toRemoveBlocks,
toRemoveINodes, toRemoveUCFiles, now());
if (ret >= 0) {
iip = INodesInPath.replace(iip, iip.length() - 1, null);
FSDirDeleteOp.incrDeletedFileCount(ret);
fsn.removeLeasesAndINodes(toRemoveUCFiles, toRemoveINodes, true);
}
} else {
// If lease soft limit time is expired, recover the lease
fsn.recoverLeaseInternal(FSNamesystem.RecoverLeaseOp.CREATE_FILE, iip,
src, holder, clientMachine, false);
throw new FileAlreadyExistsException(src + " for client " +
clientMachine + " already exists");
}
}
fsn.checkFsObjectLimit();

INodeFile newNode = null;
INodesInPath parent = FSDirMkdirOp.createAncestorDirectories(fsd, iip, permissions);
if (parent != null) {
// 添加文件元数据信息
iip = addFile(fsd, parent, iip.getLastLocalName(), permissions,
replication, blockSize, holder, clientMachine, shouldReplicate,
ecPolicyName);
newNode = iip != null ? iip.getLastINode().asFile() : null;
}
… …
setNewINodeStoragePolicy(fsd.getBlockManager(), iip, isLazyPersist);
fsd.getEditLog().logOpenFile(src, newNode, overwrite, logRetryEntry);
if (NameNode.stateChangeLog.isDebugEnabled()) {
NameNode.stateChangeLog.debug("DIR* NameSystem.startFile: added " +
src + " inode " + newNode.getId() + " " + holder);
}
return FSDirStatAndListingOp.getFileInfo(fsd, iip, false, false);
}

private static INodesInPath addFile(
FSDirectory fsd, INodesInPath existing, byte[] localName,
PermissionStatus permissions, short replication, long preferredBlockSize,
String clientName, String clientMachine, boolean shouldReplicate,
String ecPolicyName) throws IOException {

Preconditions.checkNotNull(existing);
long modTime = now();
INodesInPath newiip;
fsd.writeLock();
try {
… …
newiip = fsd.addINode(existing, newNode, permissions.getPermission());
} finally {
fsd.writeUnlock();
}
… …
Ret

INodesInPath addINode(INodesInPath existing, INode child,
FsPermission modes)
throws QuotaExceededException, UnresolvedLinkException {

cacheName(child);
writeLock();
try {
// 将数据写入到 INode 的目录树中
return addLastINode(existing, child, modes, true);
} finally {
writeUnlock();
}

3.1.3 DataStreamer 启动流程
static DFSOutputStream newStreamForCreate(DFSClient dfsClient, String src,
FsPermission masked, EnumSet flag, boolean createParent,
short replication, long blockSize, Progressable progress,
DataChecksum checksum, String[] favoredNodes, String ecPolicyName)
throws IOException {
… …
// DN 将创建请求发送给 NN(RPC)
stat = dfsClient.namenode.create(src, masked, dfsClient.clientName,
new EnumSetWritable<>(flag), createParent, replication,
blockSize, SUPPORTED_CRYPTO_VERSIONS, ecPolicyName);
… …
// 创建输出流
out = new DFSOutputStream(dfsClient, src, stat,
flag, progress, checksum, favoredNodes, true);
// 开启线程 run,DataStreamer extends Daemon extends Thread
out.start();
return out;
}

protected DFSOutputStream(DFSClient dfsClient, String src,
HdfsFileStatus stat, EnumSet flag, Progressable progress,
DataChecksum checksum, String[] favoredNodes, boolean createStreamer) {
this(dfsClient, src, flag, progress, stat, checksum);
this.shouldSyncBlock = flag.contains(CreateFlag.SYNC_BLOCK);
// Directory => File => Block(128M) => packet(64K) => chunk(chunk 512byte +
chunksum 4byte)
computePacketChunkSize(dfsClient.getConf().getWritePacketSize(),
bytesPerChecksum);
if (createStreamer) {
streamer = new DataStreamer(stat, null, dfsClient, src, progress,
checksum, cachingStrategy, byteArrayManager, favoredNodes,
addBlockFlags);
}
}

protected synchronized void start() {
getStreamer().start();
}
protected DataStreamer getStreamer() {
return streamer;
}

class DataStreamer extends Daemon {
。。。 。。。
}

public class Daemon extends Thread {
。。。 。。。
}

说明:out.start();实际是开启线程,点击 DataStreamer,搜索 run 方法

public void write(byte b[]) throws IOException {
write(b, 0, b.length);
}

public void write(byte b[], int off, int len) throws IOException {
if ((off | len | (b.length - (len + off)) | (off + len)) < 0)
throw new IndexOutOfBoundsException();
for (int i = 0 ; i < len ; i++) {
write(b[off + i]);
}
}

public void write(int b) throws IOException {
public abstract void write(int b) throws IOException;

public synchronized void write(int b) throws IOException {
buf[count++] = (byte)b;
if(count == buf.length) {
flushBuffer();
}
}
protected synchronized void flushBuffer() throws IOException {
flushBuffer(false, true);
}

Flush 的实现:
protected synchronized int flushBuffer(boolean keep,
boolean flushPartial) throws IOException {
int bufLen = count;
int partialLen = bufLen % sum.getBytesPerChecksum();
int lenToFlush = flushPartial ? bufLen : bufLen - partialLen;
if (lenToFlush != 0) {
// 向队列中写数据
// Directory => File => Block(128M) => package(64K) => chunk(chunk 512byte +
chunksum 4byte)
writeChecksumChunks(buf, 0, lenToFlush);
if (!flushPartial || keep) {
count = partialLen;
System.arraycopy(buf, bufLen - count, buf, 0, count);
} else {
count = 0;
}
}
// total bytes left minus unflushed bytes left
return count - (bufLen - lenToFlush);
}

private void writeChecksumChunks(byte b[], int off, int len)
throws IOException {

计算chunk的校验和:
// 计算 chunk 的校验和
sum.calculateChunkedSums(b, off, len, checksum, 0);
TraceScope scope = createWriteTraceScope();

int chunkLen = Math.min(sum.getBytesPerChecksum(), len - i)
int ckOffset = i / sum.getBytesPerChecksum() * getChecksumSize();

// 一个 chunk 一个 chunk 的将数据写入队列
writeChunk(b, off + i, chunkLen, checksum, ckOffset,
getChecksumSize());
}
} finally {
if (scope != null) {
scope.close();
}
}
}

如何写chuck 的呢?
protected abstract void writeChunk(byte[] b, int bOffset, int bLen,
byte[] checksum, int checksumOffset, int checksumLen) throws IOException;
writeChunkPrepare(len, ckoff, cklen);

// 往 packet 里面写 chunk 的校验和 4byte
currentPacket.writeChecksum(checksum, ckoff, cklen);

// 往 packet 里面写一个 chunk 512 byte
currentPacket.writeData(b, offset, len);

// 记录写入 packet 中的 chunk 个数,累计到 127 个 chuck,这个 packet 就满了
currentPacket.incNumChunks();
getStreamer().incBytesCurBlock(len);

// If packet is full, enqueue it for transmission
if (currentPacket.getNumChunks() == currentPacket.getMaxChunks() ||
getStreamer().getBytesCurBlock() == blockSize) {
enqueueCurrentPacketFull();
}

}

synchronized void enqueueCurrentPacketFull() throws IOException {
LOG.debug(“enqueue full {}, src={}, bytesCurBlock={}, blockSize={},”

  • " appendChunk={}, {}", currentPacket, src, getStreamer()
    .getBytesCurBlock(), blockSize, getStreamer().getAppendChunk(),
    getStreamer());
    enqueueCurrentPacket();
    adjustChunkBoundary();
    endBlock();
    }

void enqueueCurrentPacket() throws IOException {
getStreamer().waitAndQueuePacket(currentPacket);
currentPacket = null;

void waitAndQueuePacket(DFSPacket packet) throws IOException {
synchronized (dataQueue) {
try {
// 如果队列满了,等待
// If queue is full, then wait till we have enough space
boolean firstWait = true;
try {
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
Span span = Tracer.getCurrentSpan();
if (span != null) {
span.addTimelineAnnotation(“dataQueue.wait”);
}
firstWait = false;
}
try {
dataQueue.wait();
} catch (InterruptedException e) {
… …
}
}
} finally {
Span span = Tracer.getCurrentSpan();
if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation(“end.wait”);
}
}
checkClosed();
// 如果队列没满,向队列中添加数据
queuePacket(packet);
} catch (ClosedChannelException ignored) {
}
}
}

DataStreamer.java
void queuePacket(DFSPacket packet) {
synchronized (dataQueue) {
if (packet == null) return;
packet.addTraceParent(Tracer.getCurrentSpanId());
// 向队列中添加数据
dataQueue.addLast(packet);
lastQueuedSeqno = packet.getSeqno();
LOG.debug(“Queued {}, {}”, packet, this);
// 通知队列添加数据完成
dataQueue.notifyAll();
}

3.1.2 建立管道之机架感知(块存储位置)

DataStreamer.java

public void run() {
long lastPacket = Time.monotonicNow();
TraceScope scope = null;
while (!streamerClosed && dfsClient.clientRunning) {
// if the Responder encountered an error, shutdown Responder
if (errorState.hasError()) {
closeResponder();
}
DFSPacket one;
try {
// process datanode IO errors if any
boolean doSleep = processDatanodeOrExternalError();
final int halfSocketTimeout = dfsClient.getConf().getSocketTimeout()/2;
synchronized (dataQueue) {
// wait for a packet to be sent.
long now = Time.monotonicNow();
while ((!shouldStop() && dataQueue.size() == 0 &&
(stage != BlockConstructionStage.DATA_STREAMING ||
now - lastPacket < halfSocketTimeout)) || doSleep) {
long timeout = halfSocketTimeout - (now-lastPacket);
timeout = timeout <= 0 ? 1000 : timeout;
timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
timeout : 1000;
try {
// 如果 dataQueue 里面没有数据,代码会阻塞在这儿
dataQueue.wait(timeout); // 接收到 notify 消息

SpanId[] parents = one.getTraceParents();
if (parents.length > 0) {
scope = dfsClient.getTracer().
newScope(“dataStreamer”, parents[0]);
scope.getSpan().setParents(parents);
}
}
}

// get new block from namenode.
if (LOG.isDebugEnabled()) {
LOG.debug(“stage=” + stage + ", " + this);
}
if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
LOG.debug(“Allocating new block: {}”, this);
// 步骤一:向 NameNode 申请 block 并建立数据管道
setPipeline(nextBlockOutputStream());
// 步骤二:启动 ResponseProcessor 用来监听 packet 发送是否成功
initDataStreaming();
} else if (stage == BlockConstructionStage.PIPELINE_SETUP_APPEND) {
setupPipelineForAppendOrRecovery();
if (streamerClosed) {
continue;
}
initDataStreaming();
}

// send the packet
SpanId spanId = SpanId.INVALID;
synchronized (dataQueue) {
// move packet from dataQueue to ackQueue
if (!one.isHeartbeatPacket()) {
if (scope != null) {
spanId = scope.getSpanId();
scope.detach();
one.setTraceScope(scope);
}
scope = null;
// 步骤三:从 dataQueue 把要发送的这个 packet 移除出去
dataQueue.removeFirst();
// 步骤四:然后往 ackQueue 里面添加这个 packet
ackQueue.addLast(one);
packetSendTime.put(one.getSeqno(), Time.monotonicNow());
dataQueue.notifyAll();
}
}

// write out data to remote datanode
try (TraceScope ignored = dfsClient.getTracer().
newScope(“DataStreamer#writeTo”, spanId)) {
// 将数据写出去
one.writeTo(blockStream);
blockStream.flush();
} catch (IOException e) {
errorState.markFirstNodeIfNotMarked();
throw e;
}
… …
}

点击 nextBlockOutputStream
protected LocatedBlock nextBlockOutputStream() throws IOException {
LocatedBlock lb;
DatanodeInfo[] nodes;
StorageType[] nextStorageTypes;
String[] nextStorageIDs;
int count = dfsClient.getConf().getNumBlockWriteRetry();
boolean success;
final ExtendedBlock oldBlock = block.getCurrentBlock();
do {
errorState.resetInternalError();
lastException.clear();
DatanodeInfo[] excluded = getExcludedNodes();
// 向 NN 获取向哪个 DN 写数据
lb = locateFollowingBlock(
excluded.length > 0 ? excluded : null, oldBlock);
// 创建管道
success = createBlockOutputStream(nodes, nextStorageTypes, nextStorageIDs,
0L, false);
… …
} while (!success && --count >= 0);
if (!success) {
throw new IOException(“Unable to create new block.”);
}
return lb;
}

private LocatedBlock locateFollowingBlock(DatanodeInfo[] excluded,
ExtendedBlock oldBlock) throws IOException {
return DFSOutputStream.addBlock(excluded, dfsClient, src, oldBlock,
stat.getFileId(), favoredNodes, addBlockFlags);
}

LocatedBlock addBlock(String src, String clientName,
ExtendedBlock previous, DatanodeInfo[] excludeNodes, long fileId,
String[] favoredNodes, EnumSet addBlockFlags)
throws IOException;

NameNodeRpcServer.java
public LocatedBlock addBlock(String src, String clientName,
ExtendedBlock previous, DatanodeInfo[] excludedNodes, long fileId,
String[] favoredNodes, EnumSet addBlockFlags)
throws IOException {
checkNNStartup();
LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, fileId,
clientName, previous, excludedNodes, favoredNodes, addBlockFlags);
if (locatedBlock != null) {
metrics.incrAddBlockOps();
}
re
LocatedBlock getAdditionalBlock(
String src, long fileId, String clientName, ExtendedBlock previous,
DatanodeInfo[] excludedNodes, String[] favoredNodes,
EnumSet flags) throws IOException {
final String operationName = “getAdditionalBlock”;
NameNode.stateChangeLog.debug(“BLOCK* getAdditionalBlock: {} inodeId {}” +
" for {}", src, fileId, clientName);

DatanodeStorageInfo[] targets = FSDirWriteFileOp.chooseTargetForNewBlock(
blockManager, src, excludedNodes, favoredNodes, flags, r);
… …
return lb;
}
static DatanodeStorageInfo[] chooseTargetForNewBlock(
BlockManager bm, String src, DatanodeInfo[] excludedNodes,
String[] favoredNodes, EnumSet flags,
ValidateAddBlockResult r) throws IOException {
… …
return bm.chooseTarget4NewBlock(src, r.numTargets, clientNode,
excludedNodesSet, r.blockSize,
favoredNodesList, r.storagePolicyID,
r.blockType, r.ecPolicy, flags);
}

final DatanodeStorageInfo[] targets = blockplacement.chooseTarget(src,
numOfReplicas, client, excludedNodes, blocksize,
favoredDatanodeDescriptors, storagePolicy, flags);
… …
return targets;
}

DatanodeStorageInfo[] chooseTarget(String src,
int numOfReplicas, Node writer,
Set excludedNodes,
long blocksize,
List favoredNodes,
BlockStoragePolicy storagePolicy,
EnumSet flags) {
return chooseTarget(src, numOfReplicas, writer,
new ArrayList(numOfReplicas), false,
excludedNodes, blocksize, storagePolicy, flags);
}

public abstract DatanodeStorageInfo[] chooseTarget(String srcPath,
int numOfReplicas,
Node writer,
List chosen,
boolean returnChosenNodes,
Set excludedNodes,
long blocksize,
BlockStoragePolicy storagePolicy,
EnumSet flags);

int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
numOfReplicas = result[0];
int maxNodesPerRack = result[1];
for (DatanodeStorageInfo storage : chosenStorage) {
// add localMachine and related nodes to excludedNodes
// 获取不可用的 DN
addToExcludedNodes(storage.getDatanodeDescriptor(), excludedNodes);
}
List results = null;
Node localNode = null;
boolean avoidStaleNodes = (stats != null
&& stats.isAvoidingStaleDataNodesForWrite());
// 有数据正在写,避免都写入本地
if (avoidLocalNode) {
results = new ArrayList<>(chosenStorage);
Set excludedNodeCopy = new HashSet<>(excludedNodes);
if (writer != null) {
excludedNodeCopy.add(writer);
}
localNode = chooseTarget(numOfReplicas, writer,
excludedNodeCopy, blocksize, maxNodesPerRack, results,
avoidStaleNodes, storagePolicy,
EnumSet.noneOf(StorageType.class), results.isEmpty(), sTypes);
if (results.size() < numOfReplicas) {
// not enough nodes; discard results and fall back
results = null;
}
}

if (results == null) {
results = new ArrayList<>(chosenStorage);
// 真正的选择 DN 节点
localNode = chooseTarget(numOfReplicas, writer, excludedNodes,
blocksize, maxNodesPerRack, results, avoidStaleNodes,
storagePolicy, EnumSet.noneOf(StorageType.class), results.isEmpty(),
sTypes);
}

if (!returnChosenNodes) {
results.removeAll(chosenStorage);
}
(writer != null && writer instanceof DatanodeDescriptor) ? writer
: localNode,
results.toArray(new DatanodeStorageInfo[results.size()]));
}

private Node chooseTarget(int numOfReplicas,
… …) {
writer = chooseTargetInOrder(numOfReplicas, writer, excludedNodes, blocksize,
maxNodesPerRack, results, avoidStaleNodes, newBlock, storageTypes);
… …
}

protected Node chooseTargetInOrder(int numOfReplicas,
Node writer,
final Set excludedNodes,
final long blocksize,
final int maxNodesPerRack,
final List results,
final boolean avoidStaleNodes,
final boolean newBlock,
EnumMap<StorageType, Integer> storageTypes)
throws NotEnoughReplicasException {
final int numOfResults = results.size();

DatanodeStorageInfo storageInfo = chooseLocalStorage(writer,
excludedNodes, blocksize, maxNodesPerRack, results, avoidStaleNodes,
storageTypes, true);
writer = (storageInfo != null) ? storageInfo.getDatanodeDescriptor()
: null;
if (–numOfReplicas == 0) {
return writer;
}
}
final DatanodeDescriptor dn0 = results.get(0).getDatanodeDescriptor();

// 第二个块存储在另外一个机架
if (numOfResults <= 1) {
chooseRemoteRack(1, dn0, excludedNodes, blocksize, maxNodesPerRack,
results, avoidStaleNodes, storageTypes);
if (–numOfReplicas == 0) {
return writer;
}
}
final DatanodeDescriptor dn1 = results.get(1).getDatanodeDescriptor();
// 如果第一个和第二个在同一个机架,那么第三个放在其他机架
if (clusterMap.isOnSameRack(dn0, dn1)) {
chooseRemoteRack(1, dn0, excludedNodes, blocksize, maxNodesPerRack,
results, avoidStaleNodes, storageTypes);
} else if (newBlock){
// 如果是新块,和第二个块存储在同一个机架
chooseLocalRack(dn1, excludedNodes, blocksize, maxNodesPerRack,
results, avoidStaleNodes, storageTypes);

} else {
// 如果不是新块,放在当前机架
chooseLocalRack(writer, excludedNodes, blocksize, maxNodesPerRack,
results, avoidStaleNodes, storageTypes);
}
if (–numOfReplicas == 0) {
return writer;
}
}
chooseRandom(numOfReplicas, NodeBase.ROOT, excludedNodes, blocksize,
maxNodesPerRack, results, avoidStaleNodes, storageTypes);
return writer;
}
3.1.3 建立管道之 Socket 发送
点击 nextBlockOutputStream

protected LocatedBlock nextBlockOutputStream() throws IOException {
LocatedBlock lb;
DatanodeInfo[] nodes;
StorageType[] nextStorageTypes;
String[] nextStorageIDs;
int count = dfsClient.getConf().getNumBlockWriteRetry();
boolean success;
final ExtendedBlock oldBlock = block.getCurrentBlock();
do {
errorState.resetInternalError();
lastException.clear();
DatanodeInfo[] excluded = getExcludedNodes();
// 向 NN 获取向哪个 DN 写数据

lb = locateFollowingBlock(
excluded.length > 0 ? excluded : null, oldBlock);

success = createBlockOutputStream(nodes, nextStorageTypes, nextStorageIDs,
0L, false);
… …

} while (!success && --count >= 0);

if (!success) {
throw new IOException(“Unable to create new block.”);
}
return lb;

boolean createBlockOutputStream(DatanodeInfo[] nodes,
StorageType[] nodeStorageTypes, String[] nodeStorageIDs,
long newGS, boolean recoveryFlag) {
… …
// 和 DN 创建 socket
s = createSocketForPipeline(nodes[0], nodes.length, dfsClient);
// 获取输出流,用于写数据到 DN
OutputStream unbufOut = NetUtils.getOutputStream(s, writeTimeout);

InputStream unbufIn = NetUtils.getInputStream(s, readTimeout);

IOStreamPair saslStreams = dfsClient.saslClient.socketSend(s,
unbufOut, unbufIn, dfsClient, accessToken, nodes[0]);
unbufOut = saslStreams.out;
unbufIn = saslStreams.in;
out = new DataOutputStream(new BufferedOutputStream(unbufOut,
DFSUtilClient.getSmallBufferSize(dfsClient.getConfiguration())));
blockReplyStream = new DataInputStream(unbufIn);
// 发送数据
new Sender(out).writeBlock(blockCopy, nodeStorageTypes[0], accessToken,
dfsClient.clientName, nodes, nodeStorageTypes, null, bcs,
nodes.length, block.getNumBytes(), bytesSent, newGS,
checksum4WriteBlock, cachingStrategy.get(), isLazyPersistFile,
(targetPinnings != null && targetPinnings[0]), targetPinnings,
nodeStorageIDs[0], nodeStorageIDs);

public void writeBlock(… …) throws IOException {
… …
send(out, Op.WRITE_BLOCK, proto.build());
}

3.1.4 建立管道之 Socket 接收

public void run() {
Peer peer = null;
while (datanode.shouldRun && !datanode.shutdownForUpgrade) {
try {
// 接收 socket 的请求
peer = peerServer.accept();
// Make sure the xceiver count is not exceeded
int curXceiverCount = datanode.getXceiverCount();
if (curXceiverCount > maxXceiverCount) {
throw new IOException("Xceiver count " + curXceiverCount

  • " exceeds the limit of concurrent xcievers: "
  • maxXceiverCount);
    }
    // 客户端每发送一个 block,都启动一个 DataXceiver 去处理 block
    new Daemon(datanode.threadGroup,
    DataXceiver.create(peer, datanode, this))
    .start();
    } catch (SocketTimeoutException ignored) {
    … …
    }
    }
    … …
    }

点击 DataXceiver(线程),查找 run 方法
public void run() {
int opsProcessed = 0;
Op op = null;
try {
synchronized(this) {
xceiver = Thread.currentThread();
}
dataXceiverServer.addPeer(peer, Thread.currentThread(), this);
peer.setWriteTimeout(datanode.getDnConf().socketWriteTimeout);
InputStream input = socketIn;
try {
IOStreamPair saslStreams = datanode.saslServer.receive(peer, socketOut,
socketIn, datanode.getXferAddress().getPort(),
return;
}

super.initialize(new DataInputStream(input));
do {
updateCurrentThreadName(“Waiting for operation #” + (opsProcessed + 1));
try {
if (opsProcessed != 0) {
assert dnConf.socketKeepaliveTimeout > 0;
peer.setReadTimeout(dnConf.socketKeepaliveTimeout);
} else {
peer.setReadTimeout(dnConf.socketTimeout);
}
// 读取这次数据的请求类型
op = readOp();
} catch (InterruptedIOException ignored) {
// Time out while we wait for client rpc
break;
} catch (EOFException | ClosedChannelException e) {
// Since we optimistically expect the next op, it’s quite normal to
// get EOF here.
LOG.debug("Cached {} closing after {} ops. " +
“This message is usually benign.”, peer, opsProcessed);
break;
} catch (IOException err) {
incrDatanodeNetworkErrors();
throw err;
}
// restore normal timeout
if (opsProcessed != 0) {
peer.setReadTimeout(dnConf.socketTimeout);
}

// 根据操作类型处理我们的数据
processOp(op);

protected final void processOp(Op op) throws IOException {
switch(op) {
… …
case WRITE_BLOCK:
opWriteBlock(in);
break;
… …
default:
throw new IOException(“Unknown op " + op + " in data stream”);
}
}
private void opWriteBlock(DataInputStream in) throws IOException {
final OpWriteBlockProto proto = OpWriteBlockProto.parseFrom(vintPrefixed(in));
final DatanodeInfo[] targets = PBHelperClient.convert(proto.getTargetsList());
TraceScope traceScope = continueTraceSpan(proto.getHeader(),
proto.getClass().getSimpleName());
try {
writeBlock(PBHelperClient.convert(proto.getHeader().getBaseHeader().getBlock()),
PBHelperClient.convertStorageType(proto.getStorageType()),
PBHelperClient.convert(proto.getHeader().getBaseHeader().getToken()),
proto.getHeader().getClientName(),
targets,
PBHelperClient.convertStorageTypes(proto.getTargetStorageTypesList(),
targets.length),
PBHelperClient.convert(proto.getSource()),
fromProto(proto.getStage()),
proto.getPipelineSize(),
proto.getMinBytesRcvd(), proto.getMaxBytesRcvd(),
proto.getLatestGenerationStamp(),
fromProto(proto.getRequestedChecksum()),
(proto.hasCachingStrategy() ?
getCachingStrategy(proto.getCachingStrategy()) :
CachingStrategy.newDefaultStrategy()),
(proto.hasAllowLazyPersist() ? proto.getAllowLazyPersist() : false),
(proto.hasPinning() ? proto.getPinning(): false),
(PBHelperClient.convertBooleanList(proto.getTargetPinningsList())),
proto.getStorageId(),
proto.getTargetStorageIdsList().toArray(new String[0]));
} finally {
if (traceScope != null) traceScope.close();
}
}

public void writeBlock(… …) throws IOException {
… …
try {
final Replica replica;

if (isDatanode ||
stage != BlockConstructionStage.PIPELINE_CLOSE_RECOVERY) {
// open a block receiver
setCurrentBlockReceiver(getBlockReceiver(block, storageType, in,
peer.getRemoteAddressString(),
peer.getLocalAddressString(),
stage, latestGenerationStamp, minBytesRcvd, maxBytesRcvd,
clientname, srcDataNode, datanode, requestedChecksum,
cachingStrategy, allowLazyPersist, pinning, storageId));
replica = blockReceiver.getReplica();

else {
replica = datanode.data.recoverClose(
block, latestGenerationStamp, minBytesRcvd);
}
storageUuid = replica.getStorageUuid();
isOnTransientStorage = replica.isOnTransientStorage();

// Connect to downstream machine, if appropriate

// 继续连接下游的机器
if (targets.length > 0) {
InetSocketAddress mirrorTarget = null;
// Connect to backup machine
mirrorNode = targets[0].getXferAddr(connectToDnViaHostname);
LOG.debug(“Connecting to datanode {}”, mirrorNode);
mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
// 向新的副本发送 socket
mirrorSock = datanode.newSocket();
try {
… …
if (targetPinnings != null && targetPinnings.length > 0) {
// 往下游 socket 发送数据
new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
blockToken, clientname, targets, targetStorageTypes,
srcDataNode, stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
latestGenerationStamp, requestedChecksum, cachingStrategy,
allowLazyPersist, targetPinnings[0], targetPinnings,
targetStorageId, targetStorageIds);
} else {
new Sender(mirrorOut).writeBlock(originalBlock, targetStorageTypes[0],
blockToken, clientname, targets, targetStorageTypes,
srcDataNode, stage, pipelineSize, minBytesRcvd, maxBytesRcvd,
latestGenerationStamp, requestedChecksum, cachingStrategy,
allowLazyPersist, false, targetPinnings,
targetStorageId, targetStorageIds);
}
mirrorOut.flush();
DataNodeFaultInjector.get().writeBlockAfterFlush();
// read connect ack (only for clients, not for replication req)
if (isClient) {
BlockOpResponseProto connectAck =
BlockOpResponseProto.parseFrom(PBHelperClient.vintPrefixed(mirrorIn));
mirrorInStatus = connectAck.getStatus();
firstBadLink = connectAck.getFirstBadLink();
if (mirrorInStatus != SUCCESS) {
LOG.debug(“Datanode {} got response for connect” +
“ack from downstream datanode with firstbadlink as {}”,
targets.length, firstBadLink);
}
}
… …
//update metrics
datanode.getMetrics().addWriteBlockOp(elapsed());
datanode.getMetrics().incrWritesFromClient(peer.isLocal(), size);
}

BlockReceiver getBlockReceiver(
final ExtendedBlock block, final StorageType storageType,
final DataInputStream in,
final String inAddr, final String myAddr,
final BlockConstructionStage stage,
final long newGs, final long minBytesRcvd, final long maxBytesRcvd,
final String clientname, final DatanodeInfo srcDataNode,
final DataNode dn, DataChecksum requestedChecksum,
CachingStrategy cachingStrategy,
final boolean allowLazyPersist,
final boolean pinning,
final String storageId) throws IOException {
return new BlockReceiver(block, storageType, in,
inAddr, myAddr, stage, newGs, minBytesRcvd, maxBytesRcvd,
clientname, srcDataNode, dn, requestedChecksum,
cachingStrategy, allowLazyPersist, pinning, storageId);
}

BlockReceiver(final ExtendedBlock block, final StorageType storageType,
final DataInputStream in,
final String inAddr, final String myAddr,
final BlockConstructionStage stage,
final long newGs, final long minBytesRcvd, final long maxBytesRcvd,
final String clientname, final DatanodeInfo srcDataNode,
final DataNode datanode, DataChecksum requestedChecksum,
CachingStrategy cachingStrategy,
final boolean allowLazyPersist,
final boolean pinning,
final String storageId) throws IOException {
… …
if (isDatanode) { //replication or move
replicaHandler =
datanode.data.createTemporary(storageType, storageId, block, false);
} else {
switch (stage) {
case PIPELINE_SETUP_CREATE:
// 创建管道
replicaHandler = datanode.data.createRbw(storageType, storageId,
block, allowLazyPersist);

datanode.notifyNamenodeReceivingBlock(
block, replicaHandler.getReplica().getStorageUuid());
break;
… …
default: throw new IOException("Unsupported stage " + stage +
" while receiving block " + block + " from " + inAddr);
}
}
… …
}

public ReplicaHandler createRbw(
StorageType storageType, String storageId, ExtendedBlock b,
boolean allowLazyPersist) throws IOException {
try (AutoCloseableLock lock = datasetLock.acquire()) {
… …
if (ref == null) {
ref = volumes.getNextVolume(storageType, storageId, b.getNumBytes());
}
FsVolumeImpl v = (FsVolumeImpl) ref.getVolume();
// create an rbw file to hold block in the designated volume
if (allowLazyPersist && !v.isTransientStorage()) {
datanode.getMetrics().incrRamDiskBlocksWriteFallback();
}
ReplicaInPipeline newReplicaInfo;
try {
// 创建输出流的临时写文件
newReplicaInfo = v.createRbw(b);
if (newReplicaInfo.getReplicaInfo().getState() != ReplicaState.RBW) {
throw new IOException("CreateRBW returned a replica of state "

  • newReplicaInfo.getReplicaInfo().getState()
  • " for block " + b.getBlockId());
    }
    } catch (IOException e) {
    IOUtils.cleanup(null, ref);
    throw e;
    }
    volumeMap.add(b.getBlockPoolId(), newReplicaInfo.getReplicaInfo());
    return new ReplicaHandler(newReplicaInfo, ref);
    }
    }

public ReplicaHandler createRbw(
StorageType storageType, String storageId, ExtendedBlock b,
boolean allowLazyPersist) throws IOException {
try (AutoCloseableLock lock = datasetLock.acquire()) {
… …
if (ref == null) {
// 有可能有多个临时写文件
ref = volumes.getNextVolume(storageType, storageId, b.getNumBytes());
FsVolumeImpl v = (FsVolumeImpl) ref.getVolume();
// create an rbw file to hold block in the designated volume

if (allowLazyPersist && !v.isTransientStorage()) {
datanode.getMetrics().incrRamDiskBlocksWriteFallback();
}
ReplicaInPipeline newReplicaInfo;
try {
// 创建输出流的临时写文件
newReplicaInfo = v.createRbw(b);
if (newReplicaInfo.getReplicaInfo().getState() != ReplicaState.RBW) {
throw new IOException("CreateRBW returned a replica of state "

  • newReplicaInfo.getReplicaInfo().getState()
  • " for block " + b.getBlockId());
    }
    } catch (IOException e) {
    IOUtils.cleanup(null, ref);
    throw e;
    }
    volumeMap.add(b.getBlockPoolId(), newReplicaInfo.getReplicaInfo());
    return new ReplicaHandler(newReplicaInfo, ref);
    }
    public ReplicaInPipeline createRbw(ExtendedBlock b) throws IOException {
    File f = createRbwFile(b.getBlockPoolId(), b.getLocalBlock());
    LocalReplicaInPipeline newReplicaInfo = new ReplicaBuilder(ReplicaState.RBW)
    .setBlockId(b.getBlockId())
    .setGenerationStamp(b.getGenerationStamp())
    .setFsVolume(this)
    .setDirectoryToUse(f.getParentFile())
    .setBytesToReserve(b.getNumBytes())
    .buildLocalReplicaInPipeline();
    return newReplicaInfo;
    }

3.1.5 客户端接收 DN 写数据应答 Response
@Override
public void run() {
long lastPacket = Time.monotonicNow();
TraceScope scope = null;
while (!streamerClosed && dfsClient.clientRunning) {
// if the Responder encountered an error, shutdown Responder
if (errorState.hasError()) {
closeResponder();
}

DFSPacket one;
try {
// process datanode IO errors if any
boolean doSleep = processDatanodeOrExternalError();
final int halfSocketTimeout = dfsClient.getConf().getSocketTimeout()/2;
synchronized (dataQueue) {
// wait for a packet to be sent.
long now = Time.monotonicNow();
while ((!shouldStop() && dataQueue.size() == 0 &&
(stage != BlockConstructionStage.DATA_STREAMING ||
now - lastPacket < halfSocketTimeout)) || doSleep) {
long timeout = halfSocketTimeout - (now-lastPacket);
timeout = timeout <= 0 ? 1000 : timeout;
timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
timeout : 1000;
try {
// 如果 dataQueue 里面没有数据,代码会阻塞在这儿
dataQueue.wait(timeout); // 接收到 notify 消息
} catch (InterruptedException e) {
LOG.warn(“Caught exception”, e);
}
doSleep = false;
now = Time.monotonicNow();
}
if (shouldStop()) {
continue;
}
// get packet to be sent.
if (dataQueue.isEmpty()) {
one = createHeartbeatPacket();
} else {
try {
backOffIfNecessary();
} catch (InterruptedException e) {
LOG.warn(“Caught exception”, e);
}
// 队列不为空,从队列中取出 packet
one = dataQueue.getFirst(); // regular data packet
SpanId[] parents = one.getTraceParents();
if (parents.length > 0) {
scope = dfsClient.getTracer().
newScope(“dataStreamer”, parents[0]);
scope.getSpan().setParents(parents);
}
}
}
// get new block from namenode.
if (LOG.isDebugEnabled()) {
LOG.debug(“stage=” + stage + ", " + this);
}
if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
LOG.debug(“Allocating new block: {}”, this);
// 步骤一:向 NameNode 申请 block 并建立数据管道
setPipeline(nextBlockOutputStream());
// 步骤二:启动 ResponseProcessor 用来监听 packet 发送是否成功
initDataStreaming();
else if (stage == BlockConstructionStage.PIPELINE_SETUP_APPEND) {
LOG.debug(“Append to block {}”, block);
setupPipelineForAppendOrRecovery();
if (streamerClosed) {
continue;
}
initDataStreaming();
}
long lastByteOffsetInBlock = one.getLastByteOffsetBlock();
if (lastByteOffsetInBlock > stat.getBlockSize()) {
throw new IOException("BlockSize " + stat.getBlockSize() +
" < lastByteOffsetInBlock, " + this + ", " + one);
}
if (one.isLastPacketInBlock()) {
// wait for all data packets have been successfully acked
synchronized (dataQueue) {
while (!shouldStop() && ackQueue.size() != 0) {
try {
// wait for acks to arrive from datanodes
dataQueue.wait(1000);
} catch (InterruptedException e) {
LOG.warn(“Caught exception”, e);
}
}
}
if (shouldStop()) {
continue;
}
stage = BlockConstructionStage.PIPELINE_CLOSE;
}
// send the packet
SpanId spanId = SpanId.INVALID;
synchronized (dataQueue) {
// move packet from dataQueue to ackQueue
if (!one.isHeartbeatPacket()) {
if (scope != null) {
spanId = scope.getSpanId();
scope.detach();
one.setTraceScope(scope);
}
scope = null;
// 步骤三:从 dataQueue 把要发送的这个 packet 移除出去
dataQueue.removeFirst();
// 步骤四:然后往 ackQueue 里面添加这个 packet
ackQueue.addLast(one);
packetSendTime.put(one.getSeqno(), Time.monotonicNow());
dataQueue.notifyAll();
}
// write out data to remote datanode
try (TraceScope ignored = dfsClient.getTracer().
newScope(“DataStreamer#writeTo”, spanId)) {
// 将数据写出去
one.writeTo(blockStream);
blockStream.flush();
} catch (IOException e) {
errorState.markFirstNodeIfNotMarked();
throw e;
}
lastPacket = Time.monotonicNow();
// update bytesSent
long tmpBytesSent = one.getLastByteOffsetBlock();
if (bytesSent < tmpBytesSent) {
bytesSent = tmpBytesSent;
}
if (shouldStop()) {
continue;
}
// Is this block full?
if (one.isLastPacketInBlock()) {
// wait for the close packet has been acked
synchronized (dataQueue) {
while (!shouldStop() && ackQueue.size() != 0) {
dataQueue.wait(1000);// wait for acks to arrive from datanodes
}
}
if (shouldStop()) {
continue;
}
endBlock();
}
if (progress != null) { progress.progress(); }
// This is used by unit test to trigger race conditions.
if (artificialSlowdown != 0 && dfsClient.clientRunning) {
Thread.sleep(artificialSlowdown);
}
} catch (Throwable e) {
… …
} finally {
if (scope != null) {
scope.close();
scope = null;
}
}
}
closeInternal();
}
private void initDataStreaming() {
this.setName("DataStreamer for file " + src +
" block " + block);
… …
response = new ResponseProcessor(nodes);
response.start();
stage = BlockConstructionStage.DATA_STREAMING;
}
public void run() {
… …
ackQueue.removeFirst();
packetSendTime.remove(seqno);
dataQueue.notifyAll();
… .

设计思路

相关代码如下:

完整版下载: hadoop 源码解析-HDFS的上传源码解析

小结:

主要讲述了自己的一些体会,里面有许多不足,请大家指正~

参考资料和推荐阅读

深度开源: link

欢迎阅读,各位老铁,如果对你有帮助,点个赞加个关注呗!~

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值