HDFS client写文件过程 源码分析
HDFS写入文件的重要概念
HDFS一个文件由多个block构成。HDFS在进行block读写的时候是以packet(默认每个packet为64K)为单位进行的。每一个packet由若干个chunk(默认512Byte)组成。Chunk是进行数据校验的基本单位,对每一个chunk生成一个校验和(默认4Byte)并将校验和进行存储。
在写入一个block的时候,数据传输的基本单位是packet,每个packet由若干个chunk组成。
客户端
HDFS客户端写文件示例代码
FileSystem hdfs = FileSystem.get(new Configuration());
Path path = new Path("/testfile");
// writing
FSDataOutputStream dos = hdfs.create(path);
byte[] readBuf = "Hello World".getBytes("UTF-8");
dos.write(readBuf, 0, readBuf.length);
dos.close();在这里插入代码片
hdfs.close();
文件的打开
上传一个文件到hdfs,一般会调用DistributedFileSystem.create,其实现如下:
public FSDataOutputStream create(Path f, FsPermission permission,boolean overwrite,int bufferSize, short replication, long blockSize,Progressable progress) throws IOException {
return new FSDataOutputStream
(dfs.create(getPathName(f), permission,overwrite, replication, blockSize, progress, bufferSize),
statistics);
}
其最终生成一个FSDataOutputStream用于向新生成的文件中写入数据。其成员变量dfs的类型为DFSClient,DFSClient的create函数如下:
public OutputStream create(String src,FsPermission permission,boolean overwrite,short replication,long blockSize,Progressable progress,int buffersize) throws IOException {
checkOpen();
if (permission == null) {
permission = FsPermission.getDefault();
}
FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf));
OutputStream result = new DFSOutputStream(src, masked,overwrite, replication, blockSize, progress, buffersize,
conf.getInt("io.bytes.per.checksum", 512));
leasechecker.put(src, result);
return result;
}
其中构造了一个DFSOutputStream,在其构造函数中,同过RPC调用NameNode的create来创建一个文件。
当然,构造函数中还做了一件重要的事情,就是streamer.start(),也即启动了一个pipeline,用于写数据,在写入数据的过程中,我们会仔细分析。
DFSOutputStream(String src, FsPermission masked, boolean overwrite,short replication, long blockSize, Progressable progress,
int buffersize, int bytesPerChecksum) throws IOException {
this(src, blockSize, progress, bytesPerChecksum);
computePacketChunkSize(writePacketSize, bytesPerChecksum);
try {
namenode.create(src, masked, clientName, overwrite, replication, blockSize);
} catch(RemoteException re) {
throw re.unwrapRemoteException(AccessControlException.class,QuotaExceededException.class);
}
streamer.start();
}
NameNode
通过rpc调用NameNode的create函数,调用namesystem.startFile函数,其又调用startFileInternal函数,它创建一个新的文件,状态为under construction,没有任何data block与之对应。
private synchronized void startFileInternal(String src, PermissionStatus permissions,String holder,
String clientMachine, boolean overwrite, boolean append, short replication, long blockSize) throws IOException {
......
//创建一个新的文件,状态为under construction,没有任何data block与之对应
long genstamp = nextGenerationStamp();
INodeFileUnderConstruction newNode = dir.addFile(src, permissions,
replication, blockSize, holder, clientMachine, clientNode, genstamp);
......
}
客户端文件的写入
下面轮到客户端向新创建的文件中写入数据了,一般会使用FSDataOutputStream的write方法:
按照hdfs的设计,对block的数据写入使用的是pipeline的方式,也即将数据分成一个个的package,如果需要复制三分,分别写入DataNode 1, 2, 3,则会进行如下的过程:
首先将package 1写入DataNode 1
然后由DataNode 1负责将package 1写入DataNode 2,同时客户端可以将package 2写入DataNode 1
然后DataNode 2负责将package 1写入DataNode 3, 同时客户端可以讲package 3写入DataNode 1,DataNode 1将package 2写入DataNode 2
就这样将一个个package排着队的传递下去,直到所有的数据全部写入并复制完毕
FSDataOutputStream的write方法会调用DFSOutputStream的write方法,而DFSOutputStream继承自FSOutputSummer,所以实际上是调用FSOutputSummer的write方法,如下:
public synchronized void write(int b) throws IOException {
this.buf[this.count++] = (byte)b;
if (this.count == this.buf.length) {
this.flushBuffer();//最终调用writeChecksumChunk方法实现
}
}
writeChecksumChunk的实现如下:
private void writeChecksumChunks(byte[] b, int off, int len) throws IOException {
this.sum.calculateChunkedSums(b, off, len, this.checksum, 0);
for(int i = 0; i < len; i += this.sum.getBytesPerChecksum()) {
int chunkLen = Math.min(this.sum.getBytesPerChecksum(), len - i);
int ckOffset = i / this.sum.getBytesPerChecksum() * this.getChecksumSize();
this.writeChunk(b, off + i, chunkLen, this.checksum, ckOffset, this.getChecksumSize());
}
}
writeChunk由子类DFSOutputStream实现,如下:
protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException {
//创建一个package,并写入数据
currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock);
currentPacket.writeChecksum(checksum, 0, cklen);
currentPacket.writeData(b, offset, len);
currentPacket.numChunks++;
bytesCurBlock += len;
//如果此package已满,则放入队列中准备发送
if (currentPacket.numChunks == currentPacket.maxChunks ||
bytesCurBlock == blockSize) {
......
dataQueue.addLast(currentPacket);
//唤醒等待dataqueue的传输线程,也即DataStreamer
dataQueue.notifyAll();
currentPacket = null;
......
}
}
DataStreamer的run函数如下
public void run() {
while (!closed && clientRunning) {
Packet one = null;
synchronized (dataQueue) {
//如果队列中没有package,则等待
while ((!closed && !hasError && clientRunning
&& dataQueue.size() == 0) || doSleep) {
try {
dataQueue.wait(1000);
} catch (InterruptedException e) {
}
doSleep = false;
}
try {
//得到队列中的第一个package
one = dataQueue.getFirst();
long offsetInBlock = one.offsetInBlock;
//由NameNode分配block,并生成一个写入流指向此block
if (blockStream == null) {
nodes = nextBlockOutputStream(src);
response = new ResponseProcessor(nodes);
response.start();
}
ByteBuffer buf = one.getBuffer();
//将package从dataQueue移至ackQueue,等待确认
dataQueue.removeFirst();
dataQueue.notifyAll();
synchronized (ackQueue) {
ackQueue.addLast(one);
ackQueue.notifyAll();
}
//利用生成的写入流将数据写入DataNode中的block
blockStream.write(buf.array(), buf.position(), buf.remaining());
if (one.lastPacketInBlock) {
blockStream.writeInt(0); //表示此block写入完毕
}
blockStream.flush();
} catch (Throwable e) {
}
}
......
}
其中重要的一个函数是nextBlockOutputStream,实现如下:
private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {
LocatedBlock lb = null;
boolean retry = false;
DatanodeInfo[] nodes;
int count = conf.getInt("dfs.client.block.write.retries", 3);
boolean success;
do {
......
//由NameNode为文件分配DataNode和block
lb = locateFollowingBlock(startTime);
block = lb.getBlock();
nodes = lb.getLocations();
//创建向DataNode的写入流
success = createBlockOutputStream(nodes, clientName, false);
......
} while (retry && --count >= 0);
return nodes;
}
locateFollowingBlock中通过RPC调用namenode.addBlock(src, clientName)函数
NameNode
NameNode的addBlock函数实现如下:
public LocatedBlock addBlock(String src,
String clientName) throws IOException {
LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);
return locatedBlock;
}
FSNamesystem的getAdditionalBlock实现如下:
public LocatedBlock getAdditionalBlock(String src, String clientName) throws IOException {
long fileLength, blockSize;
int replication;
DatanodeDescriptor clientNode = null;
Block newBlock = null;
......
//为新的block选择DataNode
DatanodeDescriptor targets[] = replicator.chooseTarget(replication, clientNode, null, blockSize);
......
//得到文件路径中所有path的INode,其中最后一个是新添加的文件对的INode,状态为under construction
INode[] pathINodes = dir.getExistingPathINodes(src);
int inodesLen = pathINodes.length;
INodeFileUnderConstruction pendingFile = (INodeFileUnderConstruction)
pathINodes[inodesLen - 1];
//为文件分配block, 并设置在那写DataNode上
newBlock = allocateBlock(src, pathINodes);
pendingFile.setTargets(targets);
......
return new LocatedBlock(newBlock, targets, fileLength);
}
客户端
在分配了DataNode和block以后,createBlockOutputStream开始写入数据。
private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) {
//创建一个socket,链接DataNode
InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());
s = socketFactory.createSocket();
int timeoutValue = 3000 * nodes.length + socketTimeout;
s.connect(target, timeoutValue);
s.setSoTimeout(timeoutValue);
s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout;
DataOutputStream out = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), DataNode.SMALL_BUFFER_SIZE));
blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));
//写入指令
out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
out.write( DataTransferProtocol.OP_WRITE_BLOCK );
out.writeLong( block.getBlockId() );
out.writeLong( block.getGenerationStamp() );
out.writeInt( nodes.length );
out.writeBoolean( recoveryFlag );
Text.writeString( out, client );
out.writeBoolean(false);
out.writeInt( nodes.length - 1 );
//注意,次循环从1开始,而非从0开始。将除了第一个DataNode以外的另外两个DataNode的信息发送给第一个DataNode, 第一个DataNode可以根据此信息将数据写给另两个DataNode
for (int i = 1; i < nodes.length; i++) {
nodes[i].write(out);
}
checksum.writeHeader( out );
out.flush();
firstBadLink = Text.readString(blockReplyStream);
if (firstBadLink.length() != 0) {
throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);
}
blockStream = out;
}
客户端在DataStreamer的run函数中创建了写入流后,调用blockStream.write将数据写入DataNode
DataNode
DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数:
private void writeBlock(DataInputStream in) throws IOException {
DatanodeInfo srcDataNode = null;
//读入头信息
Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());
int pipelineSize = in.readInt(); // num of datanodes in entire pipeline
boolean isRecovery = in.readBoolean(); // is this part of recovery?
String client = Text.readString(in); // working on behalf of this client
boolean hasSrcDataNode = in.readBoolean(); // is src node info present
if (hasSrcDataNode) {
srcDataNode = new DatanodeInfo();
srcDataNode.readFields(in);
}
int numTargets = in.readInt();
if (numTargets < 0) {
throw new IOException("Mislabelled incoming datastream.");
}
//读入剩下的DataNode列表,如果当前是第一个DataNode,则此列表中收到的是第二个,第三个DataNode的信息,如果当前是第二个DataNode,则受到的是第三个DataNode的信息
DatanodeInfo targets[] = new DatanodeInfo[numTargets];
for (int i = 0; i < targets.length; i++) {
DatanodeInfo tmp = new DatanodeInfo();
tmp.readFields(in);
targets[i] = tmp;
}
DataOutputStream mirrorOut = null; // stream to next target
DataInputStream mirrorIn = null; // reply from next target
DataOutputStream replyOut = null; // stream to prev target
Socket mirrorSock = null; // socket to next target
BlockReceiver blockReceiver = null; // responsible for data handling
String mirrorNode = null; // the name:port of next target
String firstBadLink = ""; // first datanode that failed in connection setup
try {
//生成一个BlockReceiver, 其有成员变量DataInputStream in为从客户端或者上一个DataNode读取数据,还有成员变量DataOutputStream mirrorOut,用于向下一个DataNode写入数据,还有成员变量OutputStream out用于将数据写入本地。
blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode);
// get a connection back to the previous target
replyOut = new DataOutputStream(NetUtils.getOutputStream(s, datanode.socketWriteTimeout));
//如果当前不是最后一个DataNode,则同下一个DataNode建立socket连接
if (targets.length > 0) {
InetSocketAddress mirrorTarget = null;
// Connect to backup machine
mirrorNode = targets[0].getName();
mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
mirrorSock = datanode.newSocket();
int timeoutValue = numTargets * datanode.socketTimeout;
int writeTimeout = datanode.socketWriteTimeout +
(HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);
mirrorSock.connect(mirrorTarget, timeoutValue);
mirrorSock.setSoTimeout(timeoutValue);
mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
//创建向下一个DataNode写入数据的流
mirrorOut = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE));
mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));
mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );
mirrorOut.writeLong( block.getBlockId() );
mirrorOut.writeLong( block.getGenerationStamp() );
mirrorOut.writeInt( pipelineSize );
mirrorOut.writeBoolean( isRecovery );
Text.writeString( mirrorOut, client );
mirrorOut.writeBoolean(hasSrcDataNode);
if (hasSrcDataNode) { // pass src node information
srcDataNode.write(mirrorOut);
}
mirrorOut.writeInt( targets.length - 1 );
//此出也是从1开始,将除了下一个DataNode的其他DataNode信息发送给下一个DataNode
for ( int i = 1; i < targets.length; i++ ) {
targets[i].write( mirrorOut );
}
blockReceiver.writeChecksumHeader(mirrorOut);
mirrorOut.flush();
}
//使用BlockReceiver接受block
String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,
mirrorAddr, null, targets.length);
......
} finally {
// close all opened streams
IOUtils.closeStream(mirrorOut);
IOUtils.closeStream(mirrorIn);
IOUtils.closeStream(replyOut);
IOUtils.closeSocket(mirrorSock);
IOUtils.closeStream(blockReceiver);
}
}
BlockReceiver的receiveBlock函数中,一段重要的逻辑如下:
void receiveBlock(
DataOutputStream mirrOut, // output to next datanode
DataInputStream mirrIn, // input from next datanode
DataOutputStream replyOut, // output to previous datanode
String mirrAddr, BlockTransferThrottler throttlerArg,
int numTargets) throws IOException {
......
//不断的接受package,直到结束
while (receivePacket() > 0) {}
if (mirrorOut != null) {
try {
mirrorOut.writeInt(0); // mark the end of the block
mirrorOut.flush();
} catch (IOException e) {
handleMirrorOutError(e);
}
}
......
}
BlockReceiver的receivePacket函数如下:
private int receivePacket() throws IOException {
//从客户端或者上一个节点接收一个package
int payloadLen = readNextPacket();
buf.mark();
//read the header
buf.getInt(); // packet length
offsetInBlock = buf.getLong(); // get offset of packet in block
long seqno = buf.getLong(); // get seqno
boolean lastPacketInBlock = (buf.get() != 0);
int endOfHeader = buf.position();
buf.reset();
setBlockPosition(offsetInBlock);
//将package写入下一个DataNode
if (mirrorOut != null) {
try {
mirrorOut.write(buf.array(), buf.position(), buf.remaining());
mirrorOut.flush();
} catch (IOException e) {
handleMirrorOutError(e);
}
}
buf.position(endOfHeader);
int len = buf.getInt();
offsetInBlock += len;
int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)*
checksumSize;
int checksumOff = buf.position();
int dataOff = checksumOff + checksumLen;
byte pktBuf[] = buf.array();
buf.position(buf.limit()); // move to the end of the data.
......
//将数据写入本地的block
out.write(pktBuf, dataOff, len);
/// flush entire packet before sending ack
flush();
// put in queue for pending acks
if (responder != null) {
((PacketResponder)responder.getRunnable()).enqueue(seqno,
lastPacketInBlock);
}
return payloadLen;
}