本文将剖析写入hadoop的java接口是如何实现的写入文件的。本文不会重点剖析namenode和datanode段是如何处理,而是专注在客户端的处理上。
背景知识:简单介绍写文件client-namenode-datanode流程(单纯了解hadoop此文足矣。。。):http://www.cnblogs.com/duguguiyu/archive/2009/02/22/1396034.html
首先,load FileSystem,连hadoop系统会创建DistributedFileSystem。我们就从DistributedFileSystem的create创建文件开始吧。
@Override
public HdfsDataOutputStream create(Path f, FsPermission permission,
EnumSet<CreateFlag> cflags, int bufferSize, short replication, long blockSize,
Progressable progress) throws IOException {
statistics.incrementWriteOps(1);
final DFSOutputStream out = dfs.create(getPathName(f), permission, cflags,
replication, blockSize, progress, bufferSize);
return new HdfsDataOutputStream(out, statistics);
}
我们写java的hadoop程序时大多使用的是hadoop-common的FsDataOutputStream,HdfsDataOutputStream继承FsDataOutputStream,感觉作用就是hadoop-hdfs里的hadoop-common。这样包装一层的好处是有的用户只用引用FsDataOutputStream就能操作hdfs了。真正核心的类是DFSOutputStream,我们会详细讨论此类。
客户首先调用FileSystem.create创建一个FsDataOutput对象,实际核心是创建了DFSOutputStream:
/** Construct a new output stream for creating a file. */
private DFSOutputStream(DFSClient dfsClient, String src, FsPermission masked,
EnumSet<CreateFlag> flag, boolean createParent, short replication,
long blockSize, Progressable progress, int buffersize,
DataChecksum checksum) throws IOException {
this(dfsClient, src, blockSize, progress, checksum, replication);
this.shouldSyncBlock = flag.contains(CreateFlag.SYNC_BLOCK);
computePacketChunkSize(dfsClient.getConf().writePacketSize,
checksum.getBytesPerChecksum());
try {
dfsClient.namenode.create(
src, masked, dfsClient.clientName, new EnumSetWritable<CreateFlag>(flag), createParent, replication, blockSize);
} catch(RemoteException re) {
throw re.unwrapRemoteException(AccessControlException.class,
DSQuotaExceededException.class,
FileAlreadyExistsException.class,
FileNotFoundException.class,
ParentNotDirectoryException.class,
NSQuotaExceededException.class,
SafeModeException.class,
UnresolvedPathException.class);
}
streamer = new DataStreamer();
}
1.dfsClient.namenode.create向namenode请求create,创建一个文件。namenode端会在虚拟的文件系统中创建一个对应的INode,创建对应的lease。
2.创建DataStreamer类并启动该线程。DataStream类专门负责处理发送数据的逻辑。
创建DataStreamer:
private DataStreamer() {
isAppend = false;
stage = BlockConstructionStage.PIPELINE_SETUP_CREATE;
}
初始状态为PIPELINE_SETUP_CREATE,表示已经告诉namenode创建文件,等待block写入。
下面看下DataStreamer的run方法:
// get new block from namenode.
if (stage == BlockConstructionStage.PIPELINE_SETUP_CREATE) {
if(DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("Allocating new block");
}
nodes = nextBlockOutputStream(src);
initDataStreaming();
开始分配block。
nextBlockOutputStream返回当前连接的Datanodelist,initDataStreaming初始化传输流。
do {
hasError = false;
lastException = null;
errorIndex = -1;
success = false;
long startTime = Time.now();
DatanodeInfo[] excluded = excludedNodes.toArray(
new DatanodeInfo[excludedNodes.size()]);
block = oldBlock;
lb = locateFollowingBlock(startTime,
excluded.length > 0 ? excluded : null);
block = lb.getBlock();
block.setNumBytes(0);
accessToken = lb.getBlockToken();
nodes = lb.getLocations();
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, 0L, false);
if (!success) {
DFSClient.LOG.info("Abandoning block " + block);
dfsClient.namenode.abandonBlock(block, src, dfsClient.clientName);
block = null;
DFSClient.LOG.info("Excluding datanode " + nodes[errorIndex]);
excludedNodes.add(nodes[errorIndex]);
}
} while (!success && --count >= 0);
nextBlockOutputStream会尝试像namenode索取写入一个block所需的datanodelist并会尝试多次连接第一个datanode。
具体来说:
1.locateFollowingBlock调用dfsClient.namenode.addBlock(src, dfsClient.clientName, block, excludedNodes)拿到namenode返回的LocatedBlock对象。namenode会将blockid,datanodelist返回,顺便说一下,namenode会平衡写入数据,安全,读取数据三个方面选择合适的datanode给用户。
2.createBlockOutputStream会尝试打开写入datanodelist列表的第一个datanode的管道(socket),等待datanode发送ack(好像是同步等待datanode返回ack。。。)。
接着是initDataStreaming:
private void initDataStreaming() {
this.setName("DataStreamer for file " + src +
" block " + block);
response = new ResponseProcessor(nodes);
response.start();
stage = BlockConstructionStage.DATA_STREAMING;
}
拿到了datanodelist,我们可以创建ResponseProcessor啦。作用是处理datanode返回的ack包。然后设置stage为DATA_STREAMING。进入数据录入状态~
开始写入:
Client拿到FSDataOutputStreamer.write实际调用FSOutputSummer的write
public synchronized void write(int b) throws IOException {
sum.update(b);
buf[count++] = (byte)b;
if(count == buf.length) {
flushBuffer();
}
}
sum.update进行CRC校验的统计。在数据写满该类的内部buf后会flushBuffer。flushBuffer会最终调用DFSOutputStream的writeChunk。
在进入writeChunk前先要解释下三个相关的对象:chunk,packet,block。
chunk:大小512.CRC校验的单位。每次flush生成一个chunk。内部的结构代码里有图可以形象表示:
buf is pointed into like follows:
* (C is checksum data, D is payload data)
*
* [_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
* ^ ^ ^ ^
* | checksumPos dataStart dataPos
* checksumStart
每个chunk前面是校验位,后面是对应的数据。
packet:大小64k,由多个chunk组成。是client写入datanode的最小单位。当chunk数量达到packet能包含的极限时client端把chunk打包生成一个packet放入dataqueue,让DataStreamer发送。
block:block就是我们喜闻乐见的hadoop中实际存储的文件块了。大小为64M。datanode会收集和打包chunk生成一个个block写入磁盘。
好,理清了概念再看writeChunk方法:
if (currentPacket == null) {
currentPacket = new Packet(packetSize, chunksPerPacket,
bytesCurBlock);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("DFSClient writeChunk allocating new packet seqno=" +
currentPacket.seqno +
", src=" + src +
", packetSize=" + packetSize +
", chunksPerPacket=" + chunksPerPacket +
", bytesCurBlock=" + bytesCurBlock);
}
}
currentPacket.writeChecksum(checksum, 0, cklen);
currentPacket.writeData(b, offset, len);
currentPacket.numChunks++;
bytesCurBlock += len;
// If packet is full, enqueue it for transmission
//
if (currentPacket.numChunks == currentPacket.maxChunks ||
bytesCurBlock == blockSize) {
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("DFSClient writeChunk packet full seqno=" +
currentPacket.seqno +
", src=" + src +
", bytesCurBlock=" + bytesCurBlock +
", blockSize=" + blockSize +
", appendChunk=" + appendChunk);
}
waitAndQueueCurrentPacket();
// If the reopened file did not end at chunk boundary and the above
// write filled up its partial chunk. Tell the summer to generate full
// crc chunks from now on.
if (appendChunk && bytesCurBlock%bytesPerChecksum == 0) {
appendChunk = false;
resetChecksumChunk(bytesPerChecksum);
}
if (!appendChunk) {
int psize = Math.min((int)(blockSize-bytesCurBlock), dfsClient.getConf().writePacketSize);
computePacketChunkSize(psize, bytesPerChecksum);
}
//
// if encountering a block boundary, send an empty packet to
// indicate the end of block and reset bytesCurBlock.
//
if (bytesCurBlock == blockSize) {
currentPacket = new Packet(0, 0, bytesCurBlock);
currentPacket.lastPacketInBlock = true;
currentPacket.syncBlock = shouldSyncBlock;
waitAndQueueCurrentPacket();
bytesCurBlock = 0;
lastFlushOffset = 0;
}
}
1.如果当前的packet没有则创建packet
2.写入packet。已写入packet数++,增加bytesCurBlock值
3.如当前packet已满则将packet进dataQueue。
4.如果bytesCurBlock等于blockSize也就是默认的64M,发送一个空的packet并设置lastPacketInBlock和syncBlock告诉datanode该生成block啦~
DataStreamer线程会不停的检测DataQueue,一旦发现有packet塞入就会开始向datanode写数据
// send the packet
synchronized (dataQueue) {
// move packet from dataQueue to ackQueue
if (!one.isHeartbeatPacket()) {
dataQueue.removeFirst();
ackQueue.addLast(one);
dataQueue.notifyAll();
}
}
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("DataStreamer block " + block +
" sending packet " + one);
}
// write out data to remote datanode
try {
one.writeTo(blockStream);
blockStream.flush();
} catch (IOException e) {
// HDFS-3398 treat primary DN is down since client is unable to
// write to primary DN
errorIndex = 0;
throw e;
}
packet从dataQueue拿出并放到ackQueue的末端。
数据已经写入,之前提到DataStreamer会启动ResponseProcesser线程处理datanode的返回。现在来看下ResponseProcesser的run方法:
while (!responderClosed && dfsClient.clientRunning && !isLastPacketInBlock) {
// process responses from datanodes.
try {
// read an ack from the pipeline
ack.readFields(blockReplyStream);
if (DFSClient.LOG.isDebugEnabled()) {
DFSClient.LOG.debug("DFSClient " + ack);
}
long seqno = ack.getSeqno();
// processes response status from datanodes.
for (int i = ack.getNumOfReplies()-1; i >=0 && dfsClient.clientRunning; i--) {
final Status reply = ack.getReply(i);
if (reply != SUCCESS) {
errorIndex = i; // first bad datanode
throw new IOException("Bad response " + reply +
" for block " + block +
" from datanode " +
targets[i]);
}
}
assert seqno != PipelineAck.UNKOWN_SEQNO :
"Ack for unkown seqno should be a failed ack: " + ack;
if (seqno == Packet.HEART_BEAT_SEQNO) { // a heartbeat ack
continue;
}
// a success ack for a data packet
Packet one = null;
synchronized (dataQueue) {
one = ackQueue.getFirst();
}
if (one.seqno != seqno) {
throw new IOException("Responseprocessor: Expecting seqno " +
" for block " + block +
one.seqno + " but received " + seqno);
}
isLastPacketInBlock = one.lastPacketInBlock;
// update bytesAcked
block.setNumBytes(one.getLastByteOffsetBlock());
synchronized (dataQueue) {
lastAckedSeqno = seqno;
ackQueue.removeFirst();
dataQueue.notifyAll();
}
1.ack.readFields(blockReplyStream);阻塞式的从datanode stream中读datanode的返回值
2.收到response后检测seqid是否是当前的seq。如果不是表示传输的“滑动窗口”出错。。。
3.检测每个datanode的返回值。
4.从ackQueue中取出packet
5.datanode.notifyAll使得阻塞在dataQueue上的操作可以继续。
这边说明下写数据时的流量控制和可靠性。虽然client和datanode的通信是基于TCP的,但hadoop理论强调的是认为任何环境都有可能出现问题,因此在写文件时使用了类似TCP的滑动窗口模型。每个packet在发送后都会进入ackqueue,当收到datanode的ack时会检查seqid是否正确,这样保证packet的强顺序性。只有收到对应的ack后packet才会真正从内存中删除。同时,在收到ackqueue后就允许DataStreamer发送一个packet了。不过,这边不是TCP那样强要求收一个发一个,而是通过timeout和notifyAll两种方式结合控制发送端的速度。看下发送端控制:
synchronized (dataQueue) {
// wait for a packet to be sent.
long now = Time.now();
while ((!streamerClosed && !hasError && dfsClient.clientRunning
&& dataQueue.size() == 0 &&
(stage != BlockConstructionStage.DATA_STREAMING ||
stage == BlockConstructionStage.DATA_STREAMING &&
now - lastPacket < dfsClient.getConf().socketTimeout/2)) || doSleep ) {
long timeout = dfsClient.getConf().socketTimeout/2 - (now-lastPacket);
timeout = timeout <= 0 ? 1000 : timeout;
timeout = (stage == BlockConstructionStage.DATA_STREAMING)?
timeout : 1000;
try {
dataQueue.wait(timeout);
} catch (InterruptedException e) {
}
doSleep = false;
now = Time.now();
}
最后总结一下:
1.总的来说,写入过程综合考虑了写入速度和正确性因素形成了当前的模式。
2.create时就会向namenode请求创建文件,拿到datanodelist并打开流。
3.写入时会写入一个internalbuf,等到buf size后flush到packet中。
4.packet满了后也不直接发送而是放入dataqueue中,由另一个线程异步处理。
5.开始发送数据后还会开一个ResponseProcesser线程异步处理datanode返回的ack。
6.定期打包和单独线程处理发送数据的做法是大吞吐系统的经典做法,ack是实现可靠传输的有效方法。学习一下收益匪浅啊~