hdfs文件写入过程流程分析

    在hdfds中,文件的上传、打开、读取都是在主要的三个类:客户端DFSClient、Namenode或者加入DataNode交互作用:

    上传文件到hdfs的流程中,

1、首先调用DistributedFileSystem.create,其实现如下:

    publicFSDataOutputStream create(Path f, FsPermission permission,

                   booleanoverwrite,

           intbufferSize, short replication, long blockSize,

           Progressableprogress) throws IOException {

                    returnnew FSDataOutputStream

           (dfs.create(getPathName(f), permission,

           overwrite, replication, blockSize, progress, bufferSize),

           statistics);} 

2、这里的dfs是DFSClient类型,追踪DFSClent可追踪到:

      public OutputStream create(String src,FsPermission permission,

                      boolean overwrite,short replication,longblockSize,

                      Progressable progress,int buffersize) throwsIOException{}

在这个函数中主要包含下面这个调用:

OutputStream result = new DFSOutputStream(src, masked, overwrite,replication,

 blockSize, progress,buffersize, conf.getInt("io.bytes.per.checksum",512));

在DFSClient中找到:

DFSOutputStream(String src, FsPermission masked, booleanoverwrite,

        boolean createParent,short replication,long blockSize, Progressable progress,int buffersize,int                           bytesPerChecksum)throws IOException {} 

此函数主要有下面这个调用:

namenode.create(src,masked, clientName, overwrite,false, replication, blockSize);

streamer.start(); //启动了一个pipeline,用于写数据

namenode是NameNode类型,在NameNode类中追踪create()函数

3、在NameNode类中的create()函数:

 publicvoid create(String src,FsPermission masked,String clientName,

       boolean overwrite,boolean createParent,short replication,long blockSize)throws IOException { }

在其中含有重要的调用:

namesystem.startFile(src,new   PermissionStatus(UserGroupInformation.getCurrentUser().getShortUserName(),null, masked)

namesystem是FSNameSystem类型,跟踪FSNameSystem类:

private synchronized void startFileInternal(String src,

           PermissionStatuspermissions,String holder,String clientMachine,boolean overwrite,

boolean append,short replication,long blockSize)throws IOException { }

此函数中有:

      INodeFileUnderConstruction newNode =dir.addFile(src, permissions,

                 replication, blockSize, holder, clientMachine,clientNode, genstamp);

// 创建一个新的文件,状态为under construction,没有任何data block与之对应

4、然后客户端向新创建的文件中写入数据,一般会使用FSDataOutputStream的write函数,最终会调用DFSOutputStream的writeChunk函数,DFSOutputStream类是类DFSClient的内部类:

    在hdfs的设计,对block的数据写入使用的是pipeline的方式,即将数据分成一个个的package,如果需要复制三分,分别写入DataNode1,2,3,进行如下的过程:

首先将package1写入DataNode1

然后由DataNode1负责将package1写入DataNode2,同时客户端可以将pacage2写入DataNode1

然后DataNode2负责将package1写入DataNode3, 同时客户端可以讲package3写入DataNode1,DataNode1将package2写入DataNode2

    protectedsynchronizedvoid writeChunk(byte[] b,int offset,    int len,byte[] checksum) throws IOException { …}

synchronized (dataQueue) {

//If queue is full, then wait till we cancreate  enough space

    while(!closed && dataQueue.size() +  ackQueue.size()  >  maxPackets){

         try {

           dataQueue.wait(); //wait

         } catch (InterruptedException  e){

         }

       }

       isClosed();

       if (currentPacket == null) {

         currentPacket = new Packet(packetSize, chunksPerPacket,bytesCurBlock);

         if (LOG.isDebugEnabled()) {

           LOG.debug("DFSClient writeChunk allocating new packet seqno=" +currentPacket.seqno +

 ", src=" + src+  ", packetSize=" + packetSize+

       ", chunksPerPacket=" + chunksPerPacket +

         ",bytesCurBlock=" + bytesCurBlock); }

       currentPacket.writeChecksum(checksum, 0, cklen);

       currentPacket.writeData(b, offset, len);

       currentPacket.numChunks++;

       bytesCurBlock += len;

       // If packet is full, enqueue it for transmission

       if (currentPacket.numChunks == currentPacket.maxChunks  || bytesCurBlock == blockSize) {

         if (LOG.isDebugEnabled()) {

           LOG.debug("DFSClient writeChunk packet full seqno=" + currentPacket.seqno + ",src=" + src +

                      ",bytesCurBlock=" + bytesCurBlock +

                      ", blockSize="+ blockSize +

                      ",appendChunk=" + appendChunk);

         }

      //if we allocated a new packet because we encountered a block boundary, reset bytesCurBlock.

         if (bytesCurBlock == blockSize) {

           currentPacket.lastPacketInBlock = true;

           bytesCurBlock = 0;

           lastFlushOffset = 0;

         }

         enqueueCurrentPacket();

     //If this was the first write after reopening a file, then the above write filled up any partial chunk. Tell the summer

// to generate full crc chunks fromnow on.

         if (appendChunk) {

           appendChunk = false;

           resetChecksumChunk(bytesPerChecksum);

         }

         int psize = Math.min((int)(blockSize-bytesCurBlock),  writePacketSize);

         computePacketChunkSize(psize, bytesPerChecksum);

       }

      }

    同时前面提到的streamer.start(),streamer是DataStreamer类型,类DataStreamer也是DFSClient的内部类:

    类中的方法:

    publicvoid run() {…}内部的一些调用:

blockStream.write(buf.array(),buf.position(),buf.remaining());

//利用生成的写入流将数据写入DataNode中的block

blockStream.writeInt(0); //表明写入结束,其中blockStream是DataOutputStream类型

nodes= nextBlockOutputStream(src); 由NameNode分配block,并生成一个写入流指向此block

在DataStreamer类中:private DatanodeInfo[]nextBlockOutputStream(String client) throws IOException { }

其中包含有:

       lb = locateFollowingBlock(startTime);

       // 由NameNode为文件分配DataNode和block

再次进行追踪,在内部类DFSOutputStream中有

private LocatedBlocklocateFollowingBlock(long start,

        DatanodeInfo[] excludedNodes)throws IOException {…}

其中最重要的是:

   returnnamenode.addBlock(src, clientName, excludedNodes);

5、追踪NameNode类中的addblock(…)函数

   publicLocatedBlock addBlock(String src,

          String clientName) throws IOException {…}

其中有:

LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);

return locatedBlock;

继而可以看到要涉及类FSNamesystem中的getAdditionalBlock:

public LocatedBlockgetAdditionalBlock(String src, String clientName, List<Node> excludedNodes)throws IOException {…}

6、在客户端分配了DataNode和block以后,在内部类DFSOutputStream中的createBlockOutputStream开始写入数据,

privateboolean createBlockOutputStream(DatanodeInfo[] nodes, String client,boolean recoveryFlag) { …}

其中有:

      { LOG.debug("Connecting to " + nodes[0].getName());

         //创建一个socket,链接DataNode

       InetSocketAddress target =                                          NetUtils.createSocketAddr(nodes[0].getName());

       s =socketFactory.createSocket();

       timeoutValue = 3000 * nodes.length +socketTimeout;

       NetUtils.connect(s, target,timeoutValue);

       s.setSoTimeout(timeoutValue);

       s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

       LOG.debug("Send buf size " +s.getSendBufferSize());

       long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length +

                            datanodeWriteTimeout;

       // Xmit headerinfo to datanode

       DataOutputStream out = new DataOutputStream(

           new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout), 

DataNode.SMALL_BUFFER_SIZE));blockReplyStream =new DataInputStream(NetUtils.getInputStream(s));

         out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION);

       out.write( DataTransferProtocol.OP_WRITE_BLOCK );

       out.writeLong( block.getBlockId() );

       out.writeLong( block.getGenerationStamp());

       out.writeInt( nodes.length );

       out.writeBoolean( recoveryFlag );      // recovery flag

       Text.writeString( out, client );

       out.writeBoolean(false);// Not sending src node information

       out.writeInt( nodes.length - 1 );

       for (int i = 1; i < nodes.length; i++) {

         nodes[i].write(out);

       }

       accessToken.write(out);

       checksum.writeHeader( out );

       out.flush();

       // receive ackfor connect

       pipelineStatus = blockReplyStream.readShort();

       firstBadLink = Text.readString(blockReplyStream);

       if (pipelineStatus != DataTransferProtocol.OP_STATUS_SUCCESS) {

         if (pipelineStatus == DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN) {

           thrownew InvalidBlockTokenException(

                "Got access token error for connectack with firstBadLink as "

                    + firstBadLink);

          } else {

           thrownew IOException("Bad connect ack with firstBadLink as"

                + firstBadLink);

         }

       }

       blockStream = out;

       result = true;    // success

}

   客户端在DataStreamer的run函数中创建了写入流后,调用blockStream.write将数据写入DataNode

7、最后要将block写到disk中,看一些资料说是在DataNode中:

DataNode的DataXceiver中,收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数,

但是类DataXceiver与类DataNode是怎么建立关系暂未搞清,DataXeciver中确实存在这个writeBlock()函数,源码中的注释也是说实现将block读到disk中:

privatevoid writeBlock(DataInputStream in)throws IOException {

   DatanodeInfo srcDataNode = null;

    LOG.debug("writeBlock receive buf size " +s.getReceiveBufferSize() +

             " tcp nodelay " +s.getTcpNoDelay());

    // Read in the header

   Block block = newBlock(in.readLong(),

       dataXceiverServer.estimateBlockSize, in.readLong());

    LOG.info("Receiving block " + block +

            " src:" +remoteAddress +

            " dest:" +localAddress);

    int pipelineSize = in.readInt();//num ofdatanodes inentire pipeline

    boolean isRecovery = in.readBoolean();// is this part of recovery?

   String client = Text.readString(in);// working on behalf of this client

    boolean hasSrcDataNode = in.readBoolean();// issrc node info present

    if (hasSrcDataNode) {

     srcDataNode = new DatanodeInfo();

     srcDataNode.readFields(in);

    }

    int numTargets = in.readInt();

    if (numTargets < 0) {

      thrownew IOException("Mislabelledincoming datastream.");

    }

   DatanodeInfo targets[] = new DatanodeInfo[numTargets];

    for (int i = 0; i < targets.length; i++) {

     DatanodeInfo tmp = new DatanodeInfo();

     tmp.readFields(in);

     targets[i] = tmp;

    }

   Token<BlockTokenIdentifier> accessToken =new Token<BlockTokenIdentifier>();

   accessToken.readFields(in);

   DataOutputStream replyOut = null;   // stream toprev target

   replyOut = new DataOutputStream(

                   NetUtils.getOutputStream(s,datanode.socketWriteTimeout));

    if (datanode.isBlockTokenEnabled) {

      try {

       datanode.blockTokenSecretManager.checkAccess(accessToken,null, block,

           BlockTokenSecretManager.AccessMode.WRITE);

      }catch (InvalidToken e) {

       try {

         if (client.length() != 0) {

           replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR_ACCESS_TOKEN);

           Text.writeString(replyOut,datanode.dnRegistration.getName());

           replyOut.flush();

         }

         thrownew IOException("Access token verification failed, forclient "

             + remoteAddress + " for OP_WRITE_BLOCK for block " + block);

       } finally {

         IOUtils.closeStream(replyOut);

       }

      }

    }

   DataOutputStream mirrorOut = null// stream to next target

   DataInputStream mirrorIn = null;    // reply from next target

   Socket mirrorSock = null;          // socket to next target

   BlockReceiver blockReceiver = null; // responsible for data handling

   String mirrorNode = null;          // the name:port of next target

   String firstBadLink = "";           // first datanode that failed inconnection setup

    short mirrorInStatus = (short)DataTransferProtocol.OP_STATUS_SUCCESS;

    try {

      // open a block receiver and check ifthe block does not exist

     blockReceiver = newBlockReceiver(block, in,

         s.getRemoteSocketAddress().toString(),

         s.getLocalSocketAddress().toString(),

         isRecovery, client, srcDataNode,datanode);

      // Open networkconn to backupmachine, if appropriate

      if (targets.length > 0) {

       InetSocketAddress mirrorTarget =null;

       // Connect tobackup machine

       mirrorNode = targets[0].getName();

       mirrorTarget = NetUtils.createSocketAddr(mirrorNode);

       mirrorSock = datanode.newSocket();

       try {

         int timeoutValue =datanode.socketTimeout +

                             (HdfsConstants.READ_TIMEOUT_EXTENSION * numTargets);

         int writeTimeout =datanode.socketWriteTimeout +

                             (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

         NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);

         mirrorSock.setSoTimeout(timeoutValue);

         mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

         mirrorOut = new DataOutputStream(

            new BufferedOutputStream(

                         NetUtils.getOutputStream(mirrorSock,writeTimeout),

                         SMALL_BUFFER_SIZE));

         mirrorIn = newDataInputStream(NetUtils.getInputStream(mirrorSock));

         // Write header:Copied from DFSClient.java!

         mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

         mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

         mirrorOut.writeLong( block.getBlockId() );

         mirrorOut.writeLong( block.getGenerationStamp() );

         mirrorOut.writeInt( pipelineSize );

         mirrorOut.writeBoolean( isRecovery );

         Text.writeString( mirrorOut, client );

         mirrorOut.writeBoolean(hasSrcDataNode);

         if (hasSrcDataNode) {// passsrc node information

           srcDataNode.write(mirrorOut);

         }

         mirrorOut.writeInt( targets.length - 1 );

         for (int i = 1; i < targets.length; i++ ) {

           targets[i].write( mirrorOut );

         }

         accessToken.write(mirrorOut);

         blockReceiver.writeChecksumHeader(mirrorOut);

         mirrorOut.flush();

         // read connectack(only for clients, not for replicationreq)

         if (client.length() != 0) {

           mirrorInStatus = mirrorIn.readShort();

           firstBadLink = Text.readString(mirrorIn);

           if (LOG.isDebugEnabled() || mirrorInStatus !=DataTransferProtocol.OP_STATUS_SUCCESS) {

             LOG.info("Datanode " + targets.length +

                       " got response for connect ack " +

                       " from downstream datanode withfirstbadlink as " +

                       firstBadLink);

           }

         }

       } catch (IOException e) {

         if (client.length() != 0) {

           replyOut.writeShort((short)DataTransferProtocol.OP_STATUS_ERROR);

           Text.writeString(replyOut, mirrorNode);

           replyOut.flush();

         }

         IOUtils.closeStream(mirrorOut);

         mirrorOut = null;

         IOUtils.closeStream(mirrorIn);

         mirrorIn = null;

         IOUtils.closeSocket(mirrorSock);

         mirrorSock = null;

         if (client.length() > 0) {

           throw e;

         } else {

           LOG.info(datanode.dnRegistration +":Exceptiontransfering block " +

                     block + " to mirror " + mirrorNode +

                     ". continuing without themirror.\n" +

                     StringUtils.stringifyException(e)); }

       }

      }

      // send connectack back tosource (only for clients)

      if (client.length() != 0) {

       if (LOG.isDebugEnabled() || mirrorInStatus !=DataTransferProtocol.OP_STATUS_SUCCESS) {

         LOG.info("Datanode " + targets.length +

                   " forwarding connect ack to upstreamfirstbadlink is " +

                   firstBadLink);

       }

       replyOut.writeShort(mirrorInStatus);

       Text.writeString(replyOut, firstBadLink);

       replyOut.flush();

      }

      // receive the block and mirror to thenext target

     String mirrorAddr = (mirrorSock ==null) ?null : mirrorNode;

     blockReceiver.receiveBlock(mirrorOut,mirrorIn, replyOut,

                                 mirrorAddr,null, targets.length);

      // if this write is for a replication request (and not

      // from a client), then confirm block.For client-writes,

      // the block is finalized in thePacketResponder.

      if (client.length() == 0) {

       datanode.notifyNamenodeReceivedBlock(block,DataNode.EMPTY_DEL_HINT);

       LOG.info("Received block " + block +

                 " src: " + remoteAddress +

                 " dest: " + localAddress +

                 " of size " + block.getNumBytes());

      }

      if (datanode.blockScanner !=null) {

       datanode.blockScanner.addBlock(block);

      }

     

    } catch (IOException ioe) {

      LOG.info("writeBlock " + block +" received exception " + ioe);

      throw ioe;

    } finally {

      // close all opened streams

     IOUtils.closeStream(mirrorOut);

     IOUtils.closeStream(mirrorIn);

     IOUtils.closeStream(replyOut);

     IOUtils.closeSocket(mirrorSock);

     IOUtils.closeStream(blockReceiver);

    }

  }

在这个过程还涉及许多相关类,还要进一步分析!

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值