HDFS client写文件过程源码分析

最新推荐文章于 2024-04-04 15:32:00 发布

thaddeuslu

最新推荐文章于 2024-04-04 15:32:00 发布

阅读量352

点赞数

文章标签： hdfs 源码

原文链接：https://www.cnblogs.com/forfuture1978/archive/2010/11/10/1874222.html

版权

HDFS client写文件过程源码分析

HDFS写入文件的重要概念

HDFS一个文件由多个block构成。HDFS在进行block读写的时候是以packet(默认每个packet为64K)为单位进行的。每一个packet由若干个chunk（默认512Byte）组成。Chunk是进行数据校验的基本单位，对每一个chunk生成一个校验和(默认4Byte)并将校验和进行存储。

在写入一个block的时候，数据传输的基本单位是packet，每个packet由若干个chunk组成。

客户端

HDFS客户端写文件示例代码

FileSystem hdfs = FileSystem.get(new Configuration());
Path path = new Path("/testfile");

// writing
FSDataOutputStream dos = hdfs.create(path);
byte[] readBuf = "Hello World".getBytes("UTF-8");
dos.write(readBuf, 0, readBuf.length);
dos.close();在这里插入代码片
hdfs.close();

文件的打开
上传一个文件到hdfs，一般会调用DistributedFileSystem.create，其实现如下：

public FSDataOutputStream create(Path f, FsPermission permission,boolean overwrite,int bufferSize, short replication, long blockSize,Progressable progress) throws IOException {
    return new FSDataOutputStream
       (dfs.create(getPathName(f), permission,overwrite, replication, blockSize, progress, bufferSize),
        statistics);
}

其最终生成一个FSDataOutputStream用于向新生成的文件中写入数据。其成员变量dfs的类型为DFSClient，DFSClient的create函数如下：

public OutputStream create(String src,FsPermission permission,boolean overwrite,short replication,long blockSize,Progressable progress,int buffersize) throws IOException {
    checkOpen();
    if (permission == null) {
      permission = FsPermission.getDefault();
    }
    FsPermission masked = permission.applyUMask(FsPermission.getUMask(conf));
    OutputStream result = new DFSOutputStream(src, masked,overwrite, replication, blockSize, progress, buffersize,
        conf.getInt("io.bytes.per.checksum", 512));
    leasechecker.put(src, result);
    return result;
}

其中构造了一个DFSOutputStream，在其构造函数中，同过RPC调用NameNode的create来创建一个文件。
当然，构造函数中还做了一件重要的事情，就是streamer.start()，也即启动了一个pipeline，用于写数据，在写入数据的过程中，我们会仔细分析。

DFSOutputStream(String src, FsPermission masked, boolean overwrite,short replication, long blockSize, Progressable progress,
                int buffersize, int bytesPerChecksum) throws IOException {
    this(src, blockSize, progress, bytesPerChecksum);
    computePacketChunkSize(writePacketSize, bytesPerChecksum);
    try {
      namenode.create(src, masked, clientName, overwrite, replication, blockSize);
    } catch(RemoteException re) {
      throw re.unwrapRemoteException(AccessControlException.class,QuotaExceededException.class);
    }
    streamer.start();
}

NameNode

通过rpc调用NameNode的create函数，调用namesystem.startFile函数，其又调用startFileInternal函数，它创建一个新的文件，状态为under construction，没有任何data block与之对应。

 private synchronized void startFileInternal(String src, PermissionStatus permissions,String holder, 
 String clientMachine, boolean overwrite, boolean append, short replication, long blockSize) throws IOException {

    ......

   //创建一个新的文件，状态为under construction，没有任何data block与之对应

   long genstamp = nextGenerationStamp();

   INodeFileUnderConstruction newNode = dir.addFile(src, permissions,

      replication, blockSize, holder, clientMachine, clientNode, genstamp);

   ......

  }

客户端文件的写入

下面轮到客户端向新创建的文件中写入数据了，一般会使用FSDataOutputStream的write方法：

按照hdfs的设计，对block的数据写入使用的是pipeline的方式，也即将数据分成一个个的package，如果需要复制三分，分别写入DataNode 1, 2, 3，则会进行如下的过程：

首先将package 1写入DataNode 1
然后由DataNode 1负责将package 1写入DataNode 2，同时客户端可以将package 2写入DataNode 1
然后DataNode 2负责将package 1写入DataNode 3, 同时客户端可以讲package 3写入DataNode 1，DataNode 1将package 2写入DataNode 2
就这样将一个个package排着队的传递下去，直到所有的数据全部写入并复制完毕

FSDataOutputStream的write方法会调用DFSOutputStream的write方法，而DFSOutputStream继承自FSOutputSummer，所以实际上是调用FSOutputSummer的write方法，如下:

 public synchronized void write(int b) throws IOException {
        this.buf[this.count++] = (byte)b;
        if (this.count == this.buf.length) {
            this.flushBuffer();//最终调用writeChecksumChunk方法实现
        }

    }

writeChecksumChunk的实现如下:

    private void writeChecksumChunks(byte[] b, int off, int len) throws IOException {
        this.sum.calculateChunkedSums(b, off, len, this.checksum, 0);

        for(int i = 0; i < len; i += this.sum.getBytesPerChecksum()) {
            int chunkLen = Math.min(this.sum.getBytesPerChecksum(), len - i);
            int ckOffset = i / this.sum.getBytesPerChecksum() * this.getChecksumSize();
            this.writeChunk(b, off + i, chunkLen, this.checksum, ckOffset, this.getChecksumSize());
        }

    }

writeChunk由子类DFSOutputStream实现，如下:

protected synchronized void writeChunk(byte[] b, int offset, int len, byte[] checksum) throws IOException {

      //创建一个package，并写入数据

      currentPacket = new Packet(packetSize, chunksPerPacket, bytesCurBlock);

      currentPacket.writeChecksum(checksum, 0, cklen);

      currentPacket.writeData(b, offset, len);

      currentPacket.numChunks++;

      bytesCurBlock += len;

      //如果此package已满，则放入队列中准备发送

      if (currentPacket.numChunks == currentPacket.maxChunks ||

          bytesCurBlock == blockSize) {

          ......

          dataQueue.addLast(currentPacket);

          //唤醒等待dataqueue的传输线程，也即DataStreamer

          dataQueue.notifyAll();

          currentPacket = null;

          ......

      }

  }

DataStreamer的run函数如下

 public void run() {

    while (!closed && clientRunning) {

      Packet one = null;

      synchronized (dataQueue) {

        //如果队列中没有package，则等待

        while ((!closed && !hasError && clientRunning

               && dataQueue.size() == 0) || doSleep) {

          try {

            dataQueue.wait(1000);

          } catch (InterruptedException  e) {

          }

          doSleep = false;

        }

        try {

          //得到队列中的第一个package

          one = dataQueue.getFirst();

          long offsetInBlock = one.offsetInBlock;

          //由NameNode分配block，并生成一个写入流指向此block

          if (blockStream == null) {

            nodes = nextBlockOutputStream(src);

            response = new ResponseProcessor(nodes);

            response.start();

          }

          ByteBuffer buf = one.getBuffer();

          //将package从dataQueue移至ackQueue,等待确认

          dataQueue.removeFirst();

          dataQueue.notifyAll();

          synchronized (ackQueue) {

            ackQueue.addLast(one);

            ackQueue.notifyAll();

          }

          //利用生成的写入流将数据写入DataNode中的block

          blockStream.write(buf.array(), buf.position(), buf.remaining());

          if (one.lastPacketInBlock) {

            blockStream.writeInt(0); //表示此block写入完毕

          }

          blockStream.flush();

        } catch (Throwable e) {

        }

      }

      ......

  }

其中重要的一个函数是nextBlockOutputStream，实现如下：

 private DatanodeInfo[] nextBlockOutputStream(String client) throws IOException {

    LocatedBlock lb = null;

    boolean retry = false;

    DatanodeInfo[] nodes;

    int count = conf.getInt("dfs.client.block.write.retries", 3);

    boolean success;

    do {

      ......

      //由NameNode为文件分配DataNode和block

      lb = locateFollowingBlock(startTime);

      block = lb.getBlock();

      nodes = lb.getLocations();

      //创建向DataNode的写入流

      success = createBlockOutputStream(nodes, clientName, false);

      ......

    } while (retry && --count >= 0);

    return nodes;

  }

locateFollowingBlock中通过RPC调用namenode.addBlock(src, clientName)函数

NameNode

NameNode的addBlock函数实现如下：

  public LocatedBlock addBlock(String src,

                               String clientName) throws IOException {

    LocatedBlock locatedBlock = namesystem.getAdditionalBlock(src, clientName);

    return locatedBlock;

  }

FSNamesystem的getAdditionalBlock实现如下：

  public LocatedBlock getAdditionalBlock(String src, String clientName) throws IOException {

    long fileLength, blockSize;

    int replication;

    DatanodeDescriptor clientNode = null;

    Block newBlock = null;

    ......

    //为新的block选择DataNode

    DatanodeDescriptor targets[] = replicator.chooseTarget(replication, clientNode, null, blockSize);

    ......

    //得到文件路径中所有path的INode，其中最后一个是新添加的文件对的INode，状态为under construction

    INode[] pathINodes = dir.getExistingPathINodes(src);

    int inodesLen = pathINodes.length;

    INodeFileUnderConstruction pendingFile  = (INodeFileUnderConstruction)

                                                pathINodes[inodesLen - 1];

    //为文件分配block, 并设置在那写DataNode上

    newBlock = allocateBlock(src, pathINodes);

    pendingFile.setTargets(targets);

    ......

    return new LocatedBlock(newBlock, targets, fileLength);

  }

客户端

在分配了DataNode和block以后，createBlockOutputStream开始写入数据。

private boolean createBlockOutputStream(DatanodeInfo[] nodes, String client, boolean recoveryFlag) {

      //创建一个socket，链接DataNode

      InetSocketAddress target = NetUtils.createSocketAddr(nodes[0].getName());

      s = socketFactory.createSocket();

      int timeoutValue = 3000 * nodes.length + socketTimeout;

      s.connect(target, timeoutValue);

      s.setSoTimeout(timeoutValue);

      s.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

      long writeTimeout = HdfsConstants.WRITE_TIMEOUT_EXTENSION * nodes.length + datanodeWriteTimeout;

      DataOutputStream out = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(s, writeTimeout),  DataNode.SMALL_BUFFER_SIZE));

      blockReplyStream = new DataInputStream(NetUtils.getInputStream(s));

      //写入指令

      out.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

      out.write( DataTransferProtocol.OP_WRITE_BLOCK );

      out.writeLong( block.getBlockId() );

      out.writeLong( block.getGenerationStamp() );

      out.writeInt( nodes.length );

      out.writeBoolean( recoveryFlag );

      Text.writeString( out, client );

      out.writeBoolean(false);

      out.writeInt( nodes.length - 1 );

      //注意，次循环从1开始，而非从0开始。将除了第一个DataNode以外的另外两个DataNode的信息发送给第一个DataNode, 第一个DataNode可以根据此信息将数据写给另两个DataNode

      for (int i = 1; i < nodes.length; i++) {

        nodes[i].write(out);

      }

      checksum.writeHeader( out );

      out.flush();

      firstBadLink = Text.readString(blockReplyStream);

      if (firstBadLink.length() != 0) {

        throw new IOException("Bad connect ack with firstBadLink " + firstBadLink);

      }

      blockStream = out;

  }

客户端在DataStreamer的run函数中创建了写入流后，调用blockStream.write将数据写入DataNode

DataNode

DataNode的DataXceiver中，收到指令DataTransferProtocol.OP_WRITE_BLOCK则调用writeBlock函数：

private void writeBlock(DataInputStream in) throws IOException {

    DatanodeInfo srcDataNode = null;

    //读入头信息

    Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());

    int pipelineSize = in.readInt(); // num of datanodes in entire pipeline

    boolean isRecovery = in.readBoolean(); // is this part of recovery?

    String client = Text.readString(in); // working on behalf of this client

    boolean hasSrcDataNode = in.readBoolean(); // is src node info present

    if (hasSrcDataNode) {

      srcDataNode = new DatanodeInfo();

      srcDataNode.readFields(in);

    }

    int numTargets = in.readInt();

    if (numTargets < 0) {

      throw new IOException("Mislabelled incoming datastream.");

    }

    //读入剩下的DataNode列表，如果当前是第一个DataNode，则此列表中收到的是第二个，第三个DataNode的信息，如果当前是第二个DataNode，则受到的是第三个DataNode的信息

    DatanodeInfo targets[] = new DatanodeInfo[numTargets];

    for (int i = 0; i < targets.length; i++) {

      DatanodeInfo tmp = new DatanodeInfo();

      tmp.readFields(in);

      targets[i] = tmp;

    }

    DataOutputStream mirrorOut = null;  // stream to next target

    DataInputStream mirrorIn = null;    // reply from next target

    DataOutputStream replyOut = null;   // stream to prev target

    Socket mirrorSock = null;           // socket to next target

    BlockReceiver blockReceiver = null; // responsible for data handling

    String mirrorNode = null;           // the name:port of next target

    String firstBadLink = "";           // first datanode that failed in connection setup

    try {

      //生成一个BlockReceiver, 其有成员变量DataInputStream in为从客户端或者上一个DataNode读取数据，还有成员变量DataOutputStream mirrorOut，用于向下一个DataNode写入数据，还有成员变量OutputStream out用于将数据写入本地。

      blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode);

      // get a connection back to the previous target

      replyOut = new DataOutputStream(NetUtils.getOutputStream(s, datanode.socketWriteTimeout));

      //如果当前不是最后一个DataNode，则同下一个DataNode建立socket连接

      if (targets.length > 0) {

        InetSocketAddress mirrorTarget = null;

        // Connect to backup machine

        mirrorNode = targets[0].getName();

        mirrorTarget = NetUtils.createSocketAddr(mirrorNode);

        mirrorSock = datanode.newSocket();

        int timeoutValue = numTargets * datanode.socketTimeout;

        int writeTimeout = datanode.socketWriteTimeout +

                             (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);

        mirrorSock.connect(mirrorTarget, timeoutValue);

        mirrorSock.setSoTimeout(timeoutValue);

        mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);

        //创建向下一个DataNode写入数据的流

        mirrorOut = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(mirrorSock, writeTimeout), SMALL_BUFFER_SIZE));

        mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

        mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );

        mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );

        mirrorOut.writeLong( block.getBlockId() );

        mirrorOut.writeLong( block.getGenerationStamp() );

        mirrorOut.writeInt( pipelineSize );

        mirrorOut.writeBoolean( isRecovery );

        Text.writeString( mirrorOut, client );

        mirrorOut.writeBoolean(hasSrcDataNode);

        if (hasSrcDataNode) { // pass src node information

          srcDataNode.write(mirrorOut);

        }

        mirrorOut.writeInt( targets.length - 1 );

        //此出也是从1开始，将除了下一个DataNode的其他DataNode信息发送给下一个DataNode

        for ( int i = 1; i < targets.length; i++ ) {

          targets[i].write( mirrorOut );

        }

        blockReceiver.writeChecksumHeader(mirrorOut);

        mirrorOut.flush();

      }

      //使用BlockReceiver接受block

      String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;

      blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut,

                                 mirrorAddr, null, targets.length);

      ......

    } finally {

      // close all opened streams

      IOUtils.closeStream(mirrorOut);

      IOUtils.closeStream(mirrorIn);

      IOUtils.closeStream(replyOut);

      IOUtils.closeSocket(mirrorSock);

      IOUtils.closeStream(blockReceiver);

    }

  }

BlockReceiver的receiveBlock函数中，一段重要的逻辑如下：

  void receiveBlock(

      DataOutputStream mirrOut, // output to next datanode

      DataInputStream mirrIn,   // input from next datanode

      DataOutputStream replyOut,  // output to previous datanode

      String mirrAddr, BlockTransferThrottler throttlerArg,

      int numTargets) throws IOException {

      ......

      //不断的接受package，直到结束

      while (receivePacket() > 0) {}

      if (mirrorOut != null) {

        try {

          mirrorOut.writeInt(0); // mark the end of the block

          mirrorOut.flush();

        } catch (IOException e) {

          handleMirrorOutError(e);

        }

      }

      ......

  }

BlockReceiver的receivePacket函数如下：

private int receivePacket() throws IOException {

    //从客户端或者上一个节点接收一个package

    int payloadLen = readNextPacket();

    buf.mark();

    //read the header

    buf.getInt(); // packet length

    offsetInBlock = buf.getLong(); // get offset of packet in block

    long seqno = buf.getLong();    // get seqno

    boolean lastPacketInBlock = (buf.get() != 0);

    int endOfHeader = buf.position();

    buf.reset();

    setBlockPosition(offsetInBlock);

    //将package写入下一个DataNode

    if (mirrorOut != null) {

      try {

        mirrorOut.write(buf.array(), buf.position(), buf.remaining());

        mirrorOut.flush();

      } catch (IOException e) {

        handleMirrorOutError(e);

      }

    }

    buf.position(endOfHeader);       

    int len = buf.getInt();

    offsetInBlock += len;

    int checksumLen = ((len + bytesPerChecksum - 1)/bytesPerChecksum)*

                                                            checksumSize;

    int checksumOff = buf.position();

    int dataOff = checksumOff + checksumLen;

    byte pktBuf[] = buf.array();

    buf.position(buf.limit()); // move to the end of the data.

    ......

    //将数据写入本地的block

    out.write(pktBuf, dataOff, len);

    /// flush entire packet before sending ack

    flush();

    // put in queue for pending acks

    if (responder != null) {

      ((PacketResponder)responder.getRunnable()).enqueue(seqno,

                                      lastPacketInBlock);

    }

    return payloadLen;

  }