HDFS的文件操作流(4)——写操作(数据节点)

       上一篇本文我详细的分析了在HDFS的文件写操作中,客户端是如何工作的,其工作核心可总结为两点:一是向NameNode申请Block,二是向数据节点传输Block的packet。那么,数据节点是如何来接受这个数据块的呢?这个还得从数据节点的注册说起。

    数据节点在启动之后,会向NameNode节点进行注册来告诉它自己的一些信息,这些信息包括自己的存储信息,服务信息等。其中,服务信息包括自己的数据接受地址、RPC地址、Info地址。数据接受地址就是用来接受客户端发送过来的有关数据块的操作。而这个工作统一交给DataXceiverServer来管理,这个DataXceiverServer实际上可以看做是一个线程池,具体的为某一个客户端服务的话,它还是交给工作线程DataXceiver来做的。好吧,那就来具体的看看DataXceiver

class DataXceiver implements Runnable, FSConstants {
    public void run() {
      
    DataInputStream in=null;
    try {
      in = new DataInputStream( new BufferedInputStream(NetUtils.getInputStream(s), SMALL_BUFFER_SIZE));
      short version = in.readShort();
      if ( version != DataTransferProtocol.DATA_TRANSFER_VERSION ) {
        throw new IOException( "Version Mismatch" );
      }
    
      boolean local = s.getInetAddress().equals(s.getLocalAddress());
      byte op = in.readByte();
      
      // Make sure the xciver count is not exceeded
      int curXceiverCount = datanode.getXceiverCount();
      if (curXceiverCount > dataXceiverServer.maxXceiverCount) {
        throw new IOException("xceiverCount " + curXceiverCount  + " exceeds the limit of concurrent xcievers " + dataXceiverServer.maxXceiverCount);
      }
      
      long startTime = DataNode.now();
      switch ( op ) {
      case DataTransferProtocol.OP_READ_BLOCK:
        readBlock( in );
        datanode.myMetrics.readBlockOp.inc(DataNode.now() - startTime);
        if (local)
          datanode.myMetrics.readsFromLocalClient.inc();
        else
          datanode.myMetrics.readsFromRemoteClient.inc();
        break;
      case DataTransferProtocol.OP_WRITE_BLOCK:
        writeBlock( in );
        datanode.myMetrics.writeBlockOp.inc(DataNode.now() - startTime);
        if (local)
          datanode.myMetrics.writesFromLocalClient.inc();
        else
          datanode.myMetrics.writesFromRemoteClient.inc();
        break;
      case DataTransferProtocol.OP_READ_METADATA:
        readMetadata( in );
        datanode.myMetrics.readMetadataOp.inc(DataNode.now() - startTime);
        break;
      case DataTransferProtocol.OP_REPLACE_BLOCK: // for balancing purpose; send to a destination
        replaceBlock(in);
        datanode.myMetrics.replaceBlockOp.inc(DataNode.now() - startTime);
        break;
      case DataTransferProtocol.OP_COPY_BLOCK:
            // for balancing purpose; send to a proxy source
        copyBlock(in);
        datanode.myMetrics.copyBlockOp.inc(DataNode.now() - startTime);
        break;
      case DataTransferProtocol.OP_BLOCK_CHECKSUM: //get the checksum of a block
        getBlockChecksum(in);
        datanode.myMetrics.blockChecksumOp.inc(DataNode.now() - startTime);
        break;
      default:
        throw new IOException("Unknown opcode " + op + " in data stream");
      }
    } catch (Throwable t) {
      LOG.error(datanode.dnRegistration + ":DataXceiver",t);
    } finally {
      LOG.debug(datanode.dnRegistration + ":Number of active connections is: " + datanode.getXceiverCount());
      IOUtils.closeStream(in);
      IOUtils.closeSocket(s);
      dataXceiverServer.childSockets.remove(s);
    }
  }
}
    在 DataXceiver的run()方法中,他根据客户端的操作要求(如:读数据块、写数据块、复制数据块等)来调用响应的函数 处理,本文既是谈论文件的写操作,那么自然就会具体分析 writeBlock()方法。

private void writeBlock(DataInputStream in) throws IOException {
    DatanodeInfo srcDataNode = null;
    LOG.debug("writeBlock receive buf size " + s.getReceiveBufferSize() + " tcp no delay " + s.getTcpNoDelay());
    //先取出相关的头部信息 

    Block block = new Block(in.readLong(), dataXceiverServer.estimateBlockSize, in.readLong());//Block的id和创建时间
  
    int pipelineSize = in.readInt(); //Block副本数量
    boolean isRecovery = in.readBoolean(); // 是否是恢复Block
    String client = Text.readString(in); //客户端名字
    boolean hasSrcDataNode = in.readBoolean(); // 是否来自数据节点
    if (hasSrcDataNode) {
      srcDataNode = new DatanodeInfo();
      srcDataNode.readFields(in);//源数据节点信息
    }
    int numTargets = in.readInt();//存放剩余Block副本数量
    if (numTargets < 0) {
      throw new IOException("Mislabelled incoming datastream.");
    }
    DatanodeInfo targets[] = new DatanodeInfo[numTargets];
    for (int i = 0; i < targets.length; i++) {
      DatanodeInfo tmp = new DatanodeInfo();
      tmp.readFields(in);//存放剩余Block副本的数据节点信息
      targets[i] = tmp;
    }

    DataOutputStream mirrorOut = null;  // 向下一个数据节点写入Block的网络I/O流
    DataInputStream mirrorIn = null;    // reply from next target
    
    DataOutputStream replyOut = null;   // stream to prev target
    
    Socket mirrorSock = null;           // socket to next target
    BlockReceiver blockReceiver = null; // responsible for data handling
    String mirrorNode = null;           // the name:port of next target
    String firstBadLink = "";           // first datanode that failed in connection setup
    try {
     
      blockReceiver = new BlockReceiver(block, in, s.getRemoteSocketAddress().toString(), s.getLocalSocketAddress().toString(), isRecovery, client, srcDataNode, datanode);

      if (targets.length > 0) {
        InetSocketAddress mirrorTarget = null;
        // Connect to backup machine
        mirrorNode = targets[0].getName();
        mirrorTarget = NetUtils.createSocketAddr(mirrorNode);
        mirrorSock = datanode.newSocket();
        try {
          int timeoutValue = numTargets * datanode.socketTimeout;
          int writeTimeout = datanode.socketWriteTimeout + (HdfsConstants.WRITE_TIMEOUT_EXTENSION * numTargets);
          NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
          mirrorSock.setSoTimeout(timeoutValue);
          mirrorSock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);
          mirrorOut = new DataOutputStream(new BufferedOutputStream(NetUtils.getOutputStream(mirrorSock, writeTimeout),SMALL_BUFFER_SIZE));
          mirrorIn = new DataInputStream(NetUtils.getInputStream(mirrorSock));

          // 向下一个数据节点写入Block的头部信息
          mirrorOut.writeShort( DataTransferProtocol.DATA_TRANSFER_VERSION );
          mirrorOut.write( DataTransferProtocol.OP_WRITE_BLOCK );
          mirrorOut.writeLong( block.getBlockId() );
          mirrorOut.writeLong( block.getGenerationStamp() );
          mirrorOut.writeInt( pipelineSize );
          mirrorOut.writeBoolean( isRecovery );
          Text.writeString( mirrorOut, client );
          mirrorOut.writeBoolean(hasSrcDataNode);
          if (hasSrcDataNode) { // pass src node information
            srcDataNode.write(mirrorOut);
          }
          mirrorOut.writeInt( targets.length - 1 );
          for ( int i = 1; i < targets.length; i++ ) {
            targets[i].write( mirrorOut );
          }

          //向下一个数据节点写入校验信息
          blockReceiver.writeChecksumHeader(mirrorOut);
          mirrorOut.flush();

          //获取下一个数据节点的确认信息
          if (client.length() != 0) {
            firstBadLink = Text.readString(mirrorIn);
            if (LOG.isDebugEnabled() || firstBadLink.length() > 0) {
              LOG.info("Datanode " + targets.length +
                       " got response for connect ack " +
                       " from downstream datanode with firstbadlink as " +
                       firstBadLink);
            }
          }

        } catch (IOException e) {
          if (client.length() != 0) {
            Text.writeString(replyOut, mirrorNode);
            replyOut.flush();
          }
          IOUtils.closeStream(mirrorOut);
          mirrorOut = null;
          IOUtils.closeStream(mirrorIn);
          mirrorIn = null;
          IOUtils.closeSocket(mirrorSock);
          mirrorSock = null;
          if (client.length() > 0) {
            throw e;
          } else {
          }
        }
      }


      if (client.length() != 0) {
        if (LOG.isDebugEnabled() || firstBadLink.length() > 0) {
        }
        Text.writeString(replyOut, firstBadLink);//向前一个一个数据节点或客户回复
        replyOut.flush();
      }

      // 接受前一个数据节点或客户端发送来的Block,并把该Block发送到下一个数据节点
      String mirrorAddr = (mirrorSock == null) ? null : mirrorNode;
      blockReceiver.receiveBlock(mirrorOut, mirrorIn, replyOut, mirrorAddr, null, targets.length);

      if (client.length() == 0) {

        //将接收到的一个Block添加到当前数据节点的receivedBlockList

        datanode.notifyNamenodeReceivedBlock(block, DataNode.EMPTY_DEL_HINT);
    
      if (datanode.blockScanner != null) {
        datanode.blockScanner.addBlock(block);
      }
      
    } catch (IOException ioe) {
      throw ioe;
    } finally {
      // close all opened streams
      IOUtils.closeStream(mirrorOut);
      IOUtils.closeStream(mirrorIn);
      IOUtils.closeStream(replyOut);
      IOUtils.closeSocket(mirrorSock);
      IOUtils.closeStream(blockReceiver);
    }
  }

    在writeBlock()方法中,数据节点主要是靠BlockReceiver来接受前一个客户端或者是数据接节点发送过来的Block,并把它发送到下一个数据节点。关于BlockReceiver的实现细节在这里我就不再赘述了,有兴趣的盆友可以自己去研究一下它的源代码。当一个DataXceiver成功地接受完一个Block并把它发送到下一个数据节点之后,它就会把刚收到的一个Block放到DataNode的receivedBlockList队列中,而DataNode也会将该Block报告给NameNode节点。

    另外,我还用补充一个问题就是关于数据节点的数据传输服务地址的问题。我曾经遇到过这样一个有趣的问题:我只在一台pc上部署HDFS,一个NameNode节点,一个DataNode节点,其中我的pc的ip地址是*.*.*.*(教育网ip地址),然后我把NameNode的服务地址配置成localhost:8020,DataNode的数据服务地址配置成*.*.*.*:50010,结果客户端在向数据节点传输Block时始终都无法连接到DataNode的数据服务端口。盆友们知道这是为什么吗?这是因为数据节点在向NameNode节点注册时,NameNode执行了下面一段代码:

String dnAddress = Server.getRemoteAddress();
    if (dnAddress == null) {
      dnAddress = nodeReg.getHost();
    }      

    // check if the datanode is allowed to be connect to the namenode
    if (!verifyNodeRegistration(nodeReg, dnAddress)) {
      throw new DisallowedDatanodeException(nodeReg);
    }

    String hostName = nodeReg.getHost();
      
    // update the datanode's name with ip:port
    DatanodeID dnReg = new DatanodeID(dnAddress + ":" + nodeReg.getPort(),  nodeReg.getStorageID(),  nodeReg.getInfoPort(), nodeReg.getIpcPort());
    nodeReg.updateRegInfo(dnReg);
     在上面的代码中,NameNode节点一开始就根据连接的数据节点的远程地址更新了注册信息 nodeReg,即修改了数据节点的数据服务地址,问题也就出在这儿了。由于我设置的NameNode的地址是localhost:8020,所以当数据节点向 NameNode节点连接时,并没有走路由器,那么NameNode收到的数据节点连接的远程地址就是127.0.0.1了,即Server.getRemoteAddress()会返回127.0.0.1,客户端收到的数据节点的数据服务地址就是127.0.0.1:50010,同时数据节点也干了一件轰动的事情,就是把数据服务地址绑定到了*.*.*.*:50010上,而不是只绑定到某一个端口上,着一系列的动作就导致了客户端找不到数据节点的数据服务地址。之后,我就索性直接把NameNode的地址配置成*.*.*.*:8020,然后半点事没有。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值