HADOOP datanode SendBlock毛刺现象分析

这几天研究HDFS的datanode 和 client端的数据通信,发现目前datanode是使用NIO来做底层通信的.在datanode的DataXceiver的readBlock方法就是将datanode的block中的数据发送到客户端或者其他的datanode中去.我们可以看以下几段代码.

  try {
      try {
        blockSender = new BlockSender(block, startOffset, length,
            true, true, false, datanode, clientTraceFmt);
      } catch(IOException e) {
        out.writeShort(DataTransferProtocol.OP_STATUS_ERROR);
        throw e;
      }

      out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status
      long read = blockSender.sendBlock(out, baseStream, null); // send data
readBlock 在读取client的包头之后就开始发送相关的数据,发送是通过2个流,out就是普通的socket流,baseSteam是用于NIO的.在sendBlock中做了一些checksum计算和偏移操作后就调用sendchunks发送客户端所需要的data.此处如果transferToAllowed在配置文件中设置就是用sendfile方法加快发送流程,不然直接socket流来发送.

      if (transferToAllowed && !verifyChecksum && 
          baseStream instanceof SocketOutputStream && 
          blockIn instanceof FileInputStream) {
        
        FileChannel fileChannel = ((FileInputStream)blockIn).getChannel();
        
        // blockInPosition also indicates sendChunks() uses transferTo.
        blockInPosition = fileChannel.position();
        streamForSendChunks = baseStream;
......
 while (endOffset > offset) {
        long len = sendChunks(pktBuf, maxChunksPerPacket, 
                              streamForSendChunks);
        offset += len;
        totalRead += len + ((len + bytesPerChecksum - 1)/bytesPerChecksum*
                            checksumSize);
        seqno++;
在sendChunks方法中我们发现如果使用transferTo()的方法来NIO发送数据时候必须做一个wait 动作,注释的解释是此处避免一个jre的bug.

          //first write the packet
          sockOut.write(buf, 0, dataOff);
          // no need to flush. since we know out is not a buffered stream.
          sockOut.transferToFully(fileChannel, blockInPosition, len);
        }
---------------------------------------------
 public void transferToFully(FileChannel fileCh, long position, int count) 
                              throws IOException {
    
    while (count > 0) {
      /* 
       * Ideally we should wait after transferTo returns 0. But because of
       * a bug in JRE on Linux (http://bugs.sun.com/view_bug.do?bug_id=5103988),
       * which throws an exception instead of returning 0, we wait for the
       * channel to be writable before writing to it. If you ever see 
       * IOException with message "Resource temporarily unavailable" 
       * thrown here, please let us know.
       * 
       * Once we move to JAVA SE 7, wait should be moved to correct place.
       */
      waitForWritable();
      int nTransfered = (int) fileCh.transferTo(position, count, getChannel());

也就是这个waitForWritable();方法我们在用jstack dump信息时发现很多线程都block在这个地方,原因是由于下面file的transferTo动作在写入时要wait到这个channel是否可写.底层有个register和select的方法来做底层的等待,并有超时判断.

    /**
     * Waits on the channel with the given timeout using one of the 
     * cached selectors. It also removes any cached selectors that are
     * idle for a few seconds.
     * 
     * @param channel
     * @param ops
     * @param timeout
     * @return
     * @throws IOException
     */
    int select(SelectableChannel channel, int ops, long timeout) 
                                                   throws IOException {
     
      SelectorInfo info = get(channel);
      
      SelectionKey key = null;
      int ret = 0;
      
      try {
        while (true) {
          long start = (timeout == 0) ? 0 : System.currentTimeMillis();

          key = channel.register(info.selector, ops);// Step1
          ret = info.selector.select(timeout);//Step2
          
          if (ret != 0) {
            return ret;
          }
          
          /* Sometimes select() returns 0 much before timeout for 
           * unknown reasons. So select again if required.
           */
          if (timeout > 0) {
            timeout -= System.currentTimeMillis() - start;
            if (timeout <= 0) {
              return 0;
            }
          }
奇怪的是Step1和Step2之间等待的时间极度不均匀,在用btrace跟踪后发生分布的很散列.

                   timecost(千分之一毫秒)------- Distribution ------------- count
                    128 |                                         0
                    256 |                                         51895
                    512 |@@                                       243717
                   1024 |@@@@@@@@@@@@@                            1410085
                   2048 |@@@@@@@@@@@@@@@@@@@@                     2135480
                   4096 |@@                                       283689
                   8192 |                                         8247
                  16384 |                                         6856
                  32768 |                                         6741
                  65536 |                                         8390
                 131072 |                                         2019
                 262144 |                                         171
                 524288 |                                         21
                1048576 |                                         3
                2097152 |                                         0




有大量的毛刺产生.目前HDFS这个select的超时时间设置过长.默认为8分钟.可有下列参数改小到100毫秒做个实验.

<property> 
     <name>dfs.datanode.socket.write.timeout</name> 
     <value>1000</value> 
</property>

可以从以下直方图看出毛刺可以消除.

                  value  ------------- Distribution ------------- count
                    128 |                                         0
                    256 |                                         221
                    512 |                                         6896
                   1024 |@@@@@@@@@@@@@                            98509
                   2048 |@@@@@@@@@@@@@@@@@@@@@                    154922
                   4096 |@@@                                      23579
                   8192 |                                         1490
                  16384 |                                         438
                  32768 |                                         85
                  65536 |                                         5
                 131072 |                                         0

---------------------------------------------







  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值