Hadoop处理跨block行源码分析记录

最新推荐文章于 2021-11-01 18:32:23 发布

young-ming

最新推荐文章于 2021-11-01 18:32:23 发布

阅读量1.2k

点赞数 1

分类专栏： Hadoop hadoop源码文章标签： hadoop跨行block 处理边界源码

本文链接：https://blog.csdn.net/u011750989/article/details/11368129

版权

Hadoop 同时被 2 个专栏收录

43 篇文章 1 订阅

订阅专栏

hadoop源码

15 篇文章 0 订阅

订阅专栏

Hadoop的block大小默认为64M,将一个大文件按64M切分block,分发到各个datanode存储,那么必然会造成一行数据分布到不同block,不同的datanode,那Hadoop是如何处理的这种情况?

以TextInputFormat格式为例,LineRecordReader设计的足够健壮,当读到该分片最后一条未遇到终止符,会继续读取下一个分片的数据,直到读取出完整的数据行,下一个分片默认会跳出第一行数据(文件的第一行除外)

public class TextInputFormat extends FileInputFormat<LongWritable, Text>
  implements JobConfigurable {

  private CompressionCodecFactory compressionCodecs = null;
  
  public void configure(JobConf conf) {
    compressionCodecs = new CompressionCodecFactory(conf);
  }
  
  protected boolean isSplitable(FileSystem fs, Path file) {
    return compressionCodecs.getCodec(file) == null;
  }

  public RecordReader<LongWritable, Text> getRecordReader(
                                          InputSplit genericSplit, JobConf job,
                                          Reporter reporter)
    throws IOException {
    
    reporter.setStatus(genericSplit.toString());
    return new LineRecordReader(job, (FileSplit) genericSplit);
  }
}

LineRecordReader类:


通过filein输入流,构建LineReader,再调用其中的readline方法,读数据赋给第一个参数 Text
  public LineRecordReader(Configuration job, 
                          FileSplit split) throws IOException {
    this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength",
                                    Integer.MAX_VALUE);
    start = split.getStart();
    end = start + split.getLength();//控制分片的结束位置
    final Path file = split.getPath();
    compressionCodecs = new CompressionCodecFactory(job);
    final CompressionCodec codec = compressionCodecs.getCodec(file);

    // open the file and seek to the start of the split
    FileSystem fs = file.getFileSystem(job);
    FSDataInputStream fileIn = fs.open(split.getPath());
    boolean skipFirstLine = false;
    if (codec != null) {
      in = new LineReader(codec.createInputStream(fileIn), job);
      end = Long.MAX_VALUE;
    } else {
      if (start != 0) {
        skipFirstLine = true;
        --start;

        fileIn.seek(start);
      }
      in = new LineReader(fileIn, job);//
    }
    if (skipFirstLine) {  // skip first line and re-establish "start".如果不是这个文件的第一条记录则跳过,赋值new Text()，没有给value
      start += in.readLine(new Text(), 0,
                           (int)Math.min((long)Integer.MAX_VALUE, end - start));
    }
    this.pos = start;
  }


/** Read a line. */
  public synchronized boolean next(LongWritable key, Text value)
    throws IOException {

    while (pos < end) {
      key.set(pos);
//循环读取行数据,readline考虑跨行边界的处理

      int newSize = in.readLine(value, maxLineLength,
                                Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
                                         maxLineLength)); //in.readline最终调用的是org.apache.hadoop.util.LineReader的方法
      if (newSize == 0) {
        return false;
      }
      pos += newSize;
      if (newSize < maxLineLength) {
        return true;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));
    }

    return false;
  }

public class LineReader {
  private static final int DEFAULT_BUFFER_SIZE = 64 * 1024;
  private int bufferSize = DEFAULT_BUFFER_SIZE;
  private InputStream in;
  private byte[] buffer;
  // the number of bytes of real data in the buffer
  private int bufferLength = 0;
  // the current position in the buffer
  private int bufferPosn = 0;

  private static final byte CR = '\r';
  private static final byte LF = '\n';
....................
LineReader(util包) 核心方法readline方法,处理边界,通过in.read(buffer),将数据不断循环先读到buffer(64k大小),再在buffer里循环判断是不是\n \r换行符,如果是写进字符串，然后退出循环，如果没有找到换行符，接着循环，直到找到换行符
  
  /**
   * Read one line from the InputStream into the given Text.  A line
   * can be terminated by one of the following: '\n' (LF) , '\r' (CR),
   * or '\r\n' (CR+LF).  EOF also terminates an otherwise unterminated
   * line.
   *
   * @param str the object to store the given line (without newline)
   * @param maxLineLength the maximum number of bytes to store into str;
   *  the rest of the line is silently discarded.
   * @param maxBytesToConsume the maximum number of bytes to consume
   *  in this call.  This is only a hint, because if the line cross
   *  this threshold, we allow it to happen.  It can overshoot
   *  potentially by as much as one buffer length.
   *
   * @return the number of bytes read including the (longest) newline
   * found.
   *
   * @throws IOException if the underlying stream throws
   */
  public int readLine(Text str, int maxLineLength,
                      int maxBytesToConsume) throws IOException {
    /* We're reading data from in, but the head of the stream may be
     * already buffered in buffer, so we have several cases:
     * 1. No newline characters are in the buffer, so we need to copy
     *    everything and read another buffer from the stream.
     * 2. An unambiguously terminated line is in buffer, so we just
     *    copy to str.
     * 3. Ambiguously terminated line is in buffer, i.e. buffer ends
     *    in CR.  In this case we copy everything up to CR to str, but
     *    we also need to see what follows CR: if it's LF, then we
     *    need consume LF as well, so next call to readLine will read
     *    from after that.
     * We use a flag prevCharCR to signal if previous character was CR
     * and, if it happens to be at the end of the buffer, delay
     * consuming it until we have a chance to look at the char that
     * follows.
     */
    str.clear();
    int txtLength = 0; //tracks str.getLength(), as an optimization
    int newlineLength = 0; //length of terminating newline
    boolean prevCharCR = false; //true of prev char was CR
    long bytesConsumed = 0;
    do {
      int startPosn = bufferPosn; //starting from where we left off the last time
      if (bufferPosn >= bufferLength) {//整个buffer里都没有换行符，下面的for循环退出,该段代码继续,再次读进buffer,继续在原来的str append数据，除非明确读出了一行数据，也就是readline函数跳出了，再读新行数据时str.clear();
        startPosn = bufferPosn = 0;
        if (prevCharCR)
          ++bytesConsumed; //account for CR from previous read
        bufferLength = in.read(buffer);
        if (bufferLength <= 0)
          break; // EOF
      }
      for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newline
        if (buffer[bufferPosn] == LF) {
          newlineLength = (prevCharCR) ? 2 : 1;
          ++bufferPosn; // at next invocation proceed from following byte //看是不是CR LF两个字符连在一起的，如果是的话，appendLength 就得减2了
          break;
        }
        if (prevCharCR) { //CR + notLF, we are at notLF
          newlineLength = 1;
          break;  //不是连在一起的，减1
        }
        prevCharCR = (buffer[bufferPosn] == CR);//缓存上个字符
      }
      int readLength = bufferPosn - startPosn;
      if (prevCharCR && newlineLength == 0)
        --readLength; //CR at the end of the buffer,最后一个字符是CR了，就直接退出上面的循环了，所以newlineLength就没赋1或2

      bytesConsumed += readLength;
      int appendLength = readLength - newlineLength;
      if (appendLength > maxLineLength - txtLength) {
        appendLength = maxLineLength - txtLength;
      }
      if (appendLength > 0) {
        str.append(buffer, startPosn, appendLength);
        txtLength += appendLength;
      }
    } while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);

    if (bytesConsumed > (long)Integer.MAX_VALUE)
      throw new IOException("Too many bytes before newline: " + bytesConsumed);    
    return (int)bytesConsumed;

young-ming

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hadoop处理跨block行源码分析记录

Hadoop的block大小默认为64M,将一个大文件按64M切分block,分发到各个datanode存储,那么必然会造成一行数据分布到不同block,不同的datanode,那Hadoop是如何处理的这种情况?以TextInputFormat格式为例,LineRecordReader设计的足够健壮,当读到该分片最后一条未遇到终止符,会继续读取下一个分片的数据,直到读取出完整的数据行,下一个
复制链接

扫一扫

专栏目录