ORC 从stream read 到 column read 的转变

最新推荐文章于 2022-10-31 10:44:05 发布

zhixingheyi_tian

最新推荐文章于 2022-10-31 10:44:05 发布

阅读量519

点赞数

分类专栏： ORC

本文链接：https://blog.csdn.net/zhixingheyi_tian/article/details/104313355

版权

ORC 专栏收录该内容

10 篇文章 1 订阅

订阅专栏

ORC Stream

Orc 在读取一个 stripe时，是安装stream为单位读取的，stripe 中的column可能只有一个stream，或者多个不同属性的stream组成，stream 不是 column的子单元，

enum Kind {
   // boolean stream of whether the next value is non-null
   PRESENT = 0;
   // the primary data stream
   DATA = 1;
   // the length of each value for variable length data
   LENGTH = 2;
   // the dictionary blob
   DICTIONARY_DATA = 3;
   // deprecated prior to Hive 0.11
   // It was used to store the number of instances of each value in the
   // dictionary
   DICTIONARY_COUNT = 4;
   // a secondary data stream
   SECONDARY = 5;
   // the index for seeking to particular row groups
   ROW_INDEX = 6;
   // original bloom filters used before ORC-101
   BLOOM_FILTER = 7;
   // bloom filters that consistently use utf8
   BLOOM_FILTER_UTF8 = 8;

   // Virtual stream kinds to allocate space for encrypted index and data.
   ENCRYPTED_INDEX = 9;
   ENCRYPTED_DATA = 10;

   // stripe statistics streams
   STRIPE_STATISTICS = 100;
   // A virtual stream kind that is used for setting the encryption IV.
   FILE_STATISTICS = 101;
 }

  public static Area getArea(OrcProto.Stream.Kind kind) {
    switch (kind) {
      case ROW_INDEX:
      case DICTIONARY_COUNT:
      case BLOOM_FILTER:
      case BLOOM_FILTER_UTF8:
        return Area.INDEX;
      default:
        return Area.DATA;
    }
  }

From stream to Column reading

现在根据开发需求，将 stream reading 变为 column reading

重写 RecordReaderImpl 相关接口

因为是以列为单位读取，
includedRowGroups 和 readAllDataStreams(读取一个stripe中的所有stream) 就被注释了，

protected void readStripe() throws IOException {
    StripeInformation stripe = beginReadStripe();

// includedRowGroups = pickRowGroups();
...
// readAllDataStreams(stripe);
...

    if (rowInStripe < rowCountInStripe) {
      readPartialDataStreams(stripe);
      reader.startStripe(streams, stripeFooter);
      // if we skipped the first row group, move the pointers forward
      if (rowInStripe != 0) {
        seekToRowEntry(reader, (int) (rowInStripe / rowIndexStride));
      }
    }
  }

private void readPartialDataStreams(StripeInformation stripe) throws IO
Exception {
 // 获取相关位置 columns 的相应位置
DiskRangeList toRead = planReadColumnData(streamList, fileIncluded);
...
    // 读取 column
    bufferChunks = ((RecordReaderBinaryCacheUtils.DefaultDataReader)dataReader)
            .readFileColumnData(toRead, stripe.getOffset(), false);
....
    createStreams(streamList, bufferChunks, fileIncluded,
            dataReader.getCompressionCodec(), bufferSize, streams);
  }

// planReadColumnData

/**
   * Plan the ranges of the file that we need to read, given the list of
   * columns in one stripe.
   *
   * @param streamList        the list of streams available
   * @param includedColumns   which columns are needed
   * @return the list of disk ranges that will be loaded
   */
  private DiskRangeList planReadColumnData(
          List<OrcProto.Stream> streamList,
          boolean[] includedColumns) {
    long offset = 0;
    Map<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange> columnDiskRangeMap = new HashMap<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange>();
    ColumnDiskRangeList.CreateColumnRangeHelper list =
            new ColumnDiskRangeList.CreateColumnRangeHelper();
    for (OrcProto.Stream stream : streamList) {
      long length = stream.getLength();
      int column = stream.getColumn();
      OrcProto.Stream.Kind streamKind = stream.getKind();
      // since stream kind is optional, first check if it exists
      if (stream.hasKind() &&
              (StreamName.getArea(streamKind) == StreamName.Area.DATA) &&
              (column < includedColumns.length && includedColumns[column])) {

        if (columnDiskRangeMap.containsKey(column)) {
          columnDiskRangeMap.get(column).length += length;
        } else {
          columnDiskRangeMap.put(column, new RecordReaderBinaryCacheImpl.ColumnDiskRange(offset, length));
        }
      }
      offset += length;
    }
    for (int columnId=1; columnId<includedColumns.length; ++columnId) {
      if (includedColumns[columnId]) {
        list.add(columnId, currentStripe, columnDiskRangeMap.get(columnId).offset,
                columnDiskRangeMap.get(columnId).offset + columnDiskRangeMap.get(columnId).length);
      }
    }
    return list.extract();
  }

zhixingheyi_tian

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ORC 从stream read 到 column read 的转变

ORC StreamOrc 在读取一个 stripe时，是安装stream为单位读取的，stripe 中的column可能只有一个stream，或者多个不同属性的stream组成，stream 不是 column的子单元，enum Kind { // boolean stream of whether the next value is non-null PRESENT = 0;...
复制链接

扫一扫

专栏目录