ORC 从stream read 到 column read 的转变

ORC Stream

Orc 在读取一个 stripe时,是安装stream为单位读取的,stripe 中的column可能只有一个stream,或者多个不同属性的stream组成,stream 不是 column的子单元,

enum Kind {
   // boolean stream of whether the next value is non-null
   PRESENT = 0;
   // the primary data stream
   DATA = 1;
   // the length of each value for variable length data
   LENGTH = 2;
   // the dictionary blob
   DICTIONARY_DATA = 3;
   // deprecated prior to Hive 0.11
   // It was used to store the number of instances of each value in the
   // dictionary
   DICTIONARY_COUNT = 4;
   // a secondary data stream
   SECONDARY = 5;
   // the index for seeking to particular row groups
   ROW_INDEX = 6;
   // original bloom filters used before ORC-101
   BLOOM_FILTER = 7;
   // bloom filters that consistently use utf8
   BLOOM_FILTER_UTF8 = 8;

   // Virtual stream kinds to allocate space for encrypted index and data.
   ENCRYPTED_INDEX = 9;
   ENCRYPTED_DATA = 10;

   // stripe statistics streams
   STRIPE_STATISTICS = 100;
   // A virtual stream kind that is used for setting the encryption IV.
   FILE_STATISTICS = 101;
 }

  public static Area getArea(OrcProto.Stream.Kind kind) {
    switch (kind) {
      case ROW_INDEX:
      case DICTIONARY_COUNT:
      case BLOOM_FILTER:
      case BLOOM_FILTER_UTF8:
        return Area.INDEX;
      default:
        return Area.DATA;
    }
  }

From stream to Column reading

现在根据开发需求,将 stream reading 变为 column reading

重写 RecordReaderImpl 相关接口

因为是以列为单位读取,
includedRowGroups 和 readAllDataStreams(读取一个stripe中的所有stream) 就被注释了,

protected void readStripe() throws IOException {
    StripeInformation stripe = beginReadStripe();

// includedRowGroups = pickRowGroups();
...
// readAllDataStreams(stripe);
...

    if (rowInStripe < rowCountInStripe) {
      readPartialDataStreams(stripe);
      reader.startStripe(streams, stripeFooter);
      // if we skipped the first row group, move the pointers forward
      if (rowInStripe != 0) {
        seekToRowEntry(reader, (int) (rowInStripe / rowIndexStride));
      }
    }
  }
private void readPartialDataStreams(StripeInformation stripe) throws IO
Exception {
 // 获取相关位置 columns 的相应位置
DiskRangeList toRead = planReadColumnData(streamList, fileIncluded);
...
    // 读取 column
    bufferChunks = ((RecordReaderBinaryCacheUtils.DefaultDataReader)dataReader)
            .readFileColumnData(toRead, stripe.getOffset(), false);
....
    createStreams(streamList, bufferChunks, fileIncluded,
            dataReader.getCompressionCodec(), bufferSize, streams);
  }

// planReadColumnData

/**
   * Plan the ranges of the file that we need to read, given the list of
   * columns in one stripe.
   *
   * @param streamList        the list of streams available
   * @param includedColumns   which columns are needed
   * @return the list of disk ranges that will be loaded
   */
  private DiskRangeList planReadColumnData(
          List<OrcProto.Stream> streamList,
          boolean[] includedColumns) {
    long offset = 0;
    Map<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange> columnDiskRangeMap = new HashMap<Integer, RecordReaderBinaryCacheImpl.ColumnDiskRange>();
    ColumnDiskRangeList.CreateColumnRangeHelper list =
            new ColumnDiskRangeList.CreateColumnRangeHelper();
    for (OrcProto.Stream stream : streamList) {
      long length = stream.getLength();
      int column = stream.getColumn();
      OrcProto.Stream.Kind streamKind = stream.getKind();
      // since stream kind is optional, first check if it exists
      if (stream.hasKind() &&
              (StreamName.getArea(streamKind) == StreamName.Area.DATA) &&
              (column < includedColumns.length && includedColumns[column])) {

        if (columnDiskRangeMap.containsKey(column)) {
          columnDiskRangeMap.get(column).length += length;
        } else {
          columnDiskRangeMap.put(column, new RecordReaderBinaryCacheImpl.ColumnDiskRange(offset, length));
        }
      }
      offset += length;
    }
    for (int columnId=1; columnId<includedColumns.length; ++columnId) {
      if (includedColumns[columnId]) {
        list.add(columnId, currentStripe, columnDiskRangeMap.get(columnId).offset,
                columnDiskRangeMap.get(columnId).offset + columnDiskRangeMap.get(columnId).length);
      }
    }
    return list.extract();
  }
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
您想了解如何将ORC文件读取并写入到OBS中吗? 首先,您需要使用ORC读取程序来读取ORC文件。可以使用Apache ORC或Hive中的ORC读取器。接下来,您需要使用OBS SDK将数据写入OBS中。 以下是一个示例代码,可以将ORC文件读取并写入到OBS中: ``` import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hive.ql.io.orc.OrcFile; import org.apache.hadoop.hive.ql.io.orc.Reader; import org.apache.hadoop.hive.ql.io.orc.RecordReader; import org.apache.hadoop.hive.ql.io.orc.StructObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.StructField; import org.apache.hadoop.hive.serde2.objectinspector.StructTypeInfo; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory; import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils; import org.apache.hadoop.hive.serde2.typeinfo.TypeUtils; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import com.obs.services.ObsClient; import com.obs.services.model.PutObjectRequest; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; public class OrcToOBS { public static void main(String[] args) throws IOException { // ORC文件路径 String orcFilePath = "s3a://test-bucket/test-file.orc"; // OBS桶名称 String bucketName = "test-bucket"; // OBS对象名称 String objectName = "test-file.orc"; // ORC读取器 Reader reader = OrcFile.createReader(new Path(orcFilePath), OrcFile.readerOptions(FileSystem.get(new org.apache.hadoop.conf.Configuration()))); // ORC数据信息 StructTypeInfo orcTypeInfo = (StructTypeInfo) TypeInfoUtils.getTypeInfoFromTypeString(reader.getObjectInspector().getTypeName()); // OBS客户端 ObsClient obsClient = new ObsClient("<your access key>", "<your secret key>", "<your endpoint>"); // OBS数据流 InputStream inputStream = null; try { // OBS数据流 inputStream = obsClient.getObject(bucketName, objectName).getObjectContent(); // OBS上传请求 PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, objectName); // ORC数据读取器 RecordReader recordReader = reader.rows(); // ORC字段信息 StructObjectInspector inspector = (StructObjectInspector) reader.getObjectInspector(); // ORC字段类型信息 List<TypeInfo> typeInfos = orcTypeInfo.getAllStructFieldTypeInfos(); // ORC字段名称 List<String> fieldNames = new ArrayList<>(); for (StructField field : inspector.getAllStructFieldRefs()) { fieldNames.add(field.getFieldName()); } // ORC数据读取 Object row = null; while (recordReader.hasNext()) { row = recordReader.next(row); List<Object> fields = inspector.getStructFieldsDataAsList(row); StringBuilder stringBuilder = new StringBuilder(); for (int i = 0; i < fields.size(); i++) { TypeInfo typeInfo = typeInfos.get(i); Object field = fields.get(i); if (field != null) { if (typeInfo.getTypeName().equals(TypeInfoFactory.stringTypeInfo.getTypeName())) { stringBuilder.append(field.toString()); } else { stringBuilder.append(field.toString()); } } stringBuilder.append(","); } stringBuilder.deleteCharAt(stringBuilder.length() - 1); stringBuilder.append("\n"); String data = stringBuilder.toString(); // 写入OBS obsClient.putObject(bucketName, objectName, data.getBytes()); } // OBS上传 obsClient.putObject(putObjectRequest, inputStream); } finally { if (inputStream != null) { inputStream.close(); } obsClient.close(); } } } ``` 请注意,此示例仅供参考,您需要根据您的实际需求进行修改,例如更改ORC读取器和OBS SDK的版本,以及更改访问密钥和终端节点等信息。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值