ORC file 仅仅读取需要的列的流程

最新推荐文章于 2023-03-08 16:39:59 发布

houzhizhen

最新推荐文章于 2023-03-08 16:39:59 发布

阅读量888

点赞数

文章标签： hive

本文链接：https://blog.csdn.net/houzhizhen/article/details/122085563

版权

ORC 格式的数据，仅仅读取用到的列，这是怎么实现的呢？本文详细解析下。
如表 web_site 共有 26 个列，我们检索两个列作为结果，条件判断中再用一个列。

select web_site_id, web_rec_start_date from web_site where web_site_sk <> 1;

先用 explain 看下查询计划。

Plan optimized by CBO.

Stage-0
  Fetch Operator
    limit:-1
    Select Operator [SEL_2]
      Output:["_col0","_col1"]
      Filter Operator [FIL_4]
        predicate:(web_site_sk <> 1L)
        TableScan [TS_0]
          Output:["web_site_sk","web_site_id","web_rec_start_date"]

Fetch Operator 输出结果到终端，limit: -1 表示不限定返回结果。
Select Operator 说明我们 SELECT 的两个列。
Filter Operator 是我们的限定条件： web_site_sk <> 1
TableScan：Output 有 3 个字段，说明 TableScan operator 需要输出 3 列。

编译过程

在 SQL 解析的时候，就能知道查询需要哪些字段，这些字段查询哪些表。
SQL 在解析的时候，优化时调用使用 ColumnPruner。在 ColumnPrunerProcCtx 里，存储了操作和该操作 pruned 字段的列表。

public class ColumnPrunerProcCtx implements NodeProcessorCtx {
  private final ParseContext pctx;
  private final Map<Operator<? extends OperatorDesc>, List<FieldNode>> prunedColLists;
  private final Map<CommonJoinOperator, Map<Byte, List<FieldNode>>> joinPrunedColLists;
  }

任何操作，都调用 ColumnPrunerProcCtx的 genColLists 得到子操作的字段。

public List<FieldNode> genColLists(Operator<? extends OperatorDesc> curOp)
      throws SemanticException {
    if (curOp.getChildOperators() == null) {
      return null;
    }
    List<FieldNode> colList = null;
    for (Operator<? extends OperatorDesc> child : curOp.getChildOperators()) {
      List<FieldNode> prunList = null;
      if (child instanceof CommonJoinOperator) {
        int tag = child.getParentOperators().indexOf(curOp);
        prunList = joinPrunedColLists.get(child).get((byte) tag);
      } else if (child instanceof FileSinkOperator) {
        prunList = new ArrayList<>();
        RowSchema oldRS = curOp.getSchema();
        for (ColumnInfo colInfo : oldRS.getSignature()) {
          prunList.add(new FieldNode(colInfo.getInternalName()));
        }
      } else {
        prunList = prunedColLists.get(child);
      }
      if (prunList == null) {
        continue;
      }
      if (colList == null) {
        colList = new ArrayList<>(prunList);
      } else {
        colList = mergeFieldNodes(colList, prunList);
      }
    }
    return colList;
  }

以 filter 为例，先调用 ExprNodeDesc condn = op.getConf().getPredicate(); 拿到本操作的用到的列，再调用 mergeFieldNodesWithDesc 合并子操作的列和当前条件判断用到的列。
调用 cppCtx.getPrunedColLists().put(op, filterOpPrunedColListsOrderPreserved); 把合并后的列放到当前操作里。
调用 pruneOperator 更新当前操作的 rowSchema 对象，更新为需要的列。

public static class ColumnPrunerFilterProc implements NodeProcessor {
    @Override
    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
        Object... nodeOutputs) throws SemanticException {
      FilterOperator op = (FilterOperator) nd;
      ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
      ExprNodeDesc condn = op.getConf().getPredicate();
      List<FieldNode> filterOpPrunedColLists = mergeFieldNodesWithDesc(cppCtx.genColLists(op), condn);
      List<FieldNode> filterOpPrunedColListsOrderPreserved = preserveColumnOrder(op,
          filterOpPrunedColLists);
      cppCtx.getPrunedColLists().put(op,
          filterOpPrunedColListsOrderPreserved);

      pruneOperator(cppCtx, op, cppCtx.getPrunedColLists().get(op));
      cppCtx.handleFilterUnionChildren(op);
      return null;
    }
  }

ColumnPrunerProcFactory.ColumnPrunerTableScanProc 用需要的 cols 生成 setNeededColumnIDs 和 setupNeededColumns。

public static ColumnPrunerTableScanProc getTableScanProc() {
    return new ColumnPrunerTableScanProc();
  }
public static class ColumnPrunerTableScanProc implements SemanticNodeProcessor {
    @Override
    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
        Object... nodeOutputs) throws SemanticException {
      TableScanOperator scanOp = (TableScanOperator) nd;
      ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
      List<FieldNode> cols = cppCtx
          .genColLists((Operator<? extends OperatorDesc>) nd);
      if (cols == null && !scanOp.getConf().isGatherStats() ) {
        scanOp.setNeededColumnIDs(null);
        return null;
      }

      cols = cols == null ? new ArrayList<FieldNode>() : cols;

      cppCtx.getPrunedColLists().put((Operator<? extends OperatorDesc>) nd, cols);
      RowSchema inputRS = scanOp.getSchema();
      setupNeededColumns(scanOp, inputRS, cols);

      return null;
    }
  }

在HiveInputFormat 的 genSplits 方法里，如果操作是 TableScanOperator，则从 tableScan（类型是 TableScanOperator）里取出 neededColumnIDs 和 neededColumns，调用 pushProjection。

if ((op != null) && (op instanceof TableScanOperator)) {
         tableScan = (TableScanOperator) op;
         //Reset buffers to store filter push down columns
         readColumnsBuffer.setLength(0);
         readColumnNamesBuffer.setLength(0);
         // push down projections.
         ColumnProjectionUtils.appendReadColumns(readColumnsBuffer, readColumnNamesBuffer,
           tableScan.getNeededColumnIDs(), tableScan.getNeededColumns());
         pushDownProjection = true;
         // push down filters and as of information
         pushFiltersAndAsOf(newjob, tableScan, this.mrwork);
 }
...
pushProjection(newjob, readColumnsBuffer, readColumnNamesBuffer);

pushProjection 里，放到 JobConf 里。

private void pushProjection(final JobConf newjob, final StringBuilder readColumnsBuffer,
      final StringBuilder readColumnNamesBuffer) {
    String readColIds = readColumnsBuffer.toString();
    String readColNames = readColumnNamesBuffer.toString();
    newjob.setBoolean(ColumnProjectionUtils.READ_ALL_COLUMNS, false);
    newjob.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
    newjob.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);

    LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
    LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);
  }

读取逻辑

ORCInputFormat 在读取的时候，仅读取这三个字段的值。

由于表的格式是 ORC，informat 为 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat。
在 createReaderFromFile 方法里，有以下语句。

public static RecordReader createReaderFromFile(Reader file,
                                                  Configuration conf,
                                                  long offset, long length
                                                  ) throws IOException {
     options.include(genIncludedColumns(schema, conf));

Reader.Options 的 include 方法，如果对应的列需要读，则为 true，否则为 false。如果此表有 26 个列，我们只读 3 个列，仅读取的列对应的位置为 true，其余为 false。

public Reader.Options include(boolean[] include) {
            this.include = include;
            return this;
}

OrcInputFormat. genIncludedColumns。如果不读取所有的列，那么

static boolean[] genIncludedColumns(TypeDescription readerSchema,
                                             Configuration conf) {
     if (!ColumnProjectionUtils.isReadAllColumns(conf)) {
      List<Integer> included = ColumnProjectionUtils.getReadColumnIDs(conf);
      return genIncludedColumns(readerSchema, included);
    } else {
      return null;
    }
  }

getReadColumnIDs 方法返回需要字段的 id。通过 genIncludedColumns 生成对应的 boolean 数组。

public static List<Integer> getReadColumnIDs(Configuration conf) {
    String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT);
    String[] list = StringUtils.split(skips);
    List<Integer> result = new ArrayList<Integer>(list.length);
    for (String element : list) {
      // it may contain duplicates, remove duplicates
      Integer toAdd = Integer.parseInt(element);
      if (!result.contains(toAdd)) {
        result.add(toAdd);
      }
      // NOTE: some code uses this list to correlate with column names, and yet these lists may
      //       contain duplicates, which this call will remove and the other won't. As far as I can
      //       tell, no code will actually use these two methods together; all is good if the code
      //       gets the ID list without relying on this method. Or maybe it just works by magic.
    }
    return result;
  }