ORC file 仅仅读取需要的列的流程

ORC 格式的数据,仅仅读取用到的列,这是怎么实现的呢?本文详细解析下。
如表 web_site 共有 26 个列,我们检索两个列作为结果,条件判断中再用一个列。

select web_site_id, web_rec_start_date from web_site where web_site_sk <> 1;

先用 explain 看下查询计划。

Plan optimized by CBO.

Stage-0
  Fetch Operator
    limit:-1
    Select Operator [SEL_2]
      Output:["_col0","_col1"]
      Filter Operator [FIL_4]
        predicate:(web_site_sk <> 1L)
        TableScan [TS_0]
          Output:["web_site_sk","web_site_id","web_rec_start_date"]

Fetch Operator 输出结果到终端,limit: -1 表示不限定返回结果。
Select Operator 说明我们 SELECT 的两个列。
Filter Operator 是我们的限定条件: web_site_sk <> 1
TableScan:Output 有 3 个字段,说明 TableScan operator 需要输出 3 列。

编译过程

在 SQL 解析的时候,就能知道查询需要哪些字段,这些字段查询哪些表。
SQL 在解析的时候,优化时调用 使用 ColumnPruner。在 ColumnPrunerProcCtx 里,存储了操作和该操作 pruned 字段的列表。

public class ColumnPrunerProcCtx implements NodeProcessorCtx {
  private final ParseContext pctx;
  private final Map<Operator<? extends OperatorDesc>, List<FieldNode>> prunedColLists;
  private final Map<CommonJoinOperator, Map<Byte, List<FieldNode>>> joinPrunedColLists;
  }

任何操作,都调用 ColumnPrunerProcCtx的 genColLists 得到子操作的字段。

public List<FieldNode> genColLists(Operator<? extends OperatorDesc> curOp)
      throws SemanticException {
    if (curOp.getChildOperators() == null) {
      return null;
    }
    List<FieldNode> colList = null;
    for (Operator<? extends OperatorDesc> child : curOp.getChildOperators()) {
      List<FieldNode> prunList = null;
      if (child instanceof CommonJoinOperator) {
        int tag = child.getParentOperators().indexOf(curOp);
        prunList = joinPrunedColLists.get(child).get((byte) tag);
      } else if (child instanceof FileSinkOperator) {
        prunList = new ArrayList<>();
        RowSchema oldRS = curOp.getSchema();
        for (ColumnInfo colInfo : oldRS.getSignature()) {
          prunList.add(new FieldNode(colInfo.getInternalName()));
        }
      } else {
        prunList = prunedColLists.get(child);
      }
      if (prunList == null) {
        continue;
      }
      if (colList == null) {
        colList = new ArrayList<>(prunList);
      } else {
        colList = mergeFieldNodes(colList, prunList);
      }
    }
    return colList;
  }

以 filter 为例,先调用 ExprNodeDesc condn = op.getConf().getPredicate(); 拿到本操作的用到的列,再调用 mergeFieldNodesWithDesc 合并子操作的列和当前条件判断用到的列。
调用 cppCtx.getPrunedColLists().put(op, filterOpPrunedColListsOrderPreserved); 把合并后的列放到当前操作里。
调用 pruneOperator 更新当前操作的 rowSchema 对象,更新为需要的列。

public static class ColumnPrunerFilterProc implements NodeProcessor {
    @Override
    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
        Object... nodeOutputs) throws SemanticException {
      FilterOperator op = (FilterOperator) nd;
      ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
      ExprNodeDesc condn = op.getConf().getPredicate();
      List<FieldNode> filterOpPrunedColLists = mergeFieldNodesWithDesc(cppCtx.genColLists(op), condn);
      List<FieldNode> filterOpPrunedColListsOrderPreserved = preserveColumnOrder(op,
          filterOpPrunedColLists);
      cppCtx.getPrunedColLists().put(op,
          filterOpPrunedColListsOrderPreserved);

      pruneOperator(cppCtx, op, cppCtx.getPrunedColLists().get(op));
      cppCtx.handleFilterUnionChildren(op);
      return null;
    }
  }
  • ColumnPrunerProcFactory.ColumnPrunerTableScanProc 用需要的 cols 生成 setNeededColumnIDs 和 setupNeededColumns。
public static ColumnPrunerTableScanProc getTableScanProc() {
    return new ColumnPrunerTableScanProc();
  }
public static class ColumnPrunerTableScanProc implements SemanticNodeProcessor {
    @Override
    public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
        Object... nodeOutputs) throws SemanticException {
      TableScanOperator scanOp = (TableScanOperator) nd;
      ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
      List<FieldNode> cols = cppCtx
          .genColLists((Operator<? extends OperatorDesc>) nd);
      if (cols == null && !scanOp.getConf().isGatherStats() ) {
        scanOp.setNeededColumnIDs(null);
        return null;
      }

      cols = cols == null ? new ArrayList<FieldNode>() : cols;

      cppCtx.getPrunedColLists().put((Operator<? extends OperatorDesc>) nd, cols);
      RowSchema inputRS = scanOp.getSchema();
      setupNeededColumns(scanOp, inputRS, cols);

      return null;
    }
  }

在HiveInputFormat 的 genSplits 方法里,如果操作是 TableScanOperator,则从 tableScan(类型是 TableScanOperator)里取出 neededColumnIDs 和 neededColumns,调用 pushProjection。

if ((op != null) && (op instanceof TableScanOperator)) {
         tableScan = (TableScanOperator) op;
         //Reset buffers to store filter push down columns
         readColumnsBuffer.setLength(0);
         readColumnNamesBuffer.setLength(0);
         // push down projections.
         ColumnProjectionUtils.appendReadColumns(readColumnsBuffer, readColumnNamesBuffer,
           tableScan.getNeededColumnIDs(), tableScan.getNeededColumns());
         pushDownProjection = true;
         // push down filters and as of information
         pushFiltersAndAsOf(newjob, tableScan, this.mrwork);
 }
...
pushProjection(newjob, readColumnsBuffer, readColumnNamesBuffer);

pushProjection 里,放到 JobConf 里。

private void pushProjection(final JobConf newjob, final StringBuilder readColumnsBuffer,
      final StringBuilder readColumnNamesBuffer) {
    String readColIds = readColumnsBuffer.toString();
    String readColNames = readColumnNamesBuffer.toString();
    newjob.setBoolean(ColumnProjectionUtils.READ_ALL_COLUMNS, false);
    newjob.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
    newjob.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);

    LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
    LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);
  }

读取逻辑

ORCInputFormat 在读取的时候,仅读取这三个字段的值。

由于表的格式是 ORC,informat 为 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
在 createReaderFromFile 方法里,有以下语句。

public static RecordReader createReaderFromFile(Reader file,
                                                  Configuration conf,
                                                  long offset, long length
                                                  ) throws IOException {
     options.include(genIncludedColumns(schema, conf));
    

Reader.Options 的 include 方法,如果对应的列需要读,则为 true,否则为 false。如果此表有 26 个列,我们只读 3 个列,仅读取的列对应的位置为 true,其余为 false。

public Reader.Options include(boolean[] include) {
            this.include = include;
            return this;
}

OrcInputFormat. genIncludedColumns。如果不读取所有的列,那么

static boolean[] genIncludedColumns(TypeDescription readerSchema,
                                             Configuration conf) {
     if (!ColumnProjectionUtils.isReadAllColumns(conf)) {
      List<Integer> included = ColumnProjectionUtils.getReadColumnIDs(conf);
      return genIncludedColumns(readerSchema, included);
    } else {
      return null;
    }
  }

getReadColumnIDs 方法返回需要字段的 id。通过 genIncludedColumns 生成对应的 boolean 数组。

public static List<Integer> getReadColumnIDs(Configuration conf) {
    String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT);
    String[] list = StringUtils.split(skips);
    List<Integer> result = new ArrayList<Integer>(list.length);
    for (String element : list) {
      // it may contain duplicates, remove duplicates
      Integer toAdd = Integer.parseInt(element);
      if (!result.contains(toAdd)) {
        result.add(toAdd);
      }
      // NOTE: some code uses this list to correlate with column names, and yet these lists may
      //       contain duplicates, which this call will remove and the other won't. As far as I can
      //       tell, no code will actually use these two methods together; all is good if the code
      //       gets the ID list without relying on this method. Or maybe it just works by magic.
    }
    return result;
  }

下面看下什么时候把要读取的字段放到 conf 里。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值