ORC 格式的数据,仅仅读取用到的列,这是怎么实现的呢?本文详细解析下。
如表 web_site 共有 26 个列,我们检索两个列作为结果,条件判断中再用一个列。
select web_site_id, web_rec_start_date from web_site where web_site_sk <> 1;
先用 explain 看下查询计划。
Plan optimized by CBO.
Stage-0
Fetch Operator
limit:-1
Select Operator [SEL_2]
Output:["_col0","_col1"]
Filter Operator [FIL_4]
predicate:(web_site_sk <> 1L)
TableScan [TS_0]
Output:["web_site_sk","web_site_id","web_rec_start_date"]
Fetch Operator 输出结果到终端,limit: -1 表示不限定返回结果。
Select Operator 说明我们 SELECT 的两个列。
Filter Operator 是我们的限定条件: web_site_sk <> 1
TableScan:Output 有 3 个字段,说明 TableScan operator 需要输出 3 列。
编译过程
在 SQL 解析的时候,就能知道查询需要哪些字段,这些字段查询哪些表。
SQL 在解析的时候,优化时调用 使用 ColumnPruner
。在 ColumnPrunerProcCtx 里,存储了操作和该操作 pruned 字段的列表。
public class ColumnPrunerProcCtx implements NodeProcessorCtx {
private final ParseContext pctx;
private final Map<Operator<? extends OperatorDesc>, List<FieldNode>> prunedColLists;
private final Map<CommonJoinOperator, Map<Byte, List<FieldNode>>> joinPrunedColLists;
}
任何操作,都调用 ColumnPrunerProcCtx的 genColLists 得到子操作的字段。
public List<FieldNode> genColLists(Operator<? extends OperatorDesc> curOp)
throws SemanticException {
if (curOp.getChildOperators() == null) {
return null;
}
List<FieldNode> colList = null;
for (Operator<? extends OperatorDesc> child : curOp.getChildOperators()) {
List<FieldNode> prunList = null;
if (child instanceof CommonJoinOperator) {
int tag = child.getParentOperators().indexOf(curOp);
prunList = joinPrunedColLists.get(child).get((byte) tag);
} else if (child instanceof FileSinkOperator) {
prunList = new ArrayList<>();
RowSchema oldRS = curOp.getSchema();
for (ColumnInfo colInfo : oldRS.getSignature()) {
prunList.add(new FieldNode(colInfo.getInternalName()));
}
} else {
prunList = prunedColLists.get(child);
}
if (prunList == null) {
continue;
}
if (colList == null) {
colList = new ArrayList<>(prunList);
} else {
colList = mergeFieldNodes(colList, prunList);
}
}
return colList;
}
以 filter 为例,先调用 ExprNodeDesc condn = op.getConf().getPredicate();
拿到本操作的用到的列,再调用 mergeFieldNodesWithDesc 合并子操作的列和当前条件判断用到的列。
调用 cppCtx.getPrunedColLists().put(op, filterOpPrunedColListsOrderPreserved);
把合并后的列放到当前操作里。
调用 pruneOperator
更新当前操作的 rowSchema 对象,更新为需要的列。
public static class ColumnPrunerFilterProc implements NodeProcessor {
@Override
public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
Object... nodeOutputs) throws SemanticException {
FilterOperator op = (FilterOperator) nd;
ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
ExprNodeDesc condn = op.getConf().getPredicate();
List<FieldNode> filterOpPrunedColLists = mergeFieldNodesWithDesc(cppCtx.genColLists(op), condn);
List<FieldNode> filterOpPrunedColListsOrderPreserved = preserveColumnOrder(op,
filterOpPrunedColLists);
cppCtx.getPrunedColLists().put(op,
filterOpPrunedColListsOrderPreserved);
pruneOperator(cppCtx, op, cppCtx.getPrunedColLists().get(op));
cppCtx.handleFilterUnionChildren(op);
return null;
}
}
- ColumnPrunerProcFactory.ColumnPrunerTableScanProc 用需要的 cols 生成 setNeededColumnIDs 和 setupNeededColumns。
public static ColumnPrunerTableScanProc getTableScanProc() {
return new ColumnPrunerTableScanProc();
}
public static class ColumnPrunerTableScanProc implements SemanticNodeProcessor {
@Override
public Object process(Node nd, Stack<Node> stack, NodeProcessorCtx ctx,
Object... nodeOutputs) throws SemanticException {
TableScanOperator scanOp = (TableScanOperator) nd;
ColumnPrunerProcCtx cppCtx = (ColumnPrunerProcCtx) ctx;
List<FieldNode> cols = cppCtx
.genColLists((Operator<? extends OperatorDesc>) nd);
if (cols == null && !scanOp.getConf().isGatherStats() ) {
scanOp.setNeededColumnIDs(null);
return null;
}
cols = cols == null ? new ArrayList<FieldNode>() : cols;
cppCtx.getPrunedColLists().put((Operator<? extends OperatorDesc>) nd, cols);
RowSchema inputRS = scanOp.getSchema();
setupNeededColumns(scanOp, inputRS, cols);
return null;
}
}
在HiveInputFormat 的 genSplits 方法里,如果操作是 TableScanOperator,则从 tableScan(类型是 TableScanOperator)里取出 neededColumnIDs 和 neededColumns,调用 pushProjection。
if ((op != null) && (op instanceof TableScanOperator)) {
tableScan = (TableScanOperator) op;
//Reset buffers to store filter push down columns
readColumnsBuffer.setLength(0);
readColumnNamesBuffer.setLength(0);
// push down projections.
ColumnProjectionUtils.appendReadColumns(readColumnsBuffer, readColumnNamesBuffer,
tableScan.getNeededColumnIDs(), tableScan.getNeededColumns());
pushDownProjection = true;
// push down filters and as of information
pushFiltersAndAsOf(newjob, tableScan, this.mrwork);
}
...
pushProjection(newjob, readColumnsBuffer, readColumnNamesBuffer);
pushProjection 里,放到 JobConf 里。
private void pushProjection(final JobConf newjob, final StringBuilder readColumnsBuffer,
final StringBuilder readColumnNamesBuffer) {
String readColIds = readColumnsBuffer.toString();
String readColNames = readColumnNamesBuffer.toString();
newjob.setBoolean(ColumnProjectionUtils.READ_ALL_COLUMNS, false);
newjob.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
newjob.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);
LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, readColIds);
LOG.info("{} = {}", ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, readColNames);
}
读取逻辑
ORCInputFormat 在读取的时候,仅读取这三个字段的值。
由于表的格式是 ORC,informat 为 org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
。
在 createReaderFromFile 方法里,有以下语句。
public static RecordReader createReaderFromFile(Reader file,
Configuration conf,
long offset, long length
) throws IOException {
options.include(genIncludedColumns(schema, conf));
Reader.Options 的 include 方法,如果对应的列需要读,则为 true,否则为 false。如果此表有 26 个列,我们只读 3 个列,仅读取的列对应的位置为 true,其余为 false。
public Reader.Options include(boolean[] include) {
this.include = include;
return this;
}
OrcInputFormat. genIncludedColumns。如果不读取所有的列,那么
static boolean[] genIncludedColumns(TypeDescription readerSchema,
Configuration conf) {
if (!ColumnProjectionUtils.isReadAllColumns(conf)) {
List<Integer> included = ColumnProjectionUtils.getReadColumnIDs(conf);
return genIncludedColumns(readerSchema, included);
} else {
return null;
}
}
getReadColumnIDs 方法返回需要字段的 id。通过 genIncludedColumns 生成对应的 boolean 数组。
public static List<Integer> getReadColumnIDs(Configuration conf) {
String skips = conf.get(READ_COLUMN_IDS_CONF_STR, READ_COLUMN_IDS_CONF_STR_DEFAULT);
String[] list = StringUtils.split(skips);
List<Integer> result = new ArrayList<Integer>(list.length);
for (String element : list) {
// it may contain duplicates, remove duplicates
Integer toAdd = Integer.parseInt(element);
if (!result.contains(toAdd)) {
result.add(toAdd);
}
// NOTE: some code uses this list to correlate with column names, and yet these lists may
// contain duplicates, which this call will remove and the other won't. As far as I can
// tell, no code will actually use these two methods together; all is good if the code
// gets the ID list without relying on this method. Or maybe it just works by magic.
}
return result;
}
下面看下什么时候把要读取的字段放到 conf 里。