coalesce函数用法 sql_Presto中coalesce函数的实现与Expression Codegen

最新推荐文章于 2024-04-28 14:48:38 发布

weixin_39783915

最新推荐文章于 2024-04-28 14:48:38 发布

阅读量3.3k

点赞数

文章标签： coalesce函数用法 sql presto where in 不准确

本文介绍了Presto引擎如何处理coalesce函数，从SQL解析到生成物理Java代码的过程。在解析阶段，Presto支持的ANTLR4 SQL语法解析器将coalesce函数转化为AST树。在语义分析时，检查输入参数类型并生成逻辑计划。物理代码生成阶段，通过LocalExecutionPlanner和CursorProcessor生成具体操作。文章讨论了基于栈的Bytecode Codegen的优缺点，并提出了基于寄存器优化的可能性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言

Presto引擎中引入动态代码生成技术主要应用于表达式以及部分算子的局部逻辑代码生成，通过动态代码生成可以产生灵活、高效的执行代码实现。本文主要介绍Presto引擎中coalesce函数从sql解析到生成物理java代码的整个过程。

coalesce函数

coalesce函数语义：

coalesce（value1，value2 [，...]）
返回参数列表中的第一个非null值。 与CASE表达式一样，仅在必要时才会计算参数。

语义等价于：
CASE WHEN (value1 IS NOT NULL) THEN expression1  
...  
WHEN (valueN IS NOT NULL) THEN valueN   ELSE NULLEND

以下面这个sql查询为例子来介绍coalesce的实现：

SELECT coalesce(name, comment) FROM tpch.tiny.region

首先从sql解析开始，我们知道Presto支持标准sql语法，它使用ANTLR4生成的SQL语法解析器，查看其语法文件SqlBase.g4，我们可以知道上面这个查询的中的coalesce(name, comment)表达式会被解析成FunctionCall节点（在Presto的AST树中）。

Presto的sql语法解析是一个递归遍历解析树的过程，通过AstBuilder.visit(tree)从树的根节点开始处理，我们重点看到AstBuilder中的对FunctionCall节点的处理逻辑visitFunctionCall:

// AstBuilder.java

    public Node visitFunctionCall(SqlBaseParser.FunctionCallContext context)
    {
        ...

        if (name.toString().equalsIgnoreCase("coalesce")) {
            check(context.expression().size() >= 2, "The 'coalesce' function must have at least two arguments", context);
            check(!window.isPresent(), "OVER clause not valid for 'coalesce' function", context);
            check(!distinct, "DISTINCT not valid for 'coalesce' function", context);
            check(!filter.isPresent(), "FILTER not valid for 'coalesce' function", context);

            return new CoalesceExpression(getLocation(context), visit(context.expression(), Expression.class));
        }

        ...

        return new FunctionCall(
            getLocation(context),
            getQualifiedName(context.qualifiedName()),
            window,
            filter,
            orderBy,
            distinct,
            visit(context.expression(), Expression.class));
    }
```

我们可以看到这个过程中对coalesce函数进行了一些语法限制（至少有两个参数、不能包含开窗子句等），并且转成了CoalesceExpression节点返回，不是visitFunctionCall的默认返回FunctionCall节点。

最终生成AST树即Presto的Statement语句，内存结构如下：

Query{queryBody=QuerySpecification{select=Select{distinct=false, selectItems=[COALESCE(name, comment)]}, 
from=Optional[Table{tpch.tiny.region}], where=null, groupBy=Optional.empty, having=null, 
orderBy=Optional.empty, limit=null}, orderBy=Optional.empty}

然后在语义分析时，我们查看一下ExpressionAnalyzer.visitCoalesceExpression的代码逻辑，看看Presto对CoalesceExpression是怎么处理的：

//ExpressionAnalyzer.java

protected Type visitCoalesceExpression(CoalesceExpression node, StackableAstVisitorContext<Context> context)
{
   Type type = coerceToSingleType(context, "All COALESCE operands must be the same type: %s", node.getOperands());
   return setExpressionType(node, type);
}

我们看到Presto在此处check了coalesce函数的输入参数是否能隐式推导成同一种类型，如果不能则会抛语义分析错误。

接下来，我们知道这个AST Tree会通过LogicalPlanner优化改写生成逻辑计划(代数表达式)，接着通过PlanFragmenter对逻辑计划切分stage生成分布式执行计划，最后进行分布式调度执行，这些过程我们就不详细展开了。

生成的执行计划大致如下：

-------Stage-0-------#
TaskOutputOperator
          ^
          |
ExchangeOperator
---------------------#
             ^
             |
------------Stage-1----------#
              ^
              |
TaskOutputOperator
              ^
              |
ScanFilterAndProjectOperator
-----------------------------#

coalesce()函数表达式在生成代数表达式时其实是一个Project Node，和TableScan Node组合在一起形成ScanFilterAndProjectOperator算子。

这个Porject Node的内存结构大致如下：

ProjectNode {
    source: TableScanNode{table=tpch:region:sf0.01, outputSymbols=[name, comment]...
    assignments[1]: "expr" -> "COALESCE(CAST("name" AS varchar(152)), "comment")"
}

接下来我们看看这个 Project Node的生成物理代码的过程。

物理代码生成

SqlTask调度到Woker节点上后，会通过LocalExecutionPlanner生成具体的物理算子，下面我们重点看到其对ProjectNode的处理逻辑：

   public PhysicalOperation visitProject(ProjectNode node, LocalExecutionPlanContext context)
        {
            PlanNode sourceNode;
            Optional<Expression> filterExpression = Optional.empty();
            if (node.getSource() instanceof FilterNode) {
                FilterNode filterNode = (FilterNode) node.getSource();
                sourceNode = filterNode.getSource();
                filterExpression = Optional.of(filterNode.getPredicate());
            }
            else {
                sourceNode = node.getSource();
            }

            List<Symbol> outputSymbols = node.getOutputSymbols();

            return visitScanFilterAndProject(context, node.getId(), sourceNode, filterExpression,
                 node.getAssignments(), outputSymbols);
        }

在visitScanFilterAndProject方法中会调用ExpressionCompiler.compileCursorProcessor来生成ScanFilterAndProjectOperator中对TableScan算子捞上来的数据进行处理的类——CursorProcessor，而CursorProcessor的实际生成逻辑在CursorProcessorCompiler.generateMethods其输入参数包含过滤表达式filter和投影表达式projects。而我们这个例子中的coalesce()这个poject投影会在CursorProcessorCompiler.generateProjectMethod中通过RowExpressionCompiler.compile进行代码生成，具体的codegen逻辑就在CoalesceCodeGenerator.generateExpression中。

代码逻辑梳理:

// CoalesceCodeGenerator.java

    public BytecodeNode generateExpression(Signature signature, BytecodeGeneratorContext generatorContext,
                     Type returnType, List<RowExpression> arguments)
    {
        List<BytecodeNode> operands = new ArrayList<>();
        for (RowExpression expression : arguments) {
            operands.add(generatorContext.generate(expression)); 
        }

        Variable wasNull = generatorContext.wasNull();
        BytecodeNode nullValue = new BytecodeBlock()
                .append(wasNull.set(constantTrue()))
                .pushJavaDefault(returnType.getJavaType());

        // reverse list because current if statement builder doesn't support 
        // if/else so we need to build the if statements bottom up
        for (BytecodeNode operand : Lists.reverse(operands)) {
            IfStatement ifStatement = new IfStatement();

            ifStatement.condition()
                    .append(operand)
                    .append(wasNull);

            // if value was null, pop the null value, clear the null flag, and process the next operand
            ifStatement.ifTrue()
                    .pop(returnType.getJavaType())
                    .append(wasNull.set(constantFalse()))
                    .append(nullValue);

            nullValue = ifStatement;
        }

        return nullValue;
    }

以上的代码的主要流程：

1、先对所有输入参数，即对这个Expression Tree的子表达式进行codegen

2、以类似生成一个链表的方式去生成coalesce函数的主要逻辑

生成的伪字节码如下：

// Projection: COALESCE(#0, #1)
    public void evaluate(io.prestosql.spi.connector.ConnectorSession session, io.prestosql.spi.Page page, int position)
    {
        block_0 = page.getBlock(0);
        block_1 = page.getBlock(1);
        wasNull = false;
        this.blockBuilder
        if {
            condition {
                {
                    if {
                        condition {
                            block_0.isNull(position)
                        }
                        ifTrue {
                            {
                                load constant true
                                store wasNull)
                                ACONST_NULL
                            }
                        }
                        ifFalse {
                            varchar(152).getSlice(block_0, position)
                        }
                    }
                    wasNull
                }
            }
            ifTrue {
                {
                    POP
                    wasNull = false;
                    if {
                        condition {
                            {
                                if {
                                    condition {
                                        block_1.isNull(position)
                                    }
                                    ifTrue {
                                        {
                                            load constant true
                                            store wasNull)
                                            ACONST_NULL
                                        }
                                    }
                                    ifFalse {
                                        varchar(152).getSlice(block_1, position)
                                    }
                                }
                                wasNull
                            }
                        }
                        ifTrue {
                            {
                                POP
                                wasNull = false;
                                wasNull = true;
                                ACONST_NULL
                            }
                        }
                    }
                }
            }
        }
        
        // if (wasNull)
        if {
            condition {
                wasNull
            }
            ifTrue {
                
                // output.appendNull();
                POP
                invoke io.prestosql.spi.block.BlockBuilder.appendNull()Lio/prestosql/spi/block/BlockBuilder;
                POP
            }
            ifFalse {
                
                // varchar(152).writeSlice(output, Slice)
                store temp_0)
                store temp_1)
                [bootstrap(2L)]=>constant_2()
                load temp_1
                load temp_0
                invoke io.prestosql.spi.type.Type.writeSlice(Lio/prestosql/spi/block/BlockBuilder;Lio/airlift/slice/Slice;)V
            }
        }
        RETURN
    }

小结

Presto的Bytecode Codegen框架的实现是基于ASM，在其上封装了一层易于使用的bytecode生成api, 但是由于是基于JVM Bytecode来实现，生成代码的模式也是跟JVM字节码执行模型一致，其特点是Bytecode instructions是零地址指令集，执行模型依赖求值栈。

对一个Expression Tree的求值是一个递归过程，先计算子表达式，然后将子表达式的结果值作为操作数。由于Presto Codegen是零地址指令集基于栈求值的过程，所以在表达式求值过程中所有的状态（操作数和操作符等）都会留在栈帧上。从上面例子coalesce代码生成逻辑中我们可以看到需要频繁的push/pop操作，这对于开发者在实现复杂的代码生成逻辑时，写每一行代码时都可能需要考虑当前栈帧上的状态，可能会因为少了一个pop操作，导致栈上多了操作数没有弹出，而编译失败debug许久。

再者，对于这样的表达式codegen, eg. (expr1 + expr2) * expr1， expr1出现了两次，Presto的Bytecode Codegen框架会重复生成expr1表达式的bytecode代码，这样expr1就会被重复计算，这样的代码是低效，一种改进的思路是借鉴Dalvik虚拟机的实现——基于寄存器。 expr1只需要计算一遍，之后遇到与expr1一样的表达式时只需要从“寄存器“中取出其引用即可。

参考文献

Presto Documentation - Conditional Expressions
虚拟机随谈（一）：解释器，树遍历解释器，基于栈与基于寄存器，大杂烩