openGauss源码学习（二）选择率估算_opengauss 选择率模型-CSDN博客

本文链接：https://blog.csdn.net/qq_42282731/article/details/133198516

openGauss源码学习（二）选择率估算

上一篇：EXPLAIN计划分析

文章目录

openGauss源码学习（二）选择率估算
前言
一、基表行数估算
- 1. 单个条件选择率估算
二、多个条件的选择率
- 1. AND/OR
- 2. 范围选择率修正
总结

前言

上一篇举例子说明了在计划生成过程中，算子可能出现行数偏差较大的情况。那么这篇就从代码的角度分析一下OG是怎么估算表的行数，以及为什么有时候出现这么大的偏差。

一、基表行数估算

基表是指普通表、临时表等类型的表。基表的统计信息介绍和行数估算可以参考博文《计划生成揭秘》。简单来说，基表的行数估算是基于统计信息的，如果基表存在过滤条件，那么，行数 = 基表总行数 * 过滤条件选择率。

1. 单个条件选择率估算

条件选择率一般是根据mcv值和直方图去估算，还是以上次的简单例子来说明，a=1这个条件比较准确地估算出SeqScan算子的行数只有一行，这个就是使用mcv值去计算的。

CREATE TABLE t1(a INT, b INT);
INSERT INTO t1 SELECT x,x FROM generate_series(1,10) x; -- 插入10条数据
ANALYZE t1;
EXPLAIN ANALYZE SELECT * FROM t1 WHERE a=1;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..1.12 rows=1 width=8) (actual time=0.037..0.044 rows=1 loops=1)
   Filter: (a = 1)
   Rows Removed by Filter: 9
 Total runtime: 0.238 ms
(4 rows)

单个约束条件选择率估算的入口函数是restriction_selectivity，实际计算选择率的函数是scalarltsel_internal，对这个函数的源码做一下分析：

float8 scalarltsel_internal(PlannerInfo* root, Oid opera, List* args, int varRelid)
{
    ...
    /*
     * If expression is not variable op something or something op variable,
     * then punt and return a default estimate.
     */
    if (!get_restriction_variable(root, args, varRelid, &vardata, &other, &varonleft))
        // 先检查是否是列 operator 常量，如果不是（比如a=b）则选取固定的选择率。
        return DEFAULT_INEQ_SEL; // 0.3333333333333333

    // 对于常量的值做检查，一些特殊情况也会根据情况选取固定的选择率。
    ...

	// vardata是get_restriction_variable中通过examin_variable函数获取的
	// scalarineqsel函数内根据统计信息去估算选择率
    selec = scalarineqsel(root, opera, isgt, &vardata, constval, consttype);

    ReleaseVariableStats(vardata);

    return (float8)selec;
}

主要逻辑集中在examine_variable函数（根据var获取基表统计信息）和scalarineqsel（根据统计信息计算选择率），我们这次先分析下用例对应的短路径代码。

void examine_variable(PlannerInfo* root, Node* node, int varRelid, VariableStatData* vardata)
{
    ...
     /* Fast path for a simple Var */
     // a=3的querytree结构类似于
     // opr =
     //    larg: var 'a' varno=1 varattno=1 说明是query中的第一个表的第一列
     //    rarg: const '3'
    if (IsA(basenode, Var) && (varRelid == 0 || (uint)varRelid == ((Var*)basenode)->varno)) {
        Var* var = (Var*)basenode;

        /* Set up result fields other than the stats tuple */
        vardata->var = basenode; /* return Var without relabeling */
        vardata->rel = find_base_rel(root, var->varno); // 从simple_rel_array中获取表的物理信息
        vardata->atttype = var->vartype;
        vardata->atttypmod = var->vartypmod;
        vardata->isunique = has_unique_index(vardata->rel, var->varattno);

        /* Try to locate some stats */
        // 函数内的逻辑是获取rte的统计信息，普通表则是直接使用表oid去pg_statisic系统表中查询。
        examine_simple_variable(root, var, vardata);

		// 到这里就通过短路径获取到了一个simple var对应的单列统计信息了。
        return;
    }
    ...
    // 后续的逻辑是处理更复杂的表达式，比如var是join后的列，或者var涉及表达式(a+3=4)这种情况等等。
}

static double scalarineqsel(
    PlannerInfo* root, Oid opera, bool isgt, VariableStatData* vardata, Datum constval, Oid consttype)
{
    ...
    if (!HeapTupleIsValid(vardata->statsTuple)) {
        /* no stats available, so default result */
        return DEFAULT_INEQ_SEL; // 0.3333333333333333
    }
    stats = (Form_pg_statistic)GETSTRUCT(vardata->statsTuple);

    fmgr_info(get_opcode(opera), &opproc);
    equaloperator = get_equal(opera);

    /*
     * If we have most-common-values info, add up the fractions of the MCV
     * entries that satisfy MCV OP CONST.  These fractions contribute directly
     * to the result selectivity.  Also add up the total fraction represented
     * by MCV entries.
     */
     // mcv_selectivity负责遍历统计信息中的所有mcv值，并确认是否有满足条件的mcv值。
    mcv_selec = mcv_selectivity(vardata, &opproc, constval, true, &sumcommon, equaloperator, &inmcv, &lastcommon);
    
    /*
     * If there is a histogram, determine which bin the constant falls in, and
     * compute the resulting contribution to selectivity.
     */
     // 根据直方图估算条件的选择率，多用于估算范围条件，比如大于小于等
    hist_selec = ineq_histogram_selectivity(root, vardata, &opproc, isgt, constval, consttype);

    /*
     * Now merge the results from the MCV and histogram calculations,
     * realizing that the histogram covers only the non-null values that are
     * not listed in MCV.
     */
    selec = 1.0 - stats->stanullfrac - sumcommon;

    if (hist_selec >= 0.0)
        selec *= hist_selec;
    else {
        /*
         * If no histogram but there are values not accounted for by MCV,
         * arbitrarily assume half of them will match.
         */
        selec *= 0.5;
    }

    selec += mcv_selec;

	// 根据guc对某些特定的估算不准的场景进行选择率修正
    ...

    return selec;
}

二、多个条件的选择率

clauselist_selectivity函数对整个表达式的选择率进行了估算，首先是对于每个clause进行估算，然后根据类型再做合并处理。

1. AND/OR

OR条件选择率估算模型：

  s1 = 0.0;
        foreach (arg, ((BoolExpr*)clause)->args) {
            /* DO NOT cache the var ratio of single or-clauses */
            Selectivity s2 = clause_selectivity(root, (Node*)lfirst(arg), varRelid, jointype, sjinfo, false);

            s1 = s1 + s2 - s1 * s2;
        }

AND条件选择率估算模型：

    /* Not the right form, so treat it generically. */
    // 因为AND条件的列可能存在相关性，OG暂时不支持收集相关性统计信息
    // 提供了guc参数cost_param可以调整AND条件的估算方式。
    if ((uint32)u_sess->attr.attr_sql.cost_param & COST_ALTERNATIVE_CONJUNCT) {
        s1 = MIN(s1, s2);
        expr->xpr.selec = s1;
    } else {
        s1 = s1 * s2;
        expr->xpr.selec = s2;
    }

2. 范围选择率修正

clauselist_selectivity中也对范围条件进行了选择率修正。

        /* Successfully matched a pair of range clauses */
        Selectivity s2;
        /*
         * Exact equality to the default value probably means the
         * selectivity function punted.  This is not airtight but should
         * be good enough.
         */
        if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound == DEFAULT_INEQ_SEL) {
             s2 = DEFAULT_RANGE_INEQ_SEL;
        } else {
            // 优化器对于 a > 1 AND a < 5这种场景进行了修正
            // 没有像之前说的等值条件那样，直接对AND的两个条件选择率相乘
            // 而是对对应的pair计算重叠部分的比例。
            s2 = rqlist->hibound + rqlist->lobound - 1.0;
            /* Adjust for double-exclusion of NULLs */
            s2 += nulltestsel(root, IS_NULL, rqlist->var, varRelid, jointype, sjinfo);

            /*
             * A zero or slightly negative s2 should be converted into a
             * small positive value; we probably are dealing with a very
             * tight range and got a bogus result due to roundoff errors.
             * However, if s2 is very negative, then we probably have
             * default selectivity estimates on one or both sides of the
             * range that we failed to recognize above for some reason.
            */
            if (s2 <= 0.0) {
                if (s2 < -0.01) {
                    /*
                     * No data available --- use a default estimate that
                     * is small, but not real small.
                     */
                    s2 = DEFAULT_RANGE_INEQ_SEL;
                 } else {
                    /*
                     * It's just roundoff error; use a small positive
                     * value
                     */
                    s2 = 1.0e-10;
                }
            }
        }

总结

这篇博文对几个简单场景的选择率估算进行了说明，还有很多其他的复杂场景计划在后续的博文中进行分析说明。但是从代码实现和估算机制的角度来看，很多场景（尤其是复杂场景）PG/OG并没有办法给出一个准确的选择率，只能根据经验或者假设来获取到一个固定的值。在子查询、条件涉及列运算的话，一般会出现更多估算误差较大的情况。