PostgreSQL 并行查询选择
标签
PostgreSQL , 并行 , 代价 , 额外的启动代价
背景
为什么PostgreSQL 并行查询有额外的启动成本, 按照CBO的代价低优先原则, 还是会选择使用并行查询?
算法请参考:
PostgreSQL 估算并行计算代价时, 会考虑并行计算worker进程和leader进程之间的memory copy开销, 以及启动worker节点是分配共享内存的额外开销:
src/backend/optimizer/path/costsize.c
- parallel_tuple_cost Cost of CPU time to pass a tuple from worker to leader backend
- parallel_setup_cost Cost of setting up shared memory for parallelism
例如, 额外的代价会体现在gather node:
/*
* cost_gather
* Determines and returns the cost of gather path.
*
* 'rel' is the relation to be operated upon
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
* 'rows' may be used to point to a row estimate; if non-NULL, it overrides
* both 'rel' and 'param_info'. This is useful when the path doesn't exactly
* correspond to any particular RelOptInfo.
*/
void
cost_gather(GatherPath *path, PlannerInfo *root,
RelOptInfo *rel, ParamPathInfo *param_info,
double *rows)
{
Cost startup_cost = 0;
Cost run_cost = 0;
/* Mark the path with the correct row estimate */
if (rows)
path->path.rows = *rows;
else if (param_info)
path->path.rows = param_info->ppi_rows;
else
path->path.rows = rel->rows;
startup_cost = path->subpath->startup_cost;
run_cost = path->subpath->total_cost - path->subpath->startup_cost;
/* Parallel setup and communication cost. */
startup_cost += parallel_setup_cost;
run_cost += parallel_tuple_cost * path->path.rows;
path->path.startup_cost = startup_cost;
path->path.total_cost = (startup_cost + run_cost);
}
但是, 在估算node cpu的代价时, 用到的并不是所有worker节点代价的总和, 而是每个worker的代价.
/*
* cost_seqscan
* Determines and returns the cost of scanning a relation sequentially.
*
* 'baserel' is the relation to be scanned
* 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
*/
/* Adjust costing for parallelism, if used. */
if (path->parallel_workers > 0)
{
double parallel_divisor = get_parallel_divisor(path);
/* The CPU cost is divided among all the workers. */
cpu_run_cost /= parallel_divisor;
/*
* It may be possible to amortize some of the I/O cost, but probably
* not very much, because most operating systems already do aggressive
* prefetching. For now, we assume that the disk run cost can't be
* amortized at all.
*/
/*
* In the case of a parallel plan, the row count needs to represent
* the number of tuples processed per worker.
*/
path->rows = clamp_row_est(path->rows / parallel_divisor);
}
/*
* Estimate the fraction of the work that each worker will do given the
* number of workers budgeted for the path.
*/
static double
get_parallel_divisor(Path *path)
{
double parallel_divisor = path->parallel_workers;
/*
* Early experience with parallel query suggests that when there is only
* one worker, the leader often makes a very substantial contribution to
* executing the parallel portion of the plan, but as more workers are
* added, it does less and less, because it's busy reading tuples from the
* workers and doing whatever non-parallel post-processing is needed. By
* the time we reach 4 workers, the leader no longer makes a meaningful
* contribution. Thus, for now, estimate that the leader spends 30% of
* its time servicing each worker, and the remainder executing the
* parallel plan.
*/
if (parallel_leader_participation)
{
double leader_contribution;
leader_contribution = 1.0 - (0.3 * path->parallel_workers);
if (leader_contribution > 0)
parallel_divisor += leader_contribution;
}
return parallel_divisor;
}
注意: page scan的代价是算总的, 并没有像cpu tuple cost那样除以worker数. 为啥呢? 可能觉得IO的并行提升能力有限? 又或者是pg hacker们没有考虑到这个问题?
/*
* disk costs
*/
disk_run_cost = spc_seq_page_cost * baserel->pages;
最后总结一下, PostgreSQL的CBO思路是哪个耗时短就选哪个. 所以 代价模型=执行时间模型, 因此并行计算既然能更快, 那就要在代价估算算法中有所体现, PG在计算cpu tuple cost时会除以worker数, 实际上算的是每个worker的代价, 和并行执行的时间模型匹配.
所以虽然并行计算有启动worker的创建共享内存, worker和leader进程的内存拷贝的额外代价, 但是最终并行计算的代价还是更低, 最终有可能选择并行计算.