PostpreSQL源码学习-Chapter3 Query Processing

TheShy:)

已于 2023-07-04 17:05:52 修改

阅读量420

点赞数

分类专栏： PostgreSQL 文章标签：学习数据库 postgresql

于 2023-06-16 17:12:44 首次发布

原文链接：https://www.interdb.jp/pg/pgsql03.html

版权

PostgreSQL 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

Chapter 3 Query Processing (Part 1)

PG supports the SQL standard of 2011, and query processing（查询处理，注意processing这里并不是表示进程） in PostgreSQL efficiently processes the supported SQL.
There are three technically interesting and practical features:
- Foreign Data Wrappers（FDW，外部数据包装器），the details are in Chapter 4. More details are in official document. 中文官方文档。
- Parallel Query（并行查询）, the details are in Chapter 4.
- JIT (Just -in-time) compilation（及时汇编），the details need to see the official document.

3.1 Overview（概述）

The back-end process（会话服务程序）consists of five subsystems：(or called Five steps)
- ①Parser: The parser generates a parse tree（根据生成解析树） from an SQL statement（SQL 语句） in plain text.
- ②Analyzer: The analyzer carries out a semantic analysis of a parser tree（对上一步生成的解析树进行语义分析) and generates a query tree（然后生成一颗查找树，B树）.
- ③Rewriter: The rewriter transforms a query tree using the rules stored in the rule system if such rules exist.（如果Rule System中存在相关的规则，Rewriter将会利用这些规则修改查找树。）
- ④Planner: The planner generates the plan tree that can most effectively be executed from the query tree.（Planner从查询树中生成一颗能让查询熟最高效被查询的计划树）This subsystem is very complicated !
- ⑤Executor: The executor executes the query via accessing the tables and indexes in the order that was created by the plan tree. （执行器按照上一步生成的计划树中的顺序访问表和索引）This subsystem is very complicated, too !

3.1.1 Parser（解析SQL语句语法）

The parser generates a parser tree that can be read by subsequent subsystems from an SQL statement in plain text.（解析器生成一颗解析树，可以被后续的子系统从SQL语句文本文件中直接读到）
The root node is a SelectStmy structure defined in file parasrnode.h.
Clause，子句，从句

/*
* 注释以查询表Weather(cityId,low_tem,hig_tem,date)和City(cityId,cityName,altitude,longtitude)为例：

SELECT City.cityName,
			Weather.date,
				Weather.low_tem,
					Weather.hig_tem
FROM 	City,Weather
WHERE 	City.cityId = Weather.cityId
ORDER BY date;

*/
typedef structu SelectStem;
{
  NodeTag         type;

        /*
         * These fields are used only in "leaf" SelectStmts.
         */
        List       *distinctClause;     /* NULL, list of DISTINCT ON exprs, or
                                         * lcons(NIL,NIL) for all (SELECT DISTINCT) */
        IntoClause *intoClause;         /* target for SELECT INTO */
        List       *targetList;         /* the target list (of ResTarget)，目标列表的指针，指向ResTarget数组，该数组中包含所有SELECT后面的Column属性值，即上面查询语句中的City.cityName,Weather.date,Weather.low_tem,Weather.hig_tem*/
        List       *fromClause;         /* the FROM clause，FROM子句，查询语句中的FROM子句，即来自哪一个or几个关系表，即上面查询语句中的City和Weather，为什么是List类型，是因为查询数据来自的关系表也许不止一个 */
        Node       *whereClause;        /* WHERE qualification，WHERE子句，即查询条件，指向条件表达式的最后被执行的符号，符号又指向它两侧的表达式，表达式优先级的分解的相关算法详见数据结构类似问题：表达式求值 */
        List       *groupClause;        /* GROUP BY clauses */
        Node       *havingClause;       /* HAVING conditional-expression */
        List       *windowClause;       /* WINDOW window_name AS (...), ... */

        /*
         * In a "leaf" node representing a VALUES list, the above fields are all
         * null, and instead this field is set.  Note that the elements of the
         * sublists are just expressions, without ResTarget decoration. Also note
         * that a list element can be DEFAULT (represented as a SetToDefault
         * node), regardless of the context of the VALUES list. It's up to parse
         * analysis to reject that where not valid.
         */
        List       *valuesLists;        /* untransformed list of expression lists */

        /*
         * These fields are used in both "leaf" SelectStmts and upper-level
         * SelectStmts.
         */
        List       *sortClause;         /* sort clause (a list of SortBy's)，指向ORDER BY子句等sort clause */
        Node       *limitOffset;        /* # of result tuples to skip */
        Node       *limitCount;         /* # of result tuples to return */
        List       *lockingClause;      /* FOR UPDATE (list of LockingClause's) */
        WithClause *withClause;         /* WITH clause */

        /*
         * These fields are used only in upper-level SelectStmts.
         */
        SetOperation op;                /* type of set op */
        bool            all;            /* ALL specified? */
        struct SelectStmt *larg;        /* left child */
        struct SelectStmt *rarg;        /* right child */
        /* Eventually add fields for CORRESPONDING spec here */
}

In this step (generating a parser tree), it only returns an error if there is a syntax error in the query statement.（这一步仅仅会因为SQL语法错误而返回一个报错信息）
It dose not check the semantics of an input query. For example, if there is table name which is not exists, the parser doesn’t return an error.（这一步骤不会检查query语句的语义是否正确，例如：FROM子句中有一个不存在的数据库中不存在的表名，parser这一步是不会报错的。）
Some syntax error: write the command keyword SELECT as SELCT,and WHERE will return a syntax error when appears before SELECT.

3.1.2 Analyzer（根据Rule分析并修改）

The analyzer generates a query tree, the root node of the query tree is also a structure named Query.

/*
 * Query -
 *	  Parse analysis turns all statements into a Query tree
 *	  for further processing by the rewriter and planner.
 *
 *	  Utility statements (i.e. non-optimizable statements) have the
 *	  utilityStmt field set, and the Query itself is mostly dummy.
 *	  DECLARE CURSOR is a special case: it is represented like a SELECT,
 *	  but the original DeclareCursorStmt is stored in utilityStmt.
 *
 *	  Planning converts a Query tree into a Plan tree headed by a PlannedStmt
 *	  node --- the Query structure is not used by the executor.
 */
typedef struct Query
{
	NodeTag		type;
	CmdType		commandType;		/* select|insert|update|delete|utility，语句的种类，增删改查 */
	QuerySource 	querySource;		/* where did I come from? */
	uint32		queryId;		/* query identifier (can be set by plugins)，查询标识符*/

	bool		canSetTag;		/* do I set the command result tag? */
	Node	   	*utilityStmt;		/* non-null if this is DECLARE CURSOR or a non-optimizable statement */
	int		resultRelation; 	/* rtable index of target relation for INSERT/UPDATE/DELETE; 0 for SELECT */
	bool		hasAggs;		/* has aggregates in tlist or havingQual */
	bool		hasWindowFuncs; 	/* has window functions in tlist */
	bool		hasSubLinks;		/* has subquery SubLink */
	bool		hasDistinctOn;		/* distinctClause is from DISTINCT ON */
	bool		hasRecursive;		/* WITH RECURSIVE was specified */
	bool		hasModifyingCTE;	/* has INSERT/UPDATE/DELETE in WITH */
	bool		hasForUpdate;		/* FOR [KEY] UPDATE/SHARE was specified */
	bool		hasRowSecurity; 	/* row security applied? */
	List	   	*cteList;		/* WITH list (of CommonTableExpr's) */
	List	   	*rtable;		/* list of range table entries，是一个存放各种查询中用到的关系的列表，例如上面的例子里面，range table中存放了表Weather和City的oid和表名 */
	FromExpr   	*jointree;		/* table join tree (FROM and WHERE clauses)，存放了FROM子句和WHERE子句的内容 */
	List	   	*targetList;		/* target list (of TargetEntry) ，存放查询结果的Column，同query tree中的targetList，如果输入的查询语句中用的是*，分析器将会将其替换成所有列*/
	List	   	*withCheckOptions;	/* a list of WithCheckOption's */
	OnConflictExpr 	*onConflict; 		/* ON CONFLICT DO [NOTHING | UPDATE] */
	List	   	*returningList;		/* return-values list (of TargetEntry) ，返回值的列表*/
	List	   	*groupClause;		/* a list of SortGroupClause's，GROUP子句的列表 */
	List	   	*groupingSets;		/* a list of GroupingSet's if present */
	Node	   	*havingQual;		/* qualifications applied to groups */
	List	   	*windowClause;		/* a list of WindowClause's */
	List	   	*distinctClause; 	/* a list of SortGroupClause's */
	List	   	*sortClause;		/* a list of SortGroupClause's，存放了分类查询的子句，例如ORDER BY */
	Node	   	*limitOffset;		/* # of result tuples to skip (int8 expr) 限制偏移*/
	Node	   	*limitCount;		/* # of result tuples to return (int8 expr) 限制数量*/
	List	   	*rowMarks;		/* a list of RowMarkClause's */
	Node	   	*setOperations;		/* set-operation tree if this is top level of a UNION/INTERSECT/EXCEPT query */
	List	   	*constraintDeps; 	/* a list of pg_constraint OIDs that the query
 depends on to be semantically valid */
} Query;

3.1.3 Rewriter

（如果Rule System中存在相关的规则，Rewriter将会利用这些规则修改查找树。）Every times matching a rule in Rule System, Rewriter will utilize it rewrite the query tree. For example, View merging (If you query a view, then this view will transform into a SQL code; Optimizing the sub-query, the detail example will be shown as followed:)

# A query which contains a sub-query
SELECT cityId,tem_hig
FROM Weather
WHERE cityId IN (
	SELECT cityId
    FROM City
    WHERE cityName='xx'
);

# After the rewrite to above SQL code
SELECT cityId,tem_hig
FROM Weather,City
WHERE Weather.cityId= City.cityId
and City.cityName='xx';

If necessary, the rewriter will transform a query tree according to the rules stored in the pg_rules system catalog .（有需要的话，重写器将会根据存在*pg_rules* 中的规则修改查找树）
View system is implemented by using the rule system. The essence of command CREATE VIEW is automatically generate and stored in the catalog pg_rules.（创建视图的过程其实就是把视图的规则自动生成并存放在目录中。视图：通常存放查询的规则）
And the essence of command SELECT \* FROM view_name; is making the parser creates the parse tree as shown below:

The purposes of rewrite are；
- pre-optimization SQL statements
- avoid the unnecessary operations
- help optimizer finding better solutions

3.1.4 Planner and Executor

(Planner从查询树中生成一颗能让查询熟最高效被查询的计划树), and then generating a (query) plan tree which could be processed most effectively by executor.
Planner is most complex subsystem in RDBMS. (关系数据库管理系统)
The Planner is pure (COST-BASE OPTIMIZATION，CBO，基于成本的优化), not rule-based optimization ( RBO，基于规则的优化 ). WHAT IS COST-BASE OPTIMIZATION???
- COST（成本）：When executes a query statement, the executor spends these two aspects of cost: I/O cost and CPU cost. I/O cost means the time of reading data from disk to memory or cache; CPU cost means the time which reads and checks whether the record meet the corresponding search criteria, and sorting the cluster of query result data.
- How to execute the cost-base optimization? —
  - Firstly, according to search conditions, finding out all possible indexes which can be using.
  - Secondly, calculating the cost of scanning whole tables respectively. ( Need two values: the number of pages occupied by clustered index; the number of tuples in one table)
  - Thirdly, calculating the cost which uses different indexes to execute the query.
  - At last, comparing the cost of various execution plans, and choose the lowest cost plan as final plan.
- For example: PostgreSQL成本计算公式， MySQL成本计算示例。
How to disappear a plan tree? By using the command EXPLAIN.

my_first_pgdb=# EXPLAIN SELECT * FROM weather WHERE city = 'Changsha' ORDER BY date;
                           QUERY PLAN                           
----------------------------------------------------------------
 Sort  (cost=14.51..14.52 rows=2 width=194)
   Sort Key: date
   ->  Seq Scan on weather  (cost=0.00..14.50 rows=2 width=194)
         Filter: ((city)::text = 'Changsha'::text)
(4 rows)
/*cost=14.51..14.52 表示启动成本为14.51，而总成本为14.52*/

The relationship between the plan tree and the result of the EXPLAIN command is shown as followed:

图中的*PlannedStmt* 和*SortNode*这些数据结构定义在/opt/pgsql-15.3/include/server/nodes/plannodes.h目录下。
Explain to some values in above illustration:
- CmdType commandType，命令的类型，如SELECT|CREATE|UPDATE等等，但是这一章介绍的是查询，所以一般默认是SELECT 类型
- List *plantree，指向plan tree
- List *rtablt，指向range table ，即包含了表的表名、OID和与本次查询有关的属性信息
- List *relationOids，指向所有与本次查询有关的数据库对象OID，这里暂时只涉及到了表，没有其他的数据库对象，所以就只有一个表的OID
- SortNode——即plan tree 的根节点的数据结构
```
typedef struct{
    Cost startup_cost;	//启动成本，读取第一条tuple前花费的成本
    Cost total_cost;	//总成本=启动成本+运行成本，其中运行成本=I/O cost + CPU cost
    double rows;		//该表中的tuples数量
    List *targetlist	//该数组中包含所有SELECT后面的Column属性值，同前面parser tree和query tree中的targetList字段
    List *qual;			//
    List *lefttree;		//
    List *righttree;	//
    Oid *sortOpertors;	//
}SortNode;
```
Notice: the sequence of executor processing is from the end of the plan tree to the root in the case of a single-table query.
The executor executes the query processing with the assist the buffer manager (described in Chapter 8) and the concurrency control mechanism (described in Chapter 5). The back-end process (会话服务进程) will access the database objects from database cluster through Buffer manager by using some memory areas, and it will create some temporary files during the query processing if necessary.

3.2 Cost Estimation（成本估算） in Single-Table Query

PG‘s query optimization is based on cost. （PG的查询优化是基于成本的。）
indicator，指标；performance indicator，性能指标。Costs are dimensionless values, and these are not absolute performance indicators but are indicators to compare the relative performance of operations.（成本是一个没有计量标准的值，但成本不是一个绝对的性能指标，而是一个与操作的相对性能来比较的指标。）
Costs are estimated by the functions defined in costsize.c.（成本的估算函数定义在文件）All of operations executed by the executor have the corresponding cost functions. For example, the costs of sequential scans and index scans are estimated by cost_seqscan() and cost_index(), respectively.（所有由executor执行的操作都有相关的成本函数，例如顺序扫描的成本cost_seqscan()，索引扫描的成本cost_index()）
Three kinds of costs: start-up, run and total in PostgreSQL.（PG中包含的三种类型的成本：启动成本 + 运行成本 = 总成本）
- The start-up cost in sequence scan means the cost expended before the first tuple is fetched, and in index scan it means the cost to read index pages to access the first tuple in the target table.
In this section, we will explore how to estimate the sequential scan , index scan and sort operation in detail.（这一部分会探索三种操作的成本估算：3.2.1顺序扫描，3.2.2索引扫描，3.2.3排序操作）

3.2.1 Sequential Scan（顺序扫描）

the function which estimates the cost of sequential scan: cost_seqscan().
- The start-up cost is equal to 0.
- The run cost is defined by the equation: $\_cost'='CPU\_run\_cost'+'DISK\_run\_cost'\\=(CPU\_tuple\_cost + CPU\_operator\_cost)\times N_{tuple}+seq\_page\_cost \times N_{page}$
- Three values in above equation: seq_page_cost, cpu_tuple_cost and cpu_operator_cost are set in the postgresql.conf. The default values are respectively $1.0 ， 0.01 ， 0.0025$
How to get the numbers of tuples and pages?

my_first_pgdb=# select relpages,reltuples from pg_class where relname='weather';
 relpages | reltuples 
----------+-----------
        0 |        -1
(1 row)

WHY DOSE THE VALUE of RELTUPLES IS DIFFERENT WITH THE ACTUALLY RECORDS COUNT???

relpages	      int4	 	              //以页(大小为 BLCKSZ)的此表在磁盘上的形式的大小。它只是规划器用的一个近似值，是由 VACUUM, ANALYZE 和几个 DDL 命令，比如 CREATE INDEX 更新。
reltuples	      float4	       	      //表中行的数目。只是规划器使用的一个估计值，由 VACUUM, ANALYZE 和几个 DDL 命令，比如 CREATE INDEX 更新。

通过ANALYZE命令来收集并更新reltuples和relpages的值，对于已有数据的表，系统不会自动更新。
- ANALYZE收集数据库中表内容的统计信息，并将结果存储在pg_statistic表中，query planner will use these information to help choosing the effective executive plan.
- After ANALYZE the table, query the number of relpages and reltuples again:
- ```
my_first_pgdb=# analyze verbose weather;
INFO:  analyzing "public.weather"
INFO:  "weather": scanned 1 of 1 pages, containing 8 live rows and 0 dead rows; 8 rows in sample, 8 estimated total rows
ANALYZE

my_first_pgdb=# select relpages,reltuples from pg_class where relname='weather';
 relpages | reltuples 
----------+-----------
        1 |         8
(1 row)
```
- Then, using the equation to calculate the value of sequential scan cost: $'total\_cost'='start\_up\_cost'+'run\_cost'\\=0.0+[(0.01+0.0025)\times 8+1.0\times 1]\\=1.10$
- Confirmation in psql:
```
  my_first_pgdb=# explain select * from weather where prcp>0;
                         QUERY PLAN                       
  --------------------------------------------------------
   Seq Scan on weather  (cost=0.00..1.10 rows=8 width=30)
     Filter: (prcp > '0'::double precision)
  (2 rows)
  ```
```
什么查询条件下会用顺序扫描Sequential Scan？没有索引的情况？
This type of filter is used when reading all the tuples in the table, and it does not narrow the scanned range of table pages.（这种类型的过滤器只有在读取表中所有元组的时候用到，它不会缩小扫描页面的范围）

3.2.2 Index Scan（索引扫描）

Different index methods in PostgreSQL basically all use the common cost function to estimate the cost: cost_index().
Some values which depend to table, which means each table has different values: (SELECT relpages, reltuples FROM pg_class WHERE relname='tbl_name')
- $N_{index,tuple}$ ,
- $N_{index,page}$ ,

3.2.2.1 Start-up Cost

The start-up cost of the index scan is the cost to read the index pages to access to the first tuple in the target table. （索引中的启动成本是指读索引页访问目标表中的第一个元组花费的成本。）
The calculate equation: $'start\_up\_cost'=\\ \{ceil(log_{2}(N_{index,tuple}))+(H_{index}+1)\times 50\}\times cpu\_operator\_cost$ 。
- $H_{index}$ is the hight of the index tree. （这个变量是索引树的高度）
- ceil()函数是用来获取小数最近的且大于等于它的整数。
- The default value of cpu_operate_cost is 0.0025
- Only when the value of reltuples bigger than 8, the start-up cost will bigger than 1.

3.2.2.2 Run Cost

The run cost = CPU cost + I/O cost = CPU cost + ( table I/O cost + index cost )
The equation of this cost item: $run\_cost'= \\('index\_CPU\_cost'+'table\_CPU\_cost')+('index\_IO\_cost'+'table\_IO\_cost')$
- $'index\_CPU\_cost'=Selectivity\times N_{index,tuple}\times (CPU\_index\_tuple\_cost +qual\_op\_cost)$
  - the value named qual_op_cost is the evaluation cost of index, and it’s default value is 0.0025.
  - the value CPU_index_tuple_cost is set in filepostgresql.conf, the default value is 0.005.
- $'table\_CPU\_cost'=Selectivity\times N_{tuple}\times CPU\_tuple\_cost$
  - the value CPU_tuple_cost is set in filepostgresql.conf, the default value is 0.01.
  - The above two equations of cost calculating meaning: $the\space frequency\space of\space tuples\space using \times the\space number\space of\space tuples \times cost\space constant$ .
- $'index\_IO\_cost'=ceil(Selectivity\times N_{index,page})\times random\_page\_cost$
  - the value random_page_cost is set in filepostgresql.conf, the default value is 4.0. Set the planner’s estimate of the cost of a non-sequentially-fetched disk page. （主要是用于非顺序提取的磁盘页）When reducing this value, PG executor will prefer index scan. Although the default value is set with 4, random access to mechanical disk storage is normally much more expensive than 4 times sequential access. Because the majority of random accesses to disk are assumed to be in cache.
- $'table\_IO\_cost'=max\_IO\_cost+indexCorrelation^{2}\times(min\_IO\_cost-max\_IO\_cost)$
  - 其中， $max\_IO\_cost=N_{page}\times random\_page\_cost$
  - $min\_IO\_cost = 1\times random\_page\_cost +(ceil(Selectivity\times N_{page})-1)\times sel\_page\_cost$ .
  - $index\space Correlation$ is a statistical correlation between physical row ordering and logical ordering of the column values (cited from the official document). This ranges from -1 to 1.（是一种列值的物理行排序与逻辑排序的统计相关性，值在-1到1之间）Using a example to explain this concept:
  - ```
  SELECT col,col_asc,col_desc,col_rand FROM tbl_corr;         
            /*  正序       倒序       随机  */
     col    | col_asc | col_desc | col_rand 
  ----------+---------+----------+----------
   Tuple_1  |       1 |       12 |        3
   Tuple_2  |       2 |       11 |        8
   Tuple_3  |       3 |       10 |        5
   Tuple_4  |       4 |        9 |        9
   Tuple_5  |       5 |        8 |        7
   Tuple_6  |       6 |        7 |        2
   Tuple_7  |       7 |        6 |       10
   Tuple_8  |       8 |        5 |       11
   Tuple_9  |       9 |        4 |        4
   Tuple_10 |      10 |        3 |        1
   Tuple_11 |      11 |        2 |       12
   Tuple_12 |      12 |        1 |        6
  (12 rows)
  
  /*查询每一列的correlation，text类型的列也有该值*/
  SELECT tablename,attname,correlation FROM pg_stats WHERE tablename='tbl_name';
  tablename | attname  | correlation 
  -----------+----------+-------------
   tbl_corr  | col_asc  |           1
   tbl_corr  | col_desc |          -1
   tbl_corr  | col_rand |    0.125874
  (3 rows)
  /*Executing the following commands:*/
  SELECT * FROM tbl_corr WHERE col_asc BETWEEN 2 AND 4;
  /*When executes the above command, PG executor reads only the first page, because all target tuples are stored in the forst page. */
  SELECT * FROM tbl_corr WHERE col_rand BETWEEN 2 AND 4;
  /*When executes the second command, PG executor has to read all pages for random attribution. */
  
  /*Summary: When using index scan, There is non-corresponding bewteen the index ordering and the physical tuple ordering in the table. The concept --index correlation is a statiatic correlatoin that reflects the influence of random access by this non-corresponding.*/
```
- The value which named Selectivity is the proportion of the search range of index by the specified WHERE clause. It‘s value is a floating point number from 0 to 1. （Selectivity选择性，是指定WHERE子句搜索索引范围的比例，值的取值范围在0~1之间）There are two algorithms to calculate the value of Selectivity: MCV and histogram_bounds, and both of which are stored in the statistic information pg_stats.（Selectivity的值有两种计算方法，MCV和histogram_bounds，这两种方法都存储在统计信息文件*pg_stats*中了）
  - ①MCV (Most Common Value), There are two columns related to it: most_common_vals and most_common_freqs, most_common_vals is a list of the MCVs in the column, and the most_common_freqs is a list of the frequencies of the MVCs.（MCV在表*pg_stats有两个字段与其相关：第一个most_common_vals是一个MCV在列中的列表，第二个most_common_freqs*是一MCV出现频率的列表）SELECT most_common_vals, most_common_fres FROM pg_stats WHERE tablenames='contries' and attname='continent'，attname指的attribute name，即属性名，字段名
  - When the target column type is integer or double precision, the MCV cannot be used, then you can use histogram_bounds to estimate the cost.（当字段类型是整数or浮点数时，就不能用MCV这个量来估计成本了，而是选择用histogram_bounds，译作直方图边界值，将值划分到一组与它近似相等的组内）, SELECT histogram_bounds FROM pg_stats WHERE tablename='tbl_name' AMD attname='data'
- the value cup_index_tuple = 0.005 and random_page_cost = 4.0 are set in the postgresql.conf file.

3.2.2.3 Total Cost

3.2.3 Sort

What operation will use sorting? —ORDER BY, merge join operations and other functions.
What method is used to estimated the cost of sorting?—cost_sort().

Line159 in postgresql.conf 
#work_mem = 4MB				# min 64kB
work_mem选项，单个查询操作（例如排序or哈希表）可使用的最大内存，默认值是 4MB，ORDER BY、DISTINCT和归并连接都要用到排序操作，哈希连接、基于哈希的聚集以及基于哈希的IN子查询处理中都要用到哈希表。

If all tuples can be sorted in work_mem, the quick-sort algorithm is used.（如果元组在work_mem范围之内，就用快排算法）Otherwise, a temporary file is created and the file merge sort algorithm is used.（否则，会创建一个临时文件用于归并排序）
START-UP COST: is the cost of sorting the target tuples.（启动成本就是将目标元组排序完成的成本，也就是排序的时间复杂度）If using quick-sort, the sort cost is $O(n\cdot log_{2}n)$ , the value $n$ is the number of the tuples.
RUN COST: is the cost of reading the sorted tuples, therefore the cost is $O (n)$ .（运行成本就是读取排序完成后的元组的成本，所以时间复杂度是 $O (n)$ ）
$'start\_up\space cost'=C+comparision\_cost \times N_{sort}\times log_2(N_{sort})$
- $C$ is the total cost of the last scan, which is the total cost of index scan. （ $C$ 的含义就是，查询语句中如果没有ORDER BY，那这条语句的总启动成本就是索引查询的总成本 $C$ ）
- $comparision\_cost$ is defined in 2 times $cpu\_operator\_cost$ .
$'run\_cost'=cpu\_operator\_cost\times N_{sort}$ .

3.3 Creating the Plan Tree（创建计划树） of a Single-Table Query

This section describes the step of creating a planner tree in a single-table query. The planner in PostgreSQL performs three steps, as shown below:
- ①Carry out preprocessing.（进行预处理）
- ②Get the cheapest access path by estimating the costs of all possible access paths.（预估所有访问路径的成本，然后找到一条最小成本的路径）（An access path is a unit of processing for estimating the cost, and the access paths are used only inside the planner to create the plan tree，访问路径是用于预估成本的处理单元）
- ③Create the plan tree from the cheapest path.（创建基于最小成本路径的计划树）
Structure PlannerInfo

typedef struct PlannerInfo{
    NodeTag type;
    Query *parse; /*Query的数据结构详见前面3.1.2节 Analyze的内容，这里的parse变量指的是正在被创建计划树的查询树是哪一棵*/
    PlannerGlobal *glob;	/*当前正在被计划的全局信息*/  
    RelOptInfo**simple_rel_array;	/*存储3.3.2中第一步*/
    List join_rel_list;		/*连接相关的列表*/
    List *sort_pathkeys;	/**/
    
}PlannerInfo;

//The whole code is shown in the last of this document. 上面显示出来的变量是相对重要的变量

3.3.1 Preprocessing

Before creating a plan tree, the planner carries out some preprocessing of the query tree stored in the PlannerInfo structure.
Only discuss the main preprocessing for the single-table query in this subsection.
Three Points:
- Simplificating target lists, limit clauses, and so on.
- Normalizing Boolean expressions.（标准化布尔表达式，例如：not(not a)标准化为a）
- Flatting AND / OR expressions. In the SQL standard are binary operators, however, in PgSQL internals, they are n-ary operators and the planner always assumes that all nested AND and OR expressions are to be flattened. For example: the OR expression id=1 or id=2 or id=3.

3.3.2 Getting the Cheapest Access Path

The planner will performs the following operations to choice the cheapest access path:
- Firstly, Create a RelOptInfo structure to store the access paths and the corresponding costs by function make_one_rel() and will be stored in the value simple_rel_array of structure PlannerInfo. （创建一个结构体RelOptInfo实例，存储访问路径及其成本，这个实例将会被存储在结构体PlannerInfo中的simple_rel_array 变量中）
- ```
typedef struct RelOptInfo{
    List *baserestrictinfo; //base restrict info，存储查询的WHERE子句内容，相关函数distribute_restrictinfo_to_rels()
    double rows;
    List *indexlist;		//如果索引存在，则该变量会指向索引，如果没有索引，就是NULL
    List *pathlist;			//存储所有路径，路径的结构体是Path，将路径相关函数是standard_qp_callback()
    struct Path *cheapest_startup_path;
    struct Path *cheapest_total_path;
	struct Path *cheapest_unique_path;
	List	   *cheapest_parameterized_paths;
}RelOptInfo;
```
```
typedef struct Path{
    NodeTag type;
    NodeTag psthtype;
    double rows;
    Cost startup_cost;
    Cost total_cost;	//使用cost_seqscan函数预估当前路径的成本
    RelOptInfo *parent;
    List *pathkeys;		//是一种表示路径的排序顺序的数据结构，如果查询语句中有ORDER BY col_name，那么列tbl_name.col_name就会被作为路径键添加到sort_pathkeys
}Path;
//会用函数add_path()函数将该Path实例添加到RelOptInfo结构体中去

typedef struct PathKey
{
        NodeTag         type;

        EquivalenceClass *pk_eclass;    /* the value that is ordered */
        Oid             pk_opfamily;    /* btree opfamily defining the ordering */
        int             pk_strategy;    /* sort direction (ASC or DESC) */
        bool            pk_nulls_first; /* do NULLs come before normal values? */
} PathKey;
```
- Secondly, Estimate the costs of all possible access paths, and add the access paths to the RelOptInfo structure.（估算所有可能的访问路径的成本，将访问路径添加到RelOptInfo结构体）Details is as followed:
  - ①A path is created, the cost of the sequential scan is estimated and the estimated costs are written into the path. Then, the path is added to the pathlist of the RelOptInfo structure.（创建一条路径，顺序扫描的成本估算和被写到这条路径中，这条路径被添加到结构体RelOptInfo的pathlist变量）
  - ②If indexes related to the target table exist, index access paths are created, all index scan costs are estimated and the estimated costs are written into the path. Then, the index paths are added to the pathlist.（如果目标关系表存在索引，那么索引路径也会被创建，所有索引扫描的估算成本也会被写入路径，并将该路径也添加进变量pathlist）
  - ③If the bitmap scan can be done, bitmap scan paths are created, all bitmap scan costs are estimated and the estimated costs are written into the path. Then, the bitmap scan paths are added to the pathlist.（如果位图扫描完成了，那么位图扫描的路径也会被创建，位图扫描的估算成本也会被写入路径，并将路径添加到变量pathlist）
- Thirdly, get the cheapest access path in the pathlist of the RelOptInfo structure.（找到pathlist所有路径中最划算的访问路径）HOW TO GET THE CHEAPEST PATH (IN SIMPLE QUERY)？——If the planner only estimates the sequential scan cost because there are no indexes of the target table; therefore, the cheapest access path is automatically determined.（例SELECT * FROM tbl_1 WHERE id <300 ORDER BY data;，如果目标关系表中没有索引，计划器值只估算顺序扫描的成本，这样的话，最划算的access path就是自动确定的。）
  - （If exists ORDER BY clause, pg create a new RelOptInfo structure to process it, and the processing is as followed.)如果存在ORDER BY子句的话，就会新建一个RelOptInfo结构体去处理排序子句
  - This RelOptInfo dose not have the value of base restrict info, that is, the information of the WHERE clause.（这个新的RelOptInfo结构体没有限制语句字段baserestrictinfo）
  - Create a SortPath structure and add it to the new RelOptInfo; then, link the sequential scan path to the subpath of the sort path. then, link the sequential scan path to the subpath of the sort path.（创建一个排序路径结构体SortPath，新RelOptInfo的pathlist将会指向它，同时SortPath中的变量subpath会指向上一步中找到的最划算的Path实例。）
  - ```
  typedef struct SortPath{
      Path path;		// path stores information about the sort operation itself.存储与排序本身相关的信息
      Path *subpath;	//表示输入源的路径，即上一步中找到的最划算的path实例，这个subpath指向的Path中的parent RelOptInfo类型的指针变量，指向第一个RelOptInfo带WHERE子句限制信息的结构体。
  }SortPath;
```
- In the next stage, that is, creating a plan tree, the planner can create a sequential scan node that contains the WHERE clause as the ‘Filter’, even though the new RelOptInfo does not have the baserestrictinfo.（下一步中是创建Planner Tree，计划器会创建一个包含了WHERE子句充当过滤器的顺序扫描node，详见3.3.3小节）
- In single-table query with indexes, the query will creates a Path structure of estimating the cost of sequence scan, also.
- 每一个索引都会创建相应的IndexPath，如果这个索引对应的列在查询语句的WHERE子句中受限制，那么WHERE子句的内容会被存在这个IndexPath的indexclause子句中。
- 如果索引扫描路径的成本低于顺序扫描路径或其他列索引扫描路径的成本，那么它会被放在pathlist指针指向的第一个位置，索引扫描路径IndexPath实例也是通过函数add_path()被添加到RelOptInfo的pathlist中去。
- 然后也是创建一个新的RelOptInfo结构体（但不是因为有排序ORDER BY子句），而是直接将最划算的路径添加到新RelOptInfo结构体的pathlist中去。

3.3.3 Creating a Plan Tree

A plan tree is composed of various plan nodes, The basic node is named PlanNode, and other nodes always contain it, for example, SeqScanNode, which is for sequential scanning, is composed of PlanNode and an integer variable scanrelid. A PlanNode contains 14 fields. The following are 7 representative fields.

start-up cost
total_cost
rows, the number of rows to be scanned which is estimated by the planner.（计划器预估的将被扫描到的行数）
targetlist, 该数组中包含所有SELECT后面的Column属性值，同前面parser tree和query tree中的targetList字段
qual, stored all equal conditions expression
lefttree, the node for adding children node.
righttree

3.5 Join Operations（连接操作）

Some kinds of join operations
- Nested Loop Join
- Merge Join
- Hash Join
The three join methods supported by Postgresql can preform all join operations, not only INNER JOIN, also LEFT/RIGHT OUTER JOIN, FULL OUTER JOIN and so on.

3.5.1 Nested Loop Join（嵌套循环连接）

There are five variations of it:
- Nested Loop Join
- Materialized Nested Loop Join
- Indexed Nested Loop Join
- Other Variations

3.5.1.1 Nested Loop Join

HOW TO Nest Loop Join？

Nested Loop Join dose not have any start-up operation: $start\_up cost=0'$
The run-cost of Nested loop join is proportional to the product of size of the outer and the inner tables; （product在数学里有乘积的意思，内嵌循环连接的运行成本相当于是外表与内表的大小的乘积数量级） $'run\space cost'~\simeq~ O(N_{outer}\times N_{inner})$ . The formal equation is : $'run\space cost'=\\(cpu\_operator\_cost+cpu\_tupel\_cost)\times N_{outer}\times N_{inner}+C_{inner}\times N_{outer}$ .
- where $C_{outter}$ and $C_{inner}$ are the scanning costs of the outer table and the inner table, respectively.（ $C_{outter}$ 和 $C_{inner}$ 分别表示外表和内表被扫描一遍的成本）
- The cost of the nested loop join is always estimated, but this join operation is rarely used because more efficient variations that are described in the following are more usually used.（尽管内嵌循环连接总是被估算，但是它在实际运用中很少被用到，因为有更多更有效率的变体连接形式）

3.5.1.2 Materialized Nested Loop Join（物化嵌套循环连接）

The nested loop join described above has to scan all the tuples of the inner table whenever each tuple of the outer table is read.（上面小节的内嵌循环连接每读取一个外表中的元组都会扫描一遍所有内表元组） As scanning the entire inner table for each outer table tuple is a costly process, PostgreSQL supports the materialized nested loop join to reduce the total scanning cost of the inner table.（这是一个十分昂贵的过程，PG支持物化内嵌循环连接，它能减少整个内表扫描的成本。）
Before running a nested loop join, the executor writes the inner table tuples to the work_mem or a temporary file by scanning the inner table once using the temporary tuple storage module described in below.（在运行内嵌循环连接时，executor会通过temporary tuple storage模块扫描一次内表把内表中的所有元组写入work_mem或者一个临时文件） It has a potential to process the inner table tuples more efficiently than using the buffer manager, especially if at least all the tuples are written to work_mem.（它有可能比使用缓冲管理更有效率，特别是当所有元组都被写入了work_mem）
WHAT is TEMPORARY TUPLES STORAGE??——It’s a internal module for materializing tables, this module is composed of the functions defined in tuplestore.c, and they store and read a sequence of tuples to/from work_meme or temporary files.（temporary tuple storage（临时元组存储）模块是一个用来物理化关系表的PG内部模块，这个模块又定义在tuplestore.c文件中的函数组成，它从work_mem或者临海文件中读写元组序列） Whether the work_mem or the temporary files are used depends on the total size of the tuples to be stored.（使用work_mem还是临时文件取决于要被存储的元组的总数量，work_mem足够的话就用work_mem，不够的话就用临时文件）
HOW the COST is ESTIMATED?? && HOW the EXECUTOR PROCESSES the PLAN TREE of MATERIALIZED NESTED LOOP JOIN??——
- The executor estimate the cost of these two processing: ‘Materialize’ and ‘Nested Loop’.（执行器估计这两个处理的成本：“Materialize”和“Nested Loop”。）
- Executing the EXPLAIN command to a query statement with tables join:（对具有表联接的查询语句执行EXPLAIN命令详情如下：）
```
testdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Nested Loop  (cost=0.00..750230.50 rows=5000 width=16)
   Join Filter: (a.id = b.id)
   ->  Seq Scan on tbl_a a  (cost=0.00..145.00 rows=10000 width=8)
   ->  Materialize  (cost=0.00..98.00 rows=5000 width=8)
         ->  Seq Scan on tbl_b b  (cost=0.00..73.00 rows=5000 width=8)
(5 rows)
```
- Line 7: The executor materializes the inner table tbl_b by sequential scanning (Line 8).（执行器通过扫描序列物理化内表b）——Materialize
- Line 4: The executor carries out the nested loop join operation; the outer table is tbl_a and the inner one is the materialized tbl_b.（执行器运行内嵌循环连接操作，连接外表a和物理化之后的内表b）——Nested Loop
- The cost estimation of Materialize:
  - $'start\_up\space cost'=0$ .
  - $'run\space cost'=2\times 0.0025\times 5000=25.0$ THE ‘2’ IS WHAT VALUE???
  - Put number in the above equation: $'run\space cost'=2\times 0.0025\times 5000=25.0$
  - $'total\space cost'=\\('start\_up\space cost'+'total\space cost \space of\space seq\space scan')+'run\space cost'=\\(0.0+73.0)+25.0=\\98.0$ .
- The cost estimation of Nested Loop:
  - $'start\_up\space cost'=0$ .
  - Rescan cost（重新扫描（被物理化的内表）的成本）： $'rescan\space cost'=\\cpu\_operator\_cost\times N_{innner}=\\(0.0025)\times 5000=\\12.5$ .
  - $'run\space cost'=\\(cpu\_operator\_cost +cpu\_tuple\_cost)\times N_{inner}\times N_{outer}\\+'rescan\space cost'\times (N_{outer}-1)+C_{outer,seqscan}^{total}+C_{materialize}^{total}$ ,
  - 其中 $C_{outer,seqscan}^{total}$ 是外表扫描的总成本， $C_{materialize}^{total}$ 是物理化内表的总成本，代值， $'run\space cost'=\\(0.0025+0.01)×5000×10000+12.5×(10000−1)+145.0+98.0=750230.5$ .

3.5.1.3 Indexed Nested Loop Join（索引嵌套循环连接）

If there is an index of the inner table and this index can look up the tuples satisfying the join condition for matching each tuple of the outer table,（如果连接的内表有索引，而且索引是负责连接两个表的那一列上的） the planner considers using this index for directly searching the inner table tuples instead of sequential scanning. （计划器会考虑直接用索引而不是顺序的方式去搜素内表的元组）This variation is called indexed nested loop join; refer to Fig. 3.18. Despite the fact that it referred to the indexed ‘nested loop join’, this algorithm can process on the basis of a single loop of the outer table; therefore, it can perform the join operation efficiently.（该算法也可以用于外表的单表循环过程）
Executing the EXPLAIN command to a query statement with tables join:（对具有表联接的查询语句执行EXPLAIN命令详情如下：）

testdb=# EXPLAIN SELECT * FROM tbl_c AS c, tbl_b AS b WHERE c.id = b.id;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Nested Loop  (cost=0.29..1935.50 rows=5000 width=16)
   ->  Seq Scan on tbl_b b (cost=0.00..73.00 rows=5000 width=8)
   ->  Index Scan using tbl_c_pkey on tbl_c c  (cost=0.29..0.36 rows=1 width=8)
         Index Cond: (id = b.id)
(4 rows)

3.5.1.4 Other Variations

If there is an index of the outer table and its attributes are involved in the join condition, it can be used for the index scanning instead of the sequential scan of the outer table.（如果连接的外表有索引，且索引是负责连接两个表的那一列上的）
Using the outer table index to combine to the above three nested loop join (common join method, materialized nested loop join and indexed nested loop join). （将外部表索引与以上三个嵌套循环联接变种形式结合）In particular, if there is an index whose attribute can be an access predicate in the WHERE clause, the search range of the outer table is narrowed; therefore, the cost of the nested loop join may be drastically reduced. （特别地，如果查询限定条件，也就是WHERE子句中的属性正好包含了索引，那么外表的搜索范围会缩小，内嵌循环的查询成本会进一步减少）
外表带索引的情况下与上面三种内嵌循环连接相结合的情况如下图：

3.5.2 Merge Join（合并连接）

The Merge Join plan could only be used in Natural Join and equi-joins. WHAT is Natural Join??? is a spacial equi-joins.（自然连接是一种特殊的等值连接，等值连接=内连接，合并连接这种方案只适用于自然连接和等值连接）
The cost of the merge join is estimated by the initial_cost_mergejoin() and final_cost_mergejoin() functions.（合并连接的成本是用这两个函数估算的。）
$'start\_up\space cost'$ is the sum of sorting costs of both inner and outer tables.（启动成本是所有内表和外表排序的成本之和）; thus, the start-up cost is $O(\space N_{outer}log_2(N_{outer})+N_{inner}log_2(N_{inner}))\space$ , where $N_{outer}$ and $N_{inner}$ are the number of tuples of the outer and inner tables, respectively.（启动成本的这个数量级，其中N分别是外表和内表元组的数量）
The run cost is $O(N_{outer}+N_{inner})$ .（运行成本的数量级）

3.5.2.1 Merge Join

Join procedure: ①Sorting the outer table and the inner table; ②Nested loop joining the two sorted tables; The conceptual illustration is shown followed:

If all tuples can be stored in memory, the sorting operations will be able to be carried out in the memory itself; otherwise, temporary files are used.（如果所有元组都可以被读取到内存 ( 指work_mem ) 中，那么排序操作就可以完全在内存中运行，否则，就需要用到临时文件。）
Executing the EXPLAIN command to a query statement with tables join:（对具有表联接的查询语句执行EXPLAIN命令详情如下：）

testdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id AND b.id < 1000;
                               QUERY PLAN
-------------------------------------------------------------------------
 Merge Join  (cost=944.71..984.71 rows=1000 width=16)
   Merge Cond: (a.id = b.id)
   ->  Sort  (cost=809.39..834.39 rows=10000 width=8)
         Sort Key: a.id
         ->  Seq Scan on tbl_a a  (cost=0.00..145.00 rows=10000 width=8)
   ->  Sort  (cost=135.33..137.83 rows=1000 width=8)
         Sort Key: b.id
         ->  Seq Scan on tbl_b b  (cost=0.00..85.50 rows=1000 width=8)
               Filter: (id < 1000)
(9 rows)

3.5.2.2 Materialized Merge Join

Same as in the nested loop join, the merge join also supports the materialized merge join to materialize the inner table to make the inner table scan more efficient.（和内嵌循环连接一样，合并连接也支持物化合并连接，来物化内表，让内表扫描过程更加有效）
Executing the EXPLAIN command to a query statement with tables join:（对具有表联接的查询语句执行EXPLAIN命令详情如下：）

testdb=# EXPLAIN SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Merge Join  (cost=10466.08..10578.58 rows=5000 width=2064)
   Merge Cond: (a.id = b.id)
   ->  Sort  (cost=6708.39..6733.39 rows=10000 width=1032)
         Sort Key: a.id
         ->  Seq Scan on tbl_a a  (cost=0.00..1529.00 rows=10000 width=1032)
   ->  Materialize  (cost=3757.69..3782.69 rows=5000 width=1032)
         ->  Sort  (cost=3757.69..3770.19 rows=5000 width=1032)
               Sort Key: b.id
               ->  Seq Scan on tbl_b b  (cost=0.00..1193.00 rows=5000 width=1032)
(9 rows)

Line 10: The executor sorts the inner table tbl_b using sequential scanning (Line 12).（executor将内表用顺序扫描排序。）
Line 9: The executor materializes the result of the sorted tbl_b.（executor物化内表的排序之后的结果到内存，work_mem或者temp_files）
Line 6: The executor sorts the outer table tbl_a using sequential scanning (Line 8).（executor将外表用顺序扫描排序。）
Line 4: The executor carries out a merge join operation; the outer table is the sorted tbl_a and the inner one is the materialized sorted tbl_b.（Merge Join两个表，一个是物化+排序后内表，另一个是仅排序后的外表。）

3.5.2.3 Other Variations

Similar to the nested loop join, the merge join in PostgreSQL also has variations based on which the index scanning of the outer table can be carried out.（和内嵌循环连接相似，以上变种形式也可以结合外表有索引的情况，可以通过对外表执行索引扫描提高整体连接效率。）

3.5.3 Hash Join（哈希连接）

Similar to the merge join, the hash join can be only used in natural joins and equi-joins.（和合并连接相似，哈希连接只能被用于等值连接和自然连接）
The hash join in PostgreSQL behaves differently depending on the sizes of the tables.（哈希连接会因为关系表的元组多少有所不同） If the target table is small enough (more precisely, the size of the inner table is 25% or less of work_mem), it will be a simple two-phase in-memory hash join（如果目标表足够小，内存中的两阶段散列联接）; otherwise, the hybrid hash join is used with the skew method.（否则，混合散列连接）
In this subsection, the execution of both hash joins in PostgreSQL is described.（本节将介绍两个PG中的哈希连接）
Discussion of the cost estimation has been omitted because it is complicated.（本节将会忽略该连接方式的成本估算，因为太复杂了）Roughly speaking, the start-up and run costs are $O(N_{outer}+N_{inner})$ if assuming there is no conflict when searching and inserting into a hash table.（哈希表没有冲突的情况下，启动成本和运行成本的数量级一共是 $O(N_{outer}+N_{inner})$ ）

3.5.3.1 In-Memory Hash Join（内存中的哈希连接）

In this subsection, the in-memory hash join is described.（内存中的哈希连接）
This in-memory hash join is processed on work_mem, （哈希连接操作在内存中的work_mem区域执行）and the hash table area is called a batch in PostgreSQL. A batch has hash slots, internally called buckets,（哈希表区域被称作batch区域，一个batch有几个哈希槽，称做buckets） and the number of buckets is determined by the ExecChooseHashTableSize() function defined in nodeHash.c ; the number of buckets is always $2^n$ , where $n$ is an integer.（哈希槽buckets的数量由上述函数决定，并且数量一定是一定是2的幂次方）
The in-memory hash join has two phases: the build and the probe phases. （内存中的哈希连接有两个阶段，构建和探测）In the build phase, all tuples of the inner table are inserted into a batch（在构建阶段，所有内表的元组被读取放到batch中，）; in the probe phase, each tuple of the outer table is compared with the inner tuples in the batch and joined if the join condition is satisfied.（在探索阶段，外表的每一个元组都会和内表的元组进行比较，如果满足连接条件就直接连接）
A specific example:
- ```
testdb=# SELECT * FROM tbl_outer AS outer, tbl_inner AS inner WHERE inner.attr1 = outer.attr2;
```
- ①Create a batch on work_mem, and the batch has eight buckets.（在work_mem区域里创建一块batch区域，batch区域有八块buckets哈希槽。）
- ②Insert the first tuple of the inner table into the corresponding bucket of the batch.（将第一个内表元组插入batch区域相应的bucket）
1. Calculate the hash-key of the first tuple’s attribute which is involved in the join condition. In this example, the hash-key of the attribute ‘attr1’ of the first tuple is calculated using the built-in hash function, because the WHERE clause is inner.attr1 = outer.attr2.（计算第一个元组属性的哈希键，哈希值与连接条件有关。在上面查询语句中，第一元组的属性列attr1的哈希键是由内置哈希函数计算得出的。）
2. Insert the first tuple into the corresponding bucket. Assume that the hash-key of the first tuple is 0x000…001 by binary notation; that is, the last three bits are ‘001’. In this case, this tuple is inserted into the bucket for which the key is ‘001’.（插入第一个元组到相应的bucket，假设第一个元组哈希键二进制表示是0x000……001，最后三位是001，这种情况下，这个元组被插入相应的bucket的键就是001。）
- ③Insert the remaining tuples of the inner table.（插入内表剩余所有的元组）
- ④Probe the first tuple of the outer table.（探测外表的第一个元组）
1. Calculate the hash-key of the first tuple’s attribute which is involved in the join condition of the outer table. In this example, assume that the hash-key of the first tuple’s attribute ‘attr2’ is 0x000…100; that is, the last three bits are ‘100’.（计算外表与连接相关的属性第一个元组的哈希键。这上面这个查询语句中，假设第一个元组属性列attr2的哈希键是0x000…100，最后三位是100。）
  2. Compare the first tuple of the outer table with the inner tuples in the batch and join tuples if the join condition is satisfied.（比较外表的第一个元组和batch中相对应位置的内表元组，如果连接条件满足的话直接连接。） Because the last three bits of the hash-key of the first tuple are ‘100’, the executor retrieves the tuples belonging to the bucket whose key is ‘100’ and compares both values of the respective attributes of the tables specified by the join condition (defined by the WHERE clause).（因为第一个元组的哈希键的最后三位是“100”，所以执行器检索属于其键为“100”的bucket的元组，并比较联接条件（由WHERE子句定义）指定的表的相应属性的两个值。） If the join condition is satisfied, the first tuple of the outer table and the corresponding tuple of the inner table will be joined; otherwise, the executor does not do anything.（连接条件满足的话就会直接连接，不满足的话，executor不会做任何操作。） In this example, the bucket whose key is ‘100’ has Tuple_C. If the attr1 of Tuple_C is equal to the attr2 of the first tuple (Tuple_W), then Tuple_C and Tuple_W will be joined and saved to memory or a temporary file.（连接完会被存储到内存或者一个临时文件中去）
- ⑤Probe the remaining tuples of the outer table（依次探测外表剩下的其余元组）

3.5.3.2 Hybrid Hash Join with Skew

It is a algorithm which be used when the tuples of the inner table cannot be stored into one batch in work_mem, which is a variation based on the hybrid hash join.（这是一种在work_mem中无法将内部表的元组存储到一个批中时使用的算法，这是基于混合哈希连接的变体。）
WHAT is HYBRID HASH JOIN???
- In the first build and probe phases, PostgreSQL prepares multiple batches. The number of batches is the same as the number of buckets, determined by the ExecChooseHashTableSize() function; it is always $2^m$ , where $m$ is an integer.（第一次构建与探测阶段，PG会在work_mem中准备不止一个batches，batch数量和每个batch中buckets中的数量一致，由函数Exec Choose Hash Table Size()决定，总是2的整数幂）At this stage, only one batch is allocated in work_mem and the other batches are created as temporary files; and the tuples belonging to these batches are written to the corresponding files and saved using the temporary tuple storage feature.（其中只有一个batch被分配到内存中的work_mem中去，其余的都以临时文件的形式被创建，属于这些batches的元组会被写进相应的文件，用“临时元组存储功能”存储）
- The following illustration shows how tuples are stored in four (= $2^2$ ) batches:
- The binary form of the batch number is the prefix of the binary hash key of the tuples stored in it.（batch序号的二进制形式是存放在它里面的tuples的二进制hash-key的前缀）
In the hybrid hash join, the build and probe phases are performed the same number of times as the number of batches, because the inner and outer tables are stored in the same number of batches. （在混合哈希连接中，构建和探测阶段执行次数与batches的数量相等，因为内表和外表的存储都需要相同数量的batches） In the first round of the build and probe phases, not only is every batch created, but also the first batches of both the inner and the outer tables are processed.（构建和探测阶段的第一轮中每一个batch都被创建的同时，第一个batch的内外表也被处理了）On the other hand, the processing of the second and subsequent rounds needs writing and reloading to/from the temporary files, so these are costly processes.（对第二个和随后几个batch的处理需要对临时文件进行写入和重新加载，所以后续都是比价昂贵的过程）Therefore, PostgreSQL also prepares a special batch called skew to process many tuples more efficiently in the first round.（因此，PG准备了名为skew的一个特殊的batch，在第一轮用它更有效率地去处理更多元组）
skew batch: The skew batch stores the inner table tuples that will be joined with the outer table tuples whose MCV（Most Common Value） values of the attribute involved in the join condition are relatively large.（这个skew batch存储了与外表连接的连接条件有关的那一个属性列上最常被访问的内表元组）However, because this explanation is not easy to understand, it will be explained using a specific example.（下面一个具体的例子来解释）

表：Customers-顾客 customer_name costomer_address
1
……
10000

表：Purchase_history p_customer_name purchase_item
1
……
1,000,000
Assumption: the top 10% customers have purchased 70% of all items. Under this assumptions, let us consider how the hybrid hash join with skew performs in the first round when the query shown below is executed:（假设：前10%的客户购买了所有商品的70%。在这种假设下，让我们考虑一下当执行下面显示的查询时，带有偏斜的混合散列联接在第一轮中的执行情况：）
```
testdb=# SELECT * FROM customers as c, purchase_history as h WHERE c.name = h.customer_name;
```
- If the customers’ table is inner and the purchase_history is outer, the top 10% customers are stored in the skew batch using the MCV values of the purchase_history table. （顾客表是内表，购买历史是外表，选出购买记录表中的最常见的前10%的顾客，将顾客表中这些顾客的元组存储在skew batch）
- Note that the outer table’s MCV values are referenced to insert the inner table tuples into the skew batch.（注意，插入skew batch的是内表的元组，在这个例子中就是顾客表的元组）In the probe phase of the first round, 70% of the tuples of the outer table (purchase_history) will be joined with the tuples stored in the skew batch. （这样一来，根据假设，第一轮的探索阶段中70%的外表也就是购买记录表中的元组都会与skew batch中的内表元组进行连接）
- This way, the more non-uniform of the outer table distribution, it can be processed many tuples of the outer table in the first round.（这样，外表分布越不均匀，就可以在第一轮中处理外表的很多元组）
In the following, the working of the hybrid hash join with skew is shown.（下文中显示了带有偏斜的混合散列连接的步骤：）
- ①Create a batch and a skew batch on work_mem.（在work_mem中创建一个batch和一个skew batch）
- ②Create temporary batch files for storing the inner table tuples.（创建一个临时batch文件，用于存储内表的元组）
- In this example, three batch files are created because the inner table will be divided by four batches.（在这个例子中，因为内表会被分成四个batches，所以需要创建三个batch文件）
- ③Perform the build operation for the first tuple of the inner table. The detail are described below:（为内表的第一个元组执行build操作，细节如下：）
  - If the first tuple should be inserted into the skew batch, do it; otherwise, proceed to 2. In the example explained above, if the first tuple is one of the top 10% customers, it is inserted into the skew batch.（如果第一个元组要被插入到skew batch，那么执行插入操作，否则直接跳过这一步。即在上述的例子中，如果第一个顾客元组属于前10%，那么这个元组就会被插入到skew batch中去）
  - Calculate the hash-key of the first tuple and then insert the corresponding batch.（计算第一个内表元组的哈希键，然后根据哈希键（就是0x……这一串地址）将它插入相应的batch中去）
- ④Perform the build operation for the remaining tuples of the inner table.（对内表剩余的所有元组执行以上build操作）
- ⑤Create temporary batch files for storing the outer table tuples.（创建临时batch文件用于存储外表元组）
- ⑥If the MCV value of the first tuple is large, perform a probe operation with the skew batch; otherwise, proceed to ⑦. In the example explained above, if the first tuple is the purchase data of the top 10% customers, it is compared with the tuples in the skew batch.（如果第一个元组属于MVC，那么利用skew batch中内表相应的元组对该外表元组执行probe操作。根据上面的例子，如果这条购买记录的顾客是前10%的顾客，那么这条购买记录元组就是属于MVC的。否则就跳过这一步）
- ⑦Perform the probe operation of the first tuple.（对外表第一个元组执行probe操作，详细操作如下：）
- Depending on the hash-key value of the first tuple, the following process is performed:（根据外表第一个元组的hash-key，执行以下操作）
- - If the first tuple belongs to Batch_0, perform the probe operation.（如果外表第一元组属于Batch_0，对它执行probe）
  - Otherwise, insert into the corresponding batch.（否则，插入相应的batch）
- ⑧Perform the probe operation from the remaining tuples of the outer table. Note that, in the example, 70% of the tuples of the outer table have been processed by the skew in the first round without writing and reading to/from temporary files.（对外表其余剩余的元组依次执行以上操作。注意，在上述例子中，外部表的70%的元组已经在第一轮中被偏斜处理，而没有向临时文件写读，临时文件就是上面图中的batch_x_out，x非0）
- ⑨Remove the skew batch and clear Batch_0 to prepare the second round.（移除偏斜批次并清除batch_0以准备第二轮）
- ⑩Perform the build operation from the batch file batch_1_in.（对batch文件batch_1_in执行build操作）Perform the probe operation for tuples which are stored in the batch file batch_1_out.（对存储在batch_1_out中外表元组进行probe操作）

3.5.3.3 Index Scans in Hash Join

Hash join in PostgreSQL uses index scans if possible.（PG中的哈希连接尽可能的使用索引扫描）

3.5.4 Join Access Paths and Join Nodes（连接访问路径和连接节点）

3.5.4.1 Join Access Paths

An access path of the nested loop join is the [JoinPath](javascript:void(0)) structure, and other join access paths, [MergePath](javascript:void(0)) and [HashPath](javascript:void(0)), are based on it.（内嵌循环连接的访问路径结构体是struct JoinPath）

typedef struct JoinPath
{
	Path	   path;			/* Path结构体见一些结构体 */
	JoinType   jointype;
	Path	   *outerjoinpath;	/* path for the outer side of the join */
	Path	   *innerjoinpath;	/* path for the inner side of the join */
	List	   *joinrestrictinfo;	/* RestrictInfos to apply to join */
	/*
	 * See the notes for RelOptInfo and ParamPathInfo to understand why
	 * joinrestrictinfo is needed in JoinPath, and can't be merged into the
	 * parent RelOptInfo.
	 */
} JoinPath;

All join access paths are illustrated in the following without explanation.

3.5.4.2 Join Nodes

3.6 Creating the Plan Tree of Multiple-Tables Query（创建多表查询的计划树）

3.6.1 Preprocessing

The subquery_planner() function defined in planner.c invokes preprocessing. （负责预处理的函数是定义在文件planner.c中的subquery_planner()函数）The preprocessing for single-table queries has already been described in Section 3.3.1. In this subsection, the preprocessing for a multiple-table query will be described; however, although there are many, only some parts are described.（本节介绍一部分预处理）
- ①Planning and Converting CTE—— If there are WITH lists, the planner processes each WITH query by the SS_process_ctes() function.（计划和转换CTE）
  - WHAT is WITH query???—WITH查询（公共表表达式）
- ②Pulling Sub-queries Up—— If the FROM clause has a sub-query and it dose not have GROUP BY, HAVING, ORDER BY, LIMIT and DISTINCT clauses, and also it does not use INTERSECT or EXCEPT, the planner converts to a join form by the pull_up_subqueries() function. （将子查询上提—如果FROM子句中包含子查询，并且子查询中不包含GROUP BY，HAVING，ORDER BY，LIMIT和DISTINCT子句，也没有用到INTERSECT和EXCEPT关键字，计划器将用pull_up_subqueries()函数会把这句查询语句转换成一个多表连接的形式）For example, the query shown below which contains a sub-query in the FROM clause can be converted to a natural join query. Needless to say, this conversion is done in the query tree.（例如，这句包含子查询的查询语句可以被转换成一句自然连接查询，这种转换在query tree (Analyzer—>Rewriter)中完成）
```
testdb=# SELECT * FROM tbl_a AS a, (SELECT * FROM tbl_b) as b WHERE a.id = b.id;
	 	       	           ↓

testdb=# SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id;
```
- ③Transforming an Outer Join to an Inner Join—— The planner transforms an outer join query to an inner join query if possible.（将外连接转换成内连接——如果可以的话，计划器会将外连接转换成内连接）

3.6.2 Getting the Cheapest path

To get the optimal plan tree, the planner has to consider the combinations of all of the indexes and join methods possibilities.（为了获得最优计划树，计划器必须考虑所有索引的组合和联接方法的可能性） This is a very expensive process and it will be infeasible if the number of tables exceeds a certain level because of a combinational explosion.（这样的代价会很昂贵，而且十分有可能是办不到的，因为如果表的数量超过了一定数量级，组合的可能性会爆炸多） Fortunately, if the number of tables is smaller than around 12, the planner can get the optimal plan by applying dynamic programming. Otherwise, the planner uses the genetic algorithm. Refer to the below.（幸运的是如果表的数量小于12左右，计划器可以通过应用动态编程规划获得最优计划树，否则，计划器只能用遗传算法，该算法的详情如下：）
Genetic Query Optimizer（遗传查询优化）——When a query joining many tables is executed, a huge amount of time will be needed to optimize the query plan. To deal with this situation, PostgreSQL implements an interesting feature: the Genetic Query Optimizer. （为了解决当一个查询的连接涉及到很多个表，而大部分时间被用于优化邮计划树的情况，PG提供了一个有趣的功能，也就是Cenatic Query Optimizer）This is a kind of approximate algorithm to determine a reasonable plan within a reasonable time. Hence, in the query optimization stage, if the number of the joining tables is higher than the threshold specified by the parameter geqo_threshold (the default is 12), PostgreSQL generates a query plan using the genetic algorithm…（这是一种在合理时间内决定一种合理的计划最合适的算法，因此，在这个查询优化阶段中，如果连接的表的数量超过临界值gepo_thresh，PG就会在生成query tree 的时候用到这种遗传算法。）
Determination of the optimal plan tree by dynamic programming can be explained by the following steps:（通过以下步骤解释怎样通过动态规划决定最优计划树：）
- Level = 1
- Get the cheapest path of each table; the cheapest path is stored in the respective RelOptInfo.（Level = 1也就是几个表的连接，这里是找出每一个表中最划算的路径，直接获取每个关系表中最划算的路径，将最划算的路径存储到各自的RelOptInfo结构体中）
- Level = 2
- Get the cheapest path for each combination that selects two from all the tables.（Level = 2的时候，也就是所有表两两连接的情况下，获取每个两两组合最划算的路径）
- For example, if there are two tables, A and B, get the cheapest join path of tables A and B, and this is the final answer.（例如，如果只有两个表，A和B，两两连接的组合只有 ${A,B\}$ 这一种，直接获取表A和B连接时的最划算的路径，这就是最终的结果；如果有三个表，就要分别获取 ${A, B\}$ ， ${A, C\}$ 和 ${B, C\}$ 这三种组合连接的最划算路径）
- In the following, the RelOptInfo of two tables is represented by ${A, B\}$ .
- If there are three tables, get the cheapest path for each of ${A, B\}$ , ${A, C\}$ and ${B, C\}$ .
- Level = 3 and higher
- The same processing is continued until the level that equals the number of tables is reached.（Level = 3或者更多的时候，继续进行相同的处理，直到达到等于表数的级别为止。）
This way, the cheapest paths of the partial problems are obtained at each level and are used to get the upper level’s calculation. This makes it possible to calculate the cheapest plan tree efficiently.（这样，在每个Level上获得局部问题的最划算的路径，并用于下一个更高的Level的计算。这使得能够有效地计算最划算的计划树）
Using a explicit example to introduce the detail processing in each Level:

testdb=# \d tbl_a
     Table "public.tbl_a"
 Column |  Type   | Modifiers 
--------+---------+-----------
 id     | integer | not null
 data   | integer | 
Indexes:
    "tbl_a_pkey" PRIMARY KEY, btree (id)
    
# 注意：tbl_a的属性id带索引，下面的查询语句where子句涉及到tbl_a.id，所以用成本估算注意是的索引扫描。

testdb=# \d tbl_b
     Table "public.tbl_b"
 Column |  Type   | Modifiers 
--------+---------+-----------
 id     | integer | 
 data   | integer | 

testdb=# SELECT * FROM tbl_a AS a, tbl_b AS b WHERE a.id = b.id AND b.data < 400;

3.6.2.1 Processing in Level 1

In Level 1, the planner creates a RelOptInfo structure and estimates the cheapest costs for each relation in the query. There, RelOptInfo structures are added to the simple_rel_arrey of the PlannerInfo of this query.（在Level = 1中，计划器会创建一个RelOptInfo结构体去估算查询中的每一个关系的最划算的成本，然后这个RelOptInfo结构体实例会被插入到PlannerInfo结构体的成为simple_rel_array变量的值）

The PlannerInfo and RelOptInfo after processing in Level 1.

The RelOptInfo of tbl_a has three access paths, which are added to the pathlist of the RelOptInfo, and they are linked to three cheapest cost paths, that is, the cheapest start-up (cost) path, the cheapest total (cost) path, and the cheapest parameterized (cost) path.（上图为Level = 1处理之后，各个数据结构的情况，tbl_a的RelOptInfo结构体有三条访问路径，都被添加到了其RelOptInfo结构体的pathlist变量后面，它们分别被链接到三条最划算的成本路径指针上，分别是最划算的启动成本路径，最划算的总成本路径 和 最划算的参数化成本路径） As the cheapest start-up and total cost paths are obvious, the cost of the cheapest parameterized index scan path will be described.（最便宜的启动和总成本路径是显而易见的，下面将介绍最便宜的参数化索引扫描路径的成本）
As described in Section 3.5.1.3, the planner considers the use of the parameterized path for the indexed nested loop join (and rarely the indexed merge join with an outer index scan). （在3.5.1.3小节中，计划器通常考虑在带索引的内嵌循环连接中使用带参数的路径）The cheapest parameterized cost is the cheapest cost of the estimated parameterized paths.（最划算的参数化成本就是估算参数化路径中最划算的那一条路径的参数化成本）
The RelOptInfo of tbl_b only has a sequential scan access path because tbl_b does not have a related index.（表b没有相关索引，所以只有顺序扫描的访问路径，来估算成本）

3.6.2.2 Processing in Level 2

In Level 2, a RelOptInfo structure is created and added to the join_rel_list of the PlannerInfo. （在Level = 2时，RelOptInfo结构体实例将变成PlannerInfo中变量join_rel_list的值）Then, the costs of all possible join paths are estimated, and the best access path, whose total cost is the cheapest, is selected. （然后，估算所有可能的连接组合的路径的成本，将最优的访问路径，也就是最优总成本的路径，添加到cheapest total cost) path）The RelOptInfo stores the best access path as the cheapest total cost path. Refer to Fig. 3.33.

The following Table shows all combinations of join access paths in this example.（下表显示了例中连接访问路径的所有组合） The query of this example is an equi-join type; therefore, all the three join methods are estimated.（本例中的查询是等值连接，所以三种连接方式（内嵌循环，Merge，Hash）的都估算了） For convenience, some notations of access paths are introduced:（方便起见，表格引入了访问路径的一些符号：）
- SeqScanPath(table) means the sequential scan path of table.（表的顺序扫描路径）
- Materialized->SeqScanPath(table) means the materialized sequential scan path of a table.（物化后的顺序扫描路径）
- IndexScanPath(table, attribute) means the index scan path by the attribute of the a table.（表某一属性的索引扫描路径）
- ParameterizedIndexScanPath(table, attribute1, attribute2) means the parameterized index path by the attribute1 of the table, and it is parameterized by attribute2 of the outer table.（参数化索引路径）

	Outer Path	Inner Path
Nested Loop Join
1	SeqScanPath(tbl_a)	SeqScanPath(tbl_b)
2	SeqScanPath(tbl_a)	Materialized->SeqScanPath(tbl_b)	Materialized nested loop join
3	IndexScanPath(tbl_a,id)	SeqScanPath(tbl_b)	Nested loop join with outer index scan
4	IndexScanPath(tbl_a,id)	Materialized->SeqScanPath(tbl_b)	Materialized nested loop join with outer index scan
5	SeqScanPath(tbl_b)	SeqScanPath(tbl_a)
6	SeqScanPath(tbl_b)	Materialized->SeqScanPath(tbl_a)	Materialized nested loop join
7	SeqScanPath(tbl_b)	ParametalizedIndexScanPath(tbl_a, id, tbl_b.id)	Indexed nested loop join
Merge Join
1	SeqScanPath(tbl_a)	SeqScanPath(tbl_b)
2	IndexScanPath(tbl_a,id)	SeqScanPath(tbl_b)	Merge join with outer index scan
3	SeqScanPath(tbl_b)	SeqScanPath(tbl_a)
Hash Join
1	SeqScanPath(tbl_a)	SeqScanPath(tbl_b)
2	SeqScanPath(tbl_b)	SeqScanPath(tbl_a)

The planner finally selects the cheapest access path from the estimated join paths, and the cheapest path is added to the pathlist of the RelOptInfo {tbl_a,tbl_b}. Refer to Fig. 3.33.
In this example, as shown in the result of EXPLAIN below, the planner selects the hash join whose inner and outer tables are tbl_b and tbl_a.

TheShy:)

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
PostpreSQL源码学习-Chapter3 Query Processing

The back-end process（会话服务程序）consists of five subsystems：(or called Five steps)generates a parse tree（根据生成解析树）from an SQL statement（SQL 语句） in plain text.carries out a semantic analysis of a parser tree（对上一步生成的解析树进行语义分析)and。
复制链接

扫一扫