Tbase 源码（六）

driftingman

已于 2023-01-07 23:54:42 修改

阅读量756

点赞数 1

文章标签： postgresql 数据库

于 2022-07-29 22:23:38 首次发布

本文链接：https://blog.csdn.net/driftingman/article/details/126064300

版权

【Executor--执行可优化语句】

可优化语句被查询编译器处理后都会生成査询计划树，这一类语句由执行器(Executor)处理。该模块对外提供了三个接口： ExecutorStart、ExecutorRun 和 ExecutorEnd,其输入是包含査询计划树的数据结构QueryDesc,输出则是相关执行信息或结果数据。如果希望执行某个计划树，仅需构造包含此计划树的QueryDesc,并依次调用ExecutorStart、ExecutorRun、ExecutorEnd 3个过程即能完成相应的处理过程。

Query Processing Control Flow

   CreateQueryDesc

   ExecutorStart
       CreateExecutorState
           creates per-query context
       switch to per-query context to run ExecInitNode
       AfterTriggerBeginQuery
       ExecInitNode --- recursively scans plan tree
           ExecInitNode
               recurse into subsidiary nodes
           CreateExprContext
               creates per-tuple context
           ExecInitExpr

   ExecutorRun
       ExecProcNode --- recursively called in per-query context
           ExecEvalExpr --- called in per-tuple context
           ResetExprContext --- to free memory

   ExecutorFinish
       ExecPostprocessPlan --- run any unfinished ModifyTable nodes
       AfterTriggerEndQuery

   ExecutorEnd
       ExecEndNode --- recursively releases resources
       FreeExecutorState
           frees per-query context and child contexts

   FreeQueryDesc

ExecEndNode 释放任何内存并不重要；无论如何，它都会在 FreeExecutorState 中消失。但是，我们确实需要小心关闭关系、丢弃缓冲区pins，因此我们确实需要扫描计划状态树以找到这些资源。

执行器可评估没有任何计划树的简单表达式（“简单”表示“没有聚合和子选择”，尽管这可能隐藏在函数调用中）。这个案例有一个控制流程，比如

   CreateExecutorState
       creates per-query context

   CreateExprContext   -- or use GetPerTupleExprContext(estate)
       creates per-tuple context

   ExecPrepareExpr
       temporarily switch to per-query context
       run the expression through expression_planner
       ExecInitExpr

   Repeatedly do:
       ExecEvalExprSwitchContext
           ExecEvalExpr --- called in per-tuple context
       ResetExprContext --- to free memory

   FreeExecutorState
       frees per-query context, as well as ExprContext
       (a separate FreeExprContext call is not necessary)

EvalPlanQual (READ COMMITTED Update Checking)

对于简单的 SELECT，执行器只需要注意根据当前事务看到的快照有效的元组（即，它们是由先前提交的事务插入的，并且没有被任何先前提交的事务删除）。

但是，对于 UPDATE 和 DELETE，修改或删除已被打开或并发提交的事务修改的元组并不酷。

如果我们在 SERIALIZABLE 隔离级别下运行，那么我们只会在看到这种情况发生时引发错误。在 READ COMMITTED 隔离级别，我们必须更加努力。

READ COMMITTED 模式的基本思想是获取并发事务提交的修改元组（如果需要，在等待它提交之后）并重新评估查询条件，看看它是否仍然符合条件。

如果是这样，我们从修改后的元组重新生成更新后的元组（如果我们正在执行更新），最后更新/删除修改后的元组。

SELECT FOR UPDATE/SHARE 的行为类似，只是它的作用只是锁定修改后的元组并根据该元组的版本返回结果。

为了实现这个检查，我们实际上从头开始为每个修改的元组（或一组元组，对于 SELECT FOR UPDATE）重新运行查询，调整关系扫描节点以仅返回当前元组——要么是原始元组，要么或修改后的元组的更新（现在锁定）版本。

如果此查询返回一个元组，则修改后的元组通过 quals（如果我们正在执行 UPDATE，则查询输出是经过适当修改的更新元组）。

如果没有返回元组，则修改后的元组将失败，因此我们忽略当前结果元组并继续原始查询。

在 UPDATE/DELETE 中，只有目标关系需要以这种方式处理。在 SELECT FOR UPDATE 中，可能有多个标记为 FOR UPDATE 的关系，因此我们在执行重新检查之前在每个此类关系中获取当前元组版本的锁定。

查询中也有可能存在未被锁定的关系（它们既不是 UPDATE/DELETE 目标，也不是在 SELECT FOR UPDATE/SHARE 中指定要锁定的）。重新运行测试查询时，我们希望使用这些关系中与锁定行连接的相同行。对于普通关系，这可以通过在连接输出中包含行 TID 并重新获取该 TID 来相对便宜地实现。（重新获取成本很高，但我们正在尝试优化不需要重新测试的正常情况。）我们还必须考虑非表关系，例如 ValuesScan 或 FunctionScan。对于这些，由于没有等效的 TID，唯一实际的解决方案似乎是将整个行值包含在连接输出行中。

我们不允许在 SELECT FOR UPDATE 的目标列表中返回集合返回函数，以确保对于任何特定的扫描元组集合最多可以返回一个元组。否则，由于原始查询多次返回同一组扫描元组，我们会得到重复。同样，在 UPDATE 的目标列表中也不允许使用 SRF。在那里，它们会产生多次更新同一行的效果，这不是很有用 — 第一次之后的更新无论如何都没有效果。

Asynchronous Execution：异步执行

如果节点正在等待数据库系统外部的事件，例如等待网络 I/O 的 ForeignScan，则希望节点指示它不能立即返回任何元组，但可以稍后返回时间。发现这种情况的进程总是可以简单地通过阻塞来处理它，但这可能会浪费时间，这些时间可能会花在执行计划树的其他部分上，而这些部分可以立即取得进展。当计划树包含 Append 节点时，这种情况尤其可能发生。异步执行同时而不是串行运行 Append 节点的多个部分以提高性能。

对于异步执行，Append 节点必须首先使用 ExecAsyncRequest 从支持异步的子节点请求元组。接下来，它必须使用 ExecAppendAsyncEventWait 执行异步事件循环。最终，当一个异步请求发送到的子节点产生一个元组时，Append 节点将通过 ExecAsyncResponse 从事件循环中接收它。在异步执行的当前实现中，唯一从支持异步的子节点请求元组的节点类型是 Append，而唯一可能支持异步的节点类型是 ForeignScan。

通常，ExecAsyncResponse 回调是希望异步请求元组的节点唯一需要的回调。另一方面，支持异步的节点一般需要实现三种方法：

1, 异步请求时，会调用节点的ExecAsyncRequest回调；它应该使用 ExecAsyncRequestPending 来指示请求正在等待下面描述的回调。或者，如果结果立即可用，它可以改为使用 ExecAsyncRequestDone。
2. 当事件循环希望等待或轮询文件描述符事件时，将调用节点的 ExecAsyncConfigureWait 回调来配置节点希望等待的文件描述符事件。
3, 当文件描述符就绪时，将调用节点的 ExecAsyncNotify 回调；像 #1 一样，它应该使用 ExecAsyncRequestPending 进行另一个回调或 ExecAsyncRequestDone 立即返回结果。

/* ----------------------------------------------------------------
 *        ExecutorStart
 *
 *        This routine must be called at the beginning of any execution of any
 *        query plan
 *
 * Takes a QueryDesc previously created by CreateQueryDesc (which is separate
 * only because some places use QueryDescs for utility commands).  The tupDesc
 * field of the QueryDesc is filled in to describe the tuples that will be
 * returned, and the internal fields (estate and planstate) are set up.
 *
 * eflags contains flag bits as described in executor.h.
 *
 * NB: the CurrentMemoryContext when this is called will become the parent
 * of the per-query context used for this Executor invocation.
 *
 * We provide a function hook variable that lets loadable plugins
 * get control when ExecutorStart is called.  Such a plugin would
 * normally call standard_ExecutorStart().
 *
 * ----------------------------------------------------------------
 */
void
ExecutorStart(QueryDesc *queryDesc, int eflags)
{
    if (ExecutorStart_hook)
        (*ExecutorStart_hook) (queryDesc, eflags);
    else
        standard_ExecutorStart(queryDesc, eflags);
}

 
/* ----------------------------------------------------------------
 *        ExecutorRun
 *
 *        This is the main routine of the executor module. It accepts
 *        the query descriptor from the traffic cop and executes the
 *        query plan.
 *
 *        ExecutorStart must have been called already.
 *
 *        If direction is NoMovementScanDirection then nothing is done
 *        except to start up/shut down the destination.  Otherwise,
 *        we retrieve up to 'count' tuples in the specified direction.
 *
 *        Note: count = 0 is interpreted as no portal limit, i.e., run to
 *        completion.  Also note that the count limit is only applied to
 *        retrieved tuples, not for instance to those inserted/updated/deleted
 *        by a ModifyTable plan node.
 *
 *        There is no return value, but output tuples (if any) are sent to
 *        the destination receiver specified in the QueryDesc; and the number
 *        of tuples processed at the top level can be found in
 *        estate->es_processed.
 *
 *        We provide a function hook variable that lets loadable plugins
 *        get control when ExecutorRun is called.  Such a plugin would
 *        normally call standard_ExecutorRun().
 *
 * ----------------------------------------------------------------
 */
void
ExecutorRun(QueryDesc *queryDesc,
            ScanDirection direction, uint64 count,
            bool execute_once)
{
    if (ExecutorRun_hook)
        (*ExecutorRun_hook) (queryDesc, direction, count, execute_once);
    else
        standard_ExecutorRun(queryDesc, direction, count, execute_once);
}

/* ----------------------------------------------------------------
 *        ExecutorEnd
 *
 *        This routine must be called at the end of execution of any
 *        query plan
 *
 *        We provide a function hook variable that lets loadable plugins
 *        get control when ExecutorEnd is called.  Such a plugin would
 *        normally call standard_ExecutorEnd().
 *
 * ----------------------------------------------------------------
 */
void
ExecutorEnd(QueryDesc *queryDesc)
{
    if (ExecutorEnd_hook)
        (*ExecutorEnd_hook) (queryDesc);
    else
        standard_ExecutorEnd(queryDesc);
}

执行计划的节点关系示例：

gather

^

|

limit
^
|
sort
^
|
join
^^
/ \
join scan
^^
/ \
scan scan

上层函数通过ExecInitNode、ExecProcNode、ExecEndNode二个接口函数来统一对节点进行初始化、执行和清理，这三个函数会根据所处理节点的实际类型调用相应的初始化、执行、清理函数。

执行器初始化时，ExecutorStart会根据査询计划树构造执行器全局状态(estate)以及计划节点执行状态(planstate)。在査询计划树的执行过程中，执行器将使用planstate来记录计划节点执行状态和数据，并使用全局状态记录中的es_tupleTable字段在节点间传递结果元组。执行器的清理函数ExecutorEnd将回收执行器全局状态和计划节点执行状态。

PostgreSQL为每种计划节点定义了一种状态节点。所有的状态节点均继承于PlanState节点，其中包含辅助计划节点指针(Plan)、执行器全局状态结构指针(state)、投影运算相关信息(targetlist)、选择运算相关条件(qual),以及左右子状态节点指针(lefttree、righttree)。

提供了三个接口函数用于调用执行器，分别为ExecutorStart、ExecutorRun和ExecutorEnd。当需要使用执行器来处理査询计划时，仅需依次调用三个函数即可完成执行器的整个执行过程。

ExecutorStart =》standard_ExecutorStart =》InitPlan =》ExecInitNode

【ExecInitNode】

/* ------------------------------------------------------------------------
 *        ExecInitNode
 *
 *        Recursively initializes all the nodes in the plan tree rooted
 *        at 'node'.
 *
 *        Inputs:
 *          'node' is the current node of the plan produced by the query planner
 *          'estate' is the shared execution state for the plan tree
 *          'eflags' is a bitwise OR of flag bits described in executor.h
 *
 *        Returns a PlanState node corresponding to the given Plan node.
 * ------------------------------------------------------------------------
 */
PlanState *
ExecInitNode(Plan *node, EState *estate, int eflags)
{
......

#ifdef PGXC
        case T_RemoteQuery:
            result = (PlanState *) ExecInitRemoteQuery((RemoteQuery *) node,
                                                        estate, eflags);
            break;
#endif
#ifdef XCP
        case T_RemoteSubplan:
            result = (PlanState *) ExecInitRemoteSubplan((RemoteSubplan *) node,
                                                         estate, eflags);
......
}

主要是通过 ExecInitRemoteSubplan 函数来完成分布式节点的 subplan 执行 .

\src\backend\executor\execProcNode.c

RemoteSubplanState *
ExecInitRemoteSubplan(RemoteSubplan *node, EState *estate, int eflags)
{// #lizard forgives
RemoteStmt rstmt;
RemoteSubplanState *remotestate;
ResponseCombiner *combiner;
CombineType combineType;
struct rusage start_r;
struct timeval start_t;
#ifdef _MIGRATE_
Oid groupid = InvalidOid;
Oid reloid = InvalidOid;
ListCell *table;
#endif

if (log_remotesubplan_stats)
ResetUsageCommon(&start_r, &start_t);

#ifdef __AUDIT__
memset((void *)(&rstmt), 0, sizeof(RemoteStmt));
#endif

remotestate = makeNode(RemoteSubplanState);
combiner = (ResponseCombiner *) remotestate;
/*
* We do not need to combine row counts if we will receive intermediate
* results or if we won't return row count.
*/
if (IS_PGXC_DATANODE || estate->es_plannedstmt->commandType == CMD_SELECT)
{
combineType = COMBINE_TYPE_NONE;
remotestate->execOnAll = node->execOnAll;
}
else
{
if (node->execOnAll)
combineType = COMBINE_TYPE_SUM;
else
combineType = COMBINE_TYPE_SAME;
/*
* If we are updating replicated table we should run plan on all nodes.
* We are choosing single node only to read
*/
remotestate->execOnAll = true;
}
remotestate->execNodes = list_copy(node->nodeList);
InitResponseCombiner(combiner, 0, combineType);
combiner->ss.ps.plan = (Plan *) node;
combiner->ss.ps.state = estate;
combiner->ss.ps.ExecProcNode = ExecRemoteSubplan;
#ifdef __TBASE__
   if (estate->es_instrument)
   {
       HASHCTL       ctl;

       ctl.keysize = sizeof(RemoteInstrKey);
       ctl.entrysize = sizeof(RemoteInstr);

       combiner->recv_instr_htbl = hash_create("Remote Instrument", 8 * NumDataNodes,
       &ctl, HASH_ELEM | HASH_BLOBS);
   }
   combiner->remote_parallel_estimated = false;
#endif
combiner->ss.ps.qual = NULL;

combiner->request_type = REQUEST_TYPE_QUERY;

ExecInitResultTupleSlot(estate, &combiner->ss.ps);
ExecAssignResultTypeFromTL((PlanState *) remotestate);

#ifdef __TBASE__
if (IS_PGXC_COORDINATOR && !g_set_global_snapshot)
{
if (!need_global_snapshot)
{
int node_count = list_length(node->nodeList);

if (node_count > 1)
{
need_global_snapshot = true;
}
else if (node_count == 1)
{
MemoryContext old = MemoryContextSwitchTo(TopTransactionContext);
executed_node_list = list_append_unique_int(executed_node_list, linitial_int(node->nodeList));
MemoryContextSwitchTo(old);

if (list_length(executed_node_list) > 1)
{
need_global_snapshot = true;
}
}
}
}
#endif

/*
* We optimize execution if we going to send down query to next level
*/
remotestate->local_exec = false;
if (IS_PGXC_DATANODE)
{
if (remotestate->execNodes == NIL)
{
/*
* Special case, if subplan is not distributed, like Result, or
* query against catalog tables only.
* We are only interested in filtering out the subplan results and
* get only those we are interested in.
* XXX we may want to prevent multiple executions in this case
* either, to achieve this we will set single execNode on planning
* time and this case would never happen, this code branch could
* be removed.
*/
remotestate->local_exec = true;
}
else if (!remotestate->execOnAll)
{
/*
* XXX We should change planner and remove this flag.
* We want only one node is producing the replicated result set,
* and planner should choose that node - it is too hard to determine
* right node at execution time, because it should be guaranteed
* that all consumers make the same decision.
* For now always execute replicated plan on local node to save
* resources.
*/

/*
* Make sure local node is in execution list
*/
if (list_member_int(remotestate->execNodes, PGXCNodeId-1))
{
list_free(remotestate->execNodes);
remotestate->execNodes = NIL;
remotestate->local_exec = true;
}
else
{
/*
* To support, we need to connect to some producer, so
* each producer should be prepared to serve rows for random
* number of consumers. It is hard, because new consumer may
* connect after producing is started, on the other hand,
* absence of expected consumer is a problem too.
*/
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("Getting replicated results from remote node is not supported")));
}
}
}

/*
* If we are going to execute subplan locally or doing explain initialize
* the subplan. Otherwise have remote node doing that.
*/
if (remotestate->local_exec || (eflags & EXEC_FLAG_EXPLAIN_ONLY))
{
outerPlanState(remotestate) = ExecInitNode(outerPlan(node), estate,
eflags);
if (node->distributionNodes)
{
Oid distributionType = InvalidOid;
TupleDesc typeInfo;

typeInfo = combiner->ss.ps.ps_ResultTupleSlot->tts_tupleDescriptor;
if (node->distributionKey != InvalidAttrNumber)
{
Form_pg_attribute attr;
attr = typeInfo->attrs[node->distributionKey - 1];
distributionType = attr->atttypid;
}
/* Set up locator */
#ifdef _MIGRATE_
foreach(table, estate->es_range_table)
{
RangeTblEntry *tbl_entry = (RangeTblEntry *)lfirst(table);
if (tbl_entry->rtekind == RTE_RELATION)
{
reloid = tbl_entry->relid;
elog(DEBUG5, "[ExecInitRemoteSubplan]reloid=%d", reloid);
break;
}
}
groupid = GetRelGroup(reloid);
elog(DEBUG5, "[ExecInitRemoteSubplan]groupid=%d", groupid);

remotestate->locator = createLocator(node->distributionType,
RELATION_ACCESS_INSERT,
distributionType,
LOCATOR_LIST_LIST,
0,
(void *) node->distributionNodes,
(void **) &remotestate->dest_nodes,
false,
groupid, InvalidOid, InvalidOid, InvalidAttrNumber,
InvalidOid);
#else
remotestate->locator = createLocator(node->distributionType,
RELATION_ACCESS_INSERT,
distributionType,
LOCATOR_LIST_LIST,
0,
(void *) node->distributionNodes,
(void **) &remotestate->dest_nodes,
false);
#endif
}
else
remotestate->locator = NULL;
}

/*
* Encode subplan if it will be sent to remote nodes
*/
if (remotestate->execNodes && !(eflags & EXEC_FLAG_EXPLAIN_ONLY))
{
ParamListInfo ext_params;
/* Encode plan if we are going to execute it on other nodes */
rstmt.type = T_RemoteStmt;
if (node->distributionType == LOCATOR_TYPE_NONE && IS_PGXC_DATANODE)
{
/*
* There are cases when planner can not determine distribution of a
* subplan, in particular it does not determine distribution of
* subquery nodes. Such subplans executed from current location
* (node) and combine all results, like from coordinator nodes.
* However, if there are multiple locations where distributed
* executor is running this node, and there are more of
* RemoteSubplan plan nodes in the subtree there will be a problem -
* Instances of the inner RemoteSubplan nodes will be using the same
* SharedQueue, causing error. To avoid this problem we should
* traverse the subtree and change SharedQueue name to make it
* unique.
*/
RemoteSubplanMakeUnique((Node *) outerPlan(node), PGXCNodeId);
elog(DEBUG3, "RemoteSubplanMakeUnique for LOCATOR_TYPE_NONE unique: %d, cursor: %s",
PGXCNodeId, node->cursor);
}
rstmt.planTree = outerPlan(node);
/*
* If datanode launch further execution of a command it should tell
* it is a SELECT, otherwise secondary data nodes won't return tuples
* expecting there will be nothing to return.
*/
if (IsA(outerPlan(node), ModifyTable))
{
rstmt.commandType = estate->es_plannedstmt->commandType;
rstmt.hasReturning = estate->es_plannedstmt->hasReturning;
rstmt.resultRelations = estate->es_plannedstmt->resultRelations;
}
else
{
rstmt.commandType = CMD_SELECT;
rstmt.hasReturning = false;
rstmt.resultRelations = NIL;
}
rstmt.rtable = estate->es_range_table;
rstmt.subplans = estate->es_plannedstmt->subplans;
rstmt.nParamExec = estate->es_plannedstmt->nParamExec;
ext_params = estate->es_param_list_info;
rstmt.nParamRemote = (ext_params ? ext_params->numParams : 0) +
bms_num_members(node->scan.plan.allParam);
if (rstmt.nParamRemote > 0)
{
Bitmapset *tmpset;
int i;
int paramno;

/* Allocate enough space */
rstmt.remoteparams = (RemoteParam *) palloc(rstmt.nParamRemote *
sizeof(RemoteParam));
paramno = 0;
if (ext_params)
{
for (i = 0; i < ext_params->numParams; i++)
{
ParamExternData *param = &ext_params->params[i];
/*
* If parameter type is not yet defined but can be defined
* do that
*/
if (!OidIsValid(param->ptype) && ext_params->paramFetch)
(*ext_params->paramFetch) (ext_params, i + 1);

/*
* If the parameter type is still not defined, assume that
* it is unused. But we put a default INT4OID type for such
* unused parameters to keep the parameter pushdown code
* happy.
*
* These unused parameters are never accessed during
* execution and we will just a null value for these
* "dummy" parameters. But including them here ensures that
* we send down the parameters in the correct order and at
* the position that the datanode needs
*/
if (OidIsValid(param->ptype))
{
                       rstmt.remoteparams[paramno].paramused =
                           bms_is_member(i, node->initPlanParams) ? REMOTE_PARAM_INITPLAN : REMOTE_PARAM_SUBPLAN;
rstmt.remoteparams[paramno].paramtype = param->ptype;
}
else
{
                       rstmt.remoteparams[paramno].paramused = REMOTE_PARAM_UNUSED;
rstmt.remoteparams[paramno].paramtype = INT4OID;
}

rstmt.remoteparams[paramno].paramkind = PARAM_EXTERN;
rstmt.remoteparams[paramno].paramid = i + 1;
paramno++;
}
/* store actual number of parameters */
rstmt.nParamRemote = paramno;
}

if (!bms_is_empty(node->scan.plan.allParam))
{
Bitmapset *defineParams = NULL;
tmpset = bms_copy(node->scan.plan.allParam);
while ((i = bms_first_member(tmpset)) >= 0)
{
ParamExecData *prmdata;

prmdata = &(estate->es_param_exec_vals[i]);
rstmt.remoteparams[paramno].paramkind = PARAM_EXEC;
rstmt.remoteparams[paramno].paramid = i;
rstmt.remoteparams[paramno].paramtype = prmdata->ptype;
                   rstmt.remoteparams[paramno].paramused =
                       bms_is_member(i, node->initPlanParams) ? REMOTE_PARAM_INITPLAN : REMOTE_PARAM_SUBPLAN;
/* Will scan plan tree to find out data type of the param */
if (prmdata->ptype == InvalidOid)
defineParams = bms_add_member(defineParams, i);
paramno++;
}
/* store actual number of parameters */
rstmt.nParamRemote = paramno;
bms_free(tmpset);
if (!bms_is_empty(defineParams))
{
struct find_params_context context;
bool all_found;

context.rparams = rstmt.remoteparams;
context.defineParams = defineParams;
context.subplans = estate->es_plannedstmt->subplans;

all_found = determine_param_types(node->scan.plan.lefttree,
&context);
/*
* Remove not defined params from the list of remote params.
* If they are not referenced no need to send them down
*/
if (!all_found)
{
for (i = 0; i < rstmt.nParamRemote; i++)
{
if (rstmt.remoteparams[i].paramkind == PARAM_EXEC &&
bms_is_member(rstmt.remoteparams[i].paramid,
context.defineParams))
{
/* Copy last parameter inplace */
rstmt.nParamRemote--;
if (i < rstmt.nParamRemote)
rstmt.remoteparams[i] =
rstmt.remoteparams[rstmt.nParamRemote];
/* keep current in the same position */
i--;
}
}
}
bms_free(context.defineParams);
}
}
remotestate->nParamRemote = rstmt.nParamRemote;
remotestate->remoteparams = rstmt.remoteparams;
}
else
rstmt.remoteparams = NULL;
rstmt.rowMarks = estate->es_plannedstmt->rowMarks;
rstmt.distributionKey = node->distributionKey;
rstmt.distributionType = node->distributionType;
rstmt.distributionNodes = node->distributionNodes;
rstmt.distributionRestrict = node->distributionRestrict;
#ifdef __TBASE__
rstmt.parallelWorkerSendTuple = node->parallelWorkerSendTuple;
if(IsParallelWorker())
{
rstmt.parallelModeNeeded = true;
}
else
{
rstmt.parallelModeNeeded = estate->es_plannedstmt->parallelModeNeeded;
}

if (estate->es_plannedstmt->haspart_tobe_modify)
{
rstmt.haspart_tobe_modify = estate->es_plannedstmt->haspart_tobe_modify;
rstmt.partrelindex = estate->es_plannedstmt->partrelindex;
rstmt.partpruning = bms_copy(estate->es_plannedstmt->partpruning);
}
else
{
rstmt.haspart_tobe_modify = false;
rstmt.partrelindex = 0;
rstmt.partpruning = NULL;
}
#endif

/*
* A try-catch block to ensure that we don't leave behind a stale state
* if nodeToString fails for whatever reason.
*
* XXX We should probably rewrite it someday by either passing a
* context to nodeToString() or remembering this information somewhere
* else which gets reset in case of errors. But for now, this seems
* enough.
*/
PG_TRY();
{
set_portable_output(true);
#ifdef __AUDIT__
/*
* parseTree and queryString will be only send once while
* init the first RemoteSubplan in the whole plan tree
*/
if (IS_PGXC_COORDINATOR && IsConnFromApp() &&
estate->es_plannedstmt->parseTree != NULL &&
estate->es_remote_subplan_num == 0)
{
rstmt.queryString = estate->es_sourceText;
rstmt.parseTree = estate->es_plannedstmt->parseTree;
estate->es_remote_subplan_num++;
}
#endif
remotestate->subplanstr = nodeToString(&rstmt);
#ifdef __AUDIT__
rstmt.queryString = NULL;
rstmt.parseTree = NULL;
           elog_node_display(DEBUG5, "SendPlanMessage", &rstmt, Debug_pretty_print);
#endif
}
PG_CATCH();
{
set_portable_output(false);
PG_RE_THROW();
}
PG_END_TRY();
set_portable_output(false);

/*
* Connect to remote nodes and send down subplan.
*/
if (!(eflags & EXEC_FLAG_SUBPLAN))
{
#ifdef __TBASE__
remotestate->eflags = eflags;
/* In parallel worker, no init, do it until we start to run. */
if (IsParallelWorker())
{
#endif

#ifdef __TBASE__
}
else
{
//ExecFinishInitRemoteSubplan(remotestate);
#ifdef __TBASE__
/* Not parallel aware, build connections. */
if (!node->scan.plan.parallel_aware)
{
ExecFinishInitRemoteSubplan(remotestate);
}
else
{
/* In session process, if we are under gather, do nothing. */
}
#endif
}
#endif
}
}
remotestate->bound = false;
/*
* It does not makes sense to merge sort if there is only one tuple source.
* By the contract it is already sorted
*/
if (node->sort && remotestate->execOnAll &&
list_length(remotestate->execNodes) > 1)
combiner->merge_sort = true;

if (log_remotesubplan_stats)
ShowUsageCommon("ExecInitRemoteSubplan", &start_r, &start_t);

return remotestate;
}

void
ExecFinishInitRemoteSubplan(RemoteSubplanState *node)
{// #lizard forgives
ResponseCombiner *combiner = (ResponseCombiner *) node;
RemoteSubplan *plan = (RemoteSubplan *) combiner->ss.ps.plan;
EState *estate = combiner->ss.ps.state;
Oid *paramtypes = NULL;
GlobalTransactionId gxid = InvalidGlobalTransactionId;
Snapshot snapshot;
TimestampTz timestamp;
int i;
bool is_read_only;
char cursor[NAMEDATALEN];

#ifdef __TBASE__
node->finish_init = true;
#endif
/*
* Name is required to store plan as a statement
*/
Assert(plan->cursor);

if (plan->unique)
snprintf(cursor, NAMEDATALEN, "%s_%d", plan->cursor, plan->unique);
else
strncpy(cursor, plan->cursor, NAMEDATALEN);

/* If it is alreaty fully initialized nothing to do */
if (combiner->connections)
return;

/* local only or explain only execution */
if (node->subplanstr == NULL)
return;

/*
* Check if any results are planned to be received here.
* Otherwise it does not make sense to send out the subplan.
*/
if (IS_PGXC_DATANODE && plan->distributionRestrict &&
!list_member_int(plan->distributionRestrict, PGXCNodeId - 1))
return;

/*
* Acquire connections and send down subplan where it will be stored
* as a prepared statement.
* That does not require transaction id or snapshot, so does not send them
* here, postpone till bind.
*/
if (node->execOnAll)
{
PGXCNodeAllHandles *pgxc_connections;
       pgxc_connections = get_handles(node->execNodes, NIL, false, true, true);
combiner->conn_count = pgxc_connections->dn_conn_count;
combiner->connections = pgxc_connections->datanode_handles;
combiner->current_conn = 0;
pfree(pgxc_connections);
}
else
{
combiner->connections = (PGXCNodeHandle **) palloc(sizeof(PGXCNodeHandle *));
combiner->connections[0] = get_any_handle(node->execNodes);
combiner->conn_count = 1;
combiner->current_conn = 0;
}

gxid = GetCurrentTransactionIdIfAny();

/* extract parameter data types */
if (node->nParamRemote > 0)
{
paramtypes = (Oid *) palloc(node->nParamRemote * sizeof(Oid));
for (i = 0; i < node->nParamRemote; i++)
paramtypes[i] = node->remoteparams[i].paramtype;
}
/* send down subplan */
snapshot = GetActiveSnapshot();
timestamp = GetCurrentGTMStartTimestamp();

#ifdef __TBASE__
/* set snapshot as needed */
if (!g_set_global_snapshot && SetSnapshot(estate))
{
snapshot = estate->es_snapshot;
}
#endif
/*
* Datanode should not send down statements that may modify
* the database. Potgres assumes that all sessions under the same
* postmaster have different xids. That may cause a locking problem.
* Shared locks acquired for reading still work fine.
*/
is_read_only = IS_PGXC_DATANODE ||
!IsA(outerPlan(plan), ModifyTable);

#ifdef __TBASE__
   /* Set plpgsql transaction begin for all connections */
   for (i = 0; i < combiner->conn_count; i++)
   {
       SetPlpgsqlTransactionBegin(combiner->connections[i]);
   }
#endif

#if 0
for (i = 0; i < combiner->conn_count; i++)
{
PGXCNodeHandle *connection_tmp = combiner->connections[i];
if (g_in_plpgsql_exec_fun && need_begin_txn)
{
connection_tmp->plpgsql_need_begin_txn = true;
elog(LOG, "[PLPGSQL] ExecFinishInitRemoteSubplan conn nodename:%s backendpid:%d sock:%d need_begin_txn",
connection_tmp->nodename, connection_tmp->backend_pid, connection_tmp->sock);
}
if (g_in_plpgsql_exec_fun && need_begin_sub_txn)
{
connection_tmp->plpgsql_need_begin_sub_txn = true;
elog(LOG, "[PLPGSQL] ExecFinishInitRemoteSubplan conn nodename:%s backendpid:%d sock:%d need_begin_sub_txn",
connection_tmp->nodename, connection_tmp->backend_pid, connection_tmp->sock);
}
}

if (g_in_plpgsql_exec_fun && need_begin_txn)
{
need_begin_txn = false;
elog(LOG, "[PLPGSQL] ExecFinishInitRemoteSubplan need_begin_txn set to false");
}
if (g_in_plpgsql_exec_fun && need_begin_sub_txn)
{
need_begin_sub_txn = false;
elog(LOG, "[PLPGSQL] ExecFinishInitRemoteSubplan need_begin_sub_txn set to false");
}
#endif

for (i = 0; i < combiner->conn_count; i++)
{
PGXCNodeHandle *connection = combiner->connections[i];

       if (pgxc_node_begin(1, &connection, gxid, true,
                           is_read_only, PGXC_NODE_DATANODE))
           ereport(ERROR,
                   (errcode(ERRCODE_INTERNAL_ERROR),
                   errmsg("Could not begin transaction on data node:%s.",
                           connection->nodename)));

if (pgxc_node_send_timestamp(connection, timestamp))
{
combiner->conn_count = 0;
pfree(combiner->connections);
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Failed to send command to data nodes")));
}
if (snapshot && pgxc_node_send_snapshot(connection, snapshot))
{
combiner->conn_count = 0;
pfree(combiner->connections);
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Failed to send snapshot to data nodes")));
}
if (pgxc_node_send_cmd_id(connection, estate->es_snapshot->curcid) < 0 )
{
combiner->conn_count = 0;
pfree(combiner->connections);
ereport(ERROR,
(errcode(ERRCODE_INTERNAL_ERROR),
errmsg("Failed to send command ID to data nodes")));
}
pgxc_node_send_plan(connection, cursor, "Remote Subplan",
                           node->subplanstr, node->nParamRemote, paramtypes, estate->es_instrument);

       if (enable_statistic)
       {
           elog(LOG, "Plan Message:pid:%d,remote_pid:%d,remote_ip:%s,"
                   "remote_port:%d,fd:%d,cursor:%s",
               MyProcPid, connection->backend_pid, connection->nodehost,
                   connection->nodeport, connection->sock, cursor);
       }

       if (pgxc_node_flush(connection))
       {
           combiner->conn_count = 0;
           pfree(combiner->connections);
           ereport(ERROR,
                   (errcode(ERRCODE_INTERNAL_ERROR),
                   errmsg("Failed to send subplan to data nodes")));
       }
   }
}

TupleTableSlot *
ExecRemoteSubplan(PlanState *pstate)
{// #lizard forgives
    RemoteSubplanState *node = castNode(RemoteSubplanState, pstate);
    ResponseCombiner *combiner = (ResponseCombiner *) node;
    RemoteSubplan  *plan = (RemoteSubplan *) combiner->ss.ps.plan;
    EState           *estate = combiner->ss.ps.state;
    TupleTableSlot *resultslot = combiner->ss.ps.ps_ResultTupleSlot;
    struct rusage    start_r;
    struct timeval        start_t;
#ifdef __TBASE__
    int count = 0;
#endif
#ifdef __TBASE__
	if ((node->eflags & EXEC_FLAG_EXPLAIN_ONLY) != 0)
		return NULL;
	
    if (!node->local_exec && (!node->finish_init) && (!(node->eflags & EXEC_FLAG_SUBPLAN)))
    {
        if(node->execNodes)
        {
            ExecFinishInitRemoteSubplan(node);
        }
        else
        {
            return NULL;
        }
    }
#endif
    /* 
     * We allow combiner->conn_count == 0 after node initialization
     * if we figured out that current node won't receive any result
     * because of distributionRestrict is set by planner.
     * But we should distinguish this case from others, when conn_count is 0.
     * That is possible if local execution is chosen or data are buffered 
     * at the coordinator or data are exhausted and node was reset.
     * in last two cases connections are saved to cursor_connections and we
     * can check their presence.  
     */
    if (!node->local_exec && combiner->conn_count == 0 && 
            combiner->cursor_count == 0)
        return NULL;

    if (log_remotesubplan_stats)
        ResetUsageCommon(&start_r, &start_t);

primary_mode_phase_two:
    if (!node->bound)
    {
        int fetch = 0;
        int paramlen = 0;
		int epqctxlen = 0;
        char *paramdata = NULL;
		char *epqctxdata = NULL;
		
        /*
         * Conditions when we want to execute query on the primary node first:
         * Coordinator running replicated ModifyTable on multiple nodes
         */
        bool primary_mode = combiner->probing_primary ||
                (IS_PGXC_COORDINATOR &&
                 combiner->combine_type == COMBINE_TYPE_SAME &&
                 OidIsValid(primary_data_node) &&
                 combiner->conn_count > 1 && !g_UseDataPump);
        char cursor[NAMEDATALEN];

        if (plan->cursor)
        {
            fetch = PGXLRemoteFetchSize;
            if (plan->unique)
                snprintf(cursor, NAMEDATALEN, "%s_%d", plan->cursor, plan->unique);
            else
                strncpy(cursor, plan->cursor, NAMEDATALEN);
        }
        else
            cursor[0] = '\0';

#ifdef __TBASE__
        if(g_UseDataPump)
        {
            /* fetch all */
            fetch = 0;
        }

        /* get connection's count and handle */
        if (combiner->conn_count)
        {
            count = combiner->conn_count;
        }
        else
        {
            if (combiner->cursor)
            {
                if (!combiner->probing_primary)
                {
                    count = combiner->cursor_count;
                }
                else
                {
                    count = combiner->conn_count;
                }
            }
        }

        /* initialize */
        combiner->recv_node_count = count;
        combiner->recv_tuples     = 0;
        combiner->recv_total_time = -1;
        combiner->recv_datarows = 0;
#endif

        /*
         * Send down all available parameters, if any is used by the plan
         */
        if (estate->es_param_list_info ||
                !bms_is_empty(plan->scan.plan.allParam))
            paramlen = encode_parameters(node->nParamRemote,
                                         node->remoteparams,
                                         &combiner->ss.ps,
                                         &paramdata);

		if (estate->es_epqTuple != NULL)
			epqctxlen = encode_epqcontext(&combiner->ss.ps, &epqctxdata);

        /*
         * The subplan being rescanned, need to restore connections and
         * re-bind the portal
         */
        if (combiner->cursor)
        {
            int i;

            /*
             * On second phase of primary mode connections are properly set,
             * so do not copy.
             */
            if (!combiner->probing_primary)
            {
                combiner->conn_count = combiner->cursor_count;
                memcpy(combiner->connections, combiner->cursor_connections,
                            combiner->cursor_count * sizeof(PGXCNodeHandle *));
            }

            for (i = 0; i < combiner->conn_count; i++)
            {
                PGXCNodeHandle *conn = combiner->connections[i];

                CHECK_OWNERSHIP(conn, combiner);

                /* close previous cursor only on phase 1 */
                if (!primary_mode || !combiner->probing_primary)
                    pgxc_node_send_close(conn, false, combiner->cursor);

                /*
                 * If we now should probe primary, skip execution on non-primary
                 * nodes
                 */
                if (primary_mode && !combiner->probing_primary &&
                        conn->nodeoid != primary_data_node)
                    continue;

                /* rebind */
                pgxc_node_send_bind(conn, combiner->cursor, combiner->cursor,
									paramlen, paramdata, epqctxlen, epqctxdata);
                if (enable_statistic)
                {
                    elog(LOG, "Bind Message:pid:%d,remote_pid:%d,remote_ip:%s,remote_port:%d,fd:%d,cursor:%s",
                              MyProcPid, conn->backend_pid, conn->nodehost, conn->nodeport, conn->sock, cursor);
                }
                /* execute */
                pgxc_node_send_execute(conn, combiner->cursor, fetch);
                /* submit */
                if (pgxc_node_send_flush(conn))
                {
                   combiner->conn_count = 0;
                    pfree(combiner->connections);
                    ereport(ERROR,
                            (errcode(ERRCODE_INTERNAL_ERROR),
                             errmsg("Failed to send command to data nodes")));
                }

                /*
                 * There could be only one primary node, but can not leave the
                 * loop now, because we need to close cursors.
                 */
                if (primary_mode && !combiner->probing_primary)
                {
                    combiner->current_conn = i;
                }
            }
        }
        else if (node->execNodes)
        {
            CommandId        cid;
            int             i;

            /*
             * There are prepared statement, connections should be already here
             */
            Assert(combiner->conn_count > 0);

            combiner->extended_query = true;
            cid = estate->es_snapshot->curcid;

            for (i = 0; i < combiner->conn_count; i++)
            {
                PGXCNodeHandle *conn = combiner->connections[i];

#ifdef __TBASE__
                conn->recv_datarows = 0;
#endif

                CHECK_OWNERSHIP(conn, combiner);

                /*
                 * If we now should probe primary, skip execution on non-primary
                 * nodes
                 */
                if (primary_mode && !combiner->probing_primary &&
                        conn->nodeoid != primary_data_node)
                    continue;

                /*
                 * Update Command Id. Other command may be executed after we
                 * prepare and advanced Command Id. We should use one that
                 * was active at the moment when command started.
                 */
                if (pgxc_node_send_cmd_id(conn, cid))
                {
                    combiner->conn_count = 0;
                    pfree(combiner->connections);
                    ereport(ERROR,
                            (errcode(ERRCODE_INTERNAL_ERROR),
                             errmsg("Failed to send command ID to data nodes")));
                }

                /*
                 * Resend the snapshot as well since the connection may have
                 * been buffered and use by other commands, with different
                 * snapshot. Set the snapshot back to what it was
                 */
                if (pgxc_node_send_snapshot(conn, estate->es_snapshot))
                {
                    combiner->conn_count = 0;
                    pfree(combiner->connections);
                    ereport(ERROR,
                            (errcode(ERRCODE_INTERNAL_ERROR),
                             errmsg("Failed to send snapshot to data nodes")));
                }

                /* bind */
				pgxc_node_send_bind(conn, cursor, cursor, paramlen, paramdata,
				                    epqctxlen, epqctxdata);

                if (enable_statistic)
                {
                    elog(LOG, "Bind Message:pid:%d,remote_pid:%d,remote_ip:%s,remote_port:%d,fd:%d,cursor:%s",
                              MyProcPid, conn->backend_pid, conn->nodehost, conn->nodeport, conn->sock, cursor);
                }
                /* execute */
                pgxc_node_send_execute(conn, cursor, fetch);

                /* submit */
                if (pgxc_node_send_flush(conn))
                {
                    combiner->conn_count = 0;
                    pfree(combiner->connections);
                    ereport(ERROR,
                            (errcode(ERRCODE_INTERNAL_ERROR),
                             errmsg("Failed to send command to data nodes")));
                }

                /*
                 * There could be only one primary node, so if we executed
                 * subquery on the phase one of primary mode we can leave the
                 * loop now.
                 */
                if (primary_mode && !combiner->probing_primary)
                {
                    combiner->current_conn = i;
                    break;
                }
            }

            /*
             * On second phase of primary mode connections are backed up
             * already, so do not copy.
             */
            if (primary_mode)
            {
                if (combiner->probing_primary)
                {
                    combiner->cursor = pstrdup(cursor);
                }
                else
                {
                    combiner->cursor = pstrdup(cursor);
                    combiner->cursor_count = combiner->conn_count;
                    combiner->cursor_connections = (PGXCNodeHandle **) palloc(
                                combiner->conn_count * sizeof(PGXCNodeHandle *));
                    memcpy(combiner->cursor_connections, combiner->connections,
                                combiner->conn_count * sizeof(PGXCNodeHandle *));
                }
            }
            else
            {
                combiner->cursor = pstrdup(cursor);
                combiner->cursor_count = combiner->conn_count;
                combiner->cursor_connections = (PGXCNodeHandle **) palloc(
                            combiner->conn_count * sizeof(PGXCNodeHandle *));
                memcpy(combiner->cursor_connections, combiner->connections,
                            combiner->conn_count * sizeof(PGXCNodeHandle *));
            }
        }

        if (combiner->merge_sort)
        {
            /*
             * Requests are already made and sorter can fetch tuples to populate
             * sort buffer.
             */
            combiner->tuplesortstate = tuplesort_begin_merge(
                                       resultslot->tts_tupleDescriptor,
                                       plan->sort->numCols,
                                       plan->sort->sortColIdx,
                                       plan->sort->sortOperators,
                                       plan->sort->sortCollations,
                                       plan->sort->nullsFirst,
                                       combiner,
                                       work_mem);
        }
        if (primary_mode)
        {
            if (combiner->probing_primary)
            {
                combiner->probing_primary = false;
                node->bound = true;
            }
            else
                combiner->probing_primary = true;
        }
        else
            node->bound = true;
    }

    if (combiner->tuplesortstate)
    {
        if (tuplesort_gettupleslot((Tuplesortstate *) combiner->tuplesortstate,
                                   true, true, resultslot, NULL))
        {
            if (log_remotesubplan_stats)
                ShowUsageCommon("ExecRemoteSubplan", &start_r, &start_t);
            return resultslot;
        }
    }
    else
    {
        TupleTableSlot *slot = FetchTuple(combiner);
        if (!TupIsNull(slot))
        {
            if (log_remotesubplan_stats)
                ShowUsageCommon("ExecRemoteSubplan", &start_r, &start_t);
            return slot;
        }
        else if (combiner->probing_primary)
            /* phase1 is successfully completed, run on other nodes */
            goto primary_mode_phase_two;
    }
    if (combiner->errorMessage)
        pgxc_node_report_error(combiner);

    if (log_remotesubplan_stats)
        ShowUsageCommon("ExecRemoteSubplan", &start_r, &start_t);

#ifdef __TBASE__
    if (enable_statistic)
    {
        double __tmp__= (double)combiner->recv_tuples;
        if(__tmp__)
        {
        elog(LOG, "FetchTuple: worker:%d, recv_node_count:%d, recv_tuples:%lu, recv_total_time:%ld, avg_time:%lf.",
                   ParallelWorkerNumber, combiner->recv_node_count, combiner->recv_tuples, combiner->recv_total_time, 
                   ((double)combiner->recv_total_time) / __tmp__);
        }
        else
        {
            elog(LOG, "FetchTuple: worker:%d, recv_node_count:%d, recv_tuples:%lu, recv_total_time:%ld, avg_time:--.",
                       ParallelWorkerNumber, combiner->recv_node_count, combiner->recv_tuples, combiner->recv_total_time
                       );
        }
    }
#endif
    return NULL;
}


void
ExecReScanRemoteSubplan(RemoteSubplanState *node)
{
    ResponseCombiner *combiner = (ResponseCombiner *)node;

    /*
     * If we haven't queried remote nodes yet, just return. If outerplan'
     * chgParam is not NULL then it will be re-scanned by ExecProcNode,
     * else - no reason to re-scan it at all.
     */
    if (!node->bound)
        return;

    /*
     * If we execute locally rescan local copy of the plan
     */
    if (outerPlanState(node))
        ExecReScan(outerPlanState(node));

    /*
     * Consume any possible pending input
     */
    pgxc_connections_cleanup(combiner);

    /* misc cleanup */
    combiner->command_complete_count = 0;
    combiner->description_count = 0;

    /*
     * Force query is re-bound with new parameters
     */
    node->bound = false;
#ifdef __TBASE__
    node->eflags &= ~(EXEC_FLAG_DISCONN);
#endif
}

#ifdef __TBASE__
/*
 * ExecShutdownRemoteSubplan
 * 
 * for instrumentation only, init full planstate tree,
 * then attach recieved remote instrumenation.
 */
void
ExecShutdownRemoteSubplan(RemoteSubplanState *node)
{
	ResponseCombiner    *combiner = &node->combiner;
	PlanState           *ps = &combiner->ss.ps;
	Plan                *plan = ps->plan;
	EState              *estate = ps->state;
	
	if ((node->eflags & EXEC_FLAG_EXPLAIN_ONLY) != 0)
		return;
	
	elog(DEBUG1, "shutdown remote subplan worker %d, plan_node_id %d", ParallelWorkerNumber, plan->plan_node_id);
	
	if (estate->es_instrument)
	{
		MemoryContext oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
		AttachRemoteInstrContext ctx;
		
		if (!ps->lefttree)
			ps->lefttree = ExecInitNode(plan->lefttree, estate, EXEC_FLAG_EXPLAIN_ONLY);

		ctx.htab = combiner->recv_instr_htbl;
		ctx.node_idx_List = ((RemoteSubplan *) plan)->nodeList;
		ctx.printed_nodes = NULL;
		AttachRemoteInstr(ps->lefttree, &ctx);
		
		MemoryContextSwitchTo(oldcontext);
	}
}

调用 pgxc_node_receive 读取分布式节点数据，启用linux 内核 poll 网络通讯模型，进行网络数据接收和发送，

/*
* Wait while at least one of specified connections has data available and read
* the data into the buffer
*
* Returning state code
*        DNStatus_OK = 0,
*       DNStatus_ERR = 1,
*       DNStatus_EXPIRED = 2,
*       DNStatus_BUTTY
*/
#ifdef __TBASE__
int
pgxc_node_receive(const int conn_count,
PGXCNodeHandle ** connections, struct timeval * timeout)

#else
bool
pgxc_node_receive(const int conn_count,
PGXCNodeHandle ** connections, struct timeval * timeout)
#endif

{// #lizard forgives
#ifndef __TBASE__
#define ERROR_OCCURED true
#define NO_ERROR_OCCURED false
#endif

int i,
sockets_to_poll,
poll_val;
bool is_msg_buffered;
long timeout_ms;
struct pollfd pool_fd[conn_count];

/* sockets to be polled index */
sockets_to_poll = 0;

is_msg_buffered = false;
for (i = 0; i < conn_count; i++)
{
/* If connection has a buffered message */
if (HAS_MESSAGE_BUFFERED(connections[i]))
{
is_msg_buffered = true;
break;
}
}

......

/* read data */
for (i = 0; i < conn_count; i++)
{
PGXCNodeHandle *conn = connections[i];

if( pool_fd[i].fd == -1 )
continue;

if ( pool_fd[i].fd == conn->sock )
{
if( pool_fd[i].revents & POLLIN )
{
int read_status = pgxc_node_read_data(conn, true);
if ( read_status == EOF || read_status < 0 )
{
/* Can not read - no more actions, just discard connection */
PGXCNodeSetConnectionState(conn,
DN_CONNECTION_STATE_ERROR_FATAL);
add_error_message(conn, "unexpected EOF on datanode connection.");
                   elog(LOG, "unexpected EOF on node:%s pid:%d, read_status:%d, EOF:%d", conn->nodename, conn->backend_pid, read_status, EOF);


#ifdef __TBASE__
return DNStatus_ERR;
#else
return ERROR_OCCURED;
#endif
}

}
else if (
(pool_fd[i].revents & POLLERR) ||
(pool_fd[i].revents & POLLHUP) ||
(pool_fd[i].revents & POLLNVAL)
)
{

PGXCNodeSetConnectionState(connections[i],
DN_CONNECTION_STATE_ERROR_FATAL);
add_error_message(conn, "unexpected network error on datanode connection");
elog(LOG, "unexpected EOF on datanode:%s pid:%d with event %d", conn->nodename, conn->backend_pid, pool_fd[i].revents);
/* Should we check/read from the other connections before returning? */
#ifdef __TBASE__
return DNStatus_ERR;
#else
return ERROR_OCCURED;
#endif
}
}
}
#ifdef __TBASE__
return DNStatus_OK;
#else
return NO_ERROR_OCCURED;
#endif
}

pgxc_node_read_data 解析接收的网络数据，

/*
* Read up incoming messages from the PGXC node connection
*/
int
pgxc_node_read_data(PGXCNodeHandle *conn, bool close_if_error)

pgxc_node_send_execute 发送执行消息，推送Datanode 数据节点，

/*
* Send EXECUTE message down to the Datanode
*/
int
pgxc_node_send_execute(PGXCNodeHandle * handle, const char *portal, int fetch)
{
/* portal name size (allow NULL) */
int pnameLen = portal ? strlen(portal) + 1 : 1;

/* size + pnameLen + fetchLen */
int msgLen = 4 + pnameLen + 4;

/* msgType + msgLen */
if (ensure_out_buffer_capacity(handle->outEnd + 1 + msgLen, handle) != 0)
{
add_error_message(handle, "out of memory");
return EOF;
}

handle->outBuffer[handle->outEnd++] = 'E';
/* size */
msgLen = htonl(msgLen);
memcpy(handle->outBuffer + handle->outEnd, &msgLen, 4);
handle->outEnd += 4;
/* portal name */
if (portal)
{
memcpy(handle->outBuffer + handle->outEnd, portal, pnameLen);
handle->outEnd += pnameLen;
}
else
handle->outBuffer[handle->outEnd++] = '\0';

/* fetch */
fetch = htonl(fetch);
memcpy(handle->outBuffer + handle->outEnd, &fetch, 4);
handle->outEnd += 4;

PGXCNodeSetConnectionState(handle, DN_CONNECTION_STATE_QUERY);

handle->in_extended_query = true;
return 0;
}

/*
* FetchTuple
*
Get next tuple from one of the datanode connections.
* The connections should be in combiner->connections, if "local" dummy
* connection presents it should be the last active connection in the array.
* If combiner is set up to perform merge sort function returns tuple from
* connection defined by combiner->current_conn, or NULL slot if no more tuple
* are available from the connection. Otherwise it returns tuple from any
* connection or NULL slot if no more available connections.
* Function looks into combiner->rowBuffer before accessing connection
* and return a tuple from there if found.
* Function may wait while more data arrive from the data nodes. If there
* is a locally executed subplan function advance it and buffer resulting rows
* instead of waiting.
*/
TupleTableSlot *
FetchTuple(ResponseCombiner *combiner)

ExecutorRun调用次序 =》standard_ExecutorRun =》ExecutePlan

【ExecutePlan】

/* ----------------------------------------------------------------
* ExecutePlan
*
* Processes the query plan until we have retrieved 'numberTuples' tuples,
* moving in the specified direction.
*
* Runs to completion if numberTuples is 0
*
* Note: the ctid attribute is a 'junk' attribute that is removed before the
* user can see it
* ----------------------------------------------------------------
*/
static void
ExecutePlan(EState *estate,
PlanState *planstate,
bool use_parallel_mode,
CmdType operation,
bool sendTuples,
uint64 numberTuples,
ScanDirection direction,
DestReceiver *dest,
bool execute_once)

【ExecProcNode】

/* ----------------------------------------------------------------
 *        ExecProcNode
 *
 *        Execute the given node to return a(nother) tuple.
 * ----------------------------------------------------------------
 */
#ifndef FRONTEND
static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
    if (node->chgParam != NULL) /* something changed? */
        ExecReScan(node);        /* let ReScan handle this */

    return node->ExecProcNode(node);
}
#endif

/*
* ExecReScan
* Reset a plan node so that its output can be re-scanned.
*
* Note that if the plan node has parameters that have changed value,
* the output might be different from last time.
*/
void
ExecReScan(PlanState *node)

{

......

/* And do node-type-specific processing */
switch (nodeTag(node))
{
case T_ResultState:
ExecReScanResult((ResultState *) node);
break;

case T_ProjectSetState:
ExecReScanProjectSet((ProjectSetState *) node);
break;

case T_ModifyTableState:
ExecReScanModifyTable((ModifyTableState *) node);
break;

case T_AppendState:
ExecReScanAppend((AppendState *) node);
break;

case T_MergeAppendState:
ExecReScanMergeAppend((MergeAppendState *) node);
break;

case T_RecursiveUnionState:
ExecReScanRecursiveUnion((RecursiveUnionState *) node);
break;

case T_BitmapAndState:
ExecReScanBitmapAnd((BitmapAndState *) node);
break;

......

#ifdef PGXC
case T_RemoteSubplanState:
ExecReScanRemoteSubplan((RemoteSubplanState *) node);
break;
case T_RemoteQueryState:
ExecReScanRemoteQuery((RemoteQueryState *) node);
break;
#endif

......

}

【ExecReScanRemoteSubplan】

void
ExecReScanRemoteSubplan(RemoteSubplanState *node)
{
ResponseCombiner *combiner = (ResponseCombiner *)node;

/*
* If we haven't queried remote nodes yet, just return. If outerplan'
* chgParam is not NULL then it will be re-scanned by ExecProcNode,
* else - no reason to re-scan it at all.
*/
if (!node->bound)
return;

/*
* If we execute locally rescan local copy of the plan
*/
if (outerPlanState(node))
ExecReScan(outerPlanState(node));

/*
* Consume any possible pending input
*/
pgxc_connections_cleanup(combiner);

/* misc cleanup */
combiner->command_complete_count = 0;
combiner->description_count = 0;

/*
* Force query is re-bound with new parameters
*/
node->bound = false;
#ifdef __TBASE__
node->eflags &= ~(EXEC_FLAG_DISCONN);
#endif
}

主要是通过下面函数来完成分布式节点的ExecEndRemotesubplan .

\src\backend\executor\execProcNode.c

ExecEndNode(PlanState *node)

/* ----------------------------------------------------------------
 *        ExecEndNode
 *
 *        Recursively cleans up all the nodes in the plan rooted
 *        at 'node'.
 *
 *        After this operation, the query plan will not be able to be
 *        processed any further.  This should be called only after
 *        the query plan has been fully executed.
 * ----------------------------------------------------------------
 */
void
ExecEndNode(PlanState *node)
{
......


#ifdef PGXC
        case T_RemoteQueryState:
            ExecEndRemoteQuery((RemoteQueryState *) node);
            break;
#endif
#ifdef XCP
        case T_RemoteSubplanState:
            ExecEndRemoteSubplan((RemoteSubplanState *) node);
            break;
......
}

【【ExecEndRemoteSubplan】

void
ExecEndRemoteSubplan(RemoteSubplanState *node)
{// #lizard forgives
    int32             count    = 0;
    ResponseCombiner *combiner = (ResponseCombiner *)node;
    RemoteSubplan    *plan = (RemoteSubplan *) combiner->ss.ps.plan;
    int i;
    struct rusage    start_r;
    struct timeval        start_t;

    if (log_remotesubplan_stats)
        ResetUsageCommon(&start_r, &start_t);

    if (outerPlanState(node))
        ExecEndNode(outerPlanState(node));
    if (node->locator)
        freeLocator(node->locator);

    /*
     * Consume any possible pending input
     */
    if (node->bound)
    {
        pgxc_connections_cleanup(combiner);
    }    
    
    /*
     * Update coordinator statistics
     */
    if (IS_PGXC_COORDINATOR)
    {
        EState *estate = combiner->ss.ps.state;
......

}