Presto技术源码解析总结-一个SQL的奇幻之旅

Presto技术总结

1.环境准备

Hadoop环境,Hive环境,mysql环境,ssh环境,presto本机debug环境

推荐hadoop2.2.0、hive1.2.1、mysql5.7、openssh-server&client、presto最新版本

presto本地debug环境搭建参考presto in idea

2.查询入口&流程

所有的查询首先打到StatementResource对应的路径为@Path("/v1/statement")

Query query = Query.create(
                sessionContext,
                statement,         //实际的sql
                queryManager,
                sessionPropertyManager,
                exchangeClient,
                responseExecutor,
                timeoutExecutor,
                blockEncodingSerde);
        queries.put(query.getQueryId(), query);  //创建query并放入执行队列

Query类中执行

Query result = new Query(sessionContext, query, queryManager, sessionPropertyManager, exchangeClient, dataProcessorExecutor, timeoutExecutor, blockEncodingSerde);

Query类中的queryManager

QueryInfo queryInfo = queryManager.createQuery(sessionContext, query);  //其中sessionContext为用户的seesion信息,query为用户sql

2.1词法语法分析生成AST

queryManager是一个接口目前只有SqlQueryManager的实现类,createQuery方法

private final ConcurrentMap<QueryId, QueryExecution> queries = new ConcurrentHashMap<>();
//QueryQueueManager是一个接口,sql相关的实现类SqlQueryQueueManager
private final QueryQueueManager queueManager;

//主要实现的逻辑,词法和语法分析,生成AstNode
Statement wrappedStatement = sqlParser.createStatement(query, createParsingOptions(session));
statement = unwrapExecuteStatement(wrappedStatement, sqlParser, session);
List<Expression> parameters = wrappedStatement instanceof Execute ? ((Execute) wrappedStatement).getParameters() : emptyList();

//参数校验
validateParameters(statement, parameters);
//获取对应的query执行器工厂
QueryExecutionFactory<?> queryExecutionFactory = executionFactories.get(statement.getClass());
//query执行器工厂创建query执行器
queryExecution = queryExecutionFactory.createQueryExecution(queryId, query, session, statement, parameters);
//将query执行器个queryId映射到map中
queries.put(queryId, queryExecution);
//将query执行器提交到queueManager
queueManager.submit(statement, queryExecution, queryExecutor);
//返回query信息
return queryInfo;

SqlQueryQueueManager的submit方法

List<QueryQueue> queues;
        try {
            //按照配置的规则,选择执行队列
            queues = selectQueues(queryExecution.getSession(), executor);
        }
        catch (PrestoException e) {
            queryExecution.fail(e);
            return;
        }
        for (QueryQueue queue : queues) {
            if (!queue.reserve(queryExecution)) {
                queryExecution.fail(new PrestoException(QUERY_QUEUE_FULL, "Too many queued queries"));
                return;
            }
        }
        //如果符合规则则入队
        queues.get(0).enqueue(createQueuedExecution(queryExecution, queues.subList(1, queues.size()), executor));

    //按照配置的规则,选择执行队列
    private List<QueryQueue> selectQueues(Session session, Executor executor)
    {
        for (QueryQueueRule rule : rules) {
            Optional<List<QueryQueueDefinition>> queues = rule.match(session.toSessionRepresentation());
            if (queues.isPresent()) {
               //获取或者创建一个Query队列
                return getOrCreateQueues(session, executor, queues.get());
            }
        }
        throw new PrestoException(QUERY_REJECTED, "Query did not match any queuing rule");
    }

    //获取或者创建一个Query队列
    private List<QueryQueue> getOrCreateQueues(Session session, Executor executor, List<QueryQueueDefinition> definitions)
    {
        ImmutableList.Builder<QueryQueue> queues = ImmutableList.builder();
        for (QueryQueueDefinition definition : definitions) {
            String expandedName = definition.getExpandedTemplate(session);
            QueueKey key = new QueueKey(definition, expandedName);
            if (!queryQueues.containsKey(key)) {
                QueryQueue queue = new QueryQueue(executor, definition.getMaxQueued(), definition.getMaxConcurrent());
                if (queryQueues.putIfAbsent(key, queue) == null) {
                    // Export the mbean, after checking for races
                    String objectName = ObjectNames.builder(QueryQueue.class, definition.getTemplate()).withProperty("expansion", expandedName).build();
                    mbeanExporter.export(objectName, queue);
                }
            }
            queues.add(queryQueues.get(key));
        }
        return queues.build();
    }

QueryQueue(Executor queryExecutor, int maxQueuedQueries, int maxConcurrentQueries)
    {
        requireNonNull(queryExecutor, "queryExecutor is null");
        checkArgument(maxQueuedQueries > 0, "maxQueuedQueries must be greater than zero");
        checkArgument(maxConcurrentQueries > 0, "maxConcurrentQueries must be greater than zero");

        int permits = maxQueuedQueries + maxConcurrentQueries;
        // Check for overflow
        checkArgument(permits > 0, "maxQueuedQueries + maxConcurrentQueries must be less than or equal to %s", Integer.MAX_VALUE);

        this.queuePermits = new AtomicInteger(permits);
        this.asyncSemaphore = new AsyncSemaphore<>(maxConcurrentQueries,
                queryExecutor,
                queueEntry -> {
                    QueuedExecution queuedExecution = queueEntry.dequeue();
                    if (queuedExecution != null) {
                        queuedExecution.start();
                        return queuedExecution.getCompletionFuture();
                    }
                    return Futures.immediateFuture(null);
                });
    }

public void start()
    {
        // Only execute if the query is not already completed (e.g. cancelled)
        if (listenableFuture.isDone()) {
            return;
        }
        if (nextQueues.isEmpty()) {
            executor.execute(() -> {
                try (SetThreadName ignored = new SetThreadName("Query-%s", queryExecution.getQueryId())) {
                    //将statement转化为analysis(plan)
                    queryExecution.start();
                }
            });
        }
        else {
            nextQueues.get(0).enqueue(new QueuedExecution(queryExecution, nextQueues.subList(1, nextQueues.size()), executor, listenableFuture));
        }
    }

2.2语义分析&生成逻辑执行计划

2.2.1语义分析

先看看SqlQueryExecution类的构成

private final QueryStateMachine stateMachine;

private final Statement statement;                               //词法语法分析生成的astNode
private final Metadata metadata;
private final AccessControl accessControl;
private final SqlParser sqlParser;                               //sql解析器
private final SplitManager splitManager;
private final NodePartitioningManager nodePartitioningManager;
private final NodeScheduler nodeScheduler; //将task分配给node的核心模块,stage调度的时候会详细说明
private final List<PlanOptimizer> planOptimizers;
private final RemoteTaskFactory remoteTaskFactory;
private final LocationFactory locationFactory;
private final int scheduleSplitBatchSize;
private final ExecutorService queryExecutor;
private final ScheduledExecutorService schedulerExecutor;
private final FailureDetector failureDetector;

private final QueryExplainer queryExplainer;
private final PlanFlattener planFlattener;
private final CostCalculator costCalculator;
private final AtomicReference<SqlQueryScheduler> queryScheduler = new AtomicReference<>();
private final AtomicReference<Plan> queryPlan = new AtomicReference<>();
private final NodeTaskMap nodeTaskMap;
private final ExecutionPolicy executionPolicy;
private final List<Expression> parameters;
private final SplitSchedulerStats schedulerStats;

SqlQueryExecution类的start方法

PlanRoot plan = analyzeQuery();  //生成逻辑执行计划

//调用栈为
analyzeQuery() -> doAnalyzeQuery()
  
doAnalyzeQuery()
{
    //创建语义分析器
    Analyzer analyzer = new Analyzer(stateMachine.getSession(), metadata, sqlParser, accessControl, Optional.of(queryExplainer), parameters);
    //开始语义分析
    Analysis analysis = analyzer.analyze(statement);
    //生成逻辑Planner
    LogicalPlanner logicalPlanner = new LogicalPlanner(stateMachine.getSession(), planOptimizers, idAllocator, metadata, sqlParser, costCalculator);
    //逻辑Planner开始生成逻辑执行计划,还涉及到逻辑执行计划的优化
    Plan plan = logicalPlanner.plan(analysis);
    queryPlan.set(plan);
    //对逻辑执行计划进行分段,准备生成分布式执行计划
    SubPlan fragmentedPlan = PlanFragmenter.createSubPlans(stateMachine.getSession(), metadata, nodePartitioningManager, plan, false);
  
    return new PlanRoot(fragmentedPlan, !explainAnalyze, extractConnectors(analysis));
}

Analyzer类的analyze方法

//sql重写
Statement rewrittenStatement = StatementRewrite.rewrite(session, metadata, sqlParser, queryExplainer, statement, parameters, accessControl);
//初始化Analysis
Analysis analysis = new Analysis(rewrittenStatement, parameters, isDescribe);
//创建Statement分析器
StatementAnalyzer analyzer = new StatementAnalyzer(analysis, metadata, sqlParser, accessControl, session);
//调用Statement分析器去分析
analyzer.analyze(rewrittenStatement, Optional.empty());

analyze里面的具体实现就是遍历ASTNode对每种类型的Node作分析,主要是获取meta和校验元信息

2.2.2生成逻辑执行计划

LogicalPlanner类的plan方法

PlanNode root = planStatement(analysis, analysis.getStatement());
//检查执行计划的有效性
PlanSanityChecker.validateIntermediatePlan(root, session, metadata, sqlParser, symbolAllocator.getTypes());
//对生成的逻辑执行进行优化
root = optimizer.optimize(root, session, symbolAllocator.getTypes(), symbolAllocator, idAllocator);

LogicalPlanner类的planStatement方法

//对于普通的sql来说,只执行下面
createOutputPlan(planStatementWithoutOutput(analysis, statement), analysis);

LogicalPlanner类的planStatementWithoutOutput方法

private RelationPlan planStatementWithoutOutput(Analysis analysis, Statement statement)
    {
        if (statement instanceof CreateTableAsSelect) {
            if (analysis.isCreateTableAsSelectNoOp()) {
                throw new PrestoException(NOT_SUPPORTED, "CREATE TABLE IF NOT EXISTS is not supported in this context " + statement.getClass().getSimpleName());
            }
            return createTableCreationPlan(analysis, ((CreateTableAsSelect) statement).getQuery());
        }
        else if (statement instanceof Insert) {
            checkState(analysis.getInsert().isPresent(), "Insert handle is missing");
            return createInsertPlan(analysis, (Insert) statement);
        }
        else if (statement instanceof Delete) {
            return createDeletePlan(analysis, (Delete) statement);
        }
        else if (statement instanceof Query) {
            return createRelationPlan(analysis, (Query) statement);
        }
        else if (statement instanceof Explain && ((Explain) statement).isAnalyze()) {
            return createExplainAnalyzePlan(analysis, (Explain) statement);
        }
        else {
            throw new PrestoException(NOT_SUPPORTED, "Unsupported statement type " + statement.getClass().getSimpleName());
        }
    }

LogicalPlanner类的createRelationPlan方法

return new RelationPlanner(analysis, symbolAllocator, idAllocator, buildLambdaDeclarationToSymbolMap(analysis, symbolAllocator), metadata, session)
        .process(query, null);

RelationPlanner类具体的实现就是,遍历ASTNode,生成逻辑执行计划里面对应的Node

逻辑执行计划中常见的Node和Visit操作如下面所示:

AggregationNode            聚合操作的节点,有Final、partial、single三种,表示最终聚合、局部聚合和单点聚合,在执行计划优化前,聚合类型都是单点聚合,在优化器中会拆成局部聚合和最终聚合,类似于MR任务中的,map端局部reduce,和reduce端最终reduce

DeleteNode                 Delete操作的节点
ExchangeNode               逻辑执行计划中,不同Stage之间交换数据的节点
FilterNode                 进行Filter过滤操作的节点
JoinNode                   执行Join操作的节点
LimitNode                  执行limit操作的节点
MarkDistinctNode           处理count distinct
OutputNode                 输出Node

ProjectNode                将下层的节点输出列映射成上层节点 例如:select a + 1 from b将TableScanNode的a列 + 1 映射到OutputNode

RemoteSourceNode           类似于ExchangeNode,在分布式执行计划中,不同Stage之间交换数据的节点
SampleNode                 抽样函数Node
RowNumberNode              处理窗函数RowNumber
SortNode                   排序Node
TableScanNode              读取表的数据
TableWriterNode            写入表的数据
TopNNode                   order by ... limit 会使用效率更高的TopNNode
UnionNode                  处理Union操作
WindowNode                 处理窗口函数

RelationPlanner类的visit操作

visitTable                 生成TableScanNode
visitAliasedRelation       处理有别名的Relation
visitSampledRelation       添加一个SampleNode,主要处理抽样函数

visitJoin                  根据不同的join类型,生成不同的节点结构,一般来说是将左右两边生成对应的queryPlan,然后左右各添加一个ProjectNode,中间添加一个JoinNode相连,让上层添加一个FilterNode,FilterNode为join条件

visitQuery                 使用QueryPlanner处理Query,并返回生成的执行计划
visitQuerySpecification    使用QueryPlanner处理QueryBody,并返回生成的执行计划

QueryPlanner类的plan操作(queryBody的定义就是指一个完整的sql,可以嵌套,例如select a from QueryBody b,通常来说里面的这个QueryBody会被当做AliasedRelation继续plan)

Query和QuerySpecification相比,QuerySpecification代表完整的QueryBody,而Query则包含了QueryBody--QueryBody是一个抽象类,QuerySpecification继承了QueryBody

plan(Query query)                 首先取出Query中的queryBody,然后调用RelationPlanner进行分析,调用其visitQuerySpecification然后RelationPlanner调用QueryPlanner的plan方法
plan(QuerySpecification query)    生成一个queryBody中所有的组件Node

下面是最主要的 plan(QuerySpecification query)的plan过程

PlanBuilder builder = planFrom(node);                      //builder的root即为生成的NodeTree
RelationPlan fromRelationPlan = builder.getRelationPlan(); //生成TableScanNode
builder = filter(builder, analysis.getWhere(node), node);  //生成FilterNode
builder = aggregate(builder, node);                        //生成AggregateNode
builder = filter(builder, analysis.getHaving(node), node); //如果有Having则生成FilterNode
builder = window(builder, node);                           //生成windowNode
List<Expression> outputs = analysis.getOutputExpressions(node);
builder = handleSubqueries(builder, node, outputs);
if (node.getOrderBy().isPresent() && !SystemSessionProperties.isLegacyOrderByEnabled(session)) {
    if (analysis.getGroupingSets(node).isEmpty()) {
        builder = project(builder, outputs, fromRelationPlan);
        outputs = toSymbolReferences(computeOutputs(builder, outputs));
        builder = planBuilderFor(builder, analysis.getScope(node.getOrderBy().get()));
    }
    else {
        List<Expression> orderByAggregates = analysis.getOrderByAggregates(node.getOrderBy().get());
        builder = project(builder, Iterables.concat(outputs, orderByAggregates));
        outputs = toSymbolReferences(computeOutputs(builder, outputs));
        List<Expression> complexOrderByAggregatesToRemap = orderByAggregates.stream()
                .filter(expression -> !analysis.isColumnReference(expression))
                .collect(toImmutableList());
        builder = planBuilderFor(builder, analysis.getScope(node.getOrderBy().get()), complexOrderByAggregatesToRemap);
    }
    builder = window(builder, node.getOrderBy().get());
}
List<Expression> orderBy = analysis.getOrderByExpressions(node);
builder = handleSubqueries(builder, node, orderBy);
builder = project(builder, Iterables.concat(orderBy, outputs));
builder = distinct(builder, node);
builder = sort(builder, node);
builder = project(builder, outputs);
builder = limit(builder, node);
return new RelationPlan(
        builder.getRoot(),
        analysis.getScope(node),
        computeOutputs(builder, outputs));
2.2.3逻辑执行计划优化

LogicalPlanner类的plan方法

root = optimizer.optimize(root, session, symbolAllocator.getTypes(), symbolAllocator, idAllocator);

optimizer优化器的具体实现就是,调用不同的具体实现的优化去对,上一步生成的NodeTree(逻辑执行计划)进行逐个优化

具体的优化方法

AddExchanges                   //生成分布式执行计划,例如添加局部聚合和最终聚合
AddLocalExchanges
BeginTableWrite
CanonicalizeExpressions        //将执行计划中的表达式标准化,比如将is not null 改写为not(is null),将if语句改写为case when
CheckSubqueryNodesAreRewritten
CountConstantOptimizer         //将count(a)改写为count(*)提高不同数据源的兼容性
DesugaringOptimizer
DetermineJoinDistributionType
EliminateCrossJoins
EmptyDeleteOptimizer
HashGenerationOptimizer     //提前进行hash计算
ImplementIntersectAndExceptAsUnion
IndexJoinOptimizer          //将Join优化为IndexJoiJ,获取Join表的索引,提升速度
IterativeOptimizer
LimitPushDown               //limit条件下推,减小下层节点的数据量
MetadataDeleteOptimizer
MetadataQueryOptimizer      //将对表的分区字段进行的聚合操作,改写为针对表元数据的查询,减少读取表的操作
OptimizeMixedDistinctAggregations
PickLayout
PredicatePushDown           //谓词(过滤条件)下推,减下下层节点的数据量
ProjectionPushDown          //ProjectNode下推,减少Union节点的数据量
PruneUnreferencedOutputs    //去除ProjectNodeP不在最终输出中的列,减小计算量
PruneRedundantProjections   //去除多余的projectNode,如果上下节点全都直接映射,则去掉该层projectNode
PushTableWriteThroughUnion
RemoveUnreferencedScalarLateralNodes
SetFlatteningOptimizer         //合并能够合并的Union语句
SimplifyExpressions            //对执行计划中涉及到的表达式进行化简和优化
TransformCorrelatedNoAggregationSubqueryToJoin
TransformCorrelatedNoAggregationSubqueryToJoin
TransformCorrelatedScalarAggregationToJoin
TransformCorrelatedSingleRowSubqueryToProject
TransformQuantifiedComparisonApplyToLateralJoin
TransformUncorrelatedInPredicateSubqueryToSemiJoin
TransformUncorrelatedLateralToJoin
UnaliasSymbolReferences        //去除执行计划中projectNode无意义的映射,如果列直接相对而没有带表达式则直接映射到上层节点
WindowFilterPushDown

2.3生成分布式执行计划

2.3.1逻辑执行计划分段

这个阶段会将上面生成的逻辑执行计划切分为,多个Stage,其中Stage的阶段分为四个阶段:

Sourece、Fixed、Single

Sourece:一般是TableScanNode、ProjectNode、FilterNode,一般是最下游的取数的Stage

Fixed:一般在Sourece之后,将Sourece阶段获取的数据分散到多个节点上处理,类似于Map端reduce操作,包括局部聚合、局部Join、局部数据写入

Single:一般在Fixed之后,只在一台机器上进行,汇总所有的结果、做最终聚合、全局排序,并将结果传输给Coordinator

Coordinator_only:只在Coordinator上

SqlQueryExecution类doAnalyzeQuery方法

SubPlan fragmentedPlan = PlanFragmenter.createSubPlans(stateMachine.getSession(), metadata, nodePartitioningManager, plan, false);

//SubPlan类的构造
private final PlanFragment fragment;
private final List<SubPlan> children;

可以看出来,内部是类似于B树的树形结构,这样就将逻辑执行计划切分为了若干个段
2.3.2生成分布式执行计划
2.3.2.1获取SplitSource分片

SqlQueryExecution类start方法

//生成分段的执行计划
PlanRoot plan = analyzeQuery();
//生成分布式执行计划
planDistribution(plan);   

SqlQueryExecution类planDistribution方法

//获取stage的执行计划
StageExecutionPlan outputStageExecutionPlan = distributedPlanner.plan(plan.getRoot(), stateMachine.getSession());
//创建SqlQuery调度
SqlQueryScheduler scheduler = new SqlQueryScheduler(
                //状态监听器
                stateMachine,
                locationFactory,
                outputStageExecutionPlan,
                nodePartitioningManager,
                //将task分配给node的核心模块
                nodeScheduler,
                remoteTaskFactory,
                stateMachine.getSession(),
                plan.isSummarizeTaskInfos(),
                scheduleSplitBatchSize,
                queryExecutor,
                schedulerExecutor,
                failureDetector,
                rootOutputBuffers,
                //保存了当前stage分配的task和node的映射关系
                nodeTaskMap,
                executionPolicy,
                schedulerStats);

DistributedExecutionPlanner类plan方法

调用栈为plan -> doPlan

private StageExecutionPlan doPlan(SubPlan root, Session session, ImmutableList.Builder<SplitSource> allSplitSources)
{
PlanFragment currentFragment = root.getFragment();
//visitor模式遍历分段,对TableScaTNode进行splitManager.getSplits()操作来获取分片,实现类是HiveSplitManager,内部实现是调用HiveSplitLoader.start()方法,下面进行详细说明
//这里好像说明了一个stage里面只能有一个tableScan???
Map<PlanNodeId, SplitSource> splitSources = currentFragment.getRoot().accept(new Visitor(session, currentFragment.getPipelineExecutionStrategy(), allSplitSources), null);
ImmutableList.Builder<StageExecutionPlan> dependencies = ImmutableList.builder();
for (SubPlan childPlan : root.getChildren()) {
    dependencies.add(doPlan(childPlan, session, allSplitSources)); //此处会递归的调用,将子逻辑执行计划全部转化为带用层级关系的stage执行计划
}
return new StageExecutionPlan(
        currentFragment,
        splitSources,
        dependencies.build());
}

对TableScaTNode进行splitManager.getSplits()操作来获取分片,并将结果保存在Map<PlanNodeId, SplitSource> splitSources中(其实就是每个Node对应的SplitSource)

public Map<PlanNodeId, SplitSource> visitTableScan(TableScanNode node, Void context)
{
    // get dataSource for table
    SplitSource splitSource = splitManager.getSplits(
            session,
            node.getLayout().get(),
            pipelineExecutionStrategy == GROUPED_EXECUTION ? GROUPED_SCHEDULING : UNGROUPED_SCHEDULING);
    splitSources.add(splitSource);
    return ImmutableMap.of(node.getId(), splitSource);
}

HiveSplitManager实现ConnectorSplitManager(Presto SPI接口)

public ConnectorSplitSource getSplits(ConnectorTransactionHandle transaction, ConnectorSession session, ConnectorTableLayoutHandle layoutHandle, SplitSchedulingStrategy splitSchedulingStrategy)
{
    HiveTableLayoutHandle layout = (HiveTableLayoutHandle) layoutHandle;
    SchemaTableName tableName = layout.getSchemaTableName();

    // get table metadata
    SemiTransactionalHiveMetastore metastore = metastoreProvider.apply((HiveTransactionHandle) transaction);
    Table table = metastore.getTable(tableName.getSchemaName(), tableName.getTableName())
            .orElseThrow(() -> new TableNotFoundException(tableName));

    // verify table is not marked as non-readable
    String tableNotReadable = table.getParameters().get(OBJECT_NOT_READABLE);
    if (!isNullOrEmpty(tableNotReadable)) {
        throw new HiveNotReadableException(tableName, Optional.empty(), tableNotReadable);
    }

    // 获取hive的分区
    List<HivePartition> partitions = layout.getPartitions()
            .orElseThrow(() -> new PrestoException(GENERIC_INTERNAL_ERROR, "Layout does not contain partitions"));

    // short circuit if we don't have any partitions
    HivePartition partition = Iterables.getFirst(partitions, null);
    if (partition == null) {
        return new FixedSplitSource(ImmutableList.of());
    }

    // get buckets from first partition (arbitrary)
    List<HiveBucket> buckets = partition.getBuckets();

    // validate bucket bucketed execution
    Optional<HiveBucketHandle> bucketHandle = layout.getBucketHandle();
    if ((splitSchedulingStrategy == GROUPED_SCHEDULING) && !bucketHandle.isPresent()) {
        throw new PrestoException(GENERIC_INTERNAL_ERROR, "SchedulingPolicy is bucketed, but BucketHandle is not present");
    }

    // sort partitions
    partitions = Ordering.natural().onResultOf(HivePartition::getPartitionId).reverse().sortedCopy(partitions);

    Iterable<HivePartitionMetadata> hivePartitions = getPartitionMetadata(metastore, table, tableName, partitions, bucketHandle.map(HiveBucketHandle::toBucketProperty));

    HiveSplitLoader hiveSplitLoader = new BackgroundHiveSplitLoader(
            table,
            hivePartitions,
            layout.getCompactEffectivePredicate(),
            createBucketSplitInfo(bucketHandle, buckets),
            session,
            hdfsEnvironment,
            namenodeStats,
            directoryLister,
            executor,
            splitLoaderConcurrency,
            recursiveDfsWalkerEnabled);

    HiveSplitSource splitSource;
    switch (splitSchedulingStrategy) {
        case UNGROUPED_SCHEDULING:
            splitSource = HiveSplitSource.allAtOnce(
                    session,
                    table.getDatabaseName(),
                    table.getTableName(),
                    layout.getCompactEffectivePredicate(),
                    maxInitialSplits,
                    maxOutstandingSplits,
                    maxOutstandingSplitsSize,
                    hiveSplitLoader,
                    executor,
                    new CounterStat());
            break;
        case GROUPED_SCHEDULING:
            splitSource = HiveSplitSource.bucketed(
                    session,
                    table.getDatabaseName(),
                    table.getTableName(),
                    layout.getCompactEffectivePredicate(),
                    maxInitialSplits,
                    maxOutstandingSplits,
                    new DataSize(32, MEGABYTE),
                    hiveSplitLoader,
                    executor,
                    new CounterStat());
            break;
        default:
            throw new IllegalArgumentException("Unknown splitSchedulingStrategy: " + splitSchedulingStrategy);
    }
    hiveSplitLoader.start(splitSource);

    return splitSource;
}
2.3.2.2产生stage执行计划

上面产生了一个StageExecutionPlan(stage执行计划),下面看看StageExecutionPlan的结构**

private final PlanFragment fragment;                       //当前执行计划分段
private final Map<PlanNodeId, SplitSource> splitSources;   //从HiveSplitManager获取的分片映射关系
private final List<StageExecutionPlan> subStages;          //子执行计划分段
private final Optional<List<String>> fieldNames;           //字段名称

经过planDistribution方法之后,分段的逻辑执行计划就转化成了stage执行计划,而presto对task的调度都是基于stage来调度的,紧接着SqlQueryScheduler会构造SqlStage执行器

SqlQueryScheduler类的构造方法

List<SqlStageExecution> stages = createStages(
                (fragmentId, tasks, noMoreExchangeLocations) -> updateQueryOutputLocations(queryStateMachine, rootBufferId, tasks, noMoreExchangeLocations),
                new AtomicInteger(),
                locationFactory,
                plan.withBucketToPartition(Optional.of(new int[1])),
                nodeScheduler,
                remoteTaskFactory,
                session,
                splitBatchSize,
                partitioningHandle -> partitioningCache.computeIfAbsent(partitioningHandle, handle -> nodePartitioningManager.getNodePartitioningMap(session, handle)),
                nodePartitioningManager,
                queryExecutor,
                schedulerExecutor,
                failureDetector,
                nodeTaskMap,
                stageSchedulers,
                stageLinkages);

SqlStageExecution rootStage = stages.get(0);
2.3.2.3产生stage执行器

SqlQueryScheduler类的createStages方法

ImmutableList.Builder<SqlStageExecution> stages = ImmutableList.builder();

StageId stageId = new StageId(queryStateMachine.getQueryId(), nextStageId.getAndIncrement());
SqlStageExecution stage = new SqlStageExecution(  //创建当前的SqlStageExecution
        stageId,
        locationFactory.createStageLocation(stageId),
        plan.getFragment(),
        remoteTaskFactory,
        session,
        summarizeTaskInfo,
        nodeTaskMap,
        queryExecutor,
        failureDetector,
        schedulerStats);
stages.add(stage);

...
...
//中间省略创建stage调度器和分配策略的步骤,详情见2.4.3

ImmutableSet.Builder<SqlStageExecution> childStagesBuilder = ImmutableSet.builder();
        for (StageExecutionPlan subStagePlan : plan.getSubStages()) {
            List<SqlStageExecution> subTree = createStages( //递归创建所有的子SqlStageExecution
                    stage::addExchangeLocations,
                    nextStageId,
                    locationFactory,
                    subStagePlan.withBucketToPartition(bucketToPartition),
                    nodeScheduler,
                    remoteTaskFactory,
                    session,
                    splitBatchSize,
                    partitioningCache,
                    nodePartitioningManager,
                    queryExecutor,
                    schedulerExecutor,
                    failureDetector,
                    nodeTaskMap,
                    stageSchedulers,
                    stageLinkages);
            stages.addAll(subTree);

            SqlStageExecution childStage = subTree.get(0);
            childStagesBuilder.add(childStage);
        }
Set<SqlStageExecution> childStages = childStagesBuilder.build();

至此所有SqlStageExecution生成完毕,下面看一下SqlStageExecution的简化构成

private final StageStateMachine stateMachine;       //stage状态监听器
private final RemoteTaskFactory remoteTaskFactory;  //生成Task的工厂类
private final NodeTaskMap nodeTaskMap;              //保存当前stage分配的task和节点映射列表
private final Map<Node, Set<RemoteTask>> tasks = new ConcurrentHashMap<>();
private final AtomicInteger nextTaskId = new AtomicInteger();
private final Set<TaskId> allTasks = newConcurrentHashSet();
private final Set<TaskId> finishedTasks = newConcurrentHashSet();
private final Multimap<PlanNodeId, RemoteTask> sourceTasks = HashMultimap.create();

2.4生成分布式执行计划调度

2.4.1调度相关的服务类

先介绍一下上文提到的SqlQueryExecution中的NodeScheduler类

主要包括成员
InternalNodeManager nodeManager       //获取存活的节点列表,保存在NodeMap里面,定时更新内容,默认5秒

主要包括方法
List<Node> selectNodes                //选取存活的Node列表
NodeSelector createNodeSelector       //提供了NodeSelector,其中包括各个stage中task分配的算法
ResettableRandomizedIterator<Node> randomizedNodes  //打乱给定的NodeMap

InternalNodeManager接口的定义为

public interface InternalNodeManager
{
    Set<Node> getNodes(NodeState state);                         //获取指定状态的节点列表
    Set<Node> getActiveConnectorNodes(ConnectorId connectorId);  //根据connectorId获取节点列表
    Node getCurrentNode();       //获取当前节点信息
    Set<Node> getCoordinators(); //获取Coordinator列表
    AllNodes getAllNodes();      //获取所有的节点列表
    void refreshNodes();         //刷新节点的信息
}
//有DiscoveryNodeManager的实现类

NodeSelector接口定义为

public interface NodeSelector
{
    void lockDownNodes();
    List<Node> allNodes();                   //选择所有的节点
    Node selectCurrentNode();                //选择当前节点
    List<Node> selectRandomNodes(int limit)  //选择limit个随机的节点
    List<Node> selectRandomNodes(int limit, Set<Node> excludedNodes);  //选择limit个随机的节点排除给定的节点
    SplitPlacementResult computeAssignments(Set<Split> splits, List<RemoteTask> existingTasks);
    SplitPlacementResult computeAssignments(Set<Split> splits, List<RemoteTask> existingTasks, NodePartitionMap partitioning);
}
//SimpleNodeSelector和TopologyAwareNodeSelector实现类 Presto会根据不同的网络拓扑结构来选择不同的NodeSelector

//在NodeScheduler的构造方法中,只要不是 LEGACY网络 就认为使用了网络拓扑,LEGACY网络指的是历史的网络,采用了非TCP/IP的网络
this.useNetworkTopology = !config.getNetworkTopology().equals(NetworkTopologyType.LEGACY);

//在createNodeSelector方法中,实例化了NodeSelector
if (useNetworkTopology) {
    //所以只要你的网络使用了TCP/IP协议,实例化的NodeSelector都是TopologyAwareNodeSelector
    return new TopologyAwareNodeSelector(
            nodeManager,
            nodeTaskMap,
            includeCoordinator,
            nodeMap,
            minCandidates,
            maxSplitsPerNode,
            maxPendingSplitsPerTask,
            topologicalSplitCounters,
            networkLocationSegmentNames,
            networkLocationCache);
}
else {
    return new SimpleNodeSelector(nodeManager, nodeTaskMap, includeCoordinator, nodeMap, minCandidates, maxSplitsPerNode, maxPendingSplitsPerTask);
}

Node的定义为

public interface Node
{
    HostAddress getHostAndPort();   //host和port
    URI getHttpUri();               //url
    String getNodeIdentifier();
    String getVersion();            //version
    boolean isCoordinator();        //是不是Coordinator
}

创建createNodeSelector过程

public NodeSelector createNodeSelector(ConnectorId connectorId)
  {
      //采用了谷歌的Supplier缓存技术
      Supplier<NodeMap> nodeMap = Suppliers.memoizeWithExpiration(() -> {
          ImmutableSetMultimap.Builder<HostAddress, Node> byHostAndPort = ImmutableSetMultimap.builder();
          ImmutableSetMultimap.Builder<InetAddress, Node> byHost = ImmutableSetMultimap.builder();
          ImmutableSetMultimap.Builder<NetworkLocation, Node> workersByNetworkPath = ImmutableSetMultimap.builder();
          Set<Node> nodes;
          if (connectorId != null) {
              nodes = nodeManager.getActiveConnectorNodes(connectorId);
          }
          else {
              nodes = nodeManager.getNodes(ACTIVE);
          }

          Set<String> coordinatorNodeIds = nodeManager.getCoordinators().stream()
                  .map(Node::getNodeIdentifier)
                  .collect(toImmutableSet());
          for (Node node : nodes) {
              if (useNetworkTopology && (includeCoordinator || !coordinatorNodeIds.contains(node.getNodeIdentifier()))) {
                  NetworkLocation location = networkLocationCache.get(node.getHostAndPort());
                  for (int i = 0; i <= location.getSegments().size(); i++) {
                      workersByNetworkPath.put(location.subLocation(0, i), node);
                  }
              }
              try {
                  byHostAndPort.put(node.getHostAndPort(), node);

                  InetAddress host = InetAddress.getByName(node.getHttpUri().getHost());
                  byHost.put(host, node);
              }
              catch (UnknownHostException e) {
                  // ignore
              }
          }

          return new NodeMap(byHostAndPort.build(), byHost.build(), workersByNetworkPath.build(), coordinatorNodeIds);
      }, 5, TimeUnit.SECONDS);
      if (useNetworkTopology) {
          return new TopologyAwareNodeSelector(
                  nodeManager,
                  nodeTaskMap,
                  includeCoordinator,
                  nodeMap,
                  minCandidates,
                  maxSplitsPerNode,
                  maxPendingSplitsPerTask,
                  topologicalSplitCounters,
                  networkLocationSegmentNames,
                  networkLocationCache);
      }
      else {
          return new SimpleNodeSelector(nodeManager, nodeTaskMap, includeCoordinator, nodeMap, minCandidates, maxSplitsPerNode, maxPendingSplitsPerTask);
      }
  }
2.4.2调度选择策略

Single和Fixed Stage策略,比较简单,均为调用selectRandomNodes

2.4.3生成stage调度器和分配策略

承接2.3.2.3中间的代码

Optional<int[]> bucketToPartition;
PartitioningHandle partitioningHandle = plan.getFragment().getPartitioning();

    // 根据不同的stage类型,创建不同的stage调度器
if (partitioningHandle.equals(SOURCE_DISTRIBUTION)) {
    // nodes are selected dynamically based on the constraints of the splits and the system load
    Entry<PlanNodeId, SplitSource> entry = Iterables.getOnlyElement(plan.getSplitSources().entrySet());
    PlanNodeId planNodeId = entry.getKey();
    SplitSource splitSource = entry.getValue();
    ConnectorId connectorId = splitSource.getConnectorId();
    if (isInternalSystemConnector(connectorId)) {
        connectorId = null;
    }
    //创建nodeSelector用来选择执行的节点,主要是通过从nodeManager获取
    NodeSelector nodeSelector = nodeScheduler.createNodeSelector(connectorId);
    //split动态分配策略
    SplitPlacementPolicy placementPolicy = new DynamicSplitPlacementPolicy(nodeSelector, stage::getAllTasks);

    checkArgument(plan.getFragment().getPipelineExecutionStrategy() == UNGROUPED_EXECUTION);
  
    //source阶段的stage选择simpleSourcePartitionedScheduler
    stageSchedulers.put(stageId, simpleSourcePartitionedScheduler(stage, planNodeId, splitSource, placementPolicy, splitBatchSize));
    bucketToPartition = Optional.of(new int[1]);
}
else if (partitioningHandle.equals(SCALED_WRITER_DISTRIBUTION)) {
    bucketToPartition = Optional.of(new int[1]);
}
else {
    // nodes are pre determined by the nodePartitionMap
    NodePartitionMap nodePartitionMap = partitioningCache.apply(plan.getFragment().getPartitioning());
    long nodeCount = nodePartitionMap.getPartitionToNode().values().stream().distinct().count();
    OptionalInt concurrentLifespansPerTask = getConcurrentLifespansPerNode(session);

    Map<PlanNodeId, SplitSource> splitSources = plan.getSplitSources();
    //如果fixed阶段的stage 分配到了SplitSource 则创建选择FixedSourcePartitionedScheduler,该调度器里面自己创建了一个FixedSplitPlacementPolicy分配策略
    if (!splitSources.isEmpty()) {
        List<PlanNodeId> schedulingOrder = plan.getFragment().getPartitionedSources();
        List<ConnectorPartitionHandle> connectorPartitionHandles;
        switch (plan.getFragment().getPipelineExecutionStrategy()) {
            case GROUPED_EXECUTION:
                connectorPartitionHandles = nodePartitioningManager.listPartitionHandles(session, partitioningHandle);
                checkState(!ImmutableList.of(NOT_PARTITIONED).equals(connectorPartitionHandles));
                break;
            case UNGROUPED_EXECUTION:
                connectorPartitionHandles = ImmutableList.of(NOT_PARTITIONED);
                break;
            default:
                throw new UnsupportedOperationException();
        }
        stageSchedulers.put(stageId, new FixedSourcePartitionedScheduler(
                stage,
                splitSources,
                plan.getFragment().getPipelineExecutionStrategy(),
                schedulingOrder,
                nodePartitionMap,
                splitBatchSize,
                concurrentLifespansPerTask.isPresent() ? OptionalInt.of(toIntExact(concurrentLifespansPerTask.getAsInt() * nodeCount)) : OptionalInt.empty(),
                nodeScheduler.createNodeSelector(null),
                connectorPartitionHandles));
        bucketToPartition = Optional.of(nodePartitionMap.getBucketToPartition());
    }
    else {
        //存活的node列表
        Map<Integer, Node> partitionToNode = nodePartitionMap.getPartitionToNode();
        // todo this should asynchronously wait a standard timeout period before failing
        checkCondition(!partitionToNode.isEmpty(), NO_NODES_AVAILABLE, "No worker nodes available");

        //如果fixed阶段的stage 没有分配到SplitSource,则选择FixedSourcePartitionedScheduler
        stageSchedulers.put(stageId, new FixedCountScheduler(stage, partitionToNode));
        bucketToPartition = Optional.of(nodePartitionMap.getBucketToPartition());
    }
}
2.4.4sqlQuery调度器开始调度

scheduler.start()启动sqlQueryScheduler的调度里面涉及到Task的调度

public void start()
{
    if (started.compareAndSet(false, true)) {
        executor.submit(this::schedule);
    }
}

方法引用调用schedule()

 
private void schedule()
{
    try (SetThreadName ignored = new SetThreadName("Query-%s", queryStateMachine.getQueryId())) {
        Set<StageId> completedStages = new HashSet<>();
        ExecutionSchedule executionSchedule = executionPolicy.createExecutionSchedule(stages.values());
        while (!executionSchedule.isFinished()) {
            List<ListenableFuture<?>> blockedStages = new ArrayList<>();
            for (SqlStageExecution stage : executionSchedule.getStagesToSchedule()) {
                stage.beginScheduling();

                // 调用每个stage上的stage调度器进行task的调度
                // perform some scheduling work
                ScheduleResult result = stageSchedulers.get(stage.getStageId())
                        .schedule();

                // modify parent and children based on the results of the scheduling
                if (result.isFinished()) {
                    stage.schedulingComplete();
                }
                else if (!result.getBlocked().isDone()) {
                    blockedStages.add(result.getBlocked());
                }
                stageLinkages.get(stage.getStageId())
                        .processScheduleResults(stage.getState(), result.getNewTasks());
                schedulerStats.getSplitsScheduledPerIteration().add(result.getSplitsScheduled());
                if (result.getBlockedReason().isPresent()) {
                    switch (result.getBlockedReason().get()) {
                        case WRITER_SCALING:
                            // no-op
                            break;
                        case WAITING_FOR_SOURCE:
                            schedulerStats.getWaitingForSource().update(1);
                            break;
                        case SPLIT_QUEUES_FULL:
                            schedulerStats.getSplitQueuesFull().update(1);
                            break;
                        case MIXED_SPLIT_QUEUES_FULL_AND_WAITING_FOR_SOURCE:
                        case NO_ACTIVE_DRIVER_GROUP:
                            break;
                        default:
                            throw new UnsupportedOperationException("Unknown blocked reason: " + result.getBlockedReason().get());
                    }
                }
            }

            // make sure to update stage linkage at least once per loop to catch async state changes (e.g., partial cancel)
            for (SqlStageExecution stage : stages.values()) {
                if (!completedStages.contains(stage.getStageId()) && stage.getState().isDone()) {
                    stageLinkages.get(stage.getStageId())
                            .processScheduleResults(stage.getState(), ImmutableSet.of());
                    completedStages.add(stage.getStageId());
                }
            }

            // wait for a state change and then schedule again
            if (!blockedStages.isEmpty()) {
                try (TimeStat.BlockTimer timer = schedulerStats.getSleepTime().time()) {
                    tryGetFutureValue(whenAnyComplete(blockedStages), 1, SECONDS);
                }
                for (ListenableFuture<?> blockedStage : blockedStages) {
                    blockedStage.cancel(true);
                }
            }
        }

        for (SqlStageExecution stage : stages.values()) {
            StageState state = stage.getState();
            if (state != SCHEDULED && state != RUNNING && !state.isDone()) {
                throw new PrestoException(GENERIC_INTERNAL_ERROR, format("Scheduling is complete, but stage %s is in state %s", stage.getStageId(), state));
            }
        }
    }
}
2.4.5stage调度器开始调度

stage调度器主要包括以下三种

(1)Source task

  • SourcePartitionedScheduler

(2)Fixed task

  • FixedCountScheduler
  • FixedSourcePartitionedScheduler

分配策略主要包括下面两种
(1) DynamicSplitPlacementPolicy

(2) FixedSplitPlacementPolicy

在query调度器中,调用stage的调度器

调用代码为stageSchedulers.get(stage.getStageId()).schedule();

第一种为:SourcePartitionedScheduler

public synchronized ScheduleResult schedule()
{
    int overallSplitAssignmentCount = 0;
    ImmutableSet.Builder<RemoteTask> overallNewTasks = ImmutableSet.builder();
    List<ListenableFuture<?>> overallBlockedFutures = new ArrayList<>();
    boolean anyBlockedOnPlacements = false;
    boolean anyBlockedOnNextSplitBatch = false;
    boolean anyNotBlocked = false;

    for (Entry<Lifespan, ScheduleGroup> entry : scheduleGroups.entrySet()) {
        Lifespan lifespan = entry.getKey();
        ScheduleGroup scheduleGroup = entry.getValue();
        Set<Split> pendingSplits = scheduleGroup.pendingSplits;

        if (scheduleGroup.state != ScheduleGroupState.DISCOVERING_SPLITS) {
            verify(scheduleGroup.nextSplitBatchFuture == null);
        }
        else if (pendingSplits.isEmpty()) {
            // try to get the next batch 如果没有等待中的split,则开始获取下一批的split
            if (scheduleGroup.nextSplitBatchFuture == null) {
                scheduleGroup.nextSplitBatchFuture = splitSource.getNextBatch(scheduleGroup.partitionHandle, lifespan, splitBatchSize - pendingSplits.size());

                long start = System.nanoTime();
                Futures.addCallback(scheduleGroup.nextSplitBatchFuture, new FutureCallback<SplitBatch>()
                {
                    @Override
                    public void onSuccess(SplitBatch result)
                    {
                        stage.recordGetSplitTime(start);
                    }

                    @Override
                    public void onFailure(Throwable t)
                    {
                    }
                });
            }

            if (scheduleGroup.nextSplitBatchFuture.isDone()) {
                SplitBatch nextSplits = getFutureValue(scheduleGroup.nextSplitBatchFuture);
                scheduleGroup.nextSplitBatchFuture = null;
                pendingSplits.addAll(nextSplits.getSplits());
                if (nextSplits.isLastBatch() && scheduleGroup.state == ScheduleGroupState.DISCOVERING_SPLITS) {
                   //如果是最后一个batch,调度组的状态是正在发现split的话,则将调度组的状态更新为没有更多的splits
                    scheduleGroup.state = ScheduleGroupState.NO_MORE_SPLITS;
                }
            }
            else {
                overallBlockedFutures.add(scheduleGroup.nextSplitBatchFuture);
                anyBlockedOnNextSplitBatch = true;
                continue;
            }
        }

        Multimap<Node, Split> splitAssignment = ImmutableMultimap.of();
        if (!pendingSplits.isEmpty()) {
            if (!scheduleGroup.placementFuture.isDone()) {
                continue;
            }

            if (state == State.INITIALIZED) {
                state = State.SPLITS_ADDED;
            }

            // 计算分片分配的位置,根据前面生成的策略,这一步其实是将split分配到不同的node上去
            SplitPlacementResult splitPlacementResult = splitPlacementPolicy.computeAssignments(pendingSplits);
            splitAssignment = splitPlacementResult.getAssignments();

            // remove splits with successful placements
            splitAssignment.values().forEach(pendingSplits::remove); // AbstractSet.removeAll performs terribly here.
            overallSplitAssignmentCount += splitAssignment.size();

            // if not completed placed, mark scheduleGroup as blocked on placement
            if (!pendingSplits.isEmpty()) {
                scheduleGroup.placementFuture = splitPlacementResult.getBlocked();
                overallBlockedFutures.add(scheduleGroup.placementFuture);
                anyBlockedOnPlacements = true;
            }
        }

        // if no new splits will be assigned, update state and attach completion event
        Multimap<Node, Lifespan> noMoreSplitsNotification = ImmutableMultimap.of();
        if (pendingSplits.isEmpty() && scheduleGroup.state == ScheduleGroupState.NO_MORE_SPLITS) {
            scheduleGroup.state = ScheduleGroupState.DONE;
            if (!lifespan.isTaskWide()) {
                Node node = ((FixedSplitPlacementPolicy) splitPlacementPolicy).getNodeForBucket(lifespan.getId());
                noMoreSplitsNotification = ImmutableMultimap.of(node, lifespan);
            }
        }

        //将split分配到不同的node上执行,输入node和split放回一个RemoteTask,然后执行task
        overallNewTasks.addAll(assignSplits(splitAssignment, noMoreSplitsNotification));

        // Assert that "placement future is not done" implies "pendingSplits is not empty".
        // The other way around is not true. One obvious reason is (un)lucky timing, where the placement is unblocked between `computeAssignments` and this line.
        // However, there are other reasons that could lead to this.
        // Note that `computeAssignments` is quite broken:
        // 1. It always returns a completed future when there are no tasks, regardless of whether all nodes are blocked.
        // 2. The returned future will only be completed when a node with an assigned task becomes unblocked. Other nodes don't trigger future completion.
        // As a result, to avoid busy loops caused by 1, we check pendingSplits.isEmpty() instead of placementFuture.isDone() here.
        if (scheduleGroup.nextSplitBatchFuture == null && scheduleGroup.pendingSplits.isEmpty() && scheduleGroup.state != ScheduleGroupState.DONE) {
            anyNotBlocked = true;
        }
    }

    if (autoDropCompletedLifespans) {
        drainCompletedLifespans();
    }

    // * `splitSource.isFinished` invocation may fail after `splitSource.close` has been invoked.
    //   If state is NO_MORE_SPLITS/FINISHED, splitSource.isFinished has previously returned true, and splitSource is closed now.
    // * Even if `splitSource.isFinished()` return true, it is not necessarily safe to tear down the split source.
    //   * If anyBlockedOnNextSplitBatch is true, it means we have not checked out the recently completed nextSplitBatch futures,
    //     which may contain recently published splits. We must not ignore those.
    //   * If any scheduleGroup is still in DISCOVERING_SPLITS state, it means it hasn't realized that there will be no more splits.
    //     Next time it invokes getNextBatch, it will realize that. However, the invocation will fail we tear down splitSource now.
    if ((state == State.NO_MORE_SPLITS || state == State.FINISHED) || (scheduleGroups.isEmpty() && splitSource.isFinished())) {
        switch (state) {
            case INITIALIZED:
                // we have not scheduled a single split so far
                state = State.SPLITS_ADDED;
                ScheduleResult emptySplitScheduleResult = scheduleEmptySplit();
                overallNewTasks.addAll(emptySplitScheduleResult.getNewTasks());
                overallSplitAssignmentCount++;
                // fall through
            case SPLITS_ADDED:
                state = State.NO_MORE_SPLITS;
                splitSource.close();
                // fall through
            case NO_MORE_SPLITS:
                if (!scheduleGroups.isEmpty()) {
                    // we are blocked on split assignment
                    break;
                }
                state = State.FINISHED;
                whenFinishedOrNewLifespanAdded.set(null);
                // fall through
            case FINISHED:
                return new ScheduleResult(
                        true,
                        overallNewTasks.build(),
                        overallSplitAssignmentCount);
            default:
                throw new IllegalStateException("Unknown state");
        }
    }

    if (anyNotBlocked) {
        return new ScheduleResult(false, overallNewTasks.build(), overallSplitAssignmentCount);
    }

    // Only try to finalize task creation when scheduling would block
    overallNewTasks.addAll(finalizeTaskCreationIfNecessary());

    ScheduleResult.BlockedReason blockedReason;
    if (anyBlockedOnNextSplitBatch) {
        blockedReason = anyBlockedOnPlacements ? MIXED_SPLIT_QUEUES_FULL_AND_WAITING_FOR_SOURCE : WAITING_FOR_SOURCE;
    }
    else {
        blockedReason = anyBlockedOnPlacements ? SPLIT_QUEUES_FULL : NO_ACTIVE_DRIVER_GROUP;
    }

    overallBlockedFutures.add(whenFinishedOrNewLifespanAdded);
    return new ScheduleResult(
            false,
            overallNewTasks.build(),
            nonCancellationPropagating(whenAnyComplete(overallBlockedFutures)),
            blockedReason,
            overallSplitAssignmentCount);
}

splitPlacementPolicy.computeAssignments()方法

DynamicSplitPlacementPolicy类  动态分配逻辑的实现

@Override
public SplitPlacementResult computeAssignments(Set<Split> splits)
{
    //调用了nodeSelector的计算分配的方法
    return nodeSelector.computeAssignments(splits, remoteTasks.get());
}

//前面提到过nodeSelector接口的实现类基本是都是TopologyAwareNodeSelector,下面是TopologyAwareNodeSelector分配split的实现逻辑
@Override
    public SplitPlacementResult computeAssignments(Set<Split> splits, List<RemoteTask> existingTasks)
    {
        //拿出selector里面的nodeMap
        NodeMap nodeMap = this.nodeMap.get().get();
        //创建分配Map
        Multimap<Node, Split> assignment = HashMultimap.create();
        NodeAssignmentStats assignmentStats = new NodeAssignmentStats(nodeTaskMap, nodeMap, existingTasks);

        int[] topologicCounters = new int[topologicalSplitCounters.size()];
        Set<NetworkLocation> filledLocations = new HashSet<>();
        Set<Node> blockedExactNodes = new HashSet<>();
        boolean splitWaitingForAnyNode = false;
        for (Split split : splits) {
            //先判断这个split能不能远程获取,如果不能的话,则取出split对应的网络地址,看能否找到对应的node,包括Coordinator,然后放到候选节点里面,如果找不到对应的节点,则抛出异常
            if (!split.isRemotelyAccessible()) {
                List<Node> candidateNodes = selectExactNodes(nodeMap, split.getAddresses(), includeCoordinator);
                if (candidateNodes.isEmpty()) {
                    log.debug("No nodes available to schedule %s. Available nodes %s", split, nodeMap.getNodesByHost().keys());
                    throw new PrestoException(NO_NODES_AVAILABLE, "No nodes available to run query");
                }
                //这个里面涉及到一个选择策略,主要是最小候选节点数,和最大task分配的split进行撮合对比
                Node chosenNode = bestNodeSplitCount(candidateNodes.iterator(), minCandidates, maxPendingSplitsPerTask, assignmentStats);
                //如果可以选出来
                if (chosenNode != null) {
                    //放入选择的node和对应的split
                    assignment.put(chosenNode, split);
                    assignmentStats.addAssignedSplit(chosenNode);
                }
                // Exact node set won't matter, if a split is waiting for any node
                else if (!splitWaitingForAnyNode) {
                    //如果根据策略找不到,则先把所有的候选节点放到这个set中
                    blockedExactNodes.addAll(candidateNodes);
                }
                continue;
            }
            //如果不存在远程获取的问题,则根据下面的
            Node chosenNode = null;
            //生成网络层数,后面需要递归
            int depth = networkLocationSegmentNames.size();
            int chosenDepth = 0;
            Set<NetworkLocation> locations = new HashSet<>();
            //把这个split对应网络地址遍历放在NetworkLocation集合中
            for (HostAddress host : split.getAddresses()) {
                locations.add(networkLocationCache.get(host));
            }
            //如果缓存里面获取不到地址,则放入root地址,并将网络层数置为0
            if (locations.isEmpty()) {
                // Add the root location
                locations.add(ROOT_LOCATION);
                depth = 0;
            }
            // Try each address at progressively shallower network locations
            for (int i = depth; i >= 0 && chosenNode == null; i--) {
                for (NetworkLocation location : locations) {
                    // Skip locations which are only shallower than this level
                    // For example, locations which couldn't be located will be at the "root" location
                    if (location.getSegments().size() < i) {
                        continue;
                    }
                    location = location.subLocation(0, i);
                    if (filledLocations.contains(location)) {
                        continue;
                    }
                    Set<Node> nodes = nodeMap.getWorkersByNetworkPath().get(location);
                    chosenNode = bestNodeSplitCount(new ResettableRandomizedIterator<>(nodes), minCandidates, calculateMaxPendingSplits(i, depth), assignmentStats);
                    if (chosenNode != null) {
                        chosenDepth = i;
                        break;
                    }
                    filledLocations.add(location);
                }
            }
            if (chosenNode != null) {
                //放入选择的node和对应的split
                assignment.put(chosenNode, split);
                assignmentStats.addAssignedSplit(chosenNode);
                topologicCounters[chosenDepth]++;
            }
            else {
                splitWaitingForAnyNode = true;
            }
        }
        for (int i = 0; i < topologicCounters.length; i++) {
            if (topologicCounters[i] > 0) {
                topologicalSplitCounters.get(i).update(topologicCounters[i]);
            }
        }

        ListenableFuture<?> blocked;
        int maxPendingForWildcardNetworkAffinity = calculateMaxPendingSplits(0, networkLocationSegmentNames.size());
        if (splitWaitingForAnyNode) {
            blocked = toWhenHasSplitQueueSpaceFuture(existingTasks, calculateLowWatermark(maxPendingForWildcardNetworkAffinity));
        }
        else {
            blocked = toWhenHasSplitQueueSpaceFuture(blockedExactNodes, existingTasks, calculateLowWatermark(maxPendingForWildcardNetworkAffinity));
        }
        return new SplitPlacementResult(blocked, assignment);
    }

SourcePartitionedScheduler类的assignSplits方法

//对传入的参数splitAssignment进行遍历,对每个entry都执行如下操作
//根据node获取该node上的task,若task为空,则新建一个task,否则将该node上的split提交给运行在该node上的task进行处理
private Set<RemoteTask> assignSplits(Multimap<Node, Split> splitAssignment, Multimap<Node, Lifespan> noMoreSplitsNotification)
{
    ImmutableSet.Builder<RemoteTask> newTasks = ImmutableSet.builder();

    ImmutableSet<Node> nodes = ImmutableSet.<Node>builder()
            .addAll(splitAssignment.keySet())
            .addAll(noMoreSplitsNotification.keySet())
            .build();
    for (Node node : nodes) {
        // source partitioned tasks can only receive broadcast data; otherwise it would have a different distribution
        ImmutableMultimap<PlanNodeId, Split> splits = ImmutableMultimap.<PlanNodeId, Split>builder()
                .putAll(partitionedNode, splitAssignment.get(node))
                .build();
        ImmutableMultimap.Builder<PlanNodeId, Lifespan> noMoreSplits = ImmutableMultimap.builder();
        if (noMoreSplitsNotification.containsKey(node)) {
            noMoreSplits.putAll(partitionedNode, noMoreSplitsNotification.get(node));
        }
        newTasks.addAll(stage.scheduleSplits(
                node,
                splits,
                noMoreSplits.build()));
    }
    return newTasks.build();
}

SqlStageExecution类的scheduleSplits方法

public synchronized Set<RemoteTask> scheduleSplits(Node node, Multimap<PlanNodeId, Split> splits, Multimap<PlanNodeId, Lifespan> noMoreSplitsNotification)
{
    requireNonNull(node, "node is null");
    requireNonNull(splits, "splits is null");

    splitsScheduled.set(true);

    checkArgument(stateMachine.getFragment().getPartitionedSources().containsAll(splits.keySet()), "Invalid splits");

    ImmutableSet.Builder<RemoteTask> newTasks = ImmutableSet.builder();
    Collection<RemoteTask> tasks = this.tasks.get(node);
    RemoteTask task;
    if (tasks == null) {
        // The output buffer depends on the task id starting from 0 and being sequential, since each
        // task is assigned a private buffer based on task id.
        TaskId taskId = new TaskId(stateMachine.getStageId(), nextTaskId.getAndIncrement());
        task = scheduleTask(node, taskId, splits);
        newTasks.add(task);
    }
    else {
        task = tasks.iterator().next();
        task.addSplits(splits);
    }
    if (noMoreSplitsNotification.size() > 1) {
        // The assumption that `noMoreSplitsNotification.size() <= 1` currently holds.
        // If this assumption no longer holds, we should consider calling task.noMoreSplits with multiple entries in one shot.
        // These kind of methods can be expensive since they are grabbing locks and/or sending HTTP requests on change.
        throw new UnsupportedOperationException("This assumption no longer holds: noMoreSplitsNotification.size() < 1");
    }
    for (Entry<PlanNodeId, Lifespan> entry : noMoreSplitsNotification.entries()) {
        task.noMoreSplits(entry.getKey(), entry.getValue());
    }
    return newTasks.build();

第二种为FixedSourcePartitionedScheduler

public ScheduleResult schedule()
{
    // schedule a task on every node in the distribution
    List<RemoteTask> newTasks = ImmutableList.of();
    if (!scheduledTasks) {
        newTasks = partitioning.getPartitionToNode().entrySet().stream()
                .map(entry -> stage.scheduleTask(entry.getValue(), entry.getKey()))
                .collect(toImmutableList());
        scheduledTasks = true;
    }

    boolean allBlocked = true;
    List<ListenableFuture<?>> blocked = new ArrayList<>();
    BlockedReason blockedReason = BlockedReason.NO_ACTIVE_DRIVER_GROUP;
    int splitsScheduled = 0;

    Iterator<SourcePartitionedScheduler> schedulerIterator = sourcePartitionedSchedulers.iterator();
    List<Lifespan> driverGroupsToStart = ImmutableList.of();
    while (schedulerIterator.hasNext()) {
        SourcePartitionedScheduler sourcePartitionedScheduler = schedulerIterator.next();

        for (Lifespan lifespan : driverGroupsToStart) {
            sourcePartitionedScheduler.startLifespan(lifespan, partitionHandleFor(lifespan));
        }

        ScheduleResult schedule = sourcePartitionedScheduler.schedule();
        splitsScheduled += schedule.getSplitsScheduled();
        if (schedule.getBlockedReason().isPresent()) {
            blocked.add(schedule.getBlocked());
            blockedReason = blockedReason.combineWith(schedule.getBlockedReason().get());
        }
        else {
            verify(schedule.getBlocked().isDone(), "blockedReason not provided when scheduler is blocked");
            allBlocked = false;
        }

        driverGroupsToStart = sourcePartitionedScheduler.drainCompletedLifespans();

        if (schedule.isFinished()) {
            schedulerIterator.remove();
            sourcePartitionedScheduler.close();
        }
    }

    if (allBlocked) {
        return new ScheduleResult(sourcePartitionedSchedulers.isEmpty(), newTasks, whenAnyComplete(blocked), blockedReason, splitsScheduled);
    }
    else {
        return new ScheduleResult(sourcePartitionedSchedulers.isEmpty(), newTasks, splitsScheduled);
    }
}

第三种为FixedCountScheduler

public ScheduleResult schedule()
{
    List<RemoteTask> newTasks = partitionToNode.entrySet().stream()
            .map(entry -> taskScheduler.apply(entry.getValue(), entry.getKey()))
            .collect(toImmutableList());

    return new ScheduleResult(true, newTasks, 0);
}

2.5生成RemoteTask任务

根据Presto的架构,stage调度会产生task任务下发到worker上执行

SqlStageExecution类的scheduleTask方法

private synchronized RemoteTask scheduleTask(Node node, TaskId taskId, Multimap<PlanNodeId, Split> sourceSplits)
    {
        ImmutableMultimap.Builder<PlanNodeId, Split> initialSplits = ImmutableMultimap.builder();
        //搜集所有的sourceSplits,在类型为source的stage中,该方法传入参数sourceSplits是值的,而在fixed和single的stage中,该方法的传入参数sourceSplits是没有值的
        initialSplits.putAll(sourceSplits);
        
        sourceTasks.forEach((planNodeId, task) -> {
            TaskStatus status = task.getTaskStatus();
            if (status.getState() != TaskState.FINISHED) {
                initialSplits.put(planNodeId, createRemoteSplitFor(taskId, status.getSelf()));
            }
        });
        OutputBuffers outputBuffers = this.outputBuffers.get();
        checkState(outputBuffers != null, "Initial output buffers must be set before a task can be scheduled");

        //创建远程的task任务
        RemoteTask task = remoteTaskFactory.createRemoteTask(
                stateMachine.getSession(),
                taskId,
                node,
                stateMachine.getFragment(),
                initialSplits.build(),
                outputBuffers,
                nodeTaskMap.createPartitionedSplitCountTracker(node, taskId),
                summarizeTaskInfo);

        completeSources.forEach(task::noMoreSplits);
        allTasks.add(taskId);
        tasks.computeIfAbsent(node, key -> newConcurrentHashSet()).add(task);
        nodeTaskMap.addTask(node, task);
        task.addStateChangeListener(new StageTaskListener());
        if (!stateMachine.getState().isDone()) {
            task.start();
        }
        else {
            task.abort();
        }
        return task;
    }

RemoteTask接口对应实现类HttpRemoteTask

   public HttpRemoteTask(Session session,
            TaskId taskId,
            String nodeId,
            URI location,
            PlanFragment planFragment,
            Multimap<PlanNodeId, Split> initialSplits,
            OutputBuffers outputBuffers,
            HttpClient httpClient,
            Executor executor,
            ScheduledExecutorService updateScheduledExecutor,
            ScheduledExecutorService errorScheduledExecutor,
            Duration minErrorDuration,
            Duration maxErrorDuration,
            Duration taskStatusRefreshMaxWait,
            Duration taskInfoUpdateInterval,
            boolean summarizeTaskInfo,
            JsonCodec<TaskStatus> taskStatusCodec,
            JsonCodec<TaskInfo> taskInfoCodec,
            JsonCodec<TaskUpdateRequest> taskUpdateRequestCodec,
            PartitionedSplitCountTracker partitionedSplitCountTracker,
            RemoteTaskStats stats)
    {
        requireNonNull(session, "session is null");
        requireNonNull(taskId, "taskId is null");
        requireNonNull(nodeId, "nodeId is null");
        requireNonNull(location, "location is null");
        requireNonNull(planFragment, "planFragment is null");
        requireNonNull(outputBuffers, "outputBuffers is null");
        requireNonNull(httpClient, "httpClient is null");
        requireNonNull(executor, "executor is null");
        requireNonNull(taskStatusCodec, "taskStatusCodec is null");
        requireNonNull(taskInfoCodec, "taskInfoCodec is null");
        requireNonNull(taskUpdateRequestCodec, "taskUpdateRequestCodec is null");
        requireNonNull(partitionedSplitCountTracker, "partitionedSplitCountTracker is null");
        requireNonNull(stats, "stats is null");

        try (SetThreadName ignored = new SetThreadName("HttpRemoteTask-%s", taskId)) {
            this.taskId = taskId;
            this.session = session;
            this.nodeId = nodeId;
            this.planFragment = planFragment;
            this.outputBuffers.set(outputBuffers);
            this.httpClient = httpClient;
            this.executor = executor;
            this.errorScheduledExecutor = errorScheduledExecutor;
            this.summarizeTaskInfo = summarizeTaskInfo;
            this.taskInfoCodec = taskInfoCodec;
            this.taskUpdateRequestCodec = taskUpdateRequestCodec;
            this.updateErrorTracker = new RequestErrorTracker(taskId, location, minErrorDuration, maxErrorDuration, errorScheduledExecutor, "updating task");
            this.partitionedSplitCountTracker = requireNonNull(partitionedSplitCountTracker, "partitionedSplitCountTracker is null");
            this.stats = stats;

            for (Entry<PlanNodeId, Split> entry : requireNonNull(initialSplits, "initialSplits is null").entries()) {
                ScheduledSplit scheduledSplit = new ScheduledSplit(nextSplitId.getAndIncrement(), entry.getKey(), entry.getValue());
                pendingSplits.put(entry.getKey(), scheduledSplit);
            }
            pendingSourceSplitCount = planFragment.getPartitionedSources().stream()
                    .filter(initialSplits::containsKey)
                    .mapToInt(partitionedSource -> initialSplits.get(partitionedSource).size())
                    .sum();

            List<BufferInfo> bufferStates = outputBuffers.getBuffers()
                    .keySet().stream()
                    .map(outputId -> new BufferInfo(outputId, false, 0, 0, PageBufferInfo.empty()))
                    .collect(toImmutableList());

            TaskInfo initialTask = createInitialTask(taskId, location, nodeId, bufferStates, new TaskStats(DateTime.now(), null));

            this.taskStatusFetcher = new ContinuousTaskStatusFetcher(
                    this::failTask,
                    initialTask.getTaskStatus(),
                    taskStatusRefreshMaxWait,
                    taskStatusCodec,
                    executor,
                    httpClient,
                    minErrorDuration,
                    maxErrorDuration,
                    errorScheduledExecutor,
                    stats);

            this.taskInfoFetcher = new TaskInfoFetcher(
                    this::failTask,
                    initialTask,
                    httpClient,
                    taskInfoUpdateInterval,
                    taskInfoCodec,
                    minErrorDuration,
                    maxErrorDuration,
                    summarizeTaskInfo,
                    executor,
                    updateScheduledExecutor,
                    errorScheduledExecutor,
                    stats);

            taskStatusFetcher.addStateChangeListener(newStatus -> {
                TaskState state = newStatus.getState();
                if (state.isDone()) {
                    cleanUpTask();
                }
                else {
                    partitionedSplitCountTracker.setPartitionedSplitCount(getPartitionedSplitCount());
                    updateSplitQueueSpace();
                }
            });

            long timeout = minErrorDuration.toMillis() / MIN_RETRIES;
            this.requestTimeout = new Duration(timeout + taskStatusRefreshMaxWait.toMillis(), MILLISECONDS);
            partitionedSplitCountTracker.setPartitionedSplitCount(getPartitionedSplitCount());
            updateSplitQueueSpace();
        }
    }

Task的start方法,开始轮询对应的task状态

public void start()
{
    try (SetThreadName ignored = new SetThreadName("HttpRemoteTask-%s", taskId)) {
        // to start we just need to trigger an update
        scheduleUpdate();

        taskStatusFetcher.start();
        taskInfoFetcher.start();
    }
}

exchange用于从上游stage中获取数据,而outputBuffer则将当前stage的数据输出给下游stage

2.6Task执行

2.6.1Worker接收Task任务

前面创建RemoteTask后,通过http rest请求将task任务下放到对应的worker上去

@Path("/v1/task")
@POST
    @Path("{taskId}")
    @Consumes(MediaType.APPLICATION_JSON)
    @Produces(MediaType.APPLICATION_JSON)
    public Response createOrUpdateTask(@PathParam("taskId") TaskId taskId, TaskUpdateRequest taskUpdateRequest, @Context UriInfo uriInfo)
    {
        requireNonNull(taskUpdateRequest, "taskUpdateRequest is null");

        Session session = taskUpdateRequest.getSession().toSession(sessionPropertyManager);
        TaskInfo taskInfo = taskManager.updateTask(session,
                taskId,
                taskUpdateRequest.getFragment(),
                taskUpdateRequest.getSources(),
                taskUpdateRequest.getOutputIds());

        if (shouldSummarize(uriInfo)) {
            taskInfo = taskInfo.summarize();
        }

        return Response.ok().entity(taskInfo).build();
    }

SqlTaskManager类的updateTask方法

@Override
public TaskInfo updateTask(Session session, TaskId taskId, Optional<PlanFragment> fragment, List<TaskSource> sources, OutputBuffers outputBuffers)
{
    requireNonNull(session, "session is null");
    requireNonNull(taskId, "taskId is null");
    requireNonNull(fragment, "fragment is null");
    requireNonNull(sources, "sources is null");
    requireNonNull(outputBuffers, "outputBuffers is null");

    if (resourceOvercommit(session)) {
        // TODO: This should have been done when the QueryContext was created. However, the session isn't available at that point.
        queryContexts.getUnchecked(taskId.getQueryId()).setResourceOvercommit();
    }

    SqlTask sqlTask = tasks.getUnchecked(taskId);
    sqlTask.recordHeartbeat();
    return sqlTask.updateTask(session, fragment, sources, outputBuffers);
}
public TaskInfo updateTask(Session session, Optional<PlanFragment> fragment, List<TaskSource> sources, OutputBuffers outputBuffers)
{
    try {
        // The LazyOutput buffer does not support write methods, so the actual
        // output buffer must be established before drivers are created (e.g.
        // a VALUES query).
        outputBuffer.setOutputBuffers(outputBuffers);

        // assure the task execution is only created once
        SqlTaskExecution taskExecution;
        synchronized (this) {
            // is task already complete?
            TaskHolder taskHolder = taskHolderReference.get();
            if (taskHolder.isFinished()) {
                return taskHolder.getFinalTaskInfo();
            }
            taskExecution = taskHolder.getTaskExecution();
            if (taskExecution == null) {
                checkState(fragment.isPresent(), "fragment must be present");
                //首次的话会新建一个task执行器
                taskExecution = sqlTaskExecutionFactory.create(session, queryContext, taskStateMachine, outputBuffer, fragment.get(), sources);
                taskHolderReference.compareAndSet(taskHolder, new TaskHolder(taskExecution));
                needsPlan.set(false);
            }
        }

        if (taskExecution != null) {
            taskExecution.addSources(sources);
        }
    }
    catch (Error e) {
        failed(e);
        throw e;
    }
    catch (RuntimeException e) {
        failed(e);
    }

    return getTaskInfo();
}
public SqlTaskExecution create(Session session, QueryContext queryContext, TaskStateMachine taskStateMachine, OutputBuffer outputBuffer, PlanFragment fragment, List<TaskSource> sources)
{
    boolean verboseStats = getVerboseStats(session);
    TaskContext taskContext = queryContext.addTaskContext(
            taskStateMachine,
            session,
            verboseStats,
            cpuTimerEnabled);

    LocalExecutionPlan localExecutionPlan;
    try (SetThreadName ignored = new SetThreadName("Task-%s", taskStateMachine.getTaskId())) {
        try {
            localExecutionPlan = planner.plan(
                    taskContext,
                    fragment.getRoot(),
                    fragment.getSymbols(),
                    fragment.getPartitioningScheme(),
                    fragment.getPipelineExecutionStrategy() == GROUPED_EXECUTION,
                    fragment.getPartitionedSources(),
                    outputBuffer);

            for (DriverFactory driverFactory : localExecutionPlan.getDriverFactories()) {
                Optional<PlanNodeId> sourceId = driverFactory.getSourceId();
                if (sourceId.isPresent() && fragment.isPartitionedSources(sourceId.get())) {
                    checkArgument(fragment.getPipelineExecutionStrategy() == driverFactory.getPipelineExecutionStrategy(),
                            "Partitioned pipelines are expected to have the same execution strategy as the fragment");
                }
                else {
                    checkArgument(fragment.getPipelineExecutionStrategy() != UNGROUPED_EXECUTION || driverFactory.getPipelineExecutionStrategy() == UNGROUPED_EXECUTION,
                            "When fragment execution strategy is ungrouped, all pipelines should have ungrouped execution strategy");
                }
            }
        }
        catch (Throwable e) {
            // planning failed
            taskStateMachine.failed(e);
            throwIfUnchecked(e);
            throw new RuntimeException(e);
        }
    }
    return createSqlTaskExecution(
            taskStateMachine,
            taskContext,
            outputBuffer,
            sources,
            localExecutionPlan,
            taskExecutor,
            taskNotificationExecutor,
            queryMonitor);
}

SqlTaskExecution类的createSqlTaskExecution方法

static SqlTaskExecution createSqlTaskExecution(
        TaskStateMachine taskStateMachine,
        TaskContext taskContext,
        OutputBuffer outputBuffer,
        List<TaskSource> sources,
        LocalExecutionPlan localExecutionPlan,
        TaskExecutor taskExecutor,
        Executor notificationExecutor,
        QueryMonitor queryMonitor)
{
    SqlTaskExecution task = new SqlTaskExecution(
            taskStateMachine,
            taskContext,
            outputBuffer,
            localExecutionPlan,
            taskExecutor,
            queryMonitor,
            notificationExecutor);
    try (SetThreadName ignored = new SetThreadName("Task-%s", task.getTaskId())) {
        // The scheduleDriversForTaskLifeCycle method calls enqueueDriverSplitRunner, which registers a callback with access to this object.
        // The call back is accessed from another thread, so this code can not be placed in the constructor.
        //tasks是一个全局缓存,根据taskId获取已经缓存的sqlTask,若没有则新建一个
        SqlTask sqlTask = tasks.getUnchecked(taskId);
        sqlTask.recordHeartbeat();
        //执行sqlTask,并返回执行信息
        return sqlTask.updateTask(session, fragment, sources, outputBuffers);
    }
}
2.6.2Worker启动执行

Worker启动的时候,调用TaskExecutor类的start方法,其主要作用就是处理在Worker上运行的所有Task中的Split

@PostConstruct
public synchronized void start()
{
    //runnerThreads 的值通过配置参数:task.max-worker-threads进行配置的,默认值为当前cpu核数*4
    checkState(!closed, "TaskExecutor is closed");
    for (int i = 0; i < runnerThreads; i++) {
        addRunnerThread();
    }
    splitMonitorExecutor.scheduleWithFixedDelay(this::monitorActiveSplits, 1, 1, TimeUnit.MINUTES);
}

TaskExecutor类addRunnerThread方法

private synchronized void addRunnerThread()
{
    try {
        //Runner是本类TaskExecutor的内部类
        executor.execute(new TaskRunner());
    }
    catch (RejectedExecutionException ignored) {
    }
}

TaskRunner类

private class TaskRunner
        implements Runnable
{
    private final long runnerId = NEXT_RUNNER_ID.getAndIncrement();

    @Override
    public void run()
    {
        try (SetThreadName runnerName = new SetThreadName("SplitRunner-%s", runnerId)) {
            while (!closed && !Thread.currentThread().isInterrupted()) {
                // select next worker
                //获取下一个需要处理的PrioritizedSplitRunner对象,PrioritizedSplitRunner是对作用于一个Split所有操作的包装,封装了作用于一个Split上的一系列的Operator
                //优先级SplitRunner
                final PrioritizedSplitRunner split;
                try {
                    //从等待队列中取出一个split
                    split = waitingSplits.take();
                }
                catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                    return;
                }

                String threadId = split.getTaskHandle().getTaskId() + "-" + split.getSplitId();
                try (SetThreadName splitName = new SetThreadName(threadId)) {
                    RunningSplitInfo splitInfo = new RunningSplitInfo(ticker.read(), threadId, Thread.currentThread());
                    runningSplitInfos.add(splitInfo);

                    //将取出的split加入到runningSplit队列,该队列中保存了所有正在处理的split
                    runningSplits.add(split);

                    ListenableFuture<?> blocked;
                    try {
                        //调用各个Split的process()方法
                        blocked = split.process();
                    }
                    finally {
                        runningSplitInfos.remove(splitInfo);
                        //执行完毕之后,需要将Split从runningSplits中移除
                        runningSplits.remove(split);
                    }
                    //finished表示整个split是否已经处理完毕
                    if (split.isFinished()) {
                        log.debug("%s is finished", split.getInfo());
                        splitFinished(split);
                    }
                    else {
                        //blocked表示本次执行是否完毕
                        if (blocked.isDone()) {
                            //如果本次执行完毕了,split还没有被处理完毕,则继续放到等待队列中
                            waitingSplits.offer(split);
                        }
                        else {
                            //放到阻塞队列中
                            blockedSplits.put(split, blocked);
                            blocked.addListener(() -> {
                                //一旦固定时间片执行完毕,则从阻塞队列中移除
                                blockedSplits.remove(split);
                                //重新设置优先级
                                split.resetLevelPriority();
                                //重新放回到等待队列中
                                waitingSplits.offer(split);
                            }, executor);
                        }
                    }
                }
                catch (Throwable t) {
                    // ignore random errors due to driver thread interruption
                    if (!split.isDestroyed()) {
                        if (t instanceof PrestoException) {
                            PrestoException e = (PrestoException) t;
                            log.error("Error processing %s: %s: %s", split.getInfo(), e.getErrorCode().getName(), e.getMessage());
                        }
                        else {
                            log.error(t, "Error processing %s", split.getInfo());
                        }
                    }
                    splitFinished(split);
                }
            }
        }
        finally {
            //如果线程被中断,或者TaskExecutor结束
            if (!closed) {
                //如果是线程被中断,然后TaskExecutor尚未结束,则重新启动一个Runner线程
                addRunnerThread();
            }
        }
    }
}

所有对split的处理均由split.process完成,此处的split是PrioritizedSplitRunner的实例

public ListenableFuture<?> process()
        throws Exception
{
    try {
        long startNanos = ticker.read();
        start.compareAndSet(0, startNanos);
        lastReady.compareAndSet(0, startNanos);
        processCalls.incrementAndGet();

        waitNanos.getAndAdd(startNanos - lastReady.get());

        CpuTimer timer = new CpuTimer();

        //调用split的processFor(Duration duration)方法进行实际的split的处理,这里的split是SplitRunner的实例,然而SplitRunner的实例主要是DriverSplitRunner,SPLIT_RUN_QUANTA值是一个时间段,默认为一秒
        ListenableFuture<?> blocked = split.processFor(SPLIT_RUN_QUANTA);

        CpuTimer.CpuDuration elapsed = timer.elapsedTime();

        long quantaScheduledNanos = ticker.read() - startNanos;
        scheduledNanos.addAndGet(quantaScheduledNanos);

        priority.set(taskHandle.addScheduledNanos(quantaScheduledNanos));
        lastRun.set(ticker.read());

        if (blocked == NOT_BLOCKED) {
            unblockedQuantaWallTime.add(elapsed.getWall());
        }
        else {
            blockedQuantaWallTime.add(elapsed.getWall());
        }

        long quantaCpuNanos = elapsed.getCpu().roundTo(NANOSECONDS);
        cpuTimeNanos.addAndGet(quantaCpuNanos);

        globalCpuTimeMicros.update(quantaCpuNanos / 1000);
        globalScheduledTimeMicros.update(quantaScheduledNanos / 1000);

        return blocked;
    }
    catch (Throwable e) {
        finishedFuture.setException(e);
        throw e;
    }
}
2.6.3生成Driver

DriverSplitRunner类的processFor方法,DriverSplitRunner是SqlTaskExecution类的内部类

@Override
public ListenableFuture<?> processFor(Duration duration)
{
    //driver是作用于split上的一系列的operator的封装类,driver需要处理的Split存储在属性newSources中
    Driver driver;
    synchronized (this) {
        //如果在执行该方法前,DriverSplitRunner就已经结束了,那么就没有必要进行后续的操作了,直接返回一个value为null的ListenableFuture即可
        if (closed) {
            return Futures.immediateFuture(null);
        }
        //若当前的Driver为null,则需要首先根据Client指定的split创建一个driver,partitionedSplit是类DriverSplitRunner中的属性,其类型为ScheduledSplit,而ScheduledSplit是Split的封装类
        if (this.driver == null) {
            this.driver = driverSplitRunnerFactory.createDriver(driverContext, partitionedSplit);
        }
        //driver是作用于split上的一系列的operator的封装类,driver需要处理的split存储在属性newSources中
        driver = this.driver;
    }

    return driver.processFor(duration);
}
2.6.4Driver执行

Driver类的processFor方法

public ListenableFuture<?> processFor(Duration duration)
    {
        checkLockNotHeld("Can not process for a duration while holding the driver lock");

        requireNonNull(duration, "duration is null");

        // if the driver is blocked we don't need to continue
        SettableFuture<?> blockedFuture = driverBlockedFuture.get();
        if (!blockedFuture.isDone()) {
            return blockedFuture;
        }

        //最多可以运行时间
        long maxRuntime = duration.roundTo(TimeUnit.NANOSECONDS);

        //当前线程获得锁,若有其他线程持有锁,则最多等待100毫秒
        Optional<ListenableFuture<?>> result = tryWithLock(100, TimeUnit.MILLISECONDS, () -> {
            driverContext.startProcessTimer();
            driverContext.getYieldSignal().setWithDelay(maxRuntime, driverContext.getYieldExecutor());
            try {
                long start = System.nanoTime();
                do {
                    //对split的实际处理,在processInternal中
                    ListenableFuture<?> future = processInternal();
                    if (!future.isDone()) {
                        return updateDriverBlockedFuture(future);
                    }
                }
                while (System.nanoTime() - start < maxRuntime && !isFinishedInternal());
            }
            finally {
                driverContext.getYieldSignal().reset();
                driverContext.recordProcessed();
            }
            return NOT_BLOCKED;
        });
        return result.orElse(NOT_BLOCKED);
    }

Driver类的processInternal方法

@GuardedBy("exclusiveLock")
private ListenableFuture<?> processInternal()
{
    checkLockHeld("Lock must be held to call processInternal");

    handleMemoryRevoke();

    try {
        //如果有尚未处理的读取的split,将未读取的split加入到sourceOperator中
        processNewSources();

        //如果只有一个Operator则特别处理
        if (operators.size() == 1) {
            //如果当前Driver已经执行完毕,则返回NOT_BLOCKED
            if (driverContext.isDone()) {
                return NOT_BLOCKED;
            }

            //获取Operator
            Operator current = operators.get(0);
            //判断Operator是否阻塞
            Optional<ListenableFuture<?>> blocked = getBlockedFuture(current);
            if (blocked.isPresent()) {
                current.getOperatorContext().recordBlocked(blocked.get());
                return blocked.get();
            }

            //若未阻塞,则直接结束当前Operator
            // there is only one operator so just finish it
            current.getOperatorContext().startIntervalTimer();
            current.finish();
            current.getOperatorContext().recordFinish();
            return NOT_BLOCKED;
        }

        boolean movedPage = false;
        //若Operator的个数大于1,则执行下面的循环,从下面的循环可以看出,每次取出相邻的两个Operator,得到前一个Operator的输出数据,然后将该输出数据作为后一个Operator的输入数据
        for (int i = 0; i < operators.size() - 1 && !driverContext.isDone(); i++) {
            //一次取出相邻的两个Operator
            Operator current = operators.get(i);
            Operator next = operators.get(i + 1);

            // skip blocked operator
            if (getBlockedFuture(current).isPresent()) {
                continue;
            }

            //如果当前Operator没有结束,而且下一个Operator也需要输入
            if (!current.isFinished() && !getBlockedFuture(next).isPresent() && next.needsInput()) {
                //从当前Operator中获得OutputPage,然后将该page作为输入,交给下一个Operator进行操作
                current.getOperatorContext().startIntervalTimer();
                //Operator对Page操作的核心逻辑,不同的Operator对Page的操作处理不一样,下面以LimitOperator为示例
                Page page = current.getOutput();
                current.getOperatorContext().recordGetOutput(page);

                //将获得的OutputPage交给下一个Operator进行处理
                if (page != null && page.getPositionCount() != 0) {
                    next.getOperatorContext().startIntervalTimer();
                    //Operator对Page操作的核心逻辑,不同的Operator对Page的操作处理不一样
                    next.addInput(page);
                    next.getOperatorContext().recordAddInput(page);
                    //标示,表示进行了Page的移动
                    movedPage = true;
                }

                if (current instanceof SourceOperator) {
                    movedPage = true;
                }
            }

            //如果当前的Operator已经完成了,则通知下一个Operator:不会再有输入了,需要完成数据处理,并将结果进行刷新
            if (current.isFinished()) {
                // let next operator know there will be no more data
                next.getOperatorContext().startIntervalTimer();
                next.finish();
                next.getOperatorContext().recordFinish();
            }
        }

        //如果所有的Operator都已经循环完毕了,但是没有发生Page的移动,我们需要检查是否有Operator被block住了
        if (!movedPage) {
            List<Operator> blockedOperators = new ArrayList<>();
            List<ListenableFuture<?>> blockedFutures = new ArrayList<>();
            //循环所有的Operator,并获得每个Operator的ListenableFuture对象,判断:若当前Operator已经执行结束,则会返回其是否在等待额外的内存
            for (Operator operator : operators) {
                Optional<ListenableFuture<?>> blocked = getBlockedFuture(operator);
                if (blocked.isPresent()) {
                    blockedOperators.add(operator);
                    blockedFutures.add(blocked.get());
                }
            }

            //若确实有Operator被阻塞住了
            if (!blockedFutures.isEmpty()) {
                // unblock when the first future is complete
                //任意一个ListenableFuture完成,就会解除当前Driver的阻塞状态
                ListenableFuture<?> blocked = firstFinishedFuture(blockedFutures);
                // driver records serial blocked time
                //当前Driver添加monitor实时监听是否已经解除阻塞状态
                driverContext.recordBlocked(blocked);
                // each blocked operator is responsible for blocking the execution
                // until one of the operators can continue
                //为每个Operator注册监听器,实时监听是否已经解除阻塞状态
                for (Operator operator : blockedOperators) {
                    operator.getOperatorContext().recordBlocked(blocked);
                }
                return blocked;
            }
        }

        return NOT_BLOCKED;
    }
    catch (Throwable t) {
        List<StackTraceElement> interrupterStack = exclusiveLock.getInterrupterStack();
        if (interrupterStack == null) {
            driverContext.failed(t);
            throw t;
        }

        // Driver thread was interrupted which should only happen if the task is already finished.
        // If this becomes the actual cause of a failed query there is a bug in the task state machine.
        Exception exception = new Exception("Interrupted By");
        exception.setStackTrace(interrupterStack.stream().toArray(StackTraceElement[]::new));
        PrestoException newException = new PrestoException(GENERIC_INTERNAL_ERROR, "Driver was interrupted", exception);
        newException.addSuppressed(t);
        driverContext.failed(newException);
        throw newException;
    }
}

Driver类的getBlockedFuture判断指定的Operator是否阻塞

private Optional<ListenableFuture<?>> getBlockedFuture(Operator operator)
{
    ListenableFuture<?> blocked = revokingOperators.get(operator);
    if (blocked != null) {
        // We mark operator as blocked regardless of blocked.isDone(), because finishMemoryRevoke has not been called yet.
        return Optional.of(blocked);
    }
    blocked = operator.isBlocked();
    if (!blocked.isDone()) {
        return Optional.of(blocked);
    }
    blocked = operator.getOperatorContext().isWaitingForMemory();
    if (!blocked.isDone()) {
        return Optional.of(blocked);
    }
    blocked = operator.getOperatorContext().isWaitingForRevocableMemory();
    if (!blocked.isDone()) {
        return Optional.of(blocked);
    }
    return Optional.empty();
}
2.6.5Operator执行

Operator接口的getOutput()方法和addInput()方法是Operator处理Page的核心,这里以LimitOperator为示例

@Override
public void addInput(Page page)
{
    checkState(needsInput());

    if (page.getPositionCount() <= remainingLimit) {
        remainingLimit -= page.getPositionCount();
        nextPage = page;
    }
    else {
        Block[] blocks = new Block[page.getChannelCount()];
        for (int channel = 0; channel < page.getChannelCount(); channel++) {
            Block block = page.getBlock(channel);
            blocks[channel] = block.getRegion(0, (int) remainingLimit);
        }
        nextPage = new Page((int) remainingLimit, blocks);
        remainingLimit = 0;
    }
}

@Override
public Page getOutput()
{
    Page page = nextPage;
    nextPage = null;
    return page;
}

3.技术性改造

3.1支持Hive View

3.2自定义的Connector

3.3隐式转化

3.4支持UDF

3.5性能调优

未写完善待续…

如有错误请及时指出,共同进步~

每天晚上更新~

如需转载请附上本文链接,原创不易谢谢~

  • 6
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值