Flink源码学习二用户代码逻辑计划生成 1.17分支

最新推荐文章于 2024-06-11 09:59:17 发布

大数据动物园

最新推荐文章于 2024-06-11 09:59:17 发布

阅读量1.1k

点赞数 1

分类专栏：大数据flink学习及问题解决

本文链接：https://blog.csdn.net/weixin_47019045/article/details/127149454

版权

大数据flink学习及问题解决专栏收录该内容

5 篇文章 1 订阅

订阅专栏

1、Streaming环境获取

接上节flink调用用户代码的main方法后，用户代码中一般会有如下获取flink环境的代码

val env = StreamExecutionEnvironment.getExecutionEnvironment

这里获取的环境再命令行客户的执行这个方法是就已经初始化好了，这个env里面封装了一些用户环境配置、streaming执行配置等。

//org/apache/flink/client/ClientUtils.java:66
public static void executeProgram(
            PipelineExecutorServiceLoader executorServiceLoader,
            Configuration configuration,
            PackagedProgram program,
            boolean enforceSingleJobExecution,
            boolean suppressSysout)
            throws ProgramInvocationException {
        checkNotNull(executorServiceLoader);
        final ClassLoader userCodeClassLoader = program.getUserCodeClassLoader();
        final ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
        try {
            Thread.currentThread().setContextClassLoader(userCodeClassLoader);
 
            LOG.info(
                    "Starting program (detached: {})",
                    !configuration.getBoolean(DeploymentOptions.ATTACHED));
 
            ContextEnvironment.setAsContext(
                    executorServiceLoader,
                    configuration,
                    userCodeClassLoader,
                    enforceSingleJobExecution,
                    suppressSysout);
 
            StreamContextEnvironment.setAsContext(
                    executorServiceLoader,
                    configuration,
                    userCodeClassLoader,
                    enforceSingleJobExecution,
                    suppressSysout);
 
            try {
                program.invokeInteractiveModeForExecution();
            } finally {
                ContextEnvironment.unsetAsContext();
                StreamContextEnvironment.unsetAsContext();
            }
        } finally {
            Thread.currentThread().setContextClassLoader(contextClassLoader);
        }
    }

2、DataStreamSource生成

DataStream是flink关于streaming的最核心抽象，它是从StreamExecutionEnvironment的addSource方法和fromSource两个方法生成。其中fromSource是新的api，下面分别进行介绍。

addSource方法会触发以下这段逻辑：

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:1933
    private <OUT> DataStreamSource<OUT> addSource(
            final SourceFunction<OUT> function,
            final String sourceName,
            @Nullable final TypeInformation<OUT> typeInfo,
            final Boundedness boundedness) {
        checkNotNull(function);
        checkNotNull(sourceName);
        checkNotNull(boundedness);

        TypeInformation<OUT> resolvedTypeInfo =
                getTypeInfo(function, sourceName, SourceFunction.class, typeInfo);

        boolean isParallel = function instanceof ParallelSourceFunction;

        clean(function);

        final StreamSource<OUT, ?> sourceOperator = new StreamSource<>(function);
        return new DataStreamSource<>(
                this, resolvedTypeInfo, sourceOperator, isParallel, sourceName, boundedness);
    }

这里输入的SourceFunction就是用户代码自定义实现或调用数据源函数，如env.addSource(new FlinkKafkaConsumer11）

总结上述方法逻辑：

根据用户传入的SourceFunction获取source的输出类型的TypeInformation封装。
- flink会将所有streaming的输入输出类型封装成TypeInformation，其中基础类型数组类型的实现是BasicTypeInfo、BasicArrayTypeInfo，用户自定义的pojo类型会封装到GenericTypeInfo。
把SourceFunction封装到sourceOperator中，这里是把SourceFunction封装成StreamSource。StreamSource是一个StreamOperator算子抽象，在flink中一个DataSream封装了一次数据流转换，一个StreamOperator封装了一个函数接口，如map、filter、sourcefunction等。可以看到StreamSource走的是如下一套接口体系，这命名很随意。。。

其中用户的sourcefunction是封装在了AbstractUdfStreamOperator中

//org/apache/flink/streaming/api/operators/AbstractUdfStreamOperator.java:54    
    /** The user function. */
    protected final F userFunction;

    public AbstractUdfStreamOperator(F userFunction) {
        this.userFunction = requireNonNull(userFunction);
        checkUdfCheckpointingPreconditions();
    }

生成DataStreamSource数据流。DataStreamSource的具体继承关系如下：

//org/apache/flink/streaming/api/datastream/DataStreamSource.java:57
    /** The constructor used to create legacy sources. */
    public DataStreamSource(
            StreamExecutionEnvironment environment,
            TypeInformation<T> outTypeInfo,
            StreamSource<T, ?> operator,
            boolean isParallel,
            String sourceName,
            Boundedness boundedness) {
        super(
                environment,
                new LegacySourceTransformation<>(
                        sourceName,
                        operator,
                        outTypeInfo,
                        environment.getParallelism(),
                        boundedness));

        this.isParallel = isParallel;
        if (!isParallel) {
            setParallelism(1);
        }
    }

这里调用DataStreamSource的构造方法，注意这里的LegacySourceTransformation，它就是一次数据流转换的抽象，它封装了前面的StreamSource算子操作和流的并行度，它的继承体系如下：

最后调用了父类DataStream的构造函数，把一次流转换操作封装到DataStream中。其具体代码如下：

//org/apache/flink/streaming/api/datastream/DataStream.java:129   
    protected final StreamExecutionEnvironment environment;

    protected final Transformation<T> transformation;

    /**
     * Create a new {@link DataStream} in the given execution environment with partitioning set to
     * forward by default.
     *
     * @param environment The StreamExecutionEnvironment
     */
    public DataStream(StreamExecutionEnvironment environment, Transformation<T> transformation) {
        this.environment =
                Preconditions.checkNotNull(environment, "Execution Environment must not be null.");
        this.transformation =
                Preconditions.checkNotNull(
                        transformation, "Stream Transformation must not be null.");
    }

至此DataStream生成，后续的数据流转换操作都在DataStream上面完成了。

freomSource方法会触发以下这段逻辑：

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:1999
    public <OUT> DataStreamSource<OUT> fromSource(
            Source<OUT, ?, ?> source,
            WatermarkStrategy<OUT> timestampsAndWatermarks,
            String sourceName,
            TypeInformation<OUT> typeInfo) {

        final TypeInformation<OUT> resolvedTypeInfo =
                getTypeInfo(source, sourceName, Source.class, typeInfo);

        return new DataStreamSource<>(
                this,
                checkNotNull(source, "source"),
                checkNotNull(timestampsAndWatermarks, "timestampsAndWatermarks"),
                checkNotNull(resolvedTypeInfo),
                checkNotNull(sourceName));
    }

fromSource是flink新的数据源生成方法，这里用户生成Souece和原来的SourceFunction不一样，走的是另外一套继承接口。

总结上述逻辑：

获取source的输出类型的TypeInformation封装
构建DataStreamSource流
- 把source和水位线等标记封装到SourceTransformation中
- 构建DataStream，把SourceTransformation赋值给这个流的Transformation

3、DataStream的转换操作

上一步从数据源中获取DataStram的抽象DataStreamSource后，后面所有的转换都会在DataStram上进行。

注意到StreamExecutionEnvironment在初始化的时候会构建一个List<Transformation<?>> transformations

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:190
protected final List<Transformation<?>> transformations = new ArrayList<>();

每一次的DataStream流转换操作都会把当前的Transformation添加到算子列表 transformations 中（只有转换 transform 转换操作才会添加算子，其它都只是暂时做了 transformation 的叠加封装）

下面以map函数为例来介绍一次流转换操作。

//org/apache/flink/streaming/api/datastream/DataStream.java:592
    public <R> SingleOutputStreamOperator<R> map(
            MapFunction<T, R> mapper, TypeInformation<R> outputType) {
        return transform("Map", outputType, new StreamMap<>(clean(mapper)));
    }

可以看到一次map操作会触发transform函数

//org/apache/flink/streaming/api/datastream/DataStream.java:1180
    public <R> SingleOutputStreamOperator<R> transform(
            String operatorName,
            TypeInformation<R> outTypeInfo,
            OneInputStreamOperatorFactory<T, R> operatorFactory) {

        return doTransform(operatorName, outTypeInfo, operatorFactory);
    }


//org/apache/flink/streaming/api/datastream/DataStream.java:1188
    protected <R> SingleOutputStreamOperator<R> doTransform(
            String operatorName,
            TypeInformation<R> outTypeInfo,
            StreamOperatorFactory<R> operatorFactory) {

        // read the output type of the input Transform to coax out errors about MissingTypeInfo
        transformation.getOutputType();

        OneInputTransformation<T, R> resultTransform =
                new OneInputTransformation<>(
                        this.transformation,
                        operatorName,
                        operatorFactory,
                        outTypeInfo,
                        environment.getParallelism());

        @SuppressWarnings({"unchecked", "rawtypes"})
        SingleOutputStreamOperator<R> returnStream =
                new SingleOutputStreamOperator(environment, resultTransform);

        getExecutionEnvironment().addOperator(resultTransform);

        return returnStream;
    }

可以看到transform会调用doTransform函数，总结其作用
- 调用当前DataStram封装的transformation，获取它的输出类型，用来触发缺少输出类型的错误。
- 将当前DataSteam的transformation和需要进行操作的算子转换成新的一次输入转换操作。这里的OneInputTransformation也是Transformation的抽象。它的两个泛型分别表示当前转换的输入和转换后的输出类型。
- 最后将当前环境对象和新的transformation对象最为成员变量封装成另外一个新的DataStream对象并返回。这里注意只有需要transform的流才会生成新的DataStream算子。
注意到这里的getExecutionEnvironment().addOperator(resultTransform) 操作，flink会将所有的transformation操作维护起来。

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:2325
    public void addOperator(Transformation<?> transformation) {
        Preconditions.checkNotNull(transformation, "transformation must not be null.");
        this.transformations.add(transformation);
    }

由此可见，用户在DataStream上面执行的一系列操作，如map、filter等，实际上是在DataStream上做的转换，由flink将这些transformation操作维护了起来，一直到最后执行env.execute()方法StramGraph的构建才真正开始。

3、逻辑计划StreamGraph的生成

用户执行env.execute()后会触发以下逻辑：

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:2041
    public JobExecutionResult execute(String jobName) throws Exception {
        final List<Transformation<?>> originalTransformations = new ArrayList<>(transformations);
        StreamGraph streamGraph = getStreamGraph();
        if (jobName != null) {
            streamGraph.setJobName(jobName);
        }

        try {
            return execute(streamGraph);
        } catch (Throwable t) {
            Optional<ClusterDatasetCorruptedException> clusterDatasetCorruptedException =
                    ExceptionUtils.findThrowable(t, ClusterDatasetCorruptedException.class);
            if (!clusterDatasetCorruptedException.isPresent()) {
                throw t;
            }

            // Retry without cache if it is caused by corrupted cluster dataset.
            invalidateCacheTransformations(originalTransformations);
            streamGraph = getStreamGraph(originalTransformations);
            return execute(streamGraph);
        }
    }

这段代码主要调用getStreamGraph生成StreamGraph。

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:2237
    private StreamGraph getStreamGraph(List<Transformation<?>> transformations) {
        synchronizeClusterDatasetStatus();
        return getStreamGraphGenerator(transformations).generate();
    }

这里可以看到先生成了一个StreamGraphGenerator，再调用它的generate方法生成StreamGraph。主要是把transformations、执行配置、检查点配置和一些缓存信息作为成员变量封装了进去。具体的生成StreamGraph逻辑是下面这段代码：

//org/apache/flink/streaming/api/graph/StreamGraphGenerator.java:308
    public StreamGraph generate() {
        streamGraph = new StreamGraph(executionConfig, checkpointConfig, savepointRestoreSettings);
        streamGraph.setEnableCheckpointsAfterTasksFinish(
                configuration.get(
                        ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH));
        shouldExecuteInBatchMode = shouldExecuteInBatchMode();
        configureStreamGraph(streamGraph);

        alreadyTransformed = new IdentityHashMap<>();

        for (Transformation<?> transformation : transformations) {
            transform(transformation);
        }

        streamGraph.setSlotSharingGroupResource(slotSharingGroupResources);

        setFineGrainedGlobalStreamExchangeMode(streamGraph);

        for (StreamNode node : streamGraph.getStreamNodes()) {
            if (node.getInEdges().stream().anyMatch(this::shouldDisableUnalignedCheckpointing)) {
                for (StreamEdge edge : node.getInEdges()) {
                    edge.setSupportsUnalignedCheckpoints(false);
                }
            }
        }

        final StreamGraph builtStreamGraph = streamGraph;

        alreadyTransformed.clear();
        alreadyTransformed = null;
        streamGraph = null;

        return builtStreamGraph;
    }

总结上述代码逻辑：

构建一个StreamGraph对象并进行设置，主要是一些配置的设置和判断是流任务还是批任务，流批任务的生成逻辑不一样。

在 for (Transformation<?> transformation : transformations) {
transform(transformation);
}
这段代码中对StreamExecutionEnvironment中管理的transformation转换成运行时的TransformationTranslator，然后根据这些转换信息生成StreamGraph的节点StreamNode，并创建节点连接。

//org/apache/flink/streaming/api/graph/StreamGraphGenerator.java:187
    static {
        @SuppressWarnings("rawtypes")
        Map<Class<? extends Transformation>, TransformationTranslator<?, ? extends Transformation>>
                tmp = new HashMap<>();
        tmp.put(OneInputTransformation.class, new OneInputTransformationTranslator<>());
        tmp.put(TwoInputTransformation.class, new TwoInputTransformationTranslator<>());
        tmp.put(MultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
        tmp.put(KeyedMultipleInputTransformation.class, new MultiInputTransformationTranslator<>());
        tmp.put(SourceTransformation.class, new SourceTransformationTranslator<>());
        tmp.put(SinkTransformation.class, new SinkTransformationTranslator<>());
        tmp.put(LegacySinkTransformation.class, new LegacySinkTransformationTranslator<>());
        tmp.put(LegacySourceTransformation.class, new LegacySourceTransformationTranslator<>());
        tmp.put(UnionTransformation.class, new UnionTransformationTranslator<>());
        tmp.put(PartitionTransformation.class, new PartitionTransformationTranslator<>());
        tmp.put(SideOutputTransformation.class, new SideOutputTransformationTranslator<>());
        tmp.put(ReduceTransformation.class, new ReduceTransformationTranslator<>());
        tmp.put(
                TimestampsAndWatermarksTransformation.class,
                new TimestampsAndWatermarksTransformationTranslator<>());
        tmp.put(BroadcastStateTransformation.class, new BroadcastStateTransformationTranslator<>());
        tmp.put(
                KeyedBroadcastStateTransformation.class,
                new KeyedBroadcastStateTransformationTranslator<>());
        tmp.put(CacheTransformation.class, new CacheTransformationTranslator<>());
        translatorMap = Collections.unmodifiableMap(tmp);
    }

Transformation转换成TransformationTranslator，是一一对应的关系，所有的对应关系如上。
流任务具体的StreamNode是调用TransformationTranslator的translateForStreaming方法来生成，并添加到StreamGraph中的。这里以一次OneInputTransformation转换为例，它首先会转换成OneInputTransformationTranslator，继承关系如下。

这里再调用抽象父类的translateInternal方法进行实际的添加StramNode操作。

//org/apache/flink/streaming/runtime/translators/OneInputTransformationTranslator.java:63
    @Override
    public Collection<Integer> translateForStreamingInternal(
            final OneInputTransformation<IN, OUT> transformation, final Context context) {
        return translateInternal(
                transformation,
                transformation.getOperatorFactory(),
                transformation.getInputType(),
                transformation.getStateKeySelector(),
                transformation.getStateKeyType(),
                context);
    }

最后会调用StreamGraph的addNode方法,这里StreamNode会保存算子的信息

//org/apache/flink/streaming/api/graph/StreamGraph.java:511
    protected StreamNode addNode(
            Integer vertexID,
            @Nullable String slotSharingGroup,
            @Nullable String coLocationGroup,
            Class<? extends TaskInvokable> vertexClass,
            StreamOperatorFactory<?> operatorFactory,
            String operatorName) {

        if (streamNodes.containsKey(vertexID)) {
            throw new RuntimeException("Duplicate vertexID " + vertexID);
        }

        StreamNode vertex =
                new StreamNode(
                        vertexID,
                        slotSharingGroup,
                        coLocationGroup,
                        operatorFactory,
                        operatorName,
                        vertexClass);

        streamNodes.put(vertexID, vertex);

        return vertex;
    }

经过遍历操作处理后，由StreamNode构建的DAG图StreamGraph就生成了。

最后返回构建好的逻辑计划图StreamGraph。

4、优化的逻辑计划JobGraph的生成

生成好的StreamGraph会再次根据算子链做优化

//org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java:2183
    public JobClient executeAsync(StreamGraph streamGraph) throws Exception {
        checkNotNull(streamGraph, "StreamGraph cannot be null.");
        final PipelineExecutor executor = getPipelineExecutor();

        CompletableFuture<JobClient> jobClientFuture =
                executor.execute(streamGraph, configuration, userClassloader);

        try {
            JobClient jobClient = jobClientFuture.get();
            jobListeners.forEach(jobListener -> jobListener.onJobSubmitted(jobClient, null));
            collectIterators.forEach(iterator -> iterator.setJobClient(jobClient));
            collectIterators.clear();
            return jobClient;
        } catch (ExecutionException executionException) {
            final Throwable strippedException =
                    ExceptionUtils.stripExecutionException(executionException);
            jobListeners.forEach(
                    jobListener -> jobListener.onJobSubmitted(null, strippedException));

            throw new FlinkException(
                    String.format("Failed to execute job '%s'.", streamGraph.getJobName()),
                    strippedException);
        }
    }

yarn job调用的是下面这个方法

//org/apache/flink/client/deployment/executors/AbstractJobClusterExecutor.java:66
    @Override
    public CompletableFuture<JobClient> execute(
            @Nonnull final Pipeline pipeline,
            @Nonnull final Configuration configuration,
            @Nonnull final ClassLoader userCodeClassloader)
            throws Exception {
        final JobGraph jobGraph = PipelineExecutorUtils.getJobGraph(pipeline, configuration);

        try (final ClusterDescriptor<ClusterID> clusterDescriptor =
                clusterClientFactory.createClusterDescriptor(configuration)) {
            final ExecutionConfigAccessor configAccessor =
                    ExecutionConfigAccessor.fromConfiguration(configuration);

            final ClusterSpecification clusterSpecification =
                    clusterClientFactory.getClusterSpecification(configuration);

            final ClusterClientProvider<ClusterID> clusterClientProvider =
                    clusterDescriptor.deployJobCluster(
                            clusterSpecification, jobGraph, configAccessor.getDetachedMode());
            LOG.info("Job has been submitted with JobID " + jobGraph.getJobID());

            return CompletableFuture.completedFuture(
                    new ClusterClientJobClientAdapter<>(
                            clusterClientProvider, jobGraph.getJobID(), userCodeClassloader));
        }
    }

StreamGraph优化生成JobGraph

//org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java:204
     private JobGraph createJobGraph() {
        preValidate();
        jobGraph.setJobType(streamGraph.getJobType());

        jobGraph.enableApproximateLocalRecovery(
                streamGraph.getCheckpointConfig().isApproximateLocalRecoveryEnabled());

        // Generate deterministic hashes for the nodes in order to identify them across
        // submission iff they didn't change.
        Map<Integer, byte[]> hashes =
                defaultStreamGraphHasher.traverseStreamGraphAndGenerateHashes(streamGraph);

        // Generate legacy version hashes for backwards compatibility
        List<Map<Integer, byte[]>> legacyHashes = new ArrayList<>(legacyStreamGraphHashers.size());
        for (StreamGraphHasher hasher : legacyStreamGraphHashers) {
            legacyHashes.add(hasher.traverseStreamGraphAndGenerateHashes(streamGraph));
        }

        setChaining(hashes, legacyHashes);

        setPhysicalEdges();

        markContainsSourcesOrSinks();

        setSlotSharingAndCoLocation();

        setManagedMemoryFraction(
                Collections.unmodifiableMap(jobVertices),
                Collections.unmodifiableMap(vertexConfigs),
                Collections.unmodifiableMap(chainedConfigs),
                id -> streamGraph.getStreamNode(id).getManagedMemoryOperatorScopeUseCaseWeights(),
                id -> streamGraph.getStreamNode(id).getManagedMemorySlotScopeUseCases());

        configureCheckpointing();

        jobGraph.setSavepointRestoreSettings(streamGraph.getSavepointRestoreSettings());

        final Map<String, DistributedCache.DistributedCacheEntry> distributedCacheEntries =
                JobGraphUtils.prepareUserArtifactEntries(
                        streamGraph.getUserArtifacts().stream()
                                .collect(Collectors.toMap(e -> e.f0, e -> e.f1)),
                        jobGraph.getJobID());

        for (Map.Entry<String, DistributedCache.DistributedCacheEntry> entry :
                distributedCacheEntries.entrySet()) {
            jobGraph.addUserArtifact(entry.getKey(), entry.getValue());
        }

        // set the ExecutionConfig last when it has been finalized
        try {
            jobGraph.setExecutionConfig(streamGraph.getExecutionConfig());
        } catch (IOException e) {
            throw new IllegalConfigurationException(
                    "Could not serialize the ExecutionConfig."
                            + "This indicates that non-serializable types (like custom serializers) were registered");
        }

        jobGraph.setChangelogStateBackendEnabled(streamGraph.isChangelogStateBackendEnabled());

        addVertexIndexPrefixInVertexName();

        setVertexDescription();

        // Wait for the serialization of operator coordinators and stream config.
        try {
            FutureUtils.combineAll(
                            vertexConfigs.values().stream()
                                    .map(
                                            config ->
                                                    config.triggerSerializationAndReturnFuture(
                                                            serializationExecutor))
                                    .collect(Collectors.toList()))
                    .get();
            FutureUtils.combineAll(coordinatorSerializationFutures).get();
        } catch (Exception e) {
            throw new FlinkRuntimeException("Error in serialization.", e);
        }

        if (!streamGraph.getJobStatusHooks().isEmpty()) {
            jobGraph.setJobStatusHooks(streamGraph.getJobStatusHooks());
        }

        return jobGraph;
    }

核心生成逻辑是line222 setChaining中调用createChain，具体核心逻辑如下：

//org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java:596
    private List<StreamEdge> createChain(
            final Integer currentNodeId,
            final int chainIndex,
            final OperatorChainInfo chainInfo,
            final Map<Integer, OperatorChainInfo> chainEntryPoints) {

        Integer startNodeId = chainInfo.getStartNodeId();
        if (!builtVertices.contains(startNodeId)) {

            List<StreamEdge> transitiveOutEdges = new ArrayList<StreamEdge>();

            List<StreamEdge> chainableOutputs = new ArrayList<StreamEdge>();
            List<StreamEdge> nonChainableOutputs = new ArrayList<StreamEdge>();

            StreamNode currentNode = streamGraph.getStreamNode(currentNodeId);

            for (StreamEdge outEdge : currentNode.getOutEdges()) {
                if (isChainable(outEdge, streamGraph)) {
                    chainableOutputs.add(outEdge);
                } else {
                    nonChainableOutputs.add(outEdge);
                }
            }

            for (StreamEdge chainable : chainableOutputs) {
                transitiveOutEdges.addAll(
                        createChain(
                                chainable.getTargetId(),
                                chainIndex + 1,
                                chainInfo,
                                chainEntryPoints));
            }

            for (StreamEdge nonChainable : nonChainableOutputs) {
                transitiveOutEdges.add(nonChainable);
                createChain(
                        nonChainable.getTargetId(),
                        1, // operators start at position 1 because 0 is for chained source inputs
                        chainEntryPoints.computeIfAbsent(
                                nonChainable.getTargetId(),
                                (k) -> chainInfo.newChain(nonChainable.getTargetId())),
                        chainEntryPoints);
            }

            chainedNames.put(
                    currentNodeId,
                    createChainedName(
                            currentNodeId,
                            chainableOutputs,
                            Optional.ofNullable(chainEntryPoints.get(currentNodeId))));
            chainedMinResources.put(
                    currentNodeId, createChainedMinResources(currentNodeId, chainableOutputs));
            chainedPreferredResources.put(
                    currentNodeId,
                    createChainedPreferredResources(currentNodeId, chainableOutputs));

            OperatorID currentOperatorId =
                    chainInfo.addNodeToChain(
                            currentNodeId,
                            streamGraph.getStreamNode(currentNodeId).getOperatorName());

            if (currentNode.getInputFormat() != null) {
                getOrCreateFormatContainer(startNodeId)
                        .addInputFormat(currentOperatorId, currentNode.getInputFormat());
            }

            if (currentNode.getOutputFormat() != null) {
                getOrCreateFormatContainer(startNodeId)
                        .addOutputFormat(currentOperatorId, currentNode.getOutputFormat());
            }

            StreamConfig config =
                    currentNodeId.equals(startNodeId)
                            ? createJobVertex(startNodeId, chainInfo)
                            : new StreamConfig(new Configuration());

            setVertexConfig(
                    currentNodeId,
                    config,
                    chainableOutputs,
                    nonChainableOutputs,
                    chainInfo.getChainedSources());

            if (currentNodeId.equals(startNodeId)) {

                config.setChainStart();
                config.setChainIndex(chainIndex);
                config.setOperatorName(streamGraph.getStreamNode(currentNodeId).getOperatorName());

                LinkedHashSet<NonChainedOutput> transitiveOutputs = new LinkedHashSet<>();
                for (StreamEdge edge : transitiveOutEdges) {
                    NonChainedOutput output =
                            opIntermediateOutputs.get(edge.getSourceId()).get(edge);
                    transitiveOutputs.add(output);
                    connect(startNodeId, edge, output);
                }

                config.setVertexNonChainedOutputs(new ArrayList<>(transitiveOutputs));
                config.setTransitiveChainedTaskConfigs(chainedConfigs.get(startNodeId));

            } else {
                chainedConfigs.computeIfAbsent(
                        startNodeId, k -> new HashMap<Integer, StreamConfig>());

                config.setChainIndex(chainIndex);
                StreamNode node = streamGraph.getStreamNode(currentNodeId);
                config.setOperatorName(node.getOperatorName());
                chainedConfigs.get(startNodeId).put(currentNodeId, config);
            }

            config.setOperatorID(currentOperatorId);

            if (chainableOutputs.isEmpty()) {
                config.setChainEnd();
            }
            return transitiveOutEdges;

        } else {
            return new ArrayList<>();
        }
    }

总结上述算子链接优化逻辑：

如果从当前算子链的startNode开始没有生成过 JobVertex，则执行 chain逻辑，line622会进行深度遍历，将从源节点开始到第一个不可 chain 的 StreamNode 之间的算子做 chain 操作【先算叶子节点的 chain，依次往根节点计算】
line 629 遇到不可 chain 的边，开始深度遍历生成 JobVertex
line 672 会将 StreamNode 的输入输出配置，包括序列化配置等设置到上面的 StreamingConfig 中，并在 vertexConfigs 中保存起来，如果是新生成的 JobVertex，对应的 StreamingConfig 会以startNodeId做为 key 进行保存
transitiveOutEdges会保存的该节点下游所有的nonChainableOutputs的边，最终的方法会返回这个nonChainableOutputs的数据结构
连接startNode和所有的transitiveOutEdges【在输入 JobVertex 创建 IntermediateDataSet，partition类型为 pipeline，生成 JobEdge】
如果是新生成JobVertex，继续设置config，包括setChainStart，所有物理输出，及直接逻辑输出、chainedConfigs等
如果不是新生成 JobVertex，直接chainedConfigs

简述JobGraph生成的流程

从DataStream上操作生成的transformations List
根据transformations生成运行时的TransformationTranslator并生成StreamNode和StreamEdge
做算子的chain操作合并生成JobVertex然后生成新的StramEdge

一个JobVertex代表一个逻辑计划的节点，也就是DAG图上的顶点。注意在代码里面，JobVertex其实被作为配置封装到了StreamConfig

那具体这个StreamConfig（JobVertex）怎么生成的：

//org/apache/flink/streaming/api/graph/StreamingJobGraphGenerator.java:770
    private StreamConfig createJobVertex(Integer streamNodeId, OperatorChainInfo chainInfo) {

        JobVertex jobVertex;
        StreamNode streamNode = streamGraph.getStreamNode(streamNodeId);

        byte[] hash = chainInfo.getHash(streamNodeId);

        if (hash == null) {
            throw new IllegalStateException(
                    "Cannot find node hash. "
                            + "Did you generate them before calling this method?");
        }

        JobVertexID jobVertexId = new JobVertexID(hash);

        List<Tuple2<byte[], byte[]>> chainedOperators =
                chainInfo.getChainedOperatorHashes(streamNodeId);
        List<OperatorIDPair> operatorIDPairs = new ArrayList<>();
        if (chainedOperators != null) {
            for (Tuple2<byte[], byte[]> chainedOperator : chainedOperators) {
                OperatorID userDefinedOperatorID =
                        chainedOperator.f1 == null ? null : new OperatorID(chainedOperator.f1);
                operatorIDPairs.add(
                        OperatorIDPair.of(
                                new OperatorID(chainedOperator.f0), userDefinedOperatorID));
            }
        }

        if (chainedInputOutputFormats.containsKey(streamNodeId)) {
            jobVertex =
                    new InputOutputFormatVertex(
                            chainedNames.get(streamNodeId), jobVertexId, operatorIDPairs);

            chainedInputOutputFormats
                    .get(streamNodeId)
                    .write(new TaskConfig(jobVertex.getConfiguration()));
        } else {
            jobVertex = new JobVertex(chainedNames.get(streamNodeId), jobVertexId, operatorIDPairs);
        }

        if (streamNode.getConsumeClusterDatasetId() != null) {
            jobVertex.addIntermediateDataSetIdToConsume(streamNode.getConsumeClusterDatasetId());
        }

        for (OperatorCoordinator.Provider coordinatorProvider :
                chainInfo.getCoordinatorProviders()) {
            coordinatorSerializationFutures.add(
                    CompletableFuture.runAsync(
                            () -> {
                                try {
                                    jobVertex.addOperatorCoordinator(
                                            new SerializedValue<>(coordinatorProvider));
                                } catch (IOException e) {
                                    throw new FlinkRuntimeException(
                                            String.format(
                                                    "Coordinator Provider for node %s is not serializable.",
                                                    chainedNames.get(streamNodeId)),
                                            e);
                                }
                            },
                            serializationExecutor));
        }

        jobVertex.setResources(
                chainedMinResources.get(streamNodeId), chainedPreferredResources.get(streamNodeId));

        jobVertex.setInvokableClass(streamNode.getJobVertexClass());

        int parallelism = streamNode.getParallelism();

        if (parallelism > 0) {
            jobVertex.setParallelism(parallelism);
        } else {
            parallelism = jobVertex.getParallelism();
        }

        jobVertex.setMaxParallelism(streamNode.getMaxParallelism());

        if (LOG.isDebugEnabled()) {
            LOG.debug("Parallelism set: {} for {}", parallelism, streamNodeId);
        }

        jobVertices.put(streamNodeId, jobVertex);
        builtVertices.add(streamNodeId);
        jobGraph.addVertex(jobVertex);

        return new StreamConfig(jobVertex.getConfiguration());
    }

这里重点注意

line798 chainedInputOutputFormats.containsKey(streamNodeId) ，判断当前节点是不是数据源节点，如果是会将用户代码设置到中jobVertex的配置中，方便在jobmanager提交作业的时候做初始化。
line836 jobVertex.setInvokableClass(streamNode.getJobVertexClass()); 这是设置任务实际运行时的执行类，通过这个类调用用户自定义的操作算子函数，是flink任务中真正被执行的类。

最终JobGraoh生成完成，由客户端提交给yarn创建的集群，在jobmanager中再生成物理计划。下一节介绍flink如何向yarn申请资源，并启动jobmanager和taskmanager，部署整个任务。

大数据动物园

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
1
评论
Flink源码学习二用户代码逻辑计划生成 1.17分支

接上节flink调用用户代码的main方法后，用户代码中一般会有如下获取flink环境的代码这里获取的环境再命令行客户的执行这个方法是就已经初始化好了，这个env里面封装了一些用户环境配置、streaming执行配置等。
复制链接

扫一扫