【flink】Flink 1.12.2 源码浅析 : Task 浅析

839 篇文章 848 订阅 ¥99.90 ¥299.90
本文深入浅析Flink 1.12.2版本的Task实现,阐述Task在TaskManager中作为并行subtask执行的角色,以及其如何封装Flink操作符并提供执行服务。详细探讨了Task的属性、构造、启动、运行过程、Checkpoint相关操作(触发、确认、放弃)以及事件分发机制。
摘要由CSDN通过智能技术生成

在这里插入图片描述

1.概述

转载:Flink 1.12.2 源码浅析 : Task 浅析

Task 表示TaskManager上并行 subtask 的一次执行。

Task封装了一个Flink operator(也可能是一个用户function )并运行它,
提供使用输入数据、生成结果(中间结果分区)和与JobManager通信所需的所有服务。

Flink操作符(实现为{@link AbstractInvokable}的子类)只有数据读取器、编写器和某些事件回调。
该任务将它们连接到网络堆栈和actor消息,并跟踪执行状态和处理异常。、

Task不知道它们与其他任务的关系,也不知道它们是第一次尝试执行任务还是重复执行任务。
所有这些只有JobManager知道。

T只知道自己的可运行代码、任务的配置以及要使用和生成的 intermediate results 的id(如果有的话)。
每个任务由一个专用线程运行。

二 .代码浅析

Task是一个类, 实现了Runnable 接口…

2.1. 属性

/**
 * 线程组 包含所有 task thread ..
 * The thread group that contains all task threads.
 * */
private static final ThreadGroup TASK_THREADS_GROUP = new ThreadGroup("Flink Task Threads");

/**
 * 任务状态标识
 * For atomic state updates.
 * */
private static final AtomicReferenceFieldUpdater<Task, ExecutionState> STATE_UPDATER =
        AtomicReferenceFieldUpdater.newUpdater(  Task.class, ExecutionState.class, "executionState");

// ------------------------------------------------------------------------
// 属性字段 : Task构造方法初始化中的一部分属性.
//  Constant fields that are part of the initial Task construction
// ------------------------------------------------------------------------

/**
 *  JobID
 * The job that the task belongs to. */
private final JobID jobId;

/**
 * 当前任务在JobGraph所属的  job vertex id
 * The vertex in the JobGraph whose code the task executes. */
private final JobVertexID vertexId;

/**
 * 在ExecutionGraph中 子任务所属的 ExecutionAttemptID
 * The execution attempt of the parallel subtask.
 * */
private final ExecutionAttemptID executionId;

/**
 * 在同一个slot中运行锁分配的AllocationID
 * ID which identifies the slot in which the task is supposed to run.
 * */
private final AllocationID allocationId;

/**
 * 任务的相关信息.
 * TaskInfo object for this task. */
private final TaskInfo taskInfo;

/**
 * 任务名称,包含子任务的索引.
 *  The name of the task, including subtask indexes.
 *  */
private final String taskNameWithSubtask;

/**
 * job的配置相关信息
 * The job-wide configuration object.
 * */
private final Configuration jobConfiguration;

/**
 * task指定的配置信息.
 * The task-specific configuration.
 * */
private final Configuration taskConfiguration;

/**
 * task任务需所需要的jar文件
 * The jar files used by this task.
 * */
private final Collection<PermanentBlobKey> requiredJarFiles;

/**
 * classpaths相关
 * The classpaths used by this task.
 * */
private final Collection<URL> requiredClasspaths;

/**
 * 该任务的实例化类型
 * 1. org.apache.flink.streaming.runtime.tasks.SourceStreamTask
 * 2. org.apache.flink.streaming.runtime.tasks.OneInputStreamTask
 *
 * The name of the class that holds the invokable code.
 * */
private final String nameOfInvokableClass;

/**
 * 访问 task manager 配置/host name 相关信息
 * Access to task manager configuration and host names.
 * */
private final TaskManagerRuntimeInfo taskManagerConfig;

/**
 * 内存管理相关.
 * The memory manager to be used by this task.
 * */
private final MemoryManager memoryManager;

/**
 * I/O相关
 *
 * The I/O manager to be used by this task. */
private final IOManager ioManager;

/**
 * 广播变量 BroadcastVariableManager
 * The BroadcastVariableManager to be used by this task. */
private final BroadcastVariableManager broadcastVariableManager;

/**
 * 任务事件的Dispatcher
 */
private final TaskEventDispatcher taskEventDispatcher;

/**
 * 外部resources 信息
 * Information provider for external resources.
 * */
private final ExternalResourceInfoProvider externalResourceInfoProvider;

/**
 * task/slot 的状态信息
 * The manager for state of operators running in this task/slot.
 * */
private final TaskStateManager taskStateManager;

/**
 *
 * job指定execution配置的序列化相关
 * Serialized version of the job specific execution configuration
 * (see {@link ExecutionConfig}).
 */
private final SerializedValue<ExecutionConfig> serializedExecutionConfig;

/**
 * A record-oriented runtime result writer API for producing results.
 *
 * ResultPartitionWriter
 */
private final ResultPartitionWriter[] consumableNotifyingPartitionWriters;

/**
 * An {@link InputGate} with a specific index.
 */
private final IndexedInputGate[] inputGates;

/**
 * task manager的 Connection
 * Connection to the task manager.
 * */
private final TaskManagerActions taskManagerActions;

/**
 *
 * Input split provider for the task.
 * */
private final InputSplitProvider inputSplitProvider;

/**
 * Checkpoint 相关...
 * Checkpoint notifier used to communicate with the CheckpointCoordinator.
 * */
private final CheckpointResponder checkpointResponder;

/**
 * 发送信息给 Job Manager 的Gateway
 * The gateway for operators to send messages to the operator coordinators on the Job Manager.
 */
private final TaskOperatorEventGateway operatorCoordinatorEventGateway;

/**
 *
 * GlobalAggregateManager用于JobMaster的更新和聚合
 * GlobalAggregateManager used to update aggregates on the JobMaster.
 * */
private final GlobalAggregateManager aggregateManager;

/**
 * task请求class loader的时候,加载的library 缓存
 * The library cache, from which the task can request its class loader. */
private final LibraryCacheManager.ClassLoaderHandle classLoaderHandle;

/**
 * 用户定义的文件的缓存.
 * The cache for user-defined files that the invokable requires. */
private final FileCache fileCache;

/**
 * task的 kv 状态服务相关.
 * The service for kvState registration of this task. */
private final KvStateService kvStateService;

/**
 * task启用live reporting of accumulators的注册相关...
 * The registry of this task which enables live reporting of accumulators. */
private final AccumulatorRegistry accumulatorRegistry;

/**
 * 当前执行task的Thread 线程.
 * The thread that executes the task. */
private final Thread executingThread;

/**
 * task metrics相关
 * Parent group for all metrics of this task. */
private final TaskMetricGroup metrics;


/**
 * 分区相关...
 * Partition producer state checker to request partition states from.
 * */
private final PartitionProducerStateChecker partitionProducerStateChecker;

/**
 * Executor ????
 * Executor to run future callbacks. */
private final Executor executor;

/**
 *
 * 当执行一次run方法的Future 索引...
 * Future that is completed once {@link #run()} exits.
 *
 * */
private final CompletableFuture<ExecutionState> terminationFuture = new CompletableFuture<>();

// ------------------------------------------------------------------------
// 属性控制task的执行. 所有的字段是volatile .
//  Fields that control the task execution. All these fields are volatile
//  (which means that they introduce memory barriers), to establish
//  proper happens-before semantics on parallel modification
// ------------------------------------------------------------------------

/**
 * 是否取消 : 默认false
 * atomic flag that makes sure the invokable is canceled exactly once upon error. */
private final AtomicBoolean invokableHasBeenCanceled;

/**
 * task 的 invokable
 * 所有的请求必须复制其引用,并检查是否为null
 * 作为逻辑处理中的一部分,该字段将会被清理...
 * ???????
 *
 * The invokable of this task, if initialized.
 *
 * All accesses must copy the reference and check for null,
 * as this field is cleared as part of the disposal logic.
 *
 */
@Nullable private volatile AbstractInvokable invokable;

/**
 * 任务的状态
 * The current execution state of the task. */
private volatile ExecutionState executionState = ExecutionState.CREATED;

/** The observed exception, in case the task execution failed. */
private volatile Throwable failureCause;

/**
 * 默认值 : 30000 ???
 * Initialized from the Flink configuration. May also be set at the ExecutionConfig */
private long taskCancellationInterval;


/**
 * 根据Flink 配置进行初始化, 也可在ExecutionConfig中设置.
 * Initialized from the Flink configuration.
 * May also be set at the ExecutionConfig */
private long taskCancellationTimeout;

/**
 * 用户代码类加载器
 * This class loader should be set as the context class loader for threads that may dynamically
 * load user code.
 */
private UserCodeClassLoader userCodeClassLoader;

2.2. 构造方法

   /**
     * <b>IMPORTANT:</b> This constructor may not start any work that would need to be undone in the
     * case of a failing task deployment.
     */
    public Task(
            JobInformation jobInformation,
            TaskInformation taskInformation,
            ExecutionAttemptID executionAttemptID,
            AllocationID slotAllocationId,
            int subtaskIndex,
            int attemptNumber,
            List<ResultPartitionDeploymentDescriptor> resultPartitionDeploymentDescriptors,
            List<InputGateDeploymentDescriptor> inputGateDeploymentDescriptors,
            int targetSlotNumber,
            MemoryManager memManager,
            IOManager ioManager,
            ShuffleEnvironment<?, ?> shuffleEnvironment,
            KvStateService kvStateService,
            BroadcastVariableManager bcVarManager,
            TaskEventDispatcher taskEventDispatcher,
            ExternalResourceInfoProvider externalResourceInfoProvider,
            TaskStateManager taskStateManager,
            TaskManagerActions taskManagerActions,
            InputSplitProvider inputSplitProvider,
            CheckpointResponder checkpointResponder,
            TaskOperatorEventGateway operatorCoordinatorEventGateway,
            GlobalAggregateManager aggregateManager,
            LibraryCacheManager.ClassLoaderHandle classLoaderHandle,
            FileCache fileCache,
            TaskManagerRuntimeInfo taskManagerConfig,
            @Nonnull TaskMetricGroup metricGroup,
            ResultPartitionConsumableNotifier resultPartitionConsumableNotifier,
            PartitionProducerStateChecker partitionProducerStateChecker,
            Executor executor) {

        Preconditions.checkNotNull(jobInformation);
        Preconditions.checkNotNull(taskInformation);

        Preconditions.checkArgument(0 <= subtaskIndex, "The subtask index must be positive.");
        Preconditions.checkArgument(0 <= attemptNumber, "The attempt number must be positive.");
        Preconditions.checkArgument(
                0 <= targetSlotNumber, "The target slot number must be positive.");

        this.taskInfo =
                new TaskInfo(
                        taskInformation.getTaskName(),
                        taskInformation.getMaxNumberOfSubtasks(),
                        subtaskIndex,
                        taskInformation.getNumberOfSubtasks(),
                        attemptNumber,
                        String.valueOf(slotAllocationId));

        this.jobId = jobInformation.getJobId();
        this.vertexId = taskInformation.getJobVertexId();
        this.executionId = Preconditions.checkNotNull(executionAttemptID);
        this.allocationId = Preconditions.checkNotNull(slotAllocationId);
        this.taskNameWithSubtask = taskInfo.getTaskNameWithSubtasks();
        this.jobConfiguration = jobInformation.getJobConfiguration();
        this.taskConfiguration = taskInformation.getTaskConfiguration();
        this.requiredJarFiles = jobInformation.getRequiredJarFileBlobKeys();
        this.requiredClasspaths = jobInformation.getRequiredClasspathURLs();
        this.nameOfInvokableClass = taskInformation.getInvokableClassName();
        this.serializedExecutionConfig = jobInformation.getSerializedExecutionConfig();

        Configuration tmConfig = taskManagerConfig.getConfiguration();
        this.taskCancellationInterval =
                tmConfig.getLong(TaskManagerOptions.TASK_CANCELLATION_INTERVAL);
        this.taskCancellationTimeout =
                tmConfig.getLong(TaskManagerOptions.TASK_CANCELLATION_TIMEOUT);

        this.memoryManager = Preconditions.checkNotNull(memManager);
        this.ioManager = Preconditions.checkNotNull(ioManager);
        this.broadcastVariableManager = Preconditions.checkNotNull(bcVarManager);
        this.taskEventDispatcher = Preconditions.checkNotNull(taskEventDispatcher);
        this.taskStateManager = Preconditions.checkNotNull(taskStateManager);
        this.accumulatorRegistry = new AccumulatorRegistry(jobId, executionId);

        this.inputSplitProvider = Preconditions.checkNotNull(inputSplitProvider);
        this.checkpointResponder = Preconditions.checkNotNull(checkpointResponder);
        this.operatorCoordinatorEventGateway =
                Preconditions.checkNotNull(operatorCoordinatorEventGateway);
        this.aggregateManager = Preconditions.checkNotNull(aggregateManager);
        this.taskManagerActions = checkNotNull(taskManagerActions);
        this.externalResourceInfoProvider = checkNotNull(externalResourceInfoProvider);

        this.classLoaderHandle = Preconditions.checkNotNull(classLoaderHandle);
        this.fileCache = Preconditions.checkNotNull(fileCache);
        this.kvStateService = Preconditions.checkNotNull(kvStateService);
        this.taskManagerConfig = Preconditions.checkNotNull(taskManagerConfig);

        this.metrics = metricGroup;

        this.partitionProducerStateChecker =
                Preconditions.checkNotNull(partitionProducerStateChecker);
        this.executor = Preconditions.checkNotNull(executor);

        // create the reader and writer structures

        final String taskNameWithSubtaskAndId = taskNameWithSubtask + " (" + executionId + ')';

        final ShuffleIOOwnerContext taskShuffleContext =
                shuffleEnvironment.createShuffleIOOwnerContext(
                        taskNameWithSubtaskAndId, executionId, metrics.getIOMetricGroup());

        // produced intermediate result partitions
        final ResultPartitionWriter[] resultPartitionWriters =
                shuffleEnvironment
                        .createResultPartitionWriters(
                                taskShuffleContext, resultPartitionDeploymentDescriptors)
                        .toArray(new ResultPartitionWriter[] {});

        this.consumableNotifyingPartitionWriters =
                ConsumableNotifyingResultPartitionWriterDecorator.decorate(
                        resultPartitionDeploymentDescriptors,
                        resultPartitionWriters,
                        this,
                        jobId,
                        resultPartitionConsumableNotifier);

        // consumed intermediate result partitions
        final IndexedInputGate[] gates =
                shuffleEnvironment
                        .createInputGates(taskShuffleContext, this, inputGateDeploymentDescriptors)
                        .toArray(new IndexedInputGate[0]);

        this.inputGates = new IndexedInputGate[gates.length];
        int counter = 0;
        for (IndexedInputGate gate : gates) {
            inputGates[counter++] =
                    new InputGateWithMetrics(
                            gate, metrics.getIOMetricGroup().getNumBytesInCounter());
        }

        if (shuffleEnvironment instanceof NettyShuffleEnvironment) {
            //noinspection deprecation
            ((NettyShuffleEnvironment) shuffleEnvironment)
                    .registerLegacyNetworkMetrics(
                            metrics.getIOMetricGroup(), resultPartitionWriters, gates);
        }

        invokableHasBeenCanceled = new AtomicBoolean(false);

        // finally, create the executing thread, but do not start it
        executingThread = new Thread(TASK_THREADS_GROUP, this, taskNameWithSubtask);
    }

2.3. startTaskThread

启动task 线程.

/** Starts the task's thread. */
public void startTaskThread() {
    executingThread.start();
}

2.4. doRun

因为Task实现了Runnable接口. 所以启动必须是由Thread类的start启动. 然后调用run()方法.
在执行run方法的时候, 会引导任务并执行其代码的核心工作方法doRun。

/**
 * 引导任务并执行其代码的核心工作方法。
 * The core work method that bootstraps the task and executes its code.
 * */
@Override
public void run() {
    try {
        doRun();
    } finally {
        terminationFuture.complete(executionState);
    }
}

doRun

private void doRun() {
        // ----------------------------
        //  Initial State transition
        // ----------------------------
        while (true) {
            // 获取当前状态
            ExecutionState current = this.executionState;
            if (current == ExecutionState.CREATED) {
                // 创建&执行
                if (transitionState(ExecutionState.CREATED, ExecutionState.DEPLOYING)) {
                    // success, we can start our work
                    // 跳出,开始执行任务...
                    break;
                }


            } else if (current == ExecutionState.FAILED) {
                // 失败
                // we were immediately failed. tell the TaskManager that we reached our final state
                notifyFinalState();
                if (metrics != null) {
                    metrics.close();
                }
                return;
            } else if (current == ExecutionState.CANCELING) {
                // 取消
                if (transitionState(ExecutionState.CANCELING, ExecutionState.CANCELED)) {
                    // we were immediately canceled. tell the TaskManager that we reached our final
                    // state
                    notifyFinalState();
                    if (metrics != null) {
                        metrics.close();
                    }
                    return;
                }
            } else {
                if (metrics != null) {
                    metrics.close();
                }
                throw new IllegalStateException(
                        "Invalid state for beginning of operation of task " + this + '.');
            }
        }

        // 所有从这里获取和注册的资源最终都需要撤消
        // all resource acquisitions and registrations from here on
        // need to be undone in the end
        Map<String, Future<Path>> distributedCacheEntries = new HashMap<>();
        AbstractInvokable invokable = null;

        try {
            // ----------------------------
            //  任务引导-我们定期检查是否作为快捷方式取消
            //  Task Bootstrap - We periodically
            //  check for canceling as a shortcut
            // ----------------------------

            // activate safety net for task thread
            // Creating FileSystem stream leak safety net for task
            //
            //      Window(TumblingProcessingTimeWindows(5000),  ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction)
            //      ->
            //      Sink: Print to Std. Out (1/1)#0 (141dd597dc560a831b2b4bc195943f0b) [DEPLOYING]
            //
            LOG.debug("Creating FileSystem stream leak safety net for task {}", this);
            FileSystemSafetyNet.initializeSafetyNetForThread();


            // 首先,获取一个用户代码类加载器这可能涉及下载作业的JAR文件和/或类
            // first of all, get a user-code classloader
            // this may involve downloading the job's JAR files and/or classes

            // 加载Task 所需的JAR
            // Loading JAR files for task
            //      Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1)#0 (141dd597dc560a831b2b4bc195943f0b) [DEPLOYING].
            LOG.info("Loading JAR files for task {}.", this);


            // 获取用户类加载器 : UserCodeClassLoader
            // Getting user code class loader for task 141dd597dc560a831b2b4bc195943f0b at library cache manager took 10 milliseconds
            userCodeClassLoader = createUserCodeClassloader();

            final ExecutionConfig executionConfig =
                    serializedExecutionConfig.deserializeValue(userCodeClassLoader.asClassLoader());

            if (executionConfig.getTaskCancellationInterval() >= 0) {
                // override task cancellation interval from Flink config if set in ExecutionConfig
                taskCancellationInterval = executionConfig.getTaskCancellationInterval();
            }

            if (executionConfig.getTaskCancellationTimeout() >= 0) {
                // override task cancellation timeout from Flink config if set in ExecutionConfig
                taskCancellationTimeout = executionConfig.getTaskCancellationTimeout();
            }

            if (isCanceledOrFailed()) {
                throw new CancelTaskException();
            }

            // ----------------------------------------------------------------
            // register the task with the network stack
            // this operation may fail if the system does not have enough
            // memory to run the necessary data exchanges
            // the registration must also strictly be undone
            // ----------------------------------------------------------------

            // 接收任务
            //Registering task at network: Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1)#0 (141dd597dc560a831b2b4bc195943f0b) [DEPLOYING].
            //Registering task at network: Source: Socket Stream -> Flat Map (1/1)#0 (fc2db808f4399d580c05db4fd3c2d2df) [DEPLOYING].
            LOG.info("Registering task at network: {}.", this);

            setupPartitionsAndGates(consumableNotifyingPartitionWriters, inputGates);

            for (ResultPartitionWriter partitionWriter : consumableNotifyingPartitionWriters) {
                taskEventDispatcher.registerPartition(partitionWriter.getPartitionId());
            }

            // next, kick off the background copying of files for the distributed cache
            try {
                for (Map.Entry<String, DistributedCache.DistributedCacheEntry> entry :
                        DistributedCache.readFileInfoFromConfig(jobConfiguration)) {
                    LOG.info("Obtaining local cache file for '{}'.", entry.getKey());
                    Future<Path> cp =
                            fileCache.createTmpFile(
                                    entry.getKey(), entry.getValue(), jobId, executionId);
                    distributedCacheEntries.put(entry.getKey(), cp);
                }
            } catch (Exception e) {
                throw new Exception(
                        String.format(
                                "Exception while adding files to distributed cache of task %s (%s).",
                                taskNameWithSubtask, executionId),
                        e);
            }

            if (isCanceledOrFailed()) {
                throw new CancelTaskException();
            }

            // ----------------------------------------------------------------
            //  call the user code initialization methods
            // ----------------------------------------------------------------

            TaskKvStateRegistry kvStateRegistry =
                    kvStateService.createKvStateTaskRegistry(jobId, getJobVertexId());

            Environment env =
                    new RuntimeEnvironment(
                            jobId,
                            vertexId,
                            executionId,
                            executionConfig,
                            taskInfo,
                            jobConfiguration,
                            taskConfiguration,
                            userCodeClassLoader,
                            memoryManager,
                            ioManager,
                            broadcastVariableManager,
                            taskStateManager,
                            aggregateManager,
                            accumulatorRegistry,
                            kvStateRegistry,
                            inputSplitProvider,
                            distributedCacheEntries,
                            consumableNotifyingPartitionWriters,
                            inputGates,
                            taskEventDispatcher,
                            checkpointResponder,
                            operatorCoordinatorEventGateway,
                            taskManagerConfig,
                            metrics,
                            this,
                            externalResourceInfoProvider);

            // Make sure the user code classloader is accessible thread-locally.
            // We are setting the correct context class loader before instantiating the invokable
            // so that it is available to the invokable during its entire lifetime.
            executingThread.setContextClassLoader(userCodeClassLoader.asClassLoader());


            // 加载&实例化task的可执行代码
            // now load and instantiate the task's invokable code
            invokable =
                    loadAndInstantiateInvokable(
                            userCodeClassLoader.asClassLoader(), nameOfInvokableClass, env);

            // ----------------------------------------------------------------
            //  actual task core work
            // ----------------------------------------------------------------

            // we must make strictly sure that the invokable is accessible to the cancel() call
            // by the time we switched to running.
            this.invokable = invokable;

            // switch to the RUNNING state, if that fails, we have been canceled/failed in the
            // meantime
            if (!transitionState(ExecutionState.DEPLOYING, ExecutionState.RUNNING)) {
                throw new CancelTaskException();
            }

            // notify everyone that we switched to running
            taskManagerActions.updateTaskExecutionState(
                    new TaskExecutionState(jobId, executionId, ExecutionState.RUNNING));

            // make sure the user code classloader is accessible thread-locally
            executingThread.setContextClassLoader(userCodeClassLoader.asClassLoader());

            // run the invokable


            // 开始执行 !!!!!!!!!!!
            // source : DataSourceTask
            // 流任务 : StreamTask
            //
            // sink : DataSinkTask / IterationSynchronizationSinkTask
            //

            // 批任务: BatchTask
            invokable.invoke();

            // make sure, we enter the catch block if the task leaves the invoke() method due
            // to the fact that it has been canceled
            if (isCanceledOrFailed()) {
                throw new CancelTaskException();
            }

            // ----------------------------------------------------------------
            //  finalization of a successful execution
            // ----------------------------------------------------------------

            // finish the produced partitions. if this fails, we consider the execution failed.
            for (ResultPartitionWriter partitionWriter : consumableNotifyingPartitionWriters) {
                if (partitionWriter != null) {
                    partitionWriter.finish();
                }
            }

            // try to mark the task as finished
            // if that fails, the task was canceled/failed in the meantime
            if (!transitionState(ExecutionState.RUNNING, ExecutionState.FINISHED)) {
                throw new CancelTaskException();
            }
        } catch (Throwable t) {

            // unwrap wrapped exceptions to make stack traces more compact
            if (t instanceof WrappingRuntimeException) {
                t = ((WrappingRuntimeException) t).unwrap();
            }

            // ----------------------------------------------------------------
            // the execution failed. either the invokable code properly failed, or
            // an exception was thrown as a side effect of cancelling
            // ----------------------------------------------------------------

            TaskManagerExceptionUtils.tryEnrichTaskManagerError(t);

            try {
                // check if the exception is unrecoverable
                if (ExceptionUtils.isJvmFatalError(t)
                        || (t instanceof OutOfMemoryError
                                && taskManagerConfig.shouldExitJvmOnOutOfMemoryError())) {

                    // terminate the JVM immediately
                    // don't attempt a clean shutdown, because we cannot expect the clean shutdown
                    // to complete
                    try {
                        LOG.error(
                                "Encountered fatal error {} - terminating the JVM",
                                t.getClass().getName(),
                                t);
                    } finally {
                        Runtime.getRuntime().halt(-1);
                    }
                }

                // transition into our final state. we should be either in DEPLOYING, RUNNING,
                // CANCELING, or FAILED
                // loop for multiple retries during concurrent state changes via calls to cancel()
                // or
                // to failExternally()
                while (true) {
                    ExecutionState current = this.executionState;

                    if (current == ExecutionState.RUNNING || current == ExecutionState.DEPLOYING) {
                        if (t instanceof CancelTaskException) {
                            if (transitionState(current, ExecutionState.CANCELED)) {
                                cancelInvokable(invokable);
                                break;
                            }
                        } else {
                            if (transitionState(current, ExecutionState.FAILED, t)) {
                                // proper failure of the task. record the exception as the root
                                // cause
                                failureCause = t;
                                cancelInvokable(invokable);

                                break;
                            }
                        }
                    } else if (current == ExecutionState.CANCELING) {
                        if (transitionState(current, ExecutionState.CANCELED)) {
                            break;
                        }
                    } else if (current == ExecutionState.FAILED) {
                        // in state failed already, no transition necessary any more
                        break;
                    }
                    // unexpected state, go to failed
                    else if (transitionState(current, ExecutionState.FAILED, t)) {
                        LOG.error(
                                "Unexpected state in task {} ({}) during an exception: {}.",
                                taskNameWithSubtask,
                                executionId,
                                current);
                        break;
                    }
                    // else fall through the loop and
                }
            } catch (Throwable tt) {
                String message =
                        String.format(
                                "FATAL - exception in exception handler of task %s (%s).",
                                taskNameWithSubtask, executionId);
                LOG.error(message, tt);
                notifyFatalError(message, tt);
            }
        } finally {
            try {
                LOG.info("Freeing task resources for {} ({}).", taskNameWithSubtask, executionId);

                // clear the reference to the invokable. this helps guard against holding references
                // to the invokable and its structures in cases where this Task object is still
                // referenced
                this.invokable = null;

                // free the network resources
                releaseResources();

                // free memory resources
                if (invokable != null) {
                    memoryManager.releaseAll(invokable);
                }

                // remove all of the tasks resources
                fileCache.releaseJob(jobId, executionId);

                // close and de-activate safety net for task thread
                LOG.debug("Ensuring all FileSystem streams are closed for task {}", this);
                FileSystemSafetyNet.closeSafetyNetAndGuardedResourcesForThread();

                notifyFinalState();
            } catch (Throwable t) {
                // an error in the resource cleanup is fatal
                String message =
                        String.format(
                                "FATAL - exception in resource cleanup of task %s (%s).",
                                taskNameWithSubtask, executionId);
                LOG.error(message, t);
                notifyFatalError(message, t);
            }

            // un-register the metrics at the end so that the task may already be
            // counted as finished when this happens
            // errors here will only be logged
            try {
                metrics.close();
            } catch (Throwable t) {
                LOG.error(
                        "Error during metrics de-registration of task {} ({}).",
                        taskNameWithSubtask,
                        executionId,
                        t);
            }
        }
    }

2.5. Checkpoint 相关

2.5.1. triggerCheckpointBarrier

调用invokable的triggerCheckpointBarrier方法, 触发chckpoint

/**
 * 触发checkpoint操作
 * Calls the invokable to trigger a checkpoint.
 *
 * @param checkpointID The ID identifying the checkpoint.
 * @param checkpointTimestamp The timestamp associated with the checkpoint.
 * @param checkpointOptions Options for performing this checkpoint.
 */
public void triggerCheckpointBarrier(
        final long checkpointID,
        final long checkpointTimestamp,
        final CheckpointOptions checkpointOptions) {

    final AbstractInvokable invokable = this.invokable;
    final CheckpointMetaData checkpointMetaData =
            new CheckpointMetaData(checkpointID, checkpointTimestamp);
    // 只有状态为RUNNING才可以触发Checkpoint操作
    if (executionState == ExecutionState.RUNNING && invokable != null) {
        try {
        	// [核心] 触发Checkpoint操作.
            invokable.triggerCheckpointAsync(checkpointMetaData, checkpointOptions);
        } catch (RejectedExecutionException ex) {
            // This may happen if the mailbox is closed. It means that the task is shutting
            // down, so we just ignore it.
            LOG.debug(
                    "Triggering checkpoint {} for {} ({}) was rejected by the mailbox",
                    checkpointID,
                    taskNameWithSubtask,
                    executionId);
        } catch (Throwable t) {
            if (getExecutionState() == ExecutionState.RUNNING) {
                failExternally(
                        new Exception(
                                "Error while triggering checkpoint "
                                        + checkpointID
                                        + " for "
                                        + taskNameWithSubtask,
                                t));
            } else {
                LOG.debug(
                        "Encountered error while triggering checkpoint {} for "
                                + "{} ({}) while being not in state running.",
                        checkpointID,
                        taskNameWithSubtask,
                        executionId,
                        t);
            }
        }
    } else {
        LOG.debug(
                "Declining checkpoint request for non-running task {} ({}).",
                taskNameWithSubtask,
                executionId);

        // 发回消息说我们没有做检查点
        // send back a message that we did not do the checkpoint
        checkpointResponder.declineCheckpoint(
                jobId,
                executionId,
                checkpointID,
                new CheckpointException(
                        "Task name with subtask : " + taskNameWithSubtask,
                        CheckpointFailureReason.CHECKPOINT_DECLINED_TASK_NOT_READY));
    }
}

2.5.2. confirmCheckpoint

通过invokable 的 notifyCheckpointCompleteAsync 方法 . Checkpoint完成

@Override
public void notifyCheckpointComplete(final long checkpointID) {
    final AbstractInvokable invokable = this.invokable;

    if (executionState == ExecutionState.RUNNING && invokable != null) {
        try {

            invokable.notifyCheckpointCompleteAsync(checkpointID);
        } catch (RejectedExecutionException ex) {
            // This may happen if the mailbox is closed. It means that the task is shutting
            // down, so we just ignore it.
            LOG.debug(
                    "Notify checkpoint complete {} for {} ({}) was rejected by the mailbox",
                    checkpointID,
                    taskNameWithSubtask,
                    executionId);
        } catch (Throwable t) {
            if (getExecutionState() == ExecutionState.RUNNING) {
                // fail task if checkpoint confirmation failed.
                failExternally(new RuntimeException("Error while confirming checkpoint", t));
            }
        }
    } else {
        LOG.debug(
                "Ignoring checkpoint commit notification for non-running task {}.",
                taskNameWithSubtask);
    }
}

2.5.3. notifyCheckpointAborted

调用算子实例invokable的notifyCheckpointAbortAsync方法实现放弃Checkpoint 操作.

 @Override
    public void notifyCheckpointAborted(final long checkpointID) {
        final AbstractInvokable invokable = this.invokable;

        if (executionState == ExecutionState.RUNNING && invokable != null) {
            try {
                // 调用算子实例invokable的notifyCheckpointAbortAsync方法实现放弃Checkpoint 操作.
                invokable.notifyCheckpointAbortAsync(checkpointID);
            } catch (RejectedExecutionException ex) {
                // This may happen if the mailbox is closed. It means that the task is shutting
                // down, so we just ignore it.
                LOG.debug(
                        "Notify checkpoint abort {} for {} ({}) was rejected by the mailbox",
                        checkpointID,
                        taskNameWithSubtask,
                        executionId);
            } catch (Throwable t) {
                if (getExecutionState() == ExecutionState.RUNNING) {
                    // fail task if checkpoint aborted notification failed.
                    failExternally(new RuntimeException("Error while aborting checkpoint", t));
                }
            }
        } else {
            LOG.info(
                    "Ignoring checkpoint aborted notification for non-running task {}.",
                    taskNameWithSubtask);
        }
    }

2.6. deliverOperatorEvent

调用算子实例invokable的dispatchOperatorEvent方法实现 事件分发操作…

将操作符事件分派给可调用的任务。
如果事件传递没有成功,此方法将抛出异常。
调用者可以使用该异常来报告错误,但不需要对任务失败作出反应(此方法负责这一点)。

 /**
     *
     * 将操作符事件分派给可调用的任务。
     * 如果事件传递没有成功,此方法将抛出异常。
     * 调用者可以使用该异常来报告错误,但不需要对任务失败作出反应(此方法负责这一点)。
     * Dispatches an operator event to the invokable task.
     *
     * <p>If the event delivery did not succeed, this method throws an exception. Callers can use
     * that exception for error reporting, but need not react with failing this task (this method
     * takes care of that).
     *
     * @throws FlinkException This method throws exceptions indicating the reason why delivery did
     *     not succeed.
     */
    public void deliverOperatorEvent(OperatorID operator, SerializedValue<OperatorEvent> evt)
            throws FlinkException {
        final AbstractInvokable invokable = this.invokable;

        if (invokable == null || executionState != ExecutionState.RUNNING) {
            throw new TaskNotRunningException("Task is not yet running.");
        }

        try {
            // 调用算子实例invokable的dispatchOperatorEvent方法实现 事件分发操作..
            invokable.dispatchOperatorEvent(operator, evt);
        } catch (Throwable t) {
            ExceptionUtils.rethrowIfFatalErrorOrOOM(t);

            if (getExecutionState() == ExecutionState.RUNNING) {
                FlinkException e = new FlinkException("Error while handling operator event", t);
                failExternally(e);
                throw e;
            }
        }
    }
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值