Data Streaming Fault Tolerance 数据流容错

最新推荐文章于 2023-06-19 15:28:44 发布

huaishu

最新推荐文章于 2023-06-19 15:28:44 发布

阅读量317

点赞数 1

分类专栏： Flink

Flink 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

简介

Apache Flink提供了一种容错机制，可以持续恢复数据流应用程序的状态。该机制确保即使存在故障，程序的状态最终也只能反映数据流中的每条记录一次。请注意，有一个开关将保证降级至少一次（如下所述）。

容错机制连续地绘制分布式流数据流的快照。对于具有小状态的流应用程序，这些快照非常轻量级，可以频繁绘制，而不会对性能产生太大影响。流应用程序的状态存储在可配置的位置（例如主节点或HDFS）。

如果程序失败（由于机器，网络或软件故障），Flink将停止分布式流数据流窗口。然后，系统重新启动算子并将其重置为最新的成功检查点。输入流将重置为状态快照的点。作为重新启动的并行数据流的一部分处理的任何记录都保证不会成为先前检查点状态的一部分。

注意：默认情况下，禁用检查点。有关如何启用和配置检查点的详细信息，请参阅检查点。

注意：要使此机制实现其完全保证，数据流源（例如消息队列或代理）需要能够将流回滚到定义的最近点。 Apache Kafka具有这种能力，Flink与Kafka的连接器利用了这种能力。有关Flink连接器提供的保证的更多信息，请参阅数据源和接收器的容错保证。

注意：由于Flink的检查点是通过分布式快照实现的，因此我们可以互换使用快照和检查点。

Checkpointing 检查点

Flink的容错机制的核心部分是绘制分布式数据流和算子状态的一致快照。这些快照充当一致的检查点，系统可以在发生故障时将其退回。 Flink用于绘制这些快照的机制在“分布式数据流的轻量级异步快照”中进行了描述。它受到分布式快照的标准Chandy-Lamport算法的启发，专门针对Flink的执行模型而定制。

屏障

Flink分布式快照的核心元素是流屏障。这些屏障被注入数据流并与记录一起作为数据流的一部分流动。屏障永远不会超过记录，流量严格符合要求。屏障将数据流中的记录分为进入当前快照的记录集和进入下一个快照的记录。每个屏障都携带快照的ID，该快照的记录在其前面推送。屏障不会中断流的流动，因此非常轻量级。来自不同快照的多个屏障可以同时在流中，这意味着可以同时发生各种快照。

Stream barriers are injected into the parallel data flow at the stream sources. The point where the barriers for snapshot n are injected (let’s call it Sn) is the position in the source stream up to which the snapshot covers the data. For example, in Apache Kafka, this position would be the last record’s offset in the partition. This position Sn is reported to the checkpoint coordinator (Flink’s JobManager).

The barriers then flow downstream. When an intermediate operator has received a barrier for snapshot n from all of its input streams, it emits a barrier for snapshot n into all of its outgoing streams. Once a sink operator (the end of a streaming DAG) has received the barrier n from all of its input streams, it acknowledges that snapshot n to the checkpoint coordinator. After all sinks have acknowledged a snapshot, it is considered completed.

Once snapshot n has been completed, the job will never again ask the source for records from before Sn, since at that point these records (and their descendant records) will have passed through the entire data flow topology.

流屏障被注入流源的并行数据流中。注入快照n的屏障（我们称之为Sn）的点是源流中快照覆盖数据的位置。例如，在Apache Kafka中，此位置将是分区中最后一条记录的偏移量。该位置Sn被报告给检查点协调器（Flink的JobManager）。

然后屏障向下游流动。当中间算子从其所有输入流中收到快照n的屏障时，它会将快照n的屏障发送到其所有输出流中。一旦接收算子（流式DAG的末端）从其所有输入流接收到屏障n，它就向快照n确认检查点协调器。在所有接收器确认快照后，它被视为已完成。

一旦完成快照n，任务将永远不再向源询问来自Sn之前的记录，因为此时这些记录（及其后代记录）将通过整个数据流拓扑。

Operators that receive more than one input stream need to align the input streams on the snapshot barriers. The figure above illustrates this:

As soon as the operator receives snapshot barrier n from an incoming stream, it cannot process any further records from that stream until it has received the barrier n from the other inputs as well. Otherwise, it would mix records that belong to snapshot nand with records that belong to snapshot n+1.
Streams that report barrier n are temporarily set aside. Records that are received from these streams are not processed, but put into an input buffer.
Once the last stream has received barrier n, the operator emits all pending outgoing records, and then emits snapshot n barriers itself.
After that, it resumes processing records from all input streams, processing records from the input buffers before processing the records from the streams.

接收多个输入流的算子需要在快照屏障上对齐输入流。上图说明了这一点：

一旦算子从输入流接收到快照屏障n，它就不能处理来自该流的任何其他记录，直到它从其他输入接收到屏障n为止。否则，它会混合属于快照n的记录和属于快照n + 1的记录。
报告障碍n的流暂时被搁置。从这些流接收的记录不会被处理，而是放入输入缓冲区。
一旦最后一个流接收到屏障n，算子就会发出所有挂起的传出记录，然后自己发出快照n个屏障。
之后，它将继续处理来自所有输入流的记录，在处理来自流的记录之前处理缓冲区的记录。

State

When operators contain any form of state, this state must be part of the snapshots as well. Operator state comes in different forms:

User-defined state: This is state that is created and modified directly by the transformation functions (like map() or filter()). See State in Streaming Applications for details.
System state: This state refers to data buffers that are part of the operator’s computation. A typical example for this state are the window buffers, inside which the system collects (and aggregates) records for windows until the window is evaluated and evicted.

Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams. At that point, all updates to the state from records before the barriers will have been made, and no updates that depend on records from after the barriers have been applied. Because the state of a snapshot may be large, it is stored in a configurable state backend. By default, this is the JobManager’s memory, but for production use a distributed reliable storage should be configured (such as HDFS). After the state has been stored, the operator acknowledges the checkpoint, emits the snapshot barrier into the output streams, and proceeds.

The resulting snapshot now contains:

For each parallel stream data source, the offset/position in the stream when the snapshot was started
For each operator, a pointer to the state that was stored as part of the snapshot

状态

当算子包含任何形式的状态时，此状态也必须是快照的一部分。算子状态有不同的形式：

用户定义的状态：这是由转换函数(如map()或filter())直接创建和修改的状态。有关详细信息，请参阅流应用程序中的状态
系统状态：此状态是指作为算子计算一部分的数据缓冲区。此状态的典型示例是窗口缓冲区，系统在其中收集（和聚合）窗口记录，直到窗口被评估和逐出。

算子在从输入流接收到所有快照屏障时，以及在向其输出流发出屏障之前，对其状态进行快照。此时，将在障碍之前的记录中对状态进行所有更新，并且在应用障碍之后不依赖于记录的更新。由于快照的状态可能很大，因此它存储在可配置的状态后端中。默认情况下，这是JobManager的内存，但对于生产使用，应配置分布式可靠存储（例如HDFS）。在存储状态之后，算子确认检查点，将快照屏障发送到输出流中，然后继续。

生成的快照现在包含：

对于每个并行流数据源，启动快照时流中的偏移/位置
对于每个算子，指针的状态作为快照的一部分存储

Exactly Once vs. At Least Once

The alignment step may add latency to the streaming program. Usually, this extra latency is on the order of a few milliseconds, but we have seen cases where the latency of some outliers increased noticeably. For applications that require consistently super low latencies (few milliseconds) for all records, Flink has a switch to skip the stream alignment during a checkpoint. Checkpoint snapshots are still drawn as soon as an operator has seen the checkpoint barrier from each input.

When the alignment is skipped, an operator keeps processing all inputs, even after some checkpoint barriers for checkpoint n arrived. That way, the operator also processes elements that belong to checkpoint n+1 before the state snapshot for checkpoint n was taken. On a restore, these records will occur as duplicates, because they are both included in the state snapshot of checkpoint n, and will be replayed as part of the data after checkpoint n.

NOTE: Alignment happens only for operators with multiple predecessors (joins) as well as operators with multiple senders (after a stream repartitioning/shuffle). Because of that, dataflows with only embarrassingly parallel streaming operations (map(), flatMap(), filter(), …) actually give exactly once guarantees even in at least once mode.

完全一次与至少一次

对齐步骤可以增加流式传输程序的等待时间。通常，这个额外的延迟大约是几毫秒，但我们已经看到一些异常值的延迟显着增加的情况。对于要求所有记录始终具有超低延迟（几毫秒）的应用程序，Flink可以在检查点期间跳过流对齐。一旦算子看到每个输入的检查点屏障，仍然会绘制检查点快照。

当跳过对齐时，即使在检查点n的某些检查点屏障到达之后，算子也会继续处理所有输入。这样，在获取检查点n的状态快照之前，算子还处理属于检查点n + 1的元素。在还原时，这些记录将作为重复记录出现，因为它们都包含在检查点n的状态快照中，并将在检查点n之后作为数据的一部分进行重放。

注意：对齐仅适用于具有多个前驱（连接）的算子以及具有多个发送方的算子（在流重新分区/变换之后）。因此，数据流只有令人尴尬的并行流操作（map()，flatMap()，filter()，...）实际上即使在至少一次模式下也能提供完全一次保证。

Asynchronous State Snapshots 异步状态快照

Note that the above described mechanism implies that operators stop processing input records while they are storing a snapshot of their state in the state backend. This synchronous state snapshot introduces a delay every time a snapshot is taken.

It is possible to let an operator continue processing while it stores its state snapshot, effectively letting the state snapshots happen asynchronously in the background. To do that, the operator must be able to produce a state object that should be stored in a way such that further modifications to the operator state do not affect that state object. For example, copy-on-write data structures, such as are used in RocksDB, have this behavior.

After receiving the checkpoint barriers on its inputs, the operator starts the asynchronous snapshot copying of its state. It immediately emits the barrier to its outputs and continues with the regular stream processing. Once the background copy process has completed, it acknowledges the checkpoint to the checkpoint coordinator (the JobManager). The checkpoint is now only complete after all sinks have received the barriers and all stateful operators have acknowledged their completed backup (which may be after the barriers reach the sinks).

See State Backends for details on the state snapshots.

注意，上述机制意味着算子在将状态的快照存储在状态后端时停止处理输入记录。每次快照时，此同步状态快照都会引入延迟。

可以让算子在存储其状态快照时继续处理，从而有效地让状态快照在后台异步发生。为此，算子必须能够生成一个状态对象，该状态对象应以某种方式存储，以便对算子状态的进一步修改不会影响该状态对象。例如，使用的写时复制数据结构这种方式，诸如使用RocksDB。

在接收到输入的检查点屏障后，算子启动其状态的异步快照复制。它会立即释放其输出的屏障，并继续进行常规流处理。后台复制过程完成后，它会向检查点协调器（JobManager）确认检查点。检查点现在仅在所有接收器都已收到屏障并且所有有状态算子已确认其完成备份（可能在屏障到达接收器之后）之后才完成。

有关状态快照的详细信息，请参阅State Backends。

Recovery

Recovery under this mechanism is straightforward: Upon a failure, Flink selects the latest completed checkpoint k. The system then re-deploys the entire distributed dataflow, and gives each operator the state that was snapshotted as part of checkpoint k. The sources are set to start reading the stream from position Sk. For example in Apache Kafka, that means telling the consumer to start fetching from offset Sk.

If state was snapshotted incrementally, the operators start with the state of the latest full snapshot and then apply a series of incremental snapshot updates to that state.

See Restart Strategies for more information.

恢复

在这种机制下的恢复是直截了当的：当失败时，Flink选择最新完成的检查点k。然后，系统重新部署整个分布式数据流，并为每个算子提供作为检查点k的一部分进行快照的状态。源被设置为从位置Sk开始读取流。例如，在Apache Kafka中，这意味着告诉消费者从偏移Sk开始提取。

如果状态以递增方式进行快照，则算子将从最新完整快照的状态开始，然后对该状态应用一系列增量快照更新。

有关更多信息，请参阅Restart Strategies

Operator Snapshot Implementation

When operator snapshots are taken, there are two parts: the synchronous and the asynchronous parts.

Operators and state backends provide their snapshots as a Java FutureTask. That task contains the state where the synchronous part is completed and the asynchronous part is pending. The asynchronous part is then executed by a background thread for that checkpoint.

Operators that checkpoint purely synchronously return an already completed FutureTask. If an asynchronous operation needs to be performed, it is executed in the run() method of that FutureTask.

The tasks are cancelable, so that streams and other resource consuming handles can be released.

算子快照实施

获取算子快照时，有两部分：同步和异步部分。

算子和状态后端提供将其快照作为Java FutureTask。该任务包含完成同步部分且异步部分处于挂起状态的状态。然后，异步部分由该检查点的后台线程执行。

纯粹同步检查点的算子返回已经完成的FutureTask的。如果需要执行异步操作，则在FutureTask的run()方法中执行。

任务是可取消的，因此可以释放流和其他资源句柄。

huaishu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Data Streaming Fault Tolerance 数据流容错

简介Apache Flink提供了一种容错机制，可以持续恢复数据流应用程序的状态。该机制确保即使存在故障，程序的状态最终也只能反映数据流中的每条记录一次。请注意，有一个开关将保证降级至少一次（如下所述）。容错机制连续地绘制分布式流数据流的快照。对于具有小状态的流应用程序，这些快照非常轻量级，可以频繁绘制，而不会对性能产生太大影响。流应用程序的状态存储在可配置的位置（例如主节点或HDFS）。...
复制链接

扫一扫

专栏目录