flink 检查点_Flink检查点和恢复

最新推荐文章于 2024-07-31 17:12:18 发布

weixin_26713521

最新推荐文章于 2024-07-31 17:12:18 发布

阅读量860

点赞数

文章标签： flink java python

原文链接：https://towardsdatascience.com/flink-checkpointing-and-recovery-7e59e76c2d45

版权

本文深入探讨Apache Flink的检查点机制，详细解释了如何实现容错和状态恢复，确保数据处理的准确性和一致性。内容涵盖检查点的工作原理、配置选项以及在Java和Python API中的应用。

摘要由CSDN通过智能技术生成

flink 检查点

Apache Flink is a popular real-time data processing framework. It’s gaining more and more popularity thanks to its low-latency processing at extremely high throughput in a fault-tolerant manner.

Apache Flink是一种流行的实时数据处理框架。它以容错的方式以极高的吞吐量进行低延迟处理，因此越来越受欢迎。

While there is a good documentation provided by Flink it took me some time to get to understand the various mechanics that come together to make Flink Check pointing and Recovery work end to end. In this article I will explain the key steps one need to perform at various operator levels to create a fault tolerant Flink Job. Flink basic operators are Source, Process and Sink. Process operators could be of various flavors.

尽管Flink提供了很好的文档，但是我花了一些时间来理解使Flink Check Pointing和Recovery工作端到端结合在一起的各种机制。在本文中，我将解释在各种操作员级别上创建容错Flink Job所需执行的关键步骤。 Flink的基本运算符是Source，Process和Sink。过程操作员可能具有多种口味。

So let’s get started on what you need to do to enable check pointing and making all operators Checkpoint aware.

因此，让我们开始您需要做的事情，以启用检查点并使所有操作员都知道Checkpoint。

Flink环境配置(检查指向) (Flink Environment Configuration (Check pointing))

Flink Job Configuration for Check pointing

用于检查点的Flink作业配置

源运营商检查点 (Source Operator Checkpointing)

Source operator is the one which fetches data from the source. I wrote a simple SQL continuous query based source operator and kept track of the timestamp till the data has been queried. This information is what will be stored as part of check pointing process by flink. State of the source is saved by flink at the Job Operator level. CheckPointedFunction interface or ListCheckpointed interface should be implemented by the Source function as follows:

源运算符是从源获取数据的运算符。我编写了一个简单的基于SQL连续查询的源运算符，并跟踪时间戳，直到查询完数据为止。该信息将作为flink在检查点过程中存储的信息。源的状态通过flink在作业操作员级别保存。 CheckPointedFunction接口或ListCheckpointed接口应该由Source函数实现，如下所示：

snapshotState method will be called by the Flink Job Operator every 30 seconds as configured. Method should return the value to be saved in state backend

Flink作业操作员将按配置每30秒调用一次snapshotState方法。方法应返回要保存在状态后端的值

restoreState method is called when the operator is restarting and this method is the handler method to set the last stored timestamp (state) during a checkpoint

当操作员重新启动时将调用restoreState方法，并且该方法是在检查点期间设置最后存储的时间戳(状态)的处理程序方法

过程功能检查点 (Process Function Checkpointing)

Flink supports saving state per key via KeyedProcessFunction. ProcessWindowFunction can also save the state of windows on per key basis in case of Event Time processing

Flink支持通过KeyedProcessFunction保存每个键的状态。在事件时间处理的情况下， ProcessWindowFunction还可以按键保存窗口的状态

For KeyedProcessFunction, ValueState need to be stored per key as follows:

对于KeyedProcessFunction ，需要按以下方式存储每个键的ValueState ：

ValueState is just one of the examples. There are other ways to save the state as well. ProcessWindowFunction automatically saves the window state and no variable need to be set.

ValueState只是示例之一。还有其他保存状态的方法。 ProcessWindowFunction自动保存窗口状态，无需设置任何变量。

接收器功能检查点 (Sink Function Checkpointing)

Sink function check pointing works similar to Source Function check pointing and state is saved at the Job Operator level. I have implemented Sink function for Postgres DB. There could be multiple approaches to make sink function fault tolerant and robust considering performance and efficiency. I have taken a simplistic approach and will improve upon it in future.

接收器功能检查指向的工作方式类似于源功能检查指向，并且状态保存在作业操作员级别。我已经为Postgres DB实现了Sink功能。考虑到性能和效率，可以有多种方法使接收器功能具有容错性和鲁棒性。我采用了一种简单的方法，将来会对其进行改进。

By committing statement in snapshotState method I’m ensuring that all pending data is flushed and committed as part of checkpointing trigger.

通过在snapshotState方法中提交语句，我确保将所有未决数据刷新并作为检查点触发器的一部分提交。

可以了，好了 (All Set)

Finally, you need to run your job and you can try to cancel it in between of processing and try to rerun it by providing the checkpoint location as follows. You will need to pass the latest checkpoint yourself, pay attention to -s parameter.

最后，您需要运行您的作业，您可以尝试在处理之间取消它，并通过提供以下检查点位置来尝试重新运行它。您将需要自己通过最新的检查点，请注意-s参数。

.\flink.bat run -m localhost:8081 -s D:\flink-checkpoints\1d96f28886b693452ab1c88ab72a35c8\chk-10 -c <Job class Name> <Path to Jar file>

结论 (Conclusion)

This is a basic approach toward checkpointing and failure recovey and might need more improvements depending upon each use case. Feel free to provide me your feedback. Happy Reading!!

这是进行检查点和故障重新报告的基本方法，并且可能需要根据每个用例进行更多的改进。随时向我提供您的反馈。阅读愉快！

Repository Link to codebase:

仓库链接到代码库：