DataStream API:Execution Mode (Batch/Streaming)

曹木青芸

已于 2023-01-27 21:54:16 修改

阅读量720

点赞数

分类专栏： flink官方文档翻译-DataStream API 文章标签： batch mapreduce 数据库

于 2022-08-30 20:31:47 首次发布

本文链接：https://blog.csdn.net/weixin_48813624/article/details/126586758

版权

Execution Mode (Batch/Streaming) 执行模式(Batch/Streaming)

The DataStream API supports different runtime execution modes from which you can choose depending on the requirements of your use case and the characteristics of your job.
DataStream API支持不同的runtime执行模式，您可以根据用例的要求和作业的特性从中进行选择。

There is the “classic” execution behavior of the DataStream API, which we call STREAMING execution mode. This should be used for unbounded jobs that require continuous incremental processing and are expected to stay online indefinitely.
DataStream API有一个“经典”的执行行为，我们称之为STREAMING执行模式。这应用于需要持续增量处理且预期会无限期在线的无界作业。

Additionally, there is a batch-style execution mode that we call BATCH execution mode. This executes jobs in a way that is more reminiscent of batch processing frameworks such as MapReduce. This should be used for bounded jobs for which you have a known fixed input and which do not run continuously.
此外，还有一种批处理风格的执行模式，我们称之为BATCH执行模式。此执行作业的方式更像是MapReduce等批处理框架。这应用于有固定输入且不持续运行的有界作业。

Apache Flink’s unified approach to stream and batch processing means that a DataStream application executed over bounded input will produce the same final results regardless of the configured execution mode. It is important to note what final means here: a job executing in STREAMING mode might produce incremental updates (think upserts in a database) while a BATCH job would only produce one final result at the end. The final result will be the same if interpreted correctly but the way to get there can be different.
Apache Flink对流和批处理的统一方法意味着，无论配置的执行模式如何，在有界输入上执行的DataStream应用程序都会产生相同的最终结果。注意’'最终的"在这里的含义很重要：在STREAMING模式下执行的作业可能会产生增量更新（比如数据库中的upserts），而BATCH作业最终只会产生一个最终结果。准确来说，最终结果将是相同的，但到达结果的方式可能不同。

By enabling BATCH execution, we allow Flink to apply additional optimizations that we can only do when we know that our input is bounded. For example, different join/aggregation strategies can be used, in addition to a different shuffle implementation that allows more efficient task scheduling and failure recovery behavior. We will go into some of the details of the execution behavior below.
通过启用BATCH执行模式，我们允许Flink应用额外的优化，只有当我们知道我们的输入是有界的时，我们才能这样做。例如，除了不同的shuffle实现(允许更有效的任务调度和故障恢复行为)之外，还可以使用不同的join/aggregation策略。我们将在下面讨论执行行为的一些细节。

When can/should I use BATCH execution mode? 何时可以/应该使用BATCH执行模式？

The BATCH execution mode can only be used for Jobs/Flink Programs that are bounded. Boundedness is a property of a data source that tells us whether all the input coming from that source is known before execution or whether new data will show up, potentially indefinitely. A job, in turn, is bounded if all its sources are bounded, and unbounded otherwise.
BATCH执行模式只能用于有界的作业/Flink程序。有界性是数据source的一个属性，它告诉我们来自该source的所有输入在执行之前是否已知，或者是否会出现新数据，或者是否可能是无限期的。反过来看，如果作业的所有source都有界，则作业有界，否则作业无界。

STREAMING execution mode, on the other hand, can be used for both bounded and unbounded jobs.
另一方面，STREAMING执行模式可用于有界和无界作业。

As a rule of thumb, you should be using BATCH execution mode when your program is bounded because this will be more efficient. You have to use STREAMING execution mode when your program is unbounded because only this mode is general enough to be able to deal with continuous data streams.
根据经验，当程序有界时，您应该使用BATCH执行模式，因为这将更有效。当您的程序没有边界时，您必须使用STREAMING执行模式，因为只有这种模式足够通用，才能处理持续的数据流。

One obvious outlier is when you want to use a bounded job to bootstrap some job state that you then want to use in an unbounded job. For example, by running a bounded job using STREAMING mode, taking a savepoint, and then restoring that savepoint on an unbounded job. This is a very specific use case and one that might soon become obsolete when we allow producing a savepoint as additional output of a BATCH execution job.
一个明显的特例是，当您希望使用有界作业来引导某些作业状态，然后希望在无界作业中使用这些作业状态。例如，通过使用STREAMING模式运行有界作业，获取保存点，然后在无界作业上恢复该保存点。这是一个非常特殊的用例，当我们允许生成一个保存点作为BATCH执行作业的附加输出时，它可能很快就会过时。

Another case where you might run a bounded job using STREAMING mode is when writing tests for code that will eventually run with unbounded sources. For testing it can be more natural to use a bounded source in those cases.
使用STREAMING模式运行有界作业的另一种情况是为最终将使用无界source运行的代码编写测试。对于测试，在这些情况下使用有界source可能更自然。

Configuring BATCH execution mode 配置BATCH执行模式

The execution mode can be configured via the execution.runtime-mode setting. There are three possible values:
可以通过execution.runtime-mode设置来配置执行模式。有三种可能的值：

STREAMING: The classic DataStream execution mode (default)
STREAMING: 经典的DataStream执行模式(默认)
BATCH: Batch-style execution on the DataStream API
BATCH: DataStream API上的批处理风格执行
AUTOMATIC: Let the system decide based on the boundedness of the sources
AUTOMATIC: 让系统根据sources的有界性进行决定

This can be configured via command line parameters of bin/flink run …, or programmatically when creating/configuring the StreamExecutionEnvironment.
这可以通过命令行bin/flink run…配置，或在创建/配置StreamExecutionEnvironment时以编程方式配置。

Here’s how you can configure the execution mode via the command line:
下面是如何通过命令行配置执行模式：

$ bin/flink run -Dexecution.runtime-mode=BATCH examples/streaming/WordCount.jar

This example shows how you can configure the execution mode in code:
此示例显示了如何在代码中配置执行模式：

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setR

最低0.47元/天解锁文章

曹木青芸

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
DataStream API:Execution Mode (Batch/Streaming)

Execution Mode (Batch/Streaming)
复制链接

扫一扫

专栏目录