flink.10 DataStream api execute mode（很重要）

我先森

已于 2022-08-14 23:58:06 修改

阅读量551

点赞数

分类专栏： Flink从无到有文章标签： flink

于 2021-09-06 16:01:02 首次发布

转载需注明

本文链接：https://blog.csdn.net/qq_36066039/article/details/120136704

版权

Flink从无到有专栏收录该内容

32 篇文章 5 订阅

订阅专栏

DataStream Api的流和批处理

一.概述

在此之前你需要了解:DataStream api 概述
DataStream API 支持不同的运行时执行模式(streaming/batch)，您可以根据用例的要求和作业的特征从中进行选择。

STREAMING

DataStream API 有“经典”的执行行为，我们称之为
STREAMING执行模式。这应该用于需要连续增量处理并预计无限期保持在线的无限制作业,这是默认的执行模式.

BATCH

这以一种更让人联想到 MapReduce 等批处理框架的方式执行作业。这应该用于您具有已知固定输入并且不会连续运行的有界作业。

注意

:flink streaming api 运行程序有两种模式,如上所说这两种模式分别是streaming, batch.对用户来说,不管哪种模式,flink提供的api是统一的,如果用户的的数据是有界数据(bounded input ),那么有界数据一定对应确定的输出,不过执行模式是streaming还是batch. 因为是有界的输入源,所以结果是一致的,不同的是batch/streaming 在代码执行实现方式上又有所区别.

二.什么时候需要用batch执行方式

batch

batch(批处理)执行模式只能用于有界作业。有界性是数据源的一个属性，它告诉我们来自该数据源的所有输入在执行之前是否已知，或者是否会有新的数据出现(可能是无限期的)。反过来，如果一个job的所有源都是有界的，那么它就是有界的，否则就是无界的。

streaming

streaming 执行模式既可以用于有界作业，也可以用于无界作业,对于无界数据源只能在streaming模式下运行.

对于有界数据源 batch/streaming 的结果是相同的,不同的是内部执行上略有差别.根据经验，当您的程序有边界时，您应该使用BATCH执行模式，因为这将更有效。当你的程序是无界的时候，你必须使用streaming执行模式，因为只有这种模式才足以处理连续的数据流。

三:如何指定DataStream应用程序运行模式
执行模式可以通过execution.runtime-mode设置进行配置。存在三个可能的值：

STREAMING：经典的DataStream执行模式（默认）
BATCH：DataStream API 上的批处理式执行
AUTOMATIC: 让系统根据源的有界性来决定

上面三个配置可以在运行flink程序的时候通过命令行传入,或者在代码中指定

bin/flink run -Dexecution.runtime-mode=BATCH examples/streaming/WordCount.jar
StreamExecutionEnvironment env =StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);

三.两种模式下产生的结果是不一样的

其实flink以前进行批处理基本上都是用DataSet Api, 随着flink的进一步迭代更新，到目前为止基本上可以说DataSet Api已经快要被废弃了，到目前来说flink dataStream Api已经支持了批处理，也就是说DataStream Api既可以用于批处理也可以用于流处理，批处理的行为模式和流处理有所不同，流处理会发出每一条结果，批处理的结果却是一致的，流处理会涉及到更新，而批处理不会。这里我不想过多描述，参看：批处理和流处理的思考
下面附上一个例子，同样的数据运行在批处理模式和流处理模式，我们来观察下输出结果：


import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
//运行在批处理模式下
public class reduceDemo {
//reduce和我们想的还是有些不同
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.BATCH);
        DataStream<Tuple2<String, Long>> stream1 = env.fromElements(
                Tuple2.of("a", 1L),
                Tuple2.of("a", 2L),
                Tuple2.of("b", 1L),
                Tuple2.of("b", 3L),
                Tuple2.of("c", 100L)
        );
     stream1.keyBy(0).reduce((x0,x1)-> Tuple2.of(x0.f0,x0.f1+x1.f1)).print();
        env.execute();
    }

}
输出结果：
(a,3)
(c,100)
 (b,4)
 
运行在流处理模式下

public class reduceDemo {
//reduce和我们想的还是有些不同
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
        DataStream<Tuple2<String, Long>> stream1 = env.fromElements(
                Tuple2.of("a", 1L),
                Tuple2.of("a", 2L),
                Tuple2.of("b", 1L),
                Tuple2.of("b", 3L),
                Tuple2.of("c", 100L)
        );
     stream1.keyBy(0).reduce((x0,x1)-> Tuple2.of(x0.f0,x0.f1+x1.f1)).print();
        env.execute();
    }

}
输出结果为：
 (a,1)
 (a,3)
 (c,100)
 (b,1)
 (b,4)

在DataStream api中:
设置env.setRuntimeMode(RuntimeExecutionMode.BATCH) 其实就已经取代了 DataSet Api
这是flink为了流批一体化做的提升。

四.看下RuntimeExecutionMode

这是个枚举对象，源码如下。多说一句默认值是：AUTOMATIC

public enum RuntimeExecutionMode {

    /**
     * The Pipeline will be executed with Streaming Semantics. All tasks will be deployed before
     * execution starts, checkpoints will be enabled, and both processing and event time will be
     * fully supported.
     */
    STREAMING,

    /**
     * The Pipeline will be executed with Batch Semantics. Tasks will be scheduled gradually based
     * on the scheduling region they belong, shuffles between regions will be blocking, watermarks
     * are assumed to be "perfect" i.e. no late data, and processing time is assumed to not advance
     * during execution.
     */
    BATCH,

    /**
     * Flink will set the execution mode to {@link RuntimeExecutionMode#BATCH} if all sources are
     * bounded, or {@link RuntimeExecutionMode#STREAMING} if there is at least one source which is
     * unbounded.
     */
    AUTOMATIC
}