Flink开发-Tasks和算子链间的关系

最新推荐文章于 2024-03-19 16:46:25 发布

知白zz

最新推荐文章于 2024-03-19 16:46:25 发布

阅读量1.1k

点赞数 1

分类专栏： Flink 文章标签： flink java 实时大数据

本文链接：https://blog.csdn.net/qq884121435/article/details/119781264

版权

Flink 专栏收录该内容

15 篇文章 2 订阅

订阅专栏

Flink开发-Tasks和算子链间的关系

1. disableOperatorChaining
2. startNewChain
3. disableChaining
4.共享资源槽

对于分布式执行，Flink 将算子的 subtasks 链接在一起形成 tasks ，每个subtask 中的 Operator 连接成链也就是 Operator chain。对比每个 task 由一个线程执行，将算子链接成 tasks 是个有用的优化：它减少线程间切换、缓冲的开销，并且减少延迟的同时增加整体吞吐量。
在这里插入图片描述
每个 worker（TaskManager）都是一个 JVM 进程，可以在单独的线程中执行一个或多个 subtask。为了控制一个 TaskManager 中接受多少个 task，就有了所谓的 task slots（至少一个）。

每个 task slot 代表 TaskManager 中资源的固定子集。例如，具有 3 个 slot 的 TaskManager，会将其托管内存 1/3 用于每个 slot。分配资源意味着 subtask 不会与其他作业的 subtask 竞争托管内存，而是具有一定数量的保留托管内存。注意此处没有 CPU 隔离；当前 slot 仅分离 task 的托管内存。

通过调整 task slot 的数量，用户可以定义 subtask 如何互相隔离。每个 TaskManager 有一个 slot，这意味着每个 task 组都在单独的 JVM 中运行（例如，可以在单独的容器中启动）。具有多个 slot 意味着更多 subtask 共享同一 JVM。同一 JVM 中的 task 共享 TCP 连接（通过多路复用）和心跳信息。它们还可以共享数据集和数据结构，从而减少了每个 task 的开销。
在这里插入图片描述

默认情况下，Flink 允许 subtask 共享 slot，即便它们是不同的 task 的 subtask，只要是来自于同一作业即可。结果就是一个 slot 可以持有整个作业管道。允许 slot 共享有两个主要优点：

Flink 集群所需的 task slot 和作业中使用的最大并行度恰好一样。无需计算程序总共包含多少个 task（具有不同并行度）。
容易获得更好的资源利用。如果没有 slot 共享，非密集 subtask（source/map()）将阻塞和密集型 subtask（window）一样多的资源。通过 slot 共享，我们示例中的基本并行度从 2 增加到 6，可以充分利用分配的资源，同时确保繁重的 subtask 在 TaskManager 之间公平分配。

在这里插入图片描述

可以根据下面的样例程序来体会下Flink是如何划分Subtask的。

当数据发生重定向时例如KeyBy操作。
当程序算子并行度发生变化时。
通过算子手动切分Operator

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.setInteger("rest.port",8082);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
        env.setParallelism(2);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092");
        properties.setProperty("auto.offset.reset", "earliest");
        properties.setProperty("group.id", "g1");
        properties.setProperty("enable.auto.commit", "true");
        FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(
                "mytest",
                new SimpleStringSchema(),
                properties
        );
        DataStreamSource<String> source = env.addSource(kafkaConsumer);
        SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = source.map(new RichMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(" ");
                return Tuple2.of(split[0], Integer.parseInt(split[1]));
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> applyStream = mapStream.keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) throws Exception {
                        for (Tuple2<String, Integer> tuple2 : input) {
                            out.collect(tuple2);
                        }
                    }
                });
        applyStream.print().setParallelism(1);
        env.execute("");
    }

在这里插入图片描述

1. disableOperatorChaining

Flink程序默认开启Operator chain，我们可以在程序中通过
env.disableOperatorChaining();来关闭。

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.setInteger("rest.port",8082);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
        env.setParallelism(2);
        env.disableOperatorChaining();
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "59.111.211.35:9092,59.111.211.36:9092,59.111.211.37:9092");
        properties.setProperty("auto.offset.reset", "earliest");
        properties.setProperty("group.id", "g1");
        properties.setProperty("enable.auto.commit", "true");
        FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(
                "mytest",
                new SimpleStringSchema(),
                properties
        );
        DataStreamSource<String> source = env.addSource(kafkaConsumer);
        SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = source.map(new RichMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(" ");
                return Tuple2.of(split[0], Integer.parseInt(split[1]));
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> applyStream = mapStream.keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) throws Exception {
                        for (Tuple2<String, Integer> tuple2 : input) {
                            out.collect(tuple2);
                        }
                    }
                });
        SingleOutputStreamOperator<Tuple2<String, Integer>> filterStream = applyStream.filter(new FilterFunction<Tuple2<String, Integer>>() {
            @Override
            public boolean filter(Tuple2<String, Integer> integerTuple2) throws Exception {
                return integerTuple2.f0.startsWith("a");
            }
        });
        filterStream.print().setParallelism(1);
        env.execute("");
    }

这种情况下每一个Operator都会被切分为一个Task。
在这里插入图片描述

2. startNewChain

从当前算子之前开始划分一条新的Operator chain，可以结合共享资源槽，单独划分内存。

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.setInteger("rest.port",8082);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
        env.setParallelism(2);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "59.111.211.35:9092,59.111.211.36:9092,59.111.211.37:9092");
        properties.setProperty("auto.offset.reset", "earliest");
        properties.setProperty("group.id", "g1");
        properties.setProperty("enable.auto.commit", "true");
        FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(
                "mytest",
                new SimpleStringSchema(),
                properties
        );
        DataStreamSource<String> source = env.addSource(kafkaConsumer);
        SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = source.map(new RichMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(" ");
                return Tuple2.of(split[0], Integer.parseInt(split[1]));
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> applyStream = mapStream.keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) throws Exception {
                        for (Tuple2<String, Integer> tuple2 : input) {
                            out.collect(tuple2);
                        }
                    }
                });
        SingleOutputStreamOperator<Tuple2<String, Integer>> filterStream = applyStream.filter(new FilterFunction<Tuple2<String, Integer>>() {
            @Override
            public boolean filter(Tuple2<String, Integer> integerTuple2) throws Exception {
                return integerTuple2.f0.startsWith("a");
            }
        }).startNewChain();
        filterStream.print();
        env.execute("");
    }

开启前：
在这里插入图片描述
开启后：

3. disableChaining

针对某一个算子操作，断开算子前后的Operator chain，使算子单独作为一个Task。

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.setInteger("rest.port",8082);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
        env.setParallelism(2);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "59.111.211.35:9092,59.111.211.36:9092,59.111.211.37:9092");
        properties.setProperty("auto.offset.reset", "earliest");
        properties.setProperty("group.id", "g1");
        properties.setProperty("enable.auto.commit", "true");
        FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>(
                "mytest",
                new SimpleStringSchema(),
                properties
        );
        DataStreamSource<String> source = env.addSource(kafkaConsumer);
        SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = source.map(new RichMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(" ");
                return Tuple2.of(split[0], Integer.parseInt(split[1]));
            }
        });
        SingleOutputStreamOperator<Tuple2<String, Integer>> applyStream = mapStream.keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) throws Exception {
                        for (Tuple2<String, Integer> tuple2 : input) {
                            out.collect(tuple2);
                        }
                    }
                });
        SingleOutputStreamOperator<Tuple2<String, Integer>> filterStream = applyStream.filter(new FilterFunction<Tuple2<String, Integer>>() {
            @Override
            public boolean filter(Tuple2<String, Integer> integerTuple2) throws Exception {
                return integerTuple2.f0.startsWith("a");
            }
        }).disableChaining();
        filterStream.print();
        env.execute("");
    }

开启前：
在这里插入图片描述
开启后：

4.共享资源槽

配置算子的资源组。Flink 会将具有相同资源组的算子放在同一个slot中执行，并且将不同资源组的算子分配到不同slot中。从而实现slot槽隔离。资源组将从输入算子开始继承（就近原则）如果所有输入操作都在同一个资源组中。Flink默认的资源组的名称为“default”，算子通过显式调用 slotSharingGroup("default") 加入到这个资源组中。

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "59.111.211.35:9092,59.111.211.36:9092,59.111.211.37:9092");
        properties.setProperty("auto.offset.reset", "earliest");
        properties.setProperty("group.id", "g1");
        properties.setProperty("enable.auto.commit", "true");
        FlinkKafkaConsumer010<String> kafkaConsumer = new FlinkKafkaConsumer010<>(
                "mytest",
                new SimpleStringSchema(),
                properties
        );
        DataStreamSource<String> source = env.addSource(kafkaConsumer).setParallelism(1);
        SingleOutputStreamOperator<Tuple2<String, Integer>> mapStream = source.map(new RichMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                String[] split = s.split(" ");
                return Tuple2.of(split[0], Integer.parseInt(split[1]));
            }
        }).disableChaining().slotSharingGroup("map");
        SingleOutputStreamOperator<Tuple2<String, Integer>> applyStream = mapStream.keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple, TimeWindow>() {
                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) throws Exception {
                        for (Tuple2<String, Integer> tuple2 : input) {
                            out.collect(tuple2);
                        }
                    }
                });
        SingleOutputStreamOperator<Tuple2<String, Integer>> filterStream = applyStream.filter(new FilterFunction<Tuple2<String, Integer>>() {
            @Override
            public boolean filter(Tuple2<String, Integer> integerTuple2) throws Exception {
                return integerTuple2.f0.startsWith("a");
            }
        });
        filterStream.print();
        env.execute("");
    }

集群启动，启动时指定了三个slot，可以看到source占了一个slot，从Map开始之后的算子共用另外两个slot。
在这里插入图片描述

知白zz

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Flink开发-Tasks和算子链间的关系

Flink开发-Tasks和算子链间的关系1. disableOperatorChaining对于分布式执行，Flink 将算子的 subtasks 链接在一起形成 tasks ，每个subtask 中的 Operator 连接成链也就是 Operator chain。对比每个 task 由一个线程执行，将算子链接成 tasks 是个有用的优化：它减少线程间切换、缓冲的开销，并且减少延迟的同时增加整体吞吐量。每个 worker（TaskManager）都是一个 JVM 进程，可以在单独的线程中执行一个
复制链接

扫一扫