Flink 并行度的理解(parallel)

概念说明

一个Flink程序由多个Operator组成(source、transformation和 sink)。
一个Operator由多个并行的Task(线程)来执行, 一个Operator的并行Task(线程)数目就被称为该Operator(任务)的并行度(Parallel)。即并行度就是相对于Operator来说的。

下面列出官方对Operator的说明:

Operator

Node of a Logical Graph. An Operator performs a certain operation, which is usually executed by a Function. Sources and Sinks are special Operators for data ingestion and data egress.

Logical Graph

A logical graph is a directed graph where the nodes are Operators and the edges define input/output-relationships of the operators and correspond to data streams or data sets. A logical graph is created by submitting jobs from a Flink Application.

Logical graphs are also often referred to as dataflow graphs.

 源码里的说明:

/** * Abstract base class for all operators. An operator is a source, sink, or it applies an operation * to one or more inputs, producing a result. * * @param <OUT> Output type of the records output by this operator */@Internalpublic abstract class Operator<OUT> implements Visitable<Operator<?>> {
/**     * Sets the parallelism for this contract instance. The parallelism denotes how many parallel     * instances of the user function will be spawned during the execution.     *     * @param parallelism The number of parallel instances to spawn. Set this value to {@link     *     ExecutionConfig#PARALLELISM_DEFAULT} to let the system decide on its own.     */    public void setParallelism(int parallelism) {        this.parallelism = parallelism;    }

并行度的设置

可以有4种级别来设置Operator的并行度

1. Operator Level(算子级别)

2. Execution Environment Level(执行环境级别)

3. Client Level(客户端级别)

4. System Level(系统默认级别,不推荐,因为会影响所有作业)

1. Operator Level

直接使用对应的operator.setParallelism(xxx)即可

2. Execution Environment Level

使用env.setParallelism(xxx) (env即StreamExecutionEnvironment)

3. Client Level

并行度可以在客户端将job提交到Flink时设定。
对于CLI客户端,可以通过-p参数指定并行度
./bin/flink run -p 3 ...

4. System Level

在系统级可以通过设置flink-conf.yaml文件中的parallelism.default属性来指定所有执行环境的默认并行度。

4种设置方法的优先级

并行度的优先级:算子级别 > env级别 > 客户端级别 > 系统默认级别

也就是说优先级高的如果设置了就可以覆盖优先级低的值。

另外设置的并行度和实际执行时的并行度也并不会始终一致,比如 如果source不可以被并行执行,即使指定了并行度为多个,也不会生效;kafka读取等。

在实际生产中,推荐在算子级别显示指定各自的并行度,方便进行显示和精确的资源控制。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值