如何让你的flink job任务数据流在各subtask均匀分布

这个问题其实对于非keyby算子根本不是个问题,数据并不会只往一个subtask上面跑,不过对于keyby算子就不一样了,比如下面的情况:

 

package test;

import com.bigdata.common.utils.FlinkParamUtil;
import com.bigdata.common.utils.StreamEnvUtil;
import lombok.extern.slf4j.Slf4j;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.util.Collector;
import test.function.TestKeyedProFun;

import java.net.InetAddress;
import java.util.Random;

@Slf4j
public class ParallismTest {
    public static void main(String[] args) throws Exception {
        String jobName= ParallismTest.class.getSimpleName();
        ParameterTool pmt = FlinkParamUtil.createJobParameters(args);
        StreamExecutionEnvironment env = StreamEnvUtil.initStreamEnv(pmt, jobName);
       
        DataStreamSource<Tuple2<Integer, String>> source = env.addSource(
                new RichSourceFunction<Tuple2<Integer, String>>() {
            @Override
            public void run(SourceContext<Tuple2<Integer, String>> ctx) throws Exception {
                int i=0;
                while (true) {
                    ctx.collect(Tuple2.of(i, "s"+i ));
                    i++;
                    Thread.sleep(500);
                }
            }
            @Override
            public void cancel() {

            }
        }).setParallelism(1);
        source.keyBy(a->new Random().nextInt(parallelism)).process(
                new KeyedProcessFunction<Integer, Tuple2<Integer, String>, String>() {
                    ValueState<String> valueState;
                    String ip;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        ip= InetAddress.getLocalHost().getHostAddress();
                        ValueStateDescriptor<String> vsd = new ValueStateDescriptor<String>("vsd",String.class);
                        valueState = getRuntimeContext().getState(vsd);
                    }

                    @Override
                    public void processElement(Tuple2<Integer, String> value, KeyedProcessFunction<Integer, Tuple2<Integer, String>, String>.Context ctx, Collector<String> out) throws Exception {
                        if(valueState.value()==null){
                            valueState.update("");
                        }
                        Integer key = ctx.getCurrentKey();
                        Integer f0 = value.f0;
                        valueState.update(valueState.value()+","+f0);


                        out.collect(ip+","+key+": "+valueState.value());
                    }
                }
        ).print();

        env.execute(jobName);
    }
}

我的任务有三个taskmanager,

然而数据全部都在一个子任务里面:

九师兄有篇文章分析的不错, 

(129条消息) 【Flink】Flink key 应该分配到哪个 KeyGroup 以及 KeyGroup 分配在哪个subtask_九师兄的博客-CSDN博客

但是如果默认最大并行度是128的话,不应该出现上面的情况,我们查看官网:

 官网可见默认最大并行度是-1,而由此计算keygroup的值一直都是0:

package test;

import org.apache.flink.runtime.state.KeyGroupRange;
import org.apache.flink.util.MathUtils;
import org.apache.flink.util.Preconditions;

public class Test {
    public static void main(String[] args) {
        System.out.println(assignToKeyGroup(0,-1));
        System.out.println(assignToKeyGroup(1,-1));
        System.out.println(assignToKeyGroup(2,-1));
    }

    public static int assignToKeyGroup(Object key, int maxParallelism) {
        return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
    }

    public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
        return MathUtils.murmurHash(keyHash) % maxParallelism;
    }

    public static KeyGroupRange computeKeyGroupRangeForOperatorIndex(int maxParallelism, int parallelism, int operatorIndex) {
        int start = ((operatorIndex * maxParallelism + parallelism - 1) / parallelism);
        int end = ((operatorIndex + 1) * maxParallelism - 1) / parallelism;
        System.out.println("start,"+start+",end "+end);
        return new KeyGroupRange(start, end);
    }
}

所以我们在任务中显性设置最大并行度:

env.setMaxParallelism(3);
int maxParallelism = env.getMaxParallelism();
System.out.println("maxParallelism: "+maxParallelism);
int parallelism = env.getParallelism();

问题解决:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值