这个问题其实对于非keyby算子根本不是个问题,数据并不会只往一个subtask上面跑,不过对于keyby算子就不一样了,比如下面的情况:
package test;
import com.bigdata.common.utils.FlinkParamUtil;
import com.bigdata.common.utils.StreamEnvUtil;
import lombok.extern.slf4j.Slf4j;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.apache.flink.util.Collector;
import test.function.TestKeyedProFun;
import java.net.InetAddress;
import java.util.Random;
@Slf4j
public class ParallismTest {
public static void main(String[] args) throws Exception {
String jobName= ParallismTest.class.getSimpleName();
ParameterTool pmt = FlinkParamUtil.createJobParameters(args);
StreamExecutionEnvironment env = StreamEnvUtil.initStreamEnv(pmt, jobName);
DataStreamSource<Tuple2<Integer, String>> source = env.addSource(
new RichSourceFunction<Tuple2<Integer, String>>() {
@Override
public void run(SourceContext<Tuple2<Integer, String>> ctx) throws Exception {
int i=0;
while (true) {
ctx.collect(Tuple2.of(i, "s"+i ));
i++;
Thread.sleep(500);
}
}
@Override
public void cancel() {
}
}).setParallelism(1);
source.keyBy(a->new Random().nextInt(parallelism)).process(
new KeyedProcessFunction<Integer, Tuple2<Integer, String>, String>() {
ValueState<String> valueState;
String ip;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
ip= InetAddress.getLocalHost().getHostAddress();
ValueStateDescriptor<String> vsd = new ValueStateDescriptor<String>("vsd",String.class);
valueState = getRuntimeContext().getState(vsd);
}
@Override
public void processElement(Tuple2<Integer, String> value, KeyedProcessFunction<Integer, Tuple2<Integer, String>, String>.Context ctx, Collector<String> out) throws Exception {
if(valueState.value()==null){
valueState.update("");
}
Integer key = ctx.getCurrentKey();
Integer f0 = value.f0;
valueState.update(valueState.value()+","+f0);
out.collect(ip+","+key+": "+valueState.value());
}
}
).print();
env.execute(jobName);
}
}
我的任务有三个taskmanager,
然而数据全部都在一个子任务里面:
九师兄有篇文章分析的不错,
(129条消息) 【Flink】Flink key 应该分配到哪个 KeyGroup 以及 KeyGroup 分配在哪个subtask_九师兄的博客-CSDN博客
但是如果默认最大并行度是128的话,不应该出现上面的情况,我们查看官网:
官网可见默认最大并行度是-1,而由此计算keygroup的值一直都是0:
package test;
import org.apache.flink.runtime.state.KeyGroupRange;
import org.apache.flink.util.MathUtils;
import org.apache.flink.util.Preconditions;
public class Test {
public static void main(String[] args) {
System.out.println(assignToKeyGroup(0,-1));
System.out.println(assignToKeyGroup(1,-1));
System.out.println(assignToKeyGroup(2,-1));
}
public static int assignToKeyGroup(Object key, int maxParallelism) {
return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}
public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
return MathUtils.murmurHash(keyHash) % maxParallelism;
}
public static KeyGroupRange computeKeyGroupRangeForOperatorIndex(int maxParallelism, int parallelism, int operatorIndex) {
int start = ((operatorIndex * maxParallelism + parallelism - 1) / parallelism);
int end = ((operatorIndex + 1) * maxParallelism - 1) / parallelism;
System.out.println("start,"+start+",end "+end);
return new KeyGroupRange(start, end);
}
}
所以我们在任务中显性设置最大并行度:
env.setMaxParallelism(3);
int maxParallelism = env.getMaxParallelism();
System.out.println("maxParallelism: "+maxParallelism);
int parallelism = env.getParallelism();
问题解决: