Random partitioning:随机分区
dataStream.shuffle()
底层调用random.nextInt方法进行随机分区
public ShufflePartitioner() {
}
public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record, int numberOfOutputChannels) {
this.returnArray[0] = this.random.nextInt(numberOfOutputChannels);
return this.returnArray;
}
Rebalancing: 对数据集进行再平衡,重分区,消除数据倾斜。
dataStream.rebalance()
++this.returnArray[0] 递增分配分区,再平衡。
public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record, int numberOfOutputChannels) {
int newChannel = ++this.returnArray[0];
if (newChannel >= numberOfOutputChannels) {
this.returnArray[0] = 0;
}
Rescaling:
dataStream.rescale()
public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record, int numberOfOutputChannels) {
int newChannel = ++this.returnArray[0];
if (newChannel >= numberOfOutputChannels) {
this.returnArray[0] = 0;
}
return this.returnArray;
}
解释:
1、如果上游操作有2个并发,而下游操作有4个并发,那么上游的一个并发结果分配给下游的两个并发操作,另外的一个并发结果分配给了下游的另外两个并发操作。
2、另一方面下游有两个并发操作而上游又有4个并发操作,那么上游的其中两个操作的结果分配给下游的一个并发操作而另外两个并发操作的结果则分配给另外一个并发操作。
Custom partitioning :自定义分区
自定义分区需要实现Partitioner接口
dataStream.partitionCustom(partitioner,"someKey")
或者dataStream.partitionCustom(partitioner,0);
import org.apache.flink.api.common.functions.Partitioner;
public class MyPartition implements Partitioner<Long> {
@Override
public int partition(Long key, int numPrtitions) {
System.out.println("分区总数:"+ numPrtitions);
if(key % 2 ==0){
return 0;
}else {
return 1;
}
}
}
// 对数据进行转换,把long类型转换成tuple类型
DataStream<Tuple1<Long>> tupleData = DataStream.map(new MapFunction<Long, Tuple1<Long>>() {
@Override
public Tuple1<Long> map(Long value) throws Exception {
return new Tuple1<>(value);
}
});
//分区之后的数据
DataStream<Tuple1<Long>> partitionData = tupleData.partitionCustom(new MyPartition(),0);
DataStream<Long> result = partitionData.map(new MapFunction<Tuple1<Long>, Long>() {
@Override
public Long map(Tuple1<Long> value) throws Exception {
System.out.println("当前线程ID:"+ Thread.currentThread().getId()+",value" + value);
return value.getField(0);
}
});