导航
请先移步至Flink从入门到放弃—Stream API—常用算子(map和flatMap)观看之前内容。
本章内容
- filter : filter算子,顾名思义就是过滤掉不需要的数据,保留需要的数据。
- keyBy: 根据key分组或者说分区。
demo
package com.stream.samples;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* @author DeveloperZJQ
* @since 2022/11/13
*/
public class CustomFilterAndKeyByOperator {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStream = env.socketTextStream("192.168.112.147", 7777);
SingleOutputStreamOperator<String> filter = dataStream.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) {
// 保留字符串长度大于5的
return value.length() > 5;
}
});
KeyedStream<String, String> keyBy = filter.keyBy(new KeySelector<String, String>() {
@Override
public String getKey(String value) {
return value;
}
});
filter.print();
keyBy.print();
env.execute(MapOperator.class.getSimpleName());
}
}
上面代码没有实际业务含义,就是最简单的应用,目的是为了看其源码。
filter()
public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter) {
return this.transform("Filter", this.getType(), (OneInputStreamOperator)(new StreamFilter((FilterFunction)this.clean(filter))));
}
算子默认名就叫Filter,看下StreamFilter这个类
public StreamFilter(FilterFunction<IN> filterFunction) {
super(filterFunction);
this.chainingStrategy = ChainingStrategy.ALWAYS;
}
public void processElement(StreamRecord<IN> element) throws Exception {
if (((FilterFunction)this.userFunction).filter(element.getValue())) {
this.output.collect(element);
}
}
上面看到FilterFunction,看下这个接口类
@Public
@FunctionalInterface
public interface FilterFunction<T> extends Function, Serializable {
boolean filter(T value) throws Exception;
}
返回的是一个boolean类型的值,结合上面StreamFilter中的一行代码可以到
if (((FilterFunction)this.userFunction).filter(element.getValue())) {
this.output.collect(element);
}
如果为true,则把这个元素收集并流转下去。
值得一提的是,有些同学说源码生涩难懂,其实不是,你看上面if判断里的东西其实都是咱们学过的继承多态等基础知识,filter调的其实也是用户代码实现的filter。
然后下面的源码和map、flatMap是一样的逻辑。
keyBy()
public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {
// 先决条件key不能为空
Preconditions.checkNotNull(key);
// new KeyedStream 接下来咱们看下这个对象是干啥的
return new KeyedStream(this, (KeySelector)this.clean(key));
}
我们可以看到KeyedStream和filter、map、flatMap算子有差异了,keyBy算子没有使用DataStream,而是另起了一个继承于DataStream的类 KeyedStream。
@Public
@FunctionalInterface
public interface KeySelector<IN, KEY> extends Function, Serializable {
KEY getKey(IN value) throws Exception;
}
同时,KeySelector选择器也不一样,它返回的是一个KEY。
接着往下看
// 套娃
public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector) {
this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
}
// 套娃 ,但是请注意,实际上每一层套娃都有做事情
// 这里有一个知识点 最大并行度是128
public KeyedStream(DataStream<T> dataStream, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
this(dataStream, new PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128)), keySelector, keyType);
}
// 找到正主了,keyby算子可以看到PartitionTransformation
@Internal
KeyedStream(DataStream<T> stream, PartitionTransformation<T> partitionTransformation, KeySelector<T, KEY> keySelector, TypeInformation<KEY> keyType) {
// 调用父类构造器
super(stream.getExecutionEnvironment(), partitionTransformation);
// key分区器
this.keySelector = (KeySelector)this.clean(keySelector);
// 获取键的类型
this.keyType = this.validateKeyType(keyType);
}
看一下这个PartitionTransformation(dataStream.getTransformation(), new KeyGroupStreamPartitioner(keySelector, 128))
// key组分区器
@Internal
public class KeyGroupStreamPartitioner<T, K> extends StreamPartitioner<T> implements ConfigurableStreamPartitioner {
private static final long serialVersionUID = 1L;
private final KeySelector<T, K> keySelector;
private int maxParallelism;
public KeyGroupStreamPartitioner(KeySelector<T, K> keySelector, int maxParallelism) {
Preconditions.checkArgument(maxParallelism > 0, "Number of key-groups must be > 0!");
this.keySelector = (KeySelector)Preconditions.checkNotNull(keySelector);
this.maxParallelism = maxParallelism;
}
public int getMaxParallelism() {
return this.maxParallelism;
}
// 该方法是对数据进行实时的分区,不是上游直接发送给下游,而是将数据写入到对应的channel的缓存中,下游到上游实时拉取;
public int selectChannel(SerializationDelegate<StreamRecord<T>> record) {
Object key;
try {
key = this.keySelector.getKey(((StreamRecord)record.getInstance()).getValue());
} catch (Exception var4) {
throw new RuntimeException("Could not extract key from " + ((StreamRecord)record.getInstance()).getValue(), var4);
}
// 分发key到并行的subtask
return KeyGroupRangeAssignment.assignKeyToParallelOperator(key, this.maxParallelism, this.numberOfChannels);
}
}
再看KeyGroupRangeAssignment类
public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
Preconditions.checkNotNull(key, "Assigned key must not be null!");
return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}
public static int assignToKeyGroup(Object key, int maxParallelism) {
Preconditions.checkNotNull(key, "Assigned key must not be null!");
return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}
// 先计算key的hashCode值,MathUtils.murmurHash(keyHash)保证key的hash值一定是正的,避免返回的数字为负;将返回特殊的hash值模除以默认最大并行的,默认是128,得到keyGroupId;
public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
return MathUtils.murmurHash(keyHash) % maxParallelism;
}
// keyGroupId * parallelism(程序的并行度) / maxParallelism(默认最大并行),返回分区编号
public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) {
return keyGroupId * parallelism / maxParallelism;
}
另外如果想要POJO作为key,那么还需要重写hashcode方法。
5636

被折叠的 条评论
为什么被折叠?



