最全【Flink】（05）Apache Flink 漫谈系列 —(1)

2401_84186026

于 2024-05-04 15:39:27 发布

阅读量593

点赞数 29

分类专栏：程序员文章标签： flink apache c#

本文链接：https://blog.csdn.net/2401_84186026/article/details/138443209

版权

程序员专栏收录该内容

111 篇文章 3 订阅

订阅专栏

    /\*\* 懒加载，执行处理程序 \*/
    env.execute("Socket Window WordCount");
}

/\*\* 单词和统计次数的数据结构 \*/
public static class WordWithCount {
    public String word;
    public long count;

    public WordWithCount(String word, long count) {
        this.word = word;
        this.count = count;
    }

    @Override
    public String toString() {
        return word + " : " + count;
    }
}

}


对于上述实现，接下来要分析的内容有：


1. 如何创建从指定host和port接收数据的数据源；
2. 如何对创建好的数据源进行一系列操作来实现所需功能；
3. 如何将分析结果打印出来。


### 三、构建数据源


数据源的构建是通过 `StreamExecutionEnviroment` 的具体实现的实例来构建的。

DataStream text = env.socketTextStream(hostname, port);


在 `StreamExecutionEnviroment` 中：在指定的 host 和 port 上构建了一个接受网络数据的数据源。

public DataStreamSource socketTextStream(String hostname, int port) {
return socketTextStream(hostname, port, “\n”);
}

public DataStreamSource socketTextStream(String hostname, int port, String delimiter) {
return socketTextStream(hostname, port, delimiter, 0);
}

public DataStreamSource socketTextStream(String hostname, int port, String delimiter, long maxRetry) {
return addSource(new SocketTextStreamFunction(hostname, port, delimiter, maxRetry),
“Socket Stream”);
}


可以看到会根据传入的hostname、port，以及默认的行分隔符”\n”，和最大尝试次数0，构造一个`SocketTextStreamFunction`实例，并采用默认的数据源节点名称为”Socket Stream”。


`SocketTextStreamFunction` 的类继承图如下所示，可以看出其是 `SourceFunction` 的一个子类，而 `SourceFunction` 是Flink中数据源的基础接口。


![在这里插入图片描述](https://img-blog.csdnimg.cn/20200712124157864.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JlaWlzQmVp,size_1,color_FFFFFF,t_70#pic_center)  
 `SourceFunction` 内部方法：

@Public
public interface SourceFunction extends Function, Serializable {
void run(SourceContext ctx) throws Exception;
void cancel();

@Public
interface SourceContext {
void collect(T element);
@PublicEvolving
void collectWithTimestamp(T element, long timestamp);
@PublicEvolving
void emitWatermark(Watermark mark);
@PublicEvolving
void markAsTemporarilyIdle();
Object getCheckpointLock();
void close();
}
}


`run(SourceContex)` 方法：就是实现数据获取逻辑的地方，并可以通过传入的参数ctx进行向下游节点的数据转发。


`cancel()` 方法：则是用来取消数据源的数据产生，一般在run方法中，会存在一个循环来持续产生数据，而cancel方法则可以使得该循环终止。


其内部接口SourceContex则是用来进行数据发送的接口。了解了SourceFunction这个接口的功能后，来看下SocketTextStreamFunction的具体实现，也就是主要看其run方法的具体实现。

public void run(SourceContext ctx) throws Exception {
final StringBuilder buffer = new StringBuilder();
long attempt = 0;
/** 这里是第一层循环，只要当前处于运行状态，该循环就不会退出，会一直循环 */
while (isRunning) {
try (Socket socket = new Socket()) {
/** 对指定的hostname和port，建立Socket连接，并构建一个BufferedReader，用来从Socket中读取数据 */
currentSocket = socket;
LOG.info(“Connecting to server socket " + hostname + ‘:’ + port);
socket.connect(new InetSocketAddress(hostname, port), CONNECTION_TIMEOUT_TIME);
BufferedReader reader = new BufferedReader(new InputStreamReader(socket.getInputStream()));
char[] cbuf = new char[8192];
int bytesRead;
/** 这里是第二层循环，对运行状态进行了双重校验，同时对从Socket中读取的字节数进行判断 */
while (isRunning && (bytesRead = reader.read(cbuf)) != -1) {
buffer.append(cbuf, 0, bytesRead);
int delimPos;
/** 这里是第三层循环，就是对从Socket中读取到的数据，按行分隔符进行分割，并将每行数据作为一个整体字符串向下游转发 */
while (buffer.length() >= delimiter.length() && (delimPos = buffer.indexOf(delimiter)) != -1) {
String record = buffer.substring(0, delimPos);
if (delimiter.equals(”\n") && record.endsWith(“\r”)) {
record = record.substring(0, record.length() - 1);
}
/** 用入参ctx，进行数据的转发 */
ctx.collect(record);
buffer.delete(0, delimPos + delimiter.length());
}
}
}
/** 如果由于遇到EOF字符，导致从循环中退出，则根据运行状态，以及设置的最大重试尝试次数，决定是否进行 sleep and retry，或者直接退出循环 */
if (isRunning) {
attempt++;
if (maxNumRetries == -1 || attempt < maxNumRetries) {
LOG.warn(“Lost connection to server socket. Retrying in " + delayBetweenRetries + " msecs…”);
Thread.sleep(delayBetweenRetries);
}
else {
break;
}
}
}
/** 在最外层的循环都退出后，最后检查下缓存中是否还有数据，如果有，则向下游转发 */
if (buffer.length() > 0) {
ctx.collect(buffer.toString());
}
}


`run` 方法的逻辑如上，逻辑很清晰，就是从指定的hostname和port持续不断的读取数据，按行分隔符划分成一个个字符串，然后转发到下游。


`cancel` 方法的实现如下，就是将运行状态的标识isRunning属性设置为false，并根据需要关闭当前socket。

public void cancel() {
isRunning = false;
Socket theSocket = this.currentSocket;
/** 如果当前socket不为null，则进行关闭操作 */
if (theSocket != null) {
IOUtils.closeSocket(theSocket);
}
}


对SocketTextStreamFunction的实现清楚之后，回到 StreamExecutionEnvironment 中，看 `addSource` 方法。

public DataStreamSource addSource(SourceFunction function, String sourceName) {
return addSource(function, sourceName, null);
}

public DataStreamSource addSource(SourceFunction function, String sourceName, TypeInformation typeInfo) {
/** 如果传入的输出数据类型信息为null，则尝试提取输出数据的类型信息 */
if (typeInfo == null) {
if (function instanceof ResultTypeQueryable) {
/** 如果传入的function实现了ResultTypeQueryable接口, 则直接通过接口获取 */
typeInfo = ((ResultTypeQueryable) function).getProducedType();
} else {
try {
/** 通过反射机制来提取类型信息 */
typeInfo = TypeExtractor.createTypeInfo(
SourceFunction.class,
function.getClass(), 0, null, null);
} catch (final InvalidTypesException e) {
/** 提取失败, 则返回一个MissingTypeInfo实例 */
typeInfo = (TypeInformation) new MissingTypeInfo(sourceName, e);
}
}
}
/** 根据function是否是ParallelSourceFunction的子类实例来判断是否是一个并行数据源节点 */
boolean isParallel = function instanceof ParallelSourceFunction;
/** 闭包清理, 可减少序列化内容, 以及防止序列化出错 */
clean(function);
StreamSource<OUT, ?> sourceOperator;
/** 根据function是否是StoppableFunction的子类实例, 来决定构建不同的StreamOperator */
if (function instanceof StoppableFunction) {
sourceOperator = new StoppableStreamSource<>(cast2StoppableSourceFunction(function));
} else {
sourceOperator = new StreamSource<>(function);
}
/** 返回一个新构建的DataStreamSource实例 */
return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);
}


通过对 `addSource` 重载方法的依次调用，最后得到了一个 `DataStreamSource` 的实例。


`TypeInformation` 是Flink的类型系统中的核心类，用作函数输入和输出的类型都需要通过`TypeInformation`来表示，`TypeInformation`可以看做是数据类型的一个工具，可以通过它获取对应数据类型的序列化器和比较器等。


由于SocketTextStreamFunction不是继承自ParallelSourceFunction，且实现stoppableFunction接口，isParallel的值为false，以及sourceOperator变量对应的是一个StreamSource实例。


`StreamSource` 的类继承图如下所示：


![在这里插入图片描述](https://img-blog.csdnimg.cn/20200712125240438.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JlaWlzQmVp,size_1,color_FFFFFF,t_70#pic_center)


上图可以看出 `StreamSource` 是 `StreamOperator` 接口的一个具体实现类，其构造函数的入参就是一个 `SourceFunction` 的子类实例，这里就是前面介绍过的


`SocketTextStreamFunciton` 的实例，构造过程如下：

public StreamSource(SRC sourceFunction) {
super(sourceFunction);
this.chainingStrategy = ChainingStrategy.HEAD;
}

public AbstractUdfStreamOperator(F userFunction) {
this.userFunction = requireNonNull(userFunction);
checkUdfCheckpointingPreconditions();
}

private void checkUdfCheckpointingPreconditions() {
if (userFunction instanceof CheckpointedFunction && userFunction instanceof ListCheckpointed) {
throw new IllegalStateException(“User functions are not allowed to implement AND ListCheckpointed.”);
}
}


把传入的 `userFunction` 赋值给自己的属性变量，并对传入的 `userFunction` 做了校验工作，然后将链接策略设置为HEAD。


Flink中为了优化执行效率，会对数据处理链中的相邻节点会进行合并处理，链接策略有三种：


* ALWAYS —— 尽可能的与前后节点进行链接；
* NEVER —— 不与前后节点进行链接；
* HEAD —— 只能与后面的节点链接，不能与前面的节点链接。


作为数据源的源头，是最顶端的节点了，所以只能采用HEAD或者NEVER，对于`StreamSource`，采用的是HEAD策略。


`StreamOperator` 是Flink中流操作符的基础接口，其抽象子类 `AbstractStreamOperator` 实现了一些公共方法，用户自定义的数据处理逻辑会被封装在 `StreamOperator` 的具体实现子类中。


在 `sourceOperator` 变量被赋值后，即开始进行 `DataStreamSource` 的实例构建，并作为数据源构造调用的返回结果。

return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);


`DataStreamSource` 的类继承图如下所示，是具有一个预定义输出类型的 `DataStream`。


![在这里插入图片描述](https://img-blog.csdnimg.cn/20200712131227744.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0JlaWlzQmVp,size_1,color_FFFFFF,t_70#pic_center)  
 在Flink中，DataStream描述了一个具有相同数据类型的数据流，其提供了数据操作的各种API，如map、reduce等，通过这些API，可以对数据流中的数据进行各种操作，DataStreamSource的构建过程如下：

public DataStreamSource(StreamExecutionEnvironment environment,
TypeInformation outTypeInfo, StreamSource<T, ?> operator,
boolean isParallel, String sourceName) {
super(environment, new SourceTransformation<>(sourceName, operator, outTypeInfo, environment.getParallelism()));
this.isParallel = isParallel;
if (!isParallel) {
setParallelism(1);
}
}

protected SingleOutputStreamOperator(StreamExecutionEnvironment environment, StreamTransformation transformation) {
super(environment, transformation);
}

public DataStream(StreamExecutionEnvironment environment, StreamTransformation transformation) {
this.environment = Preconditions.checkNotNull(environment, “Execution Environment must not be null.”);
this.transformation = Preconditions.checkNotNull(transformation, “Stream Transformation must not be null.”);
}


可见构建过程就是初始化了DataStream中的`environment`和`transformation`这两个属性。


其中 `transformation` 赋值的是 `SourceTranformation` 的一个实例，`SourceTransformation`是 `StreamTransformation` 的子类，而`StreamTransformation`则描述了创建一个`DataStream`的操作。对于每个DataStream，其底层都是有一个StreamTransformation的具体实例的，所以在DataStream在构造初始时会为其属性transformation设置一个具体的实例。并且DataStream的很多接口的调用都是直接调用的StreamTransformation的相应接口，如并行度、id、输出数据类型信息、资源描述等。


通过上述过程，根据指定的hostname和port进行数据产生的数据源就构造完成了，获得的是一个`DataStreamSource`的实例，描述的是一个输出数据类型是String的数据流的源。


在上述的数据源的构建过程中，出现 **Function(SourceFunction)、StreamOperator、StreamTransformation、DataStream** 这四个接口：


* **Function**接口：用户通过继承该接口的不同子类来实现用户自己的数据处理逻辑，如上述中实现了SourceFunction这个子类，来实现从指定hostname和port来接收数据，并转发字符串的逻辑；
* **StreamOperator**接口：数据流操作符的基础接口，该接口的具体实现子类中，会有保存用户自定义数据处理逻辑的函数的属性，负责对userFunction的调用，以及调用时传入所需参数，比如在StreamSource这个类中，在调用SourceFunction的run方法时，会构建一个SourceContext的具体实例，作为入参，用于run方法中，进行数据的转发；
* **StreamTransformation**接口：该接口描述了构建一个DataStream的操作，以及该操作的并行度、输出数据类型等信息，并有一个属性，用来持有StreamOperator的一个具体实例；
* **DataStream**：描述的是一个具有相同数据类型的数据流，底层是通过具体的StreamTransformation来实现，其负责提供各种对流上的数据进行操作转换的API接口。


通过上述的关系，最终用户自定义数据处理逻辑的函数，以及并行度、输出数据类型等就都包含在了DataStream中，而DataStream也就可以很好的描述一个具体的数据流了。


上述四个接口的包含关系是这样的：`Function –> StreamOperator –> StreamTransformation –> DataStream`。


通过数据源的构造，理清Flink数据流中的几个接口的关系后，接下来再来看如何在数据源上进行各种操作，达到最终的数据统计分析的目的。


### 四、操作数据流


进行具体的转换操作：

DataStream windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector out) {
for (String word : value.split(“\s”)) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy(“word”)
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});


**这段逻辑中，对数据流做了四次操作，分别是flatMap、keyBy、timeWindow、reduce，接下来分别介绍每个转换都做了些什么操作。**


#### 4.1 flatMap 转换


`flatMap`的入参是一个`FlatMapFunction`的具体实现，功能就是将接收到的字符串，按空格切割成不同单词，然后每个单词构建一个WordWithCount实例，然后向下游转发，用于后续的数据统计。然后调用DataStream的flatMap方法，进行数据流的转换，如下：

public SingleOutputStreamOperator flatMap(FlatMapFunction<T, R> flatMapper) {
TypeInformation outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
getType(), Utils.getCallLocationName(), true);
/** 根据传入的flatMapper这个Function，构建StreamFlatMap这个StreamOperator的具体子类实例 */
return transform(“Flat Map”, outType, new StreamFlatMap<>(clean(flatMapper)));
}

public SingleOutputStreamOperator transform(String operatorName, TypeInformation outTypeInfo, OneInputStreamOperator<T, R> operator) {
/** 读取输入转换的输出类型, 如果是MissingTypeInfo, 则及时抛出异常, 终止操作 */
transformation.getOutputType();
OneInputTransformation<T, R> resultTransform = new OneInputTransformation<>(
this.transformation,
operatorName,
operator,
outTypeInfo,
environment.getParallelism());
@SuppressWarnings({ “unchecked”, “rawtypes” })
SingleOutputStreamOperator returnStream = new SingleOutputStreamOperator(environment, resultTransform);
getExecutionEnvironment().addOperator(resultTransform);
return returnStream;
}


整个构建过程，与构建数据源的过程相似。



> 
> a、先根据传入的flatMapper这个Function构建一个StreamOperator的具体子类StreamFlatMap的实例；  
>  b、根据a中构建的StreamFlatMap的实例，构建出OneInputTransFormation这个StreamTransformation的子类的实例；  
>  c、再构建出DataStream的子类SingleOutputStreamOperator的实例。
> 
> 
> 


除了构建出了 `SingleOutputStreamOperator` 这个实例为并返回外，还有代码：

getExecutionEnvironment().addOperator(resultTransform);

public void addOperator(StreamTransformation<?> transformation) {
Preconditions.checkNotNull(transformation, “transformation must not be null.”);
this.transformations.add(transformation);
}


就是将上述构建的`OneInputTransFormation`的实例，添加到了`StreamExecutionEnvironment`的属性`transformations`这个类型为`List`。


#### 4.2 keyBy 转换


这里的keyBy转换，入参是一个字符串”word”，意思是按照WordWithCount中的word字段进行分区操作。

public KeyedStream<T, Tuple> keyBy(String… fields) {
return keyBy(new Keys.ExpressionKeys<>(fields, getType()));
}


先根据传入的字段名数组，以及数据流的输出数据类型信息，构建出用来描述key的ExpressionKeys的实例，ExpressionKeys有两个属性：

/** key字段的列表, FlatFieldDescriptor 描述了每个key, 在所在类型中的位置以及key自身的数据类信息 */
private List keyFields;
/** 包含key的数据类型的类型信息, 与构造函数入参中的字段顺序一一对应 */
private TypeInformation<?>[] originalKeyTypes;


在获取key的描述之后，继续调用keyBy的重载方法：

private KeyedStream<T, Tuple> keyBy(Keys keys) {
return new KeyedStream<>(this, clean(KeySelectorUtil.getSelectorForKeys(keys,
getType(), getExecutionConfig())));
}


这里首先构建了一个KeySelector的子类ComparableKeySelector的实例，作用就是从具体的输入实例中，提取出key字段对应的值(可能是多个key字段)组成的元组(Tuple)。


对于这里的例子，就是从每个WordWithCount实例中，提取出word字段的值。


然后构建一个KeyedStream的实例，KeyedStream也是DataStream的子类。构建过程如下：

public KeyedStream(DataStream dataStream, KeySelector<T, KEY> keySelector) {
this(dataStream, keySelector, TypeExtractor.getKeySelectorTypes(keySelector, dataStream.getType()));
}

public KeyedStream(DataStream dataStream, KeySelector<T, KEY> keySelector, TypeInformation keyType) {
super(
dataStream.getExecutionEnvironment(),
new PartitionTransformation<>(
dataStream.getTransformation(),
new KeyGroupStreamPartitioner<>(keySelector, StreamGraphGenerator.DEFAULT_LOWER_BOUND_MAX_PARALLELISM)));
this.keySelector = keySelector;
this.keyType = validateKeyType(keyType);
}


在进行父类构造函数调用之前，先基于keySelector构造了一个KeyGroupStreamPartitioner的实例，再进一步构造了一个PartitionTransformation实例。


这里与flatMap的转换略有不同：



> 
> a、flatMap中，根据传入的flatMapper这个Function构建的是StreamOperator这个接口的子类的实例，而keyBy中，则是根据keySelector构建了ChannelSelector接口的子类实例；  
>  b、keyBy中构建的StreamTransformation实例，并没有添加到StreamExecutionEnvironment的属性transformations这个列表中。
> 
> 
> 


ChannelSelector只有一个接口，根据传入的数据流中的具体数据记录，以及下个节点的并行度来决定该条记录需要转发到哪个通道。

public interface ChannelSelector {
int[] selectChannels(T record, int numChannels);
}
KeyGroupStreamPartitioner中该方法的实现如下：
public int[] selectChannels(
SerializationDelegate<StreamRecord> record,
int numberOfOutputChannels) {
K key;
try {
/** 通过keySelector从传入的record中提取出对应的key */
key = keySelector.getKey(record.getInstance().getValue());
} catch (Exception e) {
throw new RuntimeException("Could not extract key from " + record.getInstance().getValue(), e);
}
/** 根据提取的key，最大并行度，以及输出通道数，决定出record要转发到的通道编号 */
returnArray[0] = KeyGroupRangeAssignment.assignKeyToParallelOperator(key, maxParallelism, numberOfOutputChannels);
return returnArray;
}


再进一步看一下KeyGroupRangerAssignment的assignKeyToParallelOperator方法的实现逻辑。

public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}

public static int assignToKeyGroup(Object key, int maxParallelism) {
return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}

public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
return MathUtils.murmurHash(keyHash) % maxParallelism;
}

public static int computeOperatorIndexForKeyGroup(int maxParallelism, int parallelism, int keyGroupId) {
return keyGroupId * parallelism / maxParallelism;
}



> 
> a、先通过key的hashCode，算出maxParallelism的余数，也就是可以得到一个[0, maxParallelism)的整数；  
>  b、在通过公式 keyGroupId \* parallelism / maxParallelism ，计算出一个[0, parallelism)区间的整数，从而实现分区功能。


![img](https://img-blog.csdnimg.cn/img_convert/ac5bb4da9e9924cd9eaf14790076401e.png)
![img](https://img-blog.csdnimg.cn/img_convert/39a13ab7c3101ca5d11021047262e3b5.png)
![img](https://img-blog.csdnimg.cn/img_convert/925bd9e4111b304769af6300e180c671.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！**

**由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新**

**[需要这份系统化资料的朋友，可以戳这里获取](https://bbs.csdn.net/topics/618545628)**

eturn keyGroupId \* parallelism / maxParallelism;
}

a、先通过key的hashCode，算出maxParallelism的余数，也就是可以得到一个[0, maxParallelism)的整数；
b、在通过公式 keyGroupId * parallelism / maxParallelism ，计算出一个[0, parallelism)区间的整数，从而实现分区功能。

[外链图片转存中…(img-JAnwab3c-1714808323035)]
[外链图片转存中…(img-dZt4CH7L-1714808323035)]
[外链图片转存中…(img-hPSKXAIB-1714808323036)]

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取