Flink Streaming专题 -1 FlinkStreaming 概述和事件时间EventTime解读

最新推荐文章于 2024-04-17 00:26:14 发布

千狼

最新推荐文章于 2024-04-17 00:26:14 发布

阅读量922

点赞数

分类专栏： Hadoop生态 Flink 文章标签： flink 大数据 eventtime 事件时间 streaming

本文链接：https://blog.csdn.net/qq_26654727/article/details/86701159

版权

Streaming

1.1 Overviewer

（1）Data Sources

DataSources 操作可以通过StreamExecutionEnvironment.addSource(sourceFunction) 方式将source加入到集群内部中，Flink预先提供了很多Source方法来帮助你来实现数据操作。当然也可以通过实现SourceFunction 的方式来实现非并行数据，或者通过继承RichSourceFunction类或实现ParallelSourceFunction接口来实现并行数据源的操作。

pre-implemented source functions

readTextFile(path) - Reads text files, i.e. files that respect the TextInputFormat specification, line-by-line and returns them as Strings.

readFile(fileInputFormat, path) - Reads (once) files as dictated by the specified file input format.

readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) - 这是前两个内部调用的方法。它根据给定的fileInputFormat读取路径中的文件。根据提供的watchType，此源可以定期监视（每隔ms）新数据的路径（FileProcessingMode.PROCESS_CONTINUOUSLY），或者处理当前在路径中的数据并退出（FileProcessingMode.PROCESS_ONCE）。使用pathFilter，用户可以进一步排除处理文件。

IMPLEMENTATION:

在引擎下，Flink将文件读取过程分为两个子任务，即目录监控和数据读取。这些子任务中的每一个都由单独的实体实现。监视由单线程任务实现，而读取由并行运行的多个任务执行。后者的并行性等于工作并行性。单个监视任务的作用是扫描目录（定期或仅一次，具体取决于watchType），找到要处理的文件，将它们分成分割，并将这些分割分配给下游读取器。读者是那些将阅读实际数据的人。每个分割仅由一个读取器读取，而读取器可以逐个读取多个分割

IMPORTANT NOTES:

如果watchType设置为FileProcessingMode.PROCESS_CONTINUOUSLY，则在修改文件时，将完全重新处理其内容。这可以打破“完全一次”的语义，因为在文件末尾附加数据将导致其所有内容被重新处理。

如果watchType设置为FileProcessingMode.PROCESS_ONCE，则源数据只会对数据扫描一次即退出，而不会等待读取完才会退出。当然会当所有的数据均被读完才会结束该读取操作。关闭该文件会导致该数据读取将不会再有checkpoint，这也会导致如果出现问题recovery时，恢复速度更慢。

Socket-based:

socketTextStream - Reads from a socket. Elements can be separated by a delimiter.

Collection-based:

fromCollection(Collection) - Creates a data stream from the Java Java.util.Collection. All elements in the collection must be of the same type.

fromCollection(Iterator, Class) - Creates a data stream from an iterator. The class specifies the data type of the elements returned by the iterator.

fromElements(T ...) - Creates a data stream from the given sequence of objects. All objects must be of the same type.

fromParallelCollection(SplittableIterator, Class) - Creates a data stream from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator.

generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel.

Custom:

addSource - Attach a new source function. For example, to read from Apache Kafka you can use addSource(new FlinkKafkaConsumer08<>(...)). See connectors for more details.

（2）Data Sinks

writeAsText() / TextOutputFormat
writeAsCsv(...) / CsvOutputFormat
print() / printToErr() - 可以输出文件内容，如果是并行输出，将输出文件为task标识符对应的id
writeUsingOutputFormat() / FileOutputFormat - Method and base class for custom file outputs. Supports custom object-to-bytes conversion.
writeToSocket - Writes elements to a socket according to a SerializationSchema
addSink - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as Apache Kafka) that are implemented as sink functions.

（3） Iterations

迭代器流程序实现步进功能并将其嵌入到IterativeStream中。由于DataStream程序可能永远不会完成，因此没有最大迭代次数。相反，您需要指定流的哪个部分反馈到迭代，哪个部分使用拆分转换或者filter操作。在这里，我们展示了使用Filter的示例。首先，我们定义一个IterativeStream

IterativeStream<Integer> iteration = input.iterate();
DataStream<Integer> iterationBody = iteration.map(/* this is executed many times */);

要关闭迭代并定义迭代尾部，请调用IterativeStream的closeWith（feedbackStream）方法。给closeWith函数的DataStream将反馈给迭代器头位置。常见的模式是使用filter来过滤流反馈的数据和向前传播的流的一部分。这些滤波器可以例如定义“终止”逻辑，其中允许元件向下游传播而不是反馈。

(4)延时控制Controlling Latency

在网络传输数据的环境中不会对所有的record数据进行一对一的数据传输，所以可以通过缓存的方式将数据形成批量数据在批量发送数据。但是当部分网络问题，会导致该缓存数据越来越多，而导致数据延迟会越来越严重。要控制吞吐量和延迟，可以在执行环境（或单个运算符）上使用env.setBufferTimeout（timeoutMillis）来设置缓冲区填充的最长等待时间。在此之后，即使缓冲区未满，也会自动发送缓冲区。超时默认值为100毫秒。

LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setBufferTimeout(timeoutMillis);

env.generateSequence(1,10).map(new MyMapper()).setBufferTimeout(timeoutMillis);

要最大化吞吐量，请设置setBufferTimeout（-1），这将删除超时，缓冲区只有在满时才会刷新。要最小化延迟，请将超时设置为接近0的值（例如5或10 ms）。应避免缓冲区超时为0，因为它可能导致严重的性能下降。

(5)debugging

在分布式集群中运行流式程序之前，最好确保实现的算法按预期工作。因此，实施数据分析程序通常是检查结果，调试和改进的增量过程。 Flink通过支持IDE内的本地调试，测试数据的注入和结果数据的收集，提供了显着简化数据分析程序开发过程的功能。本节提供了一些如何简化Flink程序开发的提示。

A.创建本地环境操作

LocalStreamEnvironment在其创建的同一JVM进程中启动Flink系统。如果从IDE启动LocalEnvironment，则可以在代码中设置断点并轻松调试程序。

final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

DataStream<String> lines = env.addSource(/* some source */);
// build your program

env.execute();