PS:我是半吊子Storm从业者,Storm是我司流计算平台很早就在使用的技术,一直没有深入了解,当时Flink已经出具规模,但是鉴于一个Storm还没搞好,再弄个Flink进来会搞不定,大家还是坚持在Storm上很久。
坚持了一年下来,对Storm的印象是:Storm可用,一台机器跑不了几个任务,ACT机制搞了半天也没搞起来,应该不容易。
结实Flink是阿里在2018年底发布的《不仅仅是流计算 Apache Flink实践》,里面讲了一堆案例,说明大家是为啥选择了Flink,并且从Storm迁移过来,总结来说,就是两点:吞吐量、Exact-once。
1. 什么是Flink
首先,Apache Flink认为自己是Stateful Computations over Data Streams(数据流上的有状态计算),精确至极。
PS:Flink系列的图片和信息来自Flink 1.8版本的官方文档https://ci.apache.org/projects/flink/flink-docs-master/
2. Flink的组件
Flink作为一个软件,内部是分层的,各层通过抽象接口堆叠,看起来非常整洁。各层如下图所示:
-
The runtime layer receives a program in the form of a JobGraph. A JobGraph is a generic parallel data flow with arbitrary tasks that consume and produce data streams.
-
Both the DataStream API and the DataSet API generate JobGraphs through separate compilation processes. The DataSet API uses an optimizer to determine the optimal plan for the program, while the DataStream API uses a stream builder.
-
The JobGraph is executed according to a variety of deployment options available in Flink (e.g., local, remote, YARN, etc)
-
Libraries and APIs that are bundled with Flink generate DataSet or DataStream API programs. These are Table for queries on logical tables, FlinkML for Machine Learning, and Gelly for graph processing.