Flink 简介
传统数据处理架构(Traditional Data Infrastructures)
- 事务处理(Transactional Processing)
Often, a database system serves multiple applications that sometimes access the same databases or tables. This application design can cause problems when applications need to evolve or scale. Since multiple applications might work on the same data representation or share the same infrastructure, changing the schema of a table or scaling a database system requires careful planning and a lot of effort. A recent approach to overcoming the tight bundling of applications is the microservices design pattern.
Microservices are designed as small, self-contained, and independent applications. They follow the UNIX philosophy of doing a single thing and doing it well. More complex applications are built by connecting several microservices with each other that only communicate over standardized interfaces such as RESTful HTTP connections. Because microservices are strictly decoupled from each other and only communicate over well-defined interfaces, each microservice can be
implemented with a different technology stack including a programming language, libraries, and datastores. Microservices and all the required software and services are typically bundled and deployed in independent containers. - 分析处理(Analytical Processing)
有状态的流式处理(Stateful Stream Processing)
Apache Flink stores the application state locally in memory or in an embedded database. Since Flink is a distributed system, the local state needs to be protected against failures to avoid data loss in case of application or machine failure. Flink guarantees this by periodically writing a consistent checkpoint of the application state to a remote and durable storage.
The lambda architecture augments the traditional periodic batch processing architecture with a speed layer that is powered by a low-latency stream processor. Data arriving at the lambda architecture is ingested by the stream processor and also written to batch storage. The stream processor computes approximated results in near real time and writes them into a speed table. The batch processor periodically processes the data in batch storage, writes the exact results into a batch table, and drops the corresponding inaccurate results from the speed table. Applications consume the results by merging approximated results from the speed table and the accurate results from the batch table.
Apache Flink is a third-generation distributed stream processor with a
competitive feature set. It provides accurate stream processing with
high throughput and low latency at scale.
Flink 的主要特点
- 事件驱动(Event-driven)
- 基于流的世界观
- 分层 API
A Quick Look at Flink
Apache Flink is a third-generation distributed stream processor with a competitive feature set. It provides accurate stream processing with high throughput and low latency at scale. In particular, the following features make Flink stand out: - Event-time and processing-time semantics. Event-time semantics
provide consistent and accurate results despite out- of-order events.
Processing-time semantics can be used for applications with very low
latency requirements. - Exactly-once state consistency guarantees. Millisecond latencies
while processing millions of events per second. Flink applications
can be scaled to run on thousands of cores. - Layered APIs with varying tradeoffs for expressiveness and ease of
use. This book covers the DataStream API and process functions, which
provide primitives for common stream processing operations, such as
windowing and asynchronous operations, and interfaces to precisely
control state and time. Flink’s relational APIs, SQL and the
LINQ-style Table API, are not discussed in this book. - Connectors to the most commonly used storage systems such as Apache
Kafka, Apache Cassandra, Elasticsearch, JDBC, Kinesis, and
(distributed) filesystems such as HDFS and S3. Ability to run
streaming applications 24/7 with very little downtime due to its
highly available setup (no single point of failure), tight
integration with Kubernetes, YARN, and Apache Mesos, fast recovery
from failures, and the ability to dynamically scale jobs. - Ability to update the application code of jobs and migrate jobs to
different Flink clusters without losing the state of the application. - Detailed and customizable collection of system and application
metrics to identify and react to problems ahead of time.
Last but not least, Flink is also a full-fledged batch processor. In addition to these features, Flink is a very developer-friendly framework due to its easy-to-use APIs. The embedded execution mode starts an application and the whole Flink system in a single JVM process, which can be used to run and debug Flink jobs within an IDE. This feature comes in handy when developing and testing Flink applications.
Flink 部署
- Standalone 模式
// 启动集群
./start-cluster.sh
// 提交任务(需要事先启动 nc -lk 7777)
./flink run -c com.ccnuacmhdu.wordcount.StreamWordCountV2 -p 2 ./FlinkPrimer-1.0-SNAPSHOT-jar-with-dependencies.jar --host localhost --port 7777
// 可视化页面(观察 Task Managers 页面的输出结果)
http://localhost:8081
- Yarn 模式
// Session-cluster 模式提交作业(略)
// Per-Job-Cluster 模式,提交作业
./flink run –m yarn-cluster -c com.ccnuacmhdu.wordcount.StreamWordCountV2
FlinkPrimer-1.0-SNAPSHOT-jar-with-dependencies.jar --host lcoalhost –port
7777
- Kubernetes 部署(略)
The Architecture of Apache Flink
Flink 运行架构
- The JobManager is the master process that controls the execution of a single application—each application is controlled by a different JobManager. The JobManager receives an application for execution. The application consists of a so-called JobGraph, a logical dataflow graph (see “Introduction to Dataflow Programming”), and a JAR file that bundles all the required classes, libraries, and other resources. The JobManager converts the JobGraph into a physical dataflow graph called the ExecutionGraph, which consists of tasks that can be executed in parallel. The JobManager requests the necessary resources (TaskManager slots) to execute the tasks from the ResourceManager. Once it receives enough TaskManager slots, it distributes the tasks of the ExecutionGraph to the TaskManagers that execute them. During execution, the JobManager is responsible for all actions that require a central coordination such as the coordination of checkpoints (see “Checkpoints, Savepoints, and State Recovery”). Flink features multiple ResourceManagers for different environments and resource providers such as YARN, Mesos, Kubernetes, and standalone deployments. The ResourceManager is responsible for managing TaskManager slots, Flink’s unit of processing resources. When a JobManager requests TaskManager slots, the ResourceManager instructs a TaskManager with idle slots to offer them to the JobManager. If the ResourceManager does not have enough slots to fulfill the JobManager’s request, the ResourceManager can talk to a resource provider to provision containers in which TaskManager processes are started. The ResourceManager also takes care of terminating idle TaskManagers to free compute resources.
- TaskManagers are the worker processes of Flink. Typically, there are multiple TaskManagers running in a Flink setup. Each TaskManager provides a certain number of slots. The number of slots limits the number of tasks a TaskManager can execute. After it has been started, a TaskManager registers its slots to the ResourceManager. When instructed by the ResourceManager, the TaskManager offers one or more of its slots to a JobManager. The JobManager can then assign tasks to the slots to execute them. During execution, a TaskManager exchanges data with other TaskManagers that run tasks of the same application. The execution of tasks and the concept of slots is discussed in “Task Execution”.
- The Dispatcher runs across job executions and provides a REST interface to submit applications for execution. Once an application is submitted for execution, it starts a JobManager and hands the application over. The REST interface enables the dispatcher to serve as an HTTP entry point to clusters that are behind a firewall. The dispatcher also runs a web dashboard to provide information about job executions. Depending on how an application is submitted for execution (discussed in “Application Deployment”), a dispatcher might not be required.
Watermarks
简单说,就是适当延迟一下,等一下迟到的数据,尽可能多地收到该收到的数据(当然不能保证所有该收到的数据都收到)!
In Flink, watermarks are implemented as special records holding a
timestamp as a Long value. Watermarks flow in a stream of regular
records with annotated timestamps as Figure 3-8 shows.
Watermarks have two basic properties:
- They must be monotonically increasing to ensure the event-
time clocks of tasks are progressing and not going backward. - They are related to record timestamps. A watermark with a
timestamp T indicates that all subsequent records should have
timestamps > T.
Watermark 处理乱序数据例子:
下面设置的延迟 3 秒,就是根据经验,数据最大乱序程度大概是 3 秒!如果还有少量的迟到数据,可以用 allowedLateness(Time.minutes(1)) 设置允许处理延迟数据,还有终极大招是 sideOutputLateData(lateTag)!
上述是每来一个数据就生成一个 Watermark,还有一种周期性间断生成 Watermark 的(正常大数据是比较密集的,一堆数据之间时间差别不大)!
Watermark 的传播:
Figure 3-9 shows how a task with four input partitions and three output partitions receives watermarks, updates its partition watermarks and event-time clock, and emits watermarks.
State Management
In general, all data maintained by a task and used to compute the results of a function belong to the state of the task. You can think of state as a local or instance variable that is accessed by a task’s business logic. Figure 3-10 shows the typical interaction between a task and its state.
In Flink, state is always associated with a specific operator. In order to make Flink’s runtime aware of the state of an operator, the operator needs to register its state. There are two types of state, operator state and keyed state.
Operator State
Operator state is scoped to an operator task.
Flink offers three primitives for operator state:
-
List state
-
Union list state(Represents state as a list of entries as well. But
it differs from regular list state in how it is restored in the case
of a failure or when an application is started from a savepoint. ) -
Broadcast state
Keyed State
Keyed state is maintained and accessed with respect to a key defined in the records of an operator’s input stream. Flink maintains one state instance per key value and partitions all records with the same key to the operator task that maintains the state for this key. When a task processes a record, it automatically scopes the state access to the key of the current record. Consequently, all records with the same key access the same state. Figure 3-12 shows how tasks interact with keyed state.
You can think of keyed state as a key-value map that is partitioned (or sharded) on the key across all parallel tasks of an operator. Flink provides different primitives for keyed state that determine the type of the value stored for each key in this distributed key-value map. We will briefly discuss the most common keyed state primitives.
-
Value state
Stores a single value of arbitrary type per key. Complex data structures can also be stored as value state. -
List state
Stores a list of values per key. The list entries can be of arbitrary type. -
Map state
Stores a key-value map per key. The key and value of the map can be of arbitrary type. -
ReducingState
-
AggregatingState
State Backends
A task of a stateful operator typically reads and updates its state for each incoming record. Because efficient state access is crucial to processing records with low latency, each parallel task locally maintains its state to ensure fast state accesses. How exactly the state is stored, accessed, and maintained is determined by a pluggable component that is called a state backend. A state backend is responsible for two things: local state management and checkpointing state to a remote location.
For local state management, a state backend stores all keyed states and ensures that all accesses are correctly scoped to the current key. Flink provides state backends that manage keyed state as objects stored in in-memory data structures on the JVM heap. Another state backend serializes state objects and puts them into RocksDB, which writes them to local hard disks. While the first option gives very fast state access, it is limited by the size of the memory. Accessing state stored by the RocksDB state backend is slower but its state may grow very large.
Currently, Flink offers three state backends, the MemoryStateBackend, the FsStateBackend, and the RocksDBStateBackend.
- MemoryStateBackend stores state as regular objects on the heap of the
TaskManager JVM process. For example, MapState is backed by a Java
HashMap object. While this approach provides very low latencies to
read or write state, it has implications on the robustness of an
application. If the state of a task instance grows too large, the JVM
and all task instances running on it can be killed due to an
OutOfMemoryError. Moreover, this approach can suffer from garbage
collection pauses because it puts many long- lived objects on the
heap. When a checkpoint is taken, MemoryStateBackend sends the state
to the JobManager, which stores it in its heap memory. Hence, the
total state of an application must fit into the JobManager’s memory.
Since its memory is volatile, the state is lost in case of a
JobManager failure. Due to these limitations, MemoryStateBackend is
only recommended for development and debugging purposes. - FsStateBackend stores the local state on the TaskManager’s JVM heap,
just like MemoryStateBackend. However, instead of checkpointing the
state to the JobManager’s volatile memory, FsStateBackend writes the
state to a remote and persistent file system. Hence, FsStateBackend
provides in-memory speed for local accesses and fault tolerance in
the case of failures. However, it is limited by the size of the
TaskManager memory and might suffer from garbage collection pauses. - RocksDBStateBackend stores all state into local RocksDB instances.
RocksDB is an embedded key-value store that persists data to the
local disk. In order to read and write data from and to RocksDB, it
needs to be de/serialized. The RocksDBStateBackend also checkpoints
the state to a remote and persistent file system. Because it writes
data to disk and supports incremental checkpoints (more on this in
“Checkpoints, Savepoints, and State Recovery”), RocksDBStateBackend
is a good choice for applications with very large state. Users have
reported applications with state sizes of multiple terabytes
leveraging RocksDBStateBackend. However, reading and writing data to
disk and the overhead of de/serializing objects result in lower read
and write performance compared to maintaining state on the heap.
val env = StreamExecutionEnvironment.getExecutionEnvironment
val checkpointPath: String = ???
// configure path for checkpoints on the remote filesystem
val backend = new RocksDBStateBackend(checkpointPath)
// configure the state backend
env.setStateBackend(backend)
Checkpoints, Savepoints, and State Recovery
略
Application Consistency Guarantees
Idempotent Writes/Transactional Writes
API
API 练习代码详见参考资料[1]和[2]附带的代码
参考资料
[1] 尚硅谷Flink(Scala版)教程丨清华硕士-武晟然老师主讲
[2] 《Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications》Fabian Hueske and Vasiliki Kalavri
[3] https://flink.apache.org/