What is Apache Flink? — Architecture(Flink架构)

最新推荐文章于 2024-01-09 08:14:57 发布

huaishu

最新推荐文章于 2024-01-09 08:14:57 发布

阅读量415

点赞数

分类专栏： Flink

Flink 专栏收录该内容

7 篇文章

订阅专栏

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

Here, we explain important aspects of Flink’s architecture.

Apache Flink是一个框架和分布式处理引擎，用于对无界和有界数据流进行状态计算。 Flink设计为在所有常见的集群环境中运行，以内存速度和任何规模执行计算。

在这里，我们解释了Flink架构的重要方面。

Process Unbounded and Bounded Data

处理无界和有界数据

Any kind of data is produced as a stream of events. Credit card transactions, sensor measurements, machine logs, or user interactions on a website or mobile application, all of these data are generated as a stream.

任何类型的数据都是作为事件流产生的。信用卡交易、传感器测量、机器日志、网站或移动应用程序上的用户交互，所有这些数据都作为流生成。

Data can be processed as unbounded or bounded streams.

Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated. Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested. It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time. Processing unbounded data often requires that events are ingested in a specific order, such as the order in which events occurred, to be able to reason about result completeness.

Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations. Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted. Processing of bounded streams is also known as batch processing.

数据可以作为无界或有界流处理。

无界流有一个开始但没有定义的结束。它们不会在生成时提供终止数据。无界流必须连续处理，即事件提取后必须立即处理。无法等待所有输入数据到达，因为输入流是无界的，并且在任何时间点都不会完结。处理无界数据通常要求以特定顺序提取事件，例如事件发生的顺序，以便能够推断结果完整性。
有界流有定义的开始和结束。在执行任何计算之前，可以通过接收所有数据来处理有界流。处理有界流不需要有序的接收，因为有界数据集总是可以排序的。有界流的处理也称为批处理。

Apache Flink excels at processing unbounded and bounded data sets. Precise control of time and state enable Flink’s runtime to run any kind of application on unbounded streams. Bounded streams are internally processed by algorithms and data structures that are specifically designed for fixed sized data sets, yielding excellent performance.

Convince yourself by exploring the use cases that have been built on top of Flink.

ApacheFlink擅长处理无边界和有界数据集。对时间和状态的精确控制使Flink的运行时能够在无边界的流上运行任何类型的应用程序。有界流由专门为固定大小的数据集设计的算法和数据结构进行内部处理，从而产生出色的性能。
通过探索构建在Flink之上的用例来说服自己。

Deploy Applications Anywhere

在任意位置部署应用程序

Apache Flink is a distributed system and requires compute resources in order to execute applications. Flink integrates with all common cluster resource managers such as Hadoop YARN, Apache Mesos, and Kubernetes but can also be setup to run as a stand-alone cluster.

Flink is designed to work well each of the previously listed resource managers. This is achieved by resource-manager-specific deployment modes that allow Flink to interact with each resource manager in its idiomatic way.

When deploying a Flink application, Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager. In case of a failure, Flink replaces the failed container by requesting new resources. All communication to submit or control an application happens via REST calls. This eases the integration of Flink in many environments.

ApacheFlink是一个分布式系统，需要计算资源才能执行应用程序。Flink集成了所有常见的集群资源管理器，如Hadoop-Sharn、Apache Meos和Kubernetes，但也可以设置为作为独立集群运行。
Flink的设计是为了使前面列出的每个资源管理器都能很好地工作。这是通过资源管理器特定的部署模式实现的，允许Flink以其惯用的方式与每个资源管理器交互。
部署Flink应用程序时，Flink会根据应用程序配置的并行性自动标识所需的资源，并从资源管理器请求这些资源。在失败的情况下，Flink通过请求新资源来替换失败的容器。提交或控制应用程序的所有通信都是通过REST调用进行的。这可以简化Flink在许多环境中的集成。

Run Applications at any Scale

以任何规模运行应用程序

Flink is designed to run stateful streaming applications at any scale. Applications are parallelized into possibly thousands of tasks that are distributed and concurrently executed in a cluster. Therefore, an application can leverage virtually unlimited amounts of CPUs, main memory, disk and network IO. Moreover, Flink easily maintains very large application state. Its asynchronous and incremental checkpointing algorithm ensures minimal impact on processing latencies while guaranteeing exactly-once state consistency.

Users reported impressive scalability numbers for Flink applications running in their production environments, such as

applications processing multiple trillions of events per day,
applications maintaining multiple terabytes of state, and
applications running on thousands of cores.

Flink被设计成在以任何规模运行有状态流程序。应用程序可能被并行化为数千个任务，这些任务在集群中被分布和并发执行。因此，应用程序可以利用几乎无限量的CPU、主内存、磁盘和网络IO。此外，Flink很容易保持非常大的应用程序状态。它的异步和增量检查点算法确保对处理延迟的影响最小，同时保证严格一次状态的一致性。
用户报告在其生产环境中运行的Flink应用程序具有令人印象深刻的可扩展性，例如

每天处理数万亿个事件的应用程序，
保持多兆字节状态的应用程序，以及
运行在数千个核心上的应用程序。

Leverage In-Memory Performance

利用内存性能

Stateful Flink applications are optimized for local state access. Task state is always maintained in memory or, if the state size exceeds the available memory, in access-efficient on-disk data structures. Hence, tasks perform all computations by accessing local, often in-memory, state yielding very low processing latencies. Flink guarantees exactly-once state consistency in case of failures by periodically and asynchronously checkpointing the local state to durable storage.

有状态的Flink应用程序针对本地状态访问进行了优化。任务状态始终保持在内存中，或者，如果状态大小超过可用内存，则保存在访问高效的磁盘上数据结构中。因此，任务通过访问本地（通常在内存中）状态来执行所有计算，从而产生非常低的处理延迟。Flink通过定期和异步地检查本地状态到持久存储，以确保在发生故障时保持一次状态的一致性。

https://flink.apache.org/flink-architecture.html