What is Apache Flink? — Applications(Flink应用程序)

最新推荐文章于 2024-08-29 10:45:33 发布

huaishu

最新推荐文章于 2024-08-29 10:45:33 发布

阅读量262

点赞数

分类专栏： Flink

Flink 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Apache Flink is a framework for stateful computations over unbounded and bounded data streams. Flink provides multiple APIs at different levels of abstraction and offers dedicated libraries for common use cases.

Here, we present Flink’s easy-to-use and expressive APIs and libraries.

ApacheFlink是一个在无边界和有界数据流上进行有状态计算的框架。Flink在不同的抽象层次上提供了多个API，并为常见用例提供了专用的库。
这里，我们展示了Flink易于使用和表达的API和库。

Building Blocks for Streaming Applications

流式应用程序的构建基块

The types of applications that can be built with and executed by a stream processing framework are defined by how well the framework controls streams, state, and time. In the following, we describe these building blocks for stream processing applications and explain Flink’s approaches to handle them.

由流处理框架来构建和执行的这类型应用程序，它被定义为由框架如何控制流、状态和时间。在下面，我们将描述流处理应用程序的这些构基块，并解释Flink处理它们的方法。

Streams

Obviously, streams are a fundamental aspect of stream processing. However, streams can have different characteristics that affect how a stream can and should be processed. Flink is a versatile processing framework that can handle any kind of stream.

Bounded and unbounded streams: Streams can be unbounded or bounded, i.e., fixed-sized data sets. Flink has sophisticated features to process unbounded streams, but also dedicated operators to efficiently process bounded streams.
Real-time and recorded streams: All data are generated as streams. There are two ways to process the data. Processing it in real-time as it is generated or persisting the stream to a storage system, e.g., a file system or object store, and processed it later. Flink applications can process recorded or real-time streams.

流

显然，流是流处理的一个基本面。但是，流可以具有不同的特性，这些特性影响流可以和应该如何处理。Flink是一个通用的处理框架，可以处理任何类型的流。

有界和无界流：流可以是无界或有界的，即固定大小的数据集。Flink具有处理无边界流的复杂功能，但也有专门的运算符来有效地处理有边界流。
实时和记录的流：所有数据都作为流生成。处理数据有两种方法。在生成流或将流持久化到存储系统（如文件系统或对象存储）时对其进行实时处理，并在以后对其进行处理。Flink应用程序可以处理记录的或实时的流。

State

Every non-trivial streaming application is stateful, i.e., only applications that apply transformations on individual events do not require state. Any application that runs basic business logic needs to remember events or intermediate results to access them at a later point in time, for example when the next event is received or after a specific time duration.

Application state is a first-class citizen in Flink. You can see that by looking at all the features that Flink provides in the context of state handling.

Multiple State Primitives: Flink provides state primitives for different data structures, such as atomic values, lists, or maps. Developers can choose the state primitive that is most efficient based on the access pattern of the function.
Pluggable State Backends: Application state is managed in and checkpointed by a pluggable state backend. Flink features different state backends that store state in memory or in RocksDB, an efficient embedded on-disk data store. Custom state backends can be plugged in as well.
Exactly-once state consistency: Flink’s checkpointing and recovery algorithms guarantee the consistency of application state in case of a failure. Hence, failures are transparently handled and do not affect the correctness of an application.
Very Large State: Flink is able to maintain application state of several terabytes in size due to its asynchronous and incremental checkpoint algorithm.
Scalable Applications: Flink supports scaling of stateful applications by redistributing the state to more or fewer workers.

状态

每个重要的流应用程序都是有状态的，即只有在单个事件上应用转换的应用程序才不需要状态。运行基本业务逻辑的任何应用程序都需要记住事件或中间结果，以便在以后的某个时间点（例如，当接收到下一个事件或在特定的时间段之后）访问它们。
应用状态是Flink中的一级公民。通过查看Flink在状态处理上下文中提供的所有特性，您可以看到这一点。

多状态基本基元：Flink为不同的数据结构（如原子值、列表或映射）提供状态原语。开发人员可以根据函数的访问模式选择最有效的状态基元。
可插拔状态后端：应用程序状态由可插拔状态后端管理和检查点。Flink具有不同的状态后端，将状态存储在内存或RockSDB中，这是一种高效的嵌入式磁盘数据存储。自定义状态后端也可以插入。
恰好一次状态一致性：Flink的检查点和恢复算法在发生故障时保证应用程序状态的一致性。因此，故障是透明处理的，不会影响应用程序的正确性。
非常大的状态：由于其异步和增量检查点算法，Flink能够保持数兆字节大小的应用程序状态。
可伸缩的应用程序：Flink支持通过将状态重新分配给更多或更少的“工作人员”来扩展有状态的应用程序。

Time

Time is another important ingredient of streaming applications. Most event streams have inherent time semantics because each event is produced at a specific point in time. Moreover, many common stream computations are based on time, such as windows aggregations, sessionization, pattern detection, and time-based joins. An important aspect of stream processing is how an application measures time, i.e., the difference of event-time and processing-time.

Flink provides a rich set of time-related features.

Event-time Mode: Applications that process streams with event-time semantics compute results based on timestamps of the events. Thereby, event-time processing allows for accurate and consistent results regardless whether recorded or real-time events are processed.
Watermark Support: Flink employs watermarks to reason about time in event-time applications. Watermarks are also a flexible mechanism to trade-off the latency and completeness of results.
Late Data Handling: When processing streams in event-time mode with watermarks, it can happen that a computation has been completed before all associated events have arrived. Such events are called late events. Flink features multiple options to handle late events, such as rerouting them via side outputs and updating previously completed results.
Processing-time Mode: In addition to its event-time mode, Flink also supports processing-time semantics which performs computations as triggered by the wall-clock time of the processing machine. The processing-time mode can be suitable for certain applications with strict low-latency requirements that can tolerate approximate results.

时间
时间是流应用程序的另一个重要组成部分。大多数事件流都具有固有的时间语义，因为每个事件都是在特定的时间点生成的。此外，许多常见的流计算都是基于时间的，例如窗口聚合、会话化、模式检测和基于时间的连接。流处理的一个重要方面是应用程序如何度量时间，即事件时间和处理时间的差异。
Flink提供了一组丰富的与时间相关的功能。

事件时间模式：使用事件时间语义处理流的应用程序根据事件的时间戳计算结果。因此，事件时间处理允许准确和一致的结果，不管无论是否处理记录的或实时的事件。
水印支持：Flink使用水印来解释事件时间应用程序中的时间。水印也是一种灵活的机制来权衡结果的延迟和完整性。
延迟数据处理：在具有水印的事件时间模式下处理流时，可能会在所有关联事件到达之前完成计算。这类事件称为迟发事件。Flink具有多个选项来处理延迟事件，例如通过侧输出重新路由它们以及更新以前完成的结果。
处理时间模式：除了事件时间模式外，Flink还支持处理时间语义，处理时间语义执行由处理器的壁钟时间触发的计算。处理时间模式可以适用于具有严格的低延迟要求、能够容忍近似结果的某些应用程序。

Layered APIs

Flink provides three layered APIs. Each API offers a different trade-off between conciseness and expressiveness and targets different use cases.

We briefly present each API, discuss its applications, and show a code example.

分层API
Flink提供三层API。每个API在简洁性和表达性之间提供不同的权衡，并针对不同的用例。
我们简要介绍每个API，讨论其应用程序，并展示一个代码示例。

The ProcessFunctions

ProcessFunctions are the most expressive function interfaces that Flink offers. Flink provides ProcessFunctions to process individual events from one or two input streams or events that were grouped in a window. ProcessFunctions provide fine-grained control over time and state. A ProcessFunction can arbitrarily modify its state and register timers that will trigger a callback function in the future. Hence, ProcessFunctions can implement complex per-event business logic as required for many stateful event-driven applications.

The following example shows a KeyedProcessFunction that operates on a KeyedStream and matches START and END events. When a START event is received, the function remembers its timestamp in state and registers a timer in four hours. If an END event is received before the timer fires, the function computes the duration between END and START event, clears the state, and returns the value. Otherwise, the timer just fires and clears the state.

The example illustrates the expressive power of the KeyedProcessFunction but also highlights that it is a rather verbose interface.

过程函数

过程函数是Flink提供的最具表现力的函数接口。Flink提供了过程函数来处理来自一个或两个输入流的单个事件，或者在一个窗口中分组的事件。过程函数提供对时间和状态的细粒度控制。过程函数可以任意修改其状态和注册计时器，这些计时器将在将来触发回调函数。因此，过程函数可以根据许多有状态事件驱动的应用程序的需要实现复杂的每事件业务逻辑。
下面的示例显示一个keyedprocessFunction，它对keyedprocesstream进行操作，并匹配开始和结束事件。当接收到启动事件时，函数会记住其状态时间戳，并在四小时内注册一个计时器。如果在计时器触发之前接收到结束事件，函数将计算结束事件和开始事件之间的持续时间，清除状态并返回值。否则，计时器将触发并清除状态。
这个例子说明了keyedprocessfunction的表达能力，但也强调它是一个相当冗长的接口。

/**
 * Matches keyed START and END events and computes the difference between 
 * both elements' timestamps. The first String field is the key attribute, 
 * the second String attribute marks START and END events.
 */
public static class StartEndDuration
    extends KeyedProcessFunction<String, Tuple2<String, String>, Tuple2<String, Long>> {

  private ValueState<Long> startTime;

  @Override
  public void open(Configuration conf) {
    // obtain state handle
    startTime = getRuntimeContext()
      .getState(new ValueStateDescriptor<Long>("startTime", Long.class));
  }

  /** Called for each processed event. */
  @Override
  public void processElement(
      Tuple2<String, String> in,
      Context ctx,
      Collector<Tuple2<String, Long>> out) throws Exception {

    switch (in.f1) {
      case "START":
        // set the start time if we receive a start event.
        startTime.update(ctx.timestamp());
        // register a timer in four hours from the start event.
        ctx.timerService()
          .registerEventTimeTimer(ctx.timestamp() + 4 * 60 * 60 * 1000);
        break;
      case "END":
        // emit the duration between start and end event
        Long sTime = startTime.value();
        if (sTime != null) {
          out.collect(Tuple2.of(in.f0, ctx.timestamp() - sTime));
          // clear the state
          startTime.clear();
        }
      default:
        // do nothing
    }
  }

  /** Called when a timer fires. */
  @Override
  public void onTimer(
      long timestamp,
      OnTimerContext ctx,
      Collector<Tuple2<String, Long>> out) {

    // Timeout interval exceeded. Cleaning up the state.
    startTime.clear();
  }
}

The DataStream API

The DataStream API provides primitives for many common stream processing operations, such as windowing, record-at-a-time transformations, and enriching events by querying an external data store. The DataStream API is available for Java and Scala and is based on functions, such as map(), reduce(), and aggregate(). Functions can be defined by extending interfaces or as Java or Scala lambda functions.

The following example shows how to sessionize a clickstream and count the number of clicks per session.

DataStream API
DataStream API为许多常见的流处理操作提供原语，例如窗口化，一次记录转换以及通过查询外部数据存储来丰富事件。 DataStream API可用于Java和Scala，它基于函数，例如map()，reduce()和aggregate()。可以通过扩展接口或Java或Scala lambda函数来定义函数。

以下示例显示如何对点击流进行会话并计算每个会话的点击次数。

// a stream of website clicks
DataStream<Click> clicks = ...

DataStream<Tuple2<String, Long>> result = clicks
  // project clicks to userId and add a 1 for counting
  .map(
    // define function by implementing the MapFunction interface.
    new MapFunction<Click, Tuple2<String, Long>>() {
      @Override
      public Tuple2<String, Long> map(Click click) {
        return Tuple2.of(click.userId, 1L);
      }
    })
  // key by userId (field 0)
  .keyBy(0)
  // define session window with 30 minute gap
  .window(EventTimeSessionWindows.withGap(Time.minutes(30L)))
  // count clicks per session. Define function as lambda function.
  .reduce((a, b) -> Tuple2.of(a.f0, a.f1 + b.f1));

SQL & Table API

Flink features two relational APIs, the Table API and SQL. Both APIs are unified APIs for batch and stream processing, i.e., queries are executed with the same semantics on unbounded, real-time streams or bounded, recorded streams and produce the same results. The Table API and SQL leverage Apache Calcite for parsing, validation, and query optimization. They can be seamlessly integrated with the DataStream and DataSet APIs and support user-defined scalar, aggregate, and table-valued functions.

Flink’s relational APIs are designed to ease the definition of data analytics, data pipelining, and ETL applications.

The following example shows the SQL query to sessionize a clickstream and count the number of clicks per session. This is the same use case as in the example of the DataStream API.

Flink具有两个关系API，Table API和SQL。这两个API都是用于批处理和流处理的统一API，即，在无界的实时流或有界的记录流上以相同的语义执行查询，并产生相同的结果。 Table API和SQL利用Apache Calcite进行解析，验证和查询优化。它们可以与DataStream和DataSet API无缝集成，并支持用户定义的标量，聚合和表值函数。

Flink的关系API旨在简化数据分析，数据流水线和ETL应用程序的定义。

以下示例显示用于会话点击流的SQL查询，并计算每个会话的点击次数。这与DataStream API示例中的用例相同。

SELECT userId, COUNT(*)
FROM clicks
GROUP BY SESSION(clicktime, INTERVAL '30' MINUTE), userId

Libraries

Flink features several libraries for common data processing use cases. The libraries are typically embedded in an API and not fully self-contained. Hence, they can benefit from all features of the API and be integrated with other libraries.

Complex Event Processing (CEP): Pattern detection is a very common use case for event stream processing. Flink’s CEP library provides an API to specify patterns of events (think of regular expressions or state machines). The CEP library is integrated with Flink’s DataStream API, such that patterns are evaluated on DataStreams. Applications for the CEP library include network intrusion detection, business process monitoring, and fraud detection.

DataSet API: The DataSet API is Flink’s core API for batch processing applications. The primitives of the DataSet API include map, reduce, (outer) join, co-group, and iterate. All operations are backed by algorithms and data structures that operate on serialized data in memory and spill to disk if the data size exceed the memory budget. The data processing algorithms of Flink’s DataSet API are inspired by traditional database operators, such as hybrid hash-join or external merge-sort.

Gelly: Gelly is a library for scalable graph processing and analysis. Gelly is implemented on top of and integrated with the DataSet API. Hence, it benefits from its scalable and robust operators. Gelly features built-in algorithms, such as label propagation, triangle enumeration, and page rank, but provides also a Graph API that eases the implementation of custom graph algorithms.

Flink具有几个用于常见数据处理用例的库。这些库通常嵌入在API中，而不是完全独立的。因此，他们可以从API的所有功能中受益，并与其他库集成。

复杂事件处理（CEP）：模式检测是事件流处理的一个非常常见的用例。 Flink的CEP库提供了一个API来指定事件模式（想想正则表达式或状态机）。 CEP库与Flink的DataStream API集成，以便在DataStream上评估模式。 CEP库的应用包括网络入侵检测，业务流程监控和欺诈检测。
DataSet API：DataSet API是Flink用于批处理应用程序的核心API。 DataSet API的原语包括map，reduce，（外部）join，co-group和iterate。所有操作都由算法和数据结构支持，这些算法和数据结构对内存中的序列化数据进行操作，并在数据大小超过内存预算时溢出到磁盘。 Flink的DataSet API的数据处理算法受到传统数据库运算符的启发，例如混合散列连接或外部合并排序。
Gelly：Gelly是一个可扩展的图形处理和分析库。 Gelly在DataSet API之上实现并与之集成。因此，它受益于其可扩展且强大的运营商。 Gelly具有内置算法，例如标签传播，三角形枚举和页面排名，但也提供了一种Graph API，可以简化自定义图算法的实现。

https://flink.apache.org/flink-applications.html