Learn Flink:Streaming Analytics

本文深入探讨Flink中的事件时间、水印的概念及其重要性。事件时间用于确保结果的准确性,水印用于处理延迟数据。文章通过实例解释了水印的工作原理,强调了延迟与完整性的权衡。此外,还介绍了Flink的窗口功能,包括窗口类型、窗口函数以及处理延迟事件的方法。
摘要由CSDN通过智能技术生成

Event Time and Watermarks 事件时间和水印

Introduction 简介

Flink explicitly supports three different notions of time:
Flink明确支持三种不同的时间概念:

  • event time: the time when an event occurred, as recorded by the device producing (or storing) the event*
    事件时间:事件发生的时间,由产生(或存储)事件的设备记录
  • ingestion time: a timestamp recorded by Flink at the moment it ingests the event*
    采集时间:Flink在采集事件时记录的时间戳
  • processing time: the time when a specific operator in your pipeline is processing the event*
    处理时间:管道中的特定operator处理事件的时间

For reproducible results, e.g., when computing the maximum price a stock reached during the first hour of trading on a given day, you should use event time. In this way the result won’t depend on when the calculation is performed.
This kind of real-time application is sometimes performed using processing time, but then the results are determined by the events that happen to be processed during that hour, rather than the events that occurred then.
Computing analytics based on processing time causes inconsistencies, and makes it difficult to re-analyze historic data or test new implementations.

对于可再现的结果,例如,当计算股票在给定日期的第一个交易小时内达到的最高价格时,您应该使用事件时间。这样,结果将不取决于何时执行计算。
实时应用程序有时会使用处理时间,这种情况下结果是由该小时内碰巧出现被处理的事件决定的,而不是由当时发生的事件决定的。
基于处理时间的计算分析会导致不一致,并使重新分析历史数据或测试新实现变得困难。

Working with Event Time 使用事件时间

If you want to use event time, you will also need to supply a Timestamp Extractor and Watermark Generator that Flink will use to track the progress of event time. This will be covered in the section below on Working with Watermarks, but first we should explain what watermarks are.
如果您想使用事件时间,还需要提供一个时间戳提取器和水印生成器,Flink将使用它们来跟踪事件时间的进度。
这将在下面关于使用水印的部分中介绍,但首先我们应该解释什么是水印。

Watermarks 水印

Let’s work through a simple example that will show why watermarks are needed, and how they work.
让我们通过一个简单的例子来说明为什么需要水印,以及它们是如何工作的。

In this example you have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are timestamps that indicate when these events actually occurred. The first event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, and so on:
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
在本例中,您有一个带有时间戳的事件流,这些事件的到达有些无序,如下所示。显示的数字是时间戳,指示这些事件实际发生的时间。第一个到达的事件发生在时间4,然后是更早发生的事件,发生在时间2,依此类推:

Now imagine that you are trying create a stream sorter. This is meant to be an application that processes each event from a stream as it arrives, and emits a new stream containing the same events, but ordered by their timestamps.
现在想象一下,您正在尝试创建一个流排序器。这意味着应用程序需要处理流中的每个事件(在事件到达时),并发出一个包含相同事件但按时间戳排序的新流。

Some observations:
观察事项:
(1) The first element your stream sorter sees is the 4, but you can’t just immediately release it as the first element of the sorted stream. It may have arrived out of order, and an earlier event might yet arrive. In fact, you have the benefit of some god-like knowledge of this stream’s future, and you can see that your stream sorter should wait at least until the 2 arrives before producing any results.
流排序器看到的第一个元素是4,但不能立即将其作为已排序流的第一个元素释放。它可能是无序到达的,可能还会出现更早的事件。事实上,你对这条流的未来有一些上帝般的认识,你可以看到你的流排序器至少应该等到2到达后才能产生任何结果。

Some buffering, and some delay, is necessary.
一些缓存和延迟是必要的。

(2) If you do this wrong, you could end up waiting forever. First the sorter saw an event from time 4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. Maybe not. You could wait forever and never see a 1.
如果你做错了,你可能会永远等下去。首先,排序器看到时间4的事件,然后是时间2的事件。时间戳小于2的事件会到达吗?可能会也可能不会。你可以永远等待,永远不会看到1。

Eventually you have to be courageous and emit the 2 as the start of the sorted stream.
最终,你必须勇敢地发出2作为排序流的开始。

(3) What you need then is some sort of policy that defines when, for any given timestamped event, to stop waiting for the arrival of earlier events.
然后,您需要的是某种策略,该策略定义了对于任何给定时间戳的事件,何时停止等待早期事件的到来。

This is precisely what watermarks do — they define when to stop waiting for earlier events.
这正是水印所做的——它们定义了何时停止等待早期事件。

Event time processing in Flink depends on watermark generators that insert special timestamped elements into the stream, called watermarks. A watermark for time t is an assertion that the stream is (probably) now complete up through time t.
Flink中的事件时间处理依赖于水印生成器,该生成器将特殊的时间戳元素插入流中,称为水印。时间t的水印是一种断言,即流现在(可能)在时间t之前是完整的。

When should this stream sorter stop waiting, and push out the 2 to start the sorted stream? When a watermark arrives with a timestamp of 2, or greater.
该流排序器何时应停止等待,并以2作为排序流的开始?当水印到达时的时间戳为2或更大。

(4) You might imagine different policies for deciding how to generate watermarks.
您可能会想象不同的策略来决定如何生成水印。

Each event arrives after some delay, and these delays vary, so some events are delayed more than others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink refers to this strategy as bounded-out-of-orderness watermarking. It is easy to imagine more complex approaches to watermarking, but for most applications a fixed delay works well enough.
每个事件都经过一些延迟后到达,并且这些延迟各不相同,因此一些事件的延迟比其他事件的延迟更大。一种简单的方法是假设这些延迟受到某个最大延迟的限制。Flink将该策略称为有界无序水印(bounded-out-of-orderness watermarking)。很容易想象更复杂的水印方式,但对于大多数应用来说,固定延迟就足够了。

Latency vs. Completeness 延迟与完整性

Another way to think about watermarks is that they give you, the developer of a streaming application, control over the tradeoff between latency and completeness. Unlike in batch processing, where one has the luxury of being able to have complete knowledge of the input before producing any results, with streaming you must eventually stop waiting to see more of the input, and produce some sort of result.
考虑水印的另一种方式是,它让流应用程序的开发人员能够控制延迟和完整性之间的权衡。与批处理不同,在批处理中,用户可以在生成任何结果之前完全了解输入,而流处理最终必须停止等待查看更多输入,并生成某种结果。

You can either configure your watermarking aggressively, with a short bounded delay, and thereby take the risk of producing results with rather incomplete knowledge of the input – i.e., a possibly wrong result, produced quickly. Or you can wait longer, and produce results that take advantage of having more complete knowledge of the input stream(s).
您可以以很短的有界延迟积极地配置水印,从而冒着在输入信息不完全的情况下生成结果的风险,即快速生成可能错误的结果。或者,您可以等待更长的时间,并利用对输入流的更完整了解产生结果。

It is also possible to implement hybrid solutions that produce initial results quickly, and then supply updates to those results as additional (late) data is processed. This is a good approach for some applications.
还可以实施快速生成初始结果的混合解决方案,然后在处理额外(延迟)数据时对这些结果提供更新。对于某些应用程序来说,这是一种很好的方法。

Lateness 延迟

Lateness is defined relative to the watermarks. A Watermark(t) asserts that the stream is complete up through time t; any event following this watermark whose timestamp is ≤ t is late.
延迟是相对于水印而定义的。水印(t)断言流在时间t之前是完整的;此水印之后的任何时间戳小于等于t的事件是迟到的。

Working with Watermarks 使用水印

In order to perform event-time-based event processing, Flink needs to know the time associated with each event, and it also needs the stream to include watermarks.
为了执行基于事件时间的事件处理,Flink需要知道与每个事件相关的时间,并且还需要流包含水印。

The Taxi data sources used in the hands-on exercises take care of these details for you. But in your own applications you will have to take care of this yourself, which is usually done by implementing a class that extracts the timestamps from the events, and generates watermarks on demand. The easiest way to do this is by using a WatermarkStrategy:
实践练习中使用的出租车数据源为您提供了这些细节。但在您自己的应用程序中,您必须自己解决这个问题,这通常是通过实现一个类来实现的,该类从事件中提取时间戳,并根据需要生成水印。最简单的方法是使用WatermarkStrategy:

DataStream<Event> stream = ...;

WatermarkStrategy<Event> strategy = 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值