window
在Flink中,StreamExecutionEnvironment是无界流,而在项目中有时会需要统计一段时间内。这个时候我们就需要用Flink中的窗口来将无界流拆分为有界流
官方定义:Windows are at the heart of processing infinite streams. Windows split the stream into “buckets” of finite size, over which we can apply computations.
窗口是处理无限流的核心。 窗口将流分隔成有限大小的“桶”,以供我们进行计算。
生命周期
官方文档:In a nutshell, a window is created as soon as the first element that should belong to this window arrives, and the window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness (see Allowed Lateness). Flink guarantees removal only for time-based windows and not for other types, e.g. global windows (see Window Assigners).
简单的说,一个窗口在属于此窗口的第一个元素到达时创建,窗口完全删除的条件是:时间(事件或处理时间)达到该窗口的结束时间戳,并加上用户指定的允许的延迟,窗口被完全删除(参见 Allowed Lateness)。Flink保证仅对基于时间的窗口进行删除,而不适用于其他类型的窗口,比如全局窗口(参见 窗口分配器)。
简单来说当第一个数据来的时候窗口被创建,当超过这个window size的时候窗口被删除,如果设置了延迟时间,那么窗口移除的时间将变为 结束时间加上延迟的时间
Flink中的时间
Event Time
官方文档:Event time is the time that each individual event occurred on its producing device. This time is typically embedded within the records before they enter Flink, and that event timestamp can be extracted from each record. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time. This watermarking mechanism is described in a later section, below.
In a perfect world, event time processing would yield completely consistent and deterministic results, regardless of when events arrive, or their ordering. However, unless the events are known to arrive in-order (by timestamp), event time processing incurs some latency while waiting for out-of-order events. As it is only possible to wait for a finite period of time, this places a limit on how deterministic event time applications can be.
Assuming all of the data has arrived, event time operations will behave as expected, and produce correct and consistent results even when working with out-of-order or late events, or when reprocessing historic data. For example, an hourly event time window will contain all records that carry an event timestamp that falls into that hour, regardless of the order in which they arrive, or when they are processed. (See the section on late events for more information.)
在现实场景中,数据的流入是有网络延迟的,对于依赖于时间进行响应计算的业务,我们需要使用Event Time来获取这条消息的真正时间进行算子
ProcessingTime
官方文档:When a streaming program runs on processing time, all time-based operations (like time windows) will use the system clock of the machines that run the respective operator. An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour.
当你的流程序给予processing time运行的,所有基于时间的操作都是使用的操作系统的时间运行相关的计算。每小时处理时间窗口将是操作系统的整点之间到达的数据。
如果业务上对时间的要求不是特别的高可以使用这个,因为他的性能是最高的。