flink-watermark（水印）

最新推荐文章于 2025-03-11 22:25:12 发布

bigdata_tw

最新推荐文章于 2025-03-11 22:25:12 发布

阅读量2.5k

点赞数

分类专栏： flink 文章标签： flink 大数据

本文链接：https://blog.csdn.net/qq_36532358/article/details/106920693

版权

一、什么是flink-watermark（水印）
1.1官方字面介绍
Flink-watermark（水印）的本质是DataStream中的一种特殊元素，每个水印都携带有一个时间戳。

 当时间戳为T的水印出现时，表示事件时间t <= T的数据都已经到达，即水印后面应该只能流入事件时间t > T的数据。

 也就是说，水印是Flink判断迟到数据的标准，同时也是窗口触发的标记。

1.2代码层面

public final class Watermark extends StreamElement {

   /** The watermark that signifies end-of-event-time. */
   public static final Watermark MAX_WATERMARK = new Watermark(Long.MAX_VALUE);

   // ------------------------------------------------------------------------

   /** The timestamp of the watermark in milliseconds. */
   private final long timestamp;

   /**
    * Creates a new watermark with the given timestamp in milliseconds.
    */
   public Watermark(long timestamp) {
      this.timestamp = timestamp;
   }

   /**
    * Returns the timestamp associated with this {@link Watermark} in milliseconds.
    */
   public long getTimestamp() {
      return timestamp;
   }

   // ------------------------------------------------------------------------

   @Override
   public boolean equals(Object o) {
      return this == o ||
            o != null && o.getClass() == Watermark.class && ((Watermark) o).timestamp == this.timestamp;
   }

   @Override
   public int hashCode() {
      return (int) (timestamp ^ (timestamp >>> 32));
   }

   @Override
   public String toString() {
      return "Watermark @ " + timestamp;
   }
}

从watermark类可以看出其继承StreamElement ，并且有一个时间戳的成员变量，标准的带有时间戳的元素。
StreamElement 是数据流中的一个元素。可以是记录或水印。是不是看到这上面的官方字面解释大概能够明白了
上事实：
数据平台部 > flink-watermark（水印） > image2020-5-25_19-11-37.png
1.3时间
1.3.1Processing time(处理时间):
Processing time refers to the system time of the machine that is executing the respective operation.
处理时间是指执行相应操作的机器的系统时间。
1.3.2Event time(事件时间):
Event time is the time that each individual event occurred on its producing device.
事件时间是每个单独事件在其生产设备上发生的时间。
1.3.3Ingestion time(摄取时间):
Ingestion time is the time that events enter Flink.
摄入时间是事件进入Flink的时间。

1.4图解示例
先通过简单示例看一下watermark结合eventTime的工作流程

1.4.1单个并行度

图中的方框就是数据元素，其中的数字表示事件时间，W(x)就表示时间戳是x的水印，并有长度为4个时间单位的滚动窗口。假设时间单位为秒，可见事件时间为2、3、1s的元素都会进入区间为[1s, 4s]的窗口，而事件时间为7s的元素会进入区间为[5s, 8s]的窗口。当水印W(4)到达时，表示已经没有t <= 4s的元素了，[1s, 4s]窗口会被触发并计算。同理，水印W(9)到达时，[5s, 8s]窗口会被触发并计算，以此类推。

1.4.2多个并行度

具有事件和水印的并行数据流和运算符
水印在源函数处生成，或者直接在源函数之后生成。源函数的每个并行子任务通常独立地生成其水印。这些水印定义了该特定并行源的事件时间。
当水印流经数据流时，它们会提前获取数据的事件时间。每个算子都会提取对应的事件时间，它为它的后继操作符产生一个新的下游水印。
一些算子使用多个输入流;例如，一个union，或者在一个keyBy(…)或partition(…)函数之后的算子。这种算子的当前事件时间是其输入流的事件时间的最小值。当它的输入流更新它们的事件时间时，算子也更新它们的事件时间。
怎么理解呢？看看下面这个图。

在遇到keyBy(…)或partition(…)等存在shffer算子事，每个并行度产生的watermark会采用广播的形式往下进行广播，下面在代码上可以看到；

注意：时间戳和水印都指定为毫秒。

二、flink-watermark怎么产生的
2.1通过时间戳分配器/水印生成器
DataStream.assignTimestampsAndWatermarks(…)
关于watermark的类依赖关系图：

2.1.1flink-watermark驱动方式
2.1.1.1AssignerWithPeriodicWatermarks（周期性水印）
周期性发出水印，默认周期是200ms
数据平台部 > flink-watermark（水印） > image2020-5-26_15-40-41.png
也能通过env.getConfig.setAutoWatermarkInterval(3000)方法来指定新的周期
数据平台部 > flink-watermark（水印） > image2020-5-26_15-43-26.png
2.1.1.1.1周期性watermark怎么用呢？
从上面的类依赖图可以很明显的看到flink自身已经为我们实现了三种常用的周期性水印类

2.1.1.1.1.1AscendingTimestampExtractor

/**
 * A timestamp assigner and watermark generator for streams where timestamps are monotonously
 * ascending. In this case, the local watermarks for the streams are easy to generate, because
 * they strictly follow the timestamps.
 * 一种时间戳分配程序和水印生成器，用于时间戳单调递增的流。在这种情况下，流的本地水印很容易生成，因为它们严格遵循时间戳。
 *
 * @param <T> The type of the elements that this function can extract timestamps from
 */
@PublicEvolving
public abstract class AscendingTimestampExtractor<T> implements AssignerWithPeriodicWatermarks<T> {

   private static final long serialVersionUID = 1L;

   /** The current timestamp. */
   private long currentTimestamp = Long.MIN_VALUE;

   /** Handler that is called when timestamp monotony is violated. */
   private MonotonyViolationHandler violationHandler = new LoggingHandler();


   /**
    * Extracts the timestamp from the given element. The timestamp must be monotonically increasing.
    *
    * @param element The element that the timestamp is extracted from.
    * @return The new timestamp.
    */
   public abstract long extractAscendingTimestamp(T element);

   /**
    * Sets the handler for violations to the ascending timestamp order.
    *
    * @param handler The violation handler to use.
    * @return This extractor.
    */
   public AscendingTimestampExtractor<T> withViolationHandler(MonotonyViolationHandler handler) {
      this.violationHandler = requireNonNull(handler);
      return this;
   }

   // ------------------------------------------------------------------------

   @Override
   public final long extractTimestamp(T element, long elementPrevTimestamp) {
      final long newTimestamp = extractAscendingTimestamp(element);
      if (newTimestamp >= this.currentTimestamp) {
         this.currentTimestamp = newTimestamp;
         return newTimestamp;
      } else {
         violationHandler.handleViolation(newTimestamp, this.currentTimestamp);
         return newTimestamp;
      }
   }

   @Override
   public final Watermark getCurrentWatermark() {
      return new Watermark(currentTimestamp == Long.MIN_VALUE ? Long.MIN_VALUE : currentTimestamp - 1);
   }

   // ------------------------------------------------------------------------
   //  Handling violations of monotonous timestamps
   // ------------------------------------------------------------------------

   /**
    * Interface for handlers that handle violations of the monotonous ascending timestamps
    * property.
    */
   public interface MonotonyViolationHandler extends java.io.Serializable {

      /**
       * Called when the property of monotonously ascending timestamps is violated, i.e.,
       * when {@code elementTimestamp < lastTimestamp}.
       *
       * @param elementTimestamp The tim

最低0.47元/天解锁文章