Flink延迟数据处理3件套

Flink延迟数据处理3件套

  • | watermark(水位线)
  • | allowedLateness(最大迟到数据)
  • | sideOutputLateData(侧输出流)

样例代码

package com.andy.flink.demo.datastream.sideoutputs

import com.andy.flink.demo.datastream.sideoutputs.FlinkHandleLateDataTest2.SensorReading
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time


object FlinkHandleLateDataTest2 {

  //定义类定义实体Model
  case class SensorReading(id: String,
                           timestamp: Long,
                           temperature: Double)

  def main(args: Array[String]): Unit = {

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism( 1 )
    // 从调用时刻开始给env创建的每一个stream追加时间特征
    env.setStreamTimeCharacteristic( TimeCharacteristic.EventTime )
    // 设置watermark的默认生成周期(单位:毫秒) -> 100毫秒生成一个WaterMark. 全局设置, 算子中如果设置将覆盖该全局设置
    env.getConfig.setAutoWatermarkInterval( 100L )

    val inputDStream: DataStream[String] = env.socketTextStream( "localhost", 9999 )

    val dataDstream: DataStream[SensorReading] = inputDStream
      .map( data => {
        val dataArray: Array[String] = data.split( "," )
        SensorReading( dataArray( 0 ), dataArray( 1 ).toLong, dataArray( 2 ).toDouble )
      } )
      // .assignAscendingTimestamps( _.timestamp * 1000L ) // 最理想状态:数据无延迟,按时间正序到达(这种理想情况下,直接指定时间戳字段就可以了)
      .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
      // 给WaterMark的一个初始值延时时间,一般该值应能够覆盖住70%~80%左右的延迟数据
      ( Time.milliseconds( 1000 ) ) {
        // 指定时间戳字段以秒为单位 * 1000(这里需要使用 ms 单位,数据中的时间请自行转换为毫秒)
        override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
      } )

    val lateOutputTag = new OutputTag[SensorReading]( "late" )

    // 迟到数据处理的三重保证机制: watermark(水位线) | allowedLateness(最大迟到数据)  | sideOutputLateData(侧输出流)
    val resultDStream: DataStream[SensorReading] = dataDstream
      .keyBy( "id" ) //按什么分组,形成键控流
      .timeWindow( Time.seconds( 5 ) ) //简便起见,这里使用滚动窗口
      .allowedLateness( Time.minutes( 1 ) ) //允许的数据最大延迟时间,则触发窗口关闭的时间为(窗口长度+Watermark时长+允许数据延迟的时间, 本例中为:5+1+60)
      .sideOutputLateData( lateOutputTag )
      .reduce( new MyReduceFunc() )

    dataDstream.print( "main-flow" )
    resultDStream.print( "result-flow" )
    // 获取侧输出流late并打印输出
    resultDStream.getSideOutput( lateOutputTag ).print( "late-flow" )

    env.execute( "FlinkHandleLateDataTest2" )
  }
}

/**
 * 自定义reduce函数, 实现时间戳不断向前更新覆盖, 并获取温度中的最小值的功能.
 */
class MyReduceFunc extends ReduceFunction[SensorReading] {
  override def reduce(value1: SensorReading,
                      value2: SensorReading): SensorReading = {
    SensorReading(
      value1.id,
      value2.timestamp,
      value1.temperature.min( value2.temperature )
    )
  }
}

三件套之一 : 水平线watermark

窗口关闭为5秒,watermark延时为1秒,所以其实窗口数据只要[0,5), 5取不到(其实具体的时间戳的窗口从哪开始到那结束会有一个方法,并不是第一条数据时间戳是0就往后延长5秒,下面会看到),但是因为watermark延长了1s 输出到了时间戳6秒的时候才会进行输出,这时候窗口并没有关闭,因为我们设置了迟到数据allowedLateness
在这里插入图片描述

三件套之二 : 迟到数据allowedLateness

迟到数据设置成了一分钟,这一分钟之类所有的在[0,5)时间戳里的数据都会进行输出,来一条数据,直接输出一条数据。一分钟后窗口才是正式关闭.
在这里插入图片描述

三件套之三 : 侧输出流sideOutputLateData

兜底保证,窗口在关闭后,把数据输出到侧输出流,之后看到late的数据可以进行和之前的数据进行手动合并
在这里插入图片描述

实际窗口时间戳程序底层算法

  1. 先从时间窗口的使用位置看起

自定义延迟数据处理类:

val lateOutputTag = new OutputTag[SensorReading]( "late" )

    // 迟到数据处理的三重保证机制: watermark(水位线) | allowedLateness(最大迟到数据)  | sideOutputLateData(侧输出流)
    val resultDStream: DataStream[SensorReading] = dataDstream
      .keyBy( "id" ) //按什么分组,形成键控流
      .timeWindow( Time.seconds( 5 ) ) //简便起见,这里使用滚动窗口
      .allowedLateness( Time.minutes( 1 ) ) //允许的数据最大延迟时间,则触发窗口关闭的时间为(窗口长度+Watermark时长+允许数据延迟的时间, 本例中为:5+1+60)
      .sideOutputLateData( lateOutputTag )
      .reduce( new MyReduceFunc() )

这里设置为滚动窗口, 窗口大小 5 秒钟.

  1. 监控流scala处理类
    KeyedStream.scala
import org.apache.flink.streaming.api.datastream.{ DataStream => JavaStream, KeyedStream => KeyedJavaStream, WindowedStream => WindowedJavaStream}

@Public
class KeyedStream[T, K](javaStream: KeyedJavaStream[T, K]) extends DataStream[T](javaStream) {

  // ------------------------------------------------------------------------
  //  Properties
  // ------------------------------------------------------------------------

  /**
   * Gets the type of the key by which this stream is keyed.
   */
  @Internal
  def getKeyType = javaStream.getKeyType()

/**
   * Windows this [[KeyedStream]] into tumbling time windows.
   *
   * This is a shortcut for either `.window(TumblingEventTimeWindows.of(size))` or
   * `.window(TumblingProcessingTimeWindows.of(size))` depending on the time characteristic
   * set using
   * [[StreamExecutionEnvironment.setStreamTimeCharacteristic()]]
   *
   * @param size The size of the window.
   */
  def timeWindow(size: Time): WindowedStream[T, K, TimeWindow] = {
    new WindowedStream(javaStream.timeWindow(size))
  }

从这句代码 及 import 语句中, 可以看到:
new WindowedStream(javaStream.timeWindow(size))
WindowsStream的传参中,使用了KeyedStream.java里的: javaStream里的方法,

  1. 监控流java处理类
    KeyedStream.java
/**
 * A {@link KeyedStream} represents a {@link DataStream} on which operator state is
 * partitioned by key using a provided {@link KeySelector}. Typical operations supported by a
 * {@code DataStream} are also possible on a {@code KeyedStream}, with the exception of
 * partitioning methods such as shuffle, forward and keyBy.
 *
 * <p>Reduce-style operations, such as {@link #reduce}, {@link #sum} and {@link #fold} work on
 * elements that have the same key.
 *
 * @param <T> The type of the elements in the Keyed Stream.
 * @param <KEY> The type of the key in the Keyed Stream.
 */
@Public
public class KeyedStream<T, KEY> extends DataStream<T> {

	/**
	 * The key selector that can get the key by which the stream if partitioned from the elements.
	 */
	private final KeySelector<T, KEY> keySelector;

	/** The type of the key by which the stream is partitioned. */
	private final TypeInformation<KEY> keyType;

	// ------------------------------------------------------------------------
	//  Windowing
	// ------------------------------------------------------------------------

	/**
	 * Windows this {@code KeyedStream} into tumbling time windows.
	 *
	 * <p>This is a shortcut for either {@code .window(TumblingEventTimeWindows.of(size))} or
	 * {@code .window(TumblingProcessingTimeWindows.of(size))} depending on the time characteristic
	 * set using
	 * {@link org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#setStreamTimeCharacteristic(org.apache.flink.streaming.api.TimeCharacteristic)}
	 *
	 * @param size The size of the window.
	 */
	public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size) {
		if (environment.getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime) {
			return window(TumblingProcessingTimeWindows.of(size));
		} else {
			return window(TumblingEventTimeWindows.of(size));
		}
	}
  1. 滚动事件窗口(EventTimeWindow)java处理类
    TumblingEventTimeWindows.java
/**
 * A {@link WindowAssigner} that windows elements into windows based on the timestamp of the
 * elements. Windows cannot overlap.
 *
 * <p>For example, in order to window into windows of 1 minute:
 * <pre> {@code
 * DataStream<Tuple2<String, Integer>> in = ...;
 * KeyedStream<Tuple2<String, Integer>, String> keyed = in.keyBy(...);
 * WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowed =
 *   keyed.window(TumblingEventTimeWindows.of(Time.minutes(1)));
 * } </pre>
 */
@PublicEvolving
public class TumblingEventTimeWindows extends WindowAssigner<Object, TimeWindow> {
	private static final long serialVersionUID = 1L;

	private final long size;

	private final long offset;

	protected TumblingEventTimeWindows(long size, long offset) {
		if (Math.abs(offset) >= size) {
			throw new IllegalArgumentException("TumblingEventTimeWindows parameters must satisfy abs(offset) < size");
		}

		this.size = size;
		this.offset = offset;
	}

	@Override
	public Collection<TimeWindow> assignWindows(Object element, long timestamp, WindowAssignerContext context) {
		if (timestamp > Long.MIN_VALUE) {
			// Long.MIN_VALUE is currently assigned when no timestamp is present
			long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);
			return Collections.singletonList(new TimeWindow(start, start + size));
		} else {
			throw new RuntimeException("Record has Long.MIN_VALUE timestamp (= no timestamp marker). " +
					"Is the time characteristic set to 'ProcessingTime', or did you forget to call " +
					"'DataStream.assignTimestampsAndWatermarks(...)'?");
		}
	}

从分配窗口(assignWindows)方法中, 可以看到窗口的起始位置的计算来自于:
long start = TimeWindow.getWindowStartWithOffset(timestamp, offset, size);

  1. 时间窗口(TimeWindow)java处理类
    TimeWindow.java

进一步在 TimeWindow.java 类中, 可以看到如下方法:

/**
 * A {@link Window} that represents a time interval from {@code start} (inclusive) to
 * {@code end} (exclusive).
 */
@PublicEvolving
public class TimeWindow extends Window {

	private final long start;
	private final long end;

	public TimeWindow(long start, long end) {
		this.start = start;
		this.end = end;
	}
	
	/**
	 * Method to get the window start for a timestamp.
	 *
	 * @param timestamp epoch millisecond to get the window start.
	 * @param offset The offset which window start would be shifted by.
	 * @param windowSize The size of the generated windows.
	 * @return window start
	 */
	public static long getWindowStartWithOffset(long timestamp, long offset, long windowSize) {
		return timestamp - (timestamp - offset + windowSize) % windowSize;
	}

从该算法中可以看出窗口开始位置的计算公式为:
timestamp - (timestamp - offset + windowSize) % windowSize
(timestamp - offset + windowSize) % windowSize 里, 先减去偏移量, 再加上窗口大小, 最后再与窗口大小取模, 该公式可以得出:

  1. offset与windowSize的运算是避免时间戳取到负值
  2. 先加上一个窗口大小, 再使用窗口取模, 则可理解为对数据大小的影响范围的贡献可以忽略.
  3. (timestamp - offset + windowSize) % windowSize 最终可以理解为对timestamp按窗口取模, 而整体的(timestamp - offset + windowSize) % windowSize可理解为对timestamp去除余数, 获得一个去除余数后的整数,
    作为窗口的起始位置.
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值