Flink:Watermark

最新推荐文章于 2024-01-28 15:59:26 发布

GScallion

最新推荐文章于 2024-01-28 15:59:26 发布

阅读量357

点赞数

分类专栏： Flink

本文链接：https://blog.csdn.net/qq_24325581/article/details/117515189

版权

Flink 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一、Watermark简介与用途

1、Watermark用于处理乱序事件，需要通过与window结合使用来实现
2、导致事件乱序的原因有两个，out-of-order 和 late element

二、Watermark策略

1、Watermark策略用途

为了使用Watermark机制，需要指定事件中的某一个字段为事件时间，并需要根据该事件时间生成Watermark。

接口WatermarkStrategy为此提供了两个抽象方法：
1、TimestampAssigner createTimestampAssigner(TimestampAssignerSupplier.Context context)
根据策略实例化一个可分配的时间戳，即抽取事件时间
2、WatermarkGenerator createWatermarkGenerator(WatermarkGeneratorSupplier.Context context)
根据策略实例化一个Watermark生成器

2、Watermark策略源码

Watermark策略源码

package org.apache.flink.api.common.eventtime;

import org.apache.flink.annotation.Public;

import java.io.Serializable;
import java.time.Duration;

import static org.apache.flink.util.Preconditions.checkArgument;
import static org.apache.flink.util.Preconditions.checkNotNull;

/**
 * The WatermarkStrategy defines how to generate {@link Watermark}s in the stream sources. The
 * WatermarkStrategy is a builder/factory for the {@link WatermarkGenerator} that generates the
 * watermarks and the {@link TimestampAssigner} which assigns the internal timestamp of a record.
 * WatermarkStrategy 定义了如何在流源中生成 {@link Watermark}。 WatermarkStrategy 是生成水印的 {@link WatermarkGenerator} 和分配记录内部时间戳的 {@link TimestampAssigner} 的构建器/工厂。
 *
 * <p>This interface is split into three parts: 1) methods that an implementor of this interface
 * needs to implement, 2) builder methods for building a {@code WatermarkStrategy} on a base
 * strategy, 3) convenience methods for constructing a {code WatermarkStrategy} for common built-in
 * strategies or based on a {@link WatermarkGeneratorSupplier}
 * 该接口分为三个部分：1) 该接口的实现者需要实现的方法，2) 用于在基本策略上构建 {@code WatermarkStrategy} 的构建器方法，3) 用于构造 {code WatermarkStrategy} 的便捷方法 常见的内置策略或基于 {@link WatermarkGeneratorSupplier}
 *
 * <p>Implementors of this interface need only implement {@link #createWatermarkGenerator(WatermarkGeneratorSupplier.Context)}.
 * Optionally, you can implement {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)}.
 * 此接口的实现者只需实现 {@link #createWatermarkGenerator(WatermarkGeneratorSupplier.Context)}。 或者，您可以实现 {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)}。
 *
 * <p>The builder methods, like {@link #withIdleness(Duration)} or {@link
 * #createTimestampAssigner(TimestampAssignerSupplier.Context)} create a new {@code
 * WatermarkStrategy} that wraps and enriches a base strategy. The strategy on which the method is
 * called is the base strategy.
 * 构建器方法，如 {@link #withIdleness(Duration)} 或 {@link #createTimestampAssigner(TimestampAssignerSupplier.Context)} 创建了一个新的 {@code WatermarkStrategy} 来包装和丰富基本策略。 调用该方法的策略是基本策略。
 *
 * <p>The convenience methods, for example {@link #forBoundedOutOfOrderness(Duration)}, create a
 * {@code WatermarkStrategy} for common built in strategies.
 * 方便的方法，例如 {@link #forBoundedOutOfOrderness(Duration)}，为常见的内置策略创建一个 {@code WatermarkStrategy}。
 *
 * <p>This interface is {@link Serializable} because watermark strategies may be shipped
 * to workers during distributed execution.
 * 此接口是 {@link Serializable}，因为水印策略可能会在分布式执行期间发送给工作人员。
 */
@Public
public interface WatermarkStrategy<T> extends
		TimestampAssignerSupplier<T>, WatermarkGeneratorSupplier<T> {

	// ------------------------------------------------------------------------
	//  Methods that implementors need to implement.
	// ------------------------------------------------------------------------

	/**
	 * Instantiates a WatermarkGenerator that generates watermarks according to this strategy.
	 * 实例化根据此策略生成水印的 WatermarkGenerator。
	 */
	@Override
	WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);

	/**
	 * Instantiates a {@link TimestampAssigner} for assigning timestamps according to this
	 * strategy.
	 * 实例化一个 {@link TimestampAssigner} 用于根据此策略分配时间戳。
	 */
	@Override
	default TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
		// By default, this is {@link RecordTimestampAssigner},
		// for cases where records come out of a source with valid timestamps, for example from Kafka.
		// 对于记录来自具有有效时间戳的源的情况，例如来自 Kafka。
		return new RecordTimestampAssigner<>();
	}

	// ------------------------------------------------------------------------
	//  Builder methods for enriching a base WatermarkStrategy
	// ------------------------------------------------------------------------

	/**
	 * Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
	 * {@link TimestampAssigner} (via a {@link TimestampAssignerSupplier}).
	 * 创建一个新的 {@code WatermarkStrategy} 来包装此策略，但使用给定的 {@link TimestampAssigner}（通过 {@link TimestampAssignerSupplier}）。
	 *
	 * <p>You can use this when a {@link TimestampAssigner} needs additional context, for example
	 * access to the metrics system.
	 * 当 {@link TimestampAssigner} 需要额外的上下文时，您可以使用它，例如访问指标系统。
	 *
	 * <pre>
	 * {@code WatermarkStrategy<Object> wmStrategy = WatermarkStrategy
	 *   .forMonotonousTimestamps()
	 *   .withTimestampAssigner((ctx) -> new MetricsReportingAssigner(ctx));
	 * }</pre>
	 */
	default WatermarkStrategy<T> withTimestampAssigner(TimestampAssignerSupplier<T> timestampAssigner) {
		checkNotNull(timestampAssigner, "timestampAssigner");
		return new WatermarkStrategyWithTimestampAssigner<>(this, timestampAssigner);
	}

	/**
	 * Creates a new {@code WatermarkStrategy} that wraps this strategy but instead uses the given
	 * {@link SerializableTimestampAssigner}.
	 * 创建一个新的 {@code WatermarkStrategy} 来包装这个策略，而是使用给定的 {@link SerializableTimestampAssigner}。
	 *
	 * <p>You can use this in case you want to specify a {@link TimestampAssigner} via a lambda
	 * function.
	 * 如果您想通过 lambda 函数指定 {@link TimestampAssigner}，您可以使用它。
	 *
	 * <pre>
	 * {@code WatermarkStrategy<CustomObject> wmStrategy = WatermarkStrategy
	 *   .forMonotonousTimestamps()
	 *   .withTimestampAssigner((event, timestamp) -> event.getTimestamp());
	 * }</pre>
	 */
	default WatermarkStrategy<T> withTimestampAssigner(SerializableTimestampAssigner<T> timestampAssigner) {
		checkNotNull(timestampAssigner, "timestampAssigner");
		return new WatermarkStrategyWithTimestampAssigner<>(this,
				TimestampAssignerSupplier.of(timestampAssigner));
	}

	/**
	 * Creates a new enriched {@link WatermarkStrategy} that also does idleness detection in the
	 * created {@link WatermarkGenerator}.
	 * 创建一个新的丰富的 {@link WatermarkStrategy}，它也在创建的 {@link WatermarkGenerator} 中进行空闲检测。
	 *
	 * <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
	 * stream for that amount of time, then that partition is considered "idle" and will not hold
	 * back the progress of watermarks in downstream operators.
	 * 为水印策略添加空闲超时。 如果在该时间内没有记录在流的分区中流动，则该分区被视为“空闲”并且不会阻止下游算子中水印的进度。
	 *
	 * <p>Idleness can be important if some partitions have little data and might not have events
	 * during some periods. Without idleness, these streams can stall the overall event time
	 * progress of the application.
	 * 如果某些分区的数据很少并且在某些时间段内可能没有事件，则空闲可能很重要。 如果没有空闲，这些流可能会拖延应用程序的整体事件时间进度。
	 */
	default WatermarkStrategy<T> withIdleness(Duration idleTimeout) {
		checkNotNull(idleTimeout, "idleTimeout");
		checkArgument(!(idleTimeout.isZero() || idleTimeout.isNegative()),
				"idleTimeout must be greater than zero");
		return new WatermarkStrategyWithIdleness<>(this, idleTimeout);
	}

	// ------------------------------------------------------------------------
	//  Convenience methods for common watermark strategies
	// ------------------------------------------------------------------------

	/**
	 * Creates a watermark strategy for situations with monotonously ascending timestamps.
	 * 为时间戳单调递增的情况创建水印策略。
	 *
	 * <p>The watermarks are generated periodically and tightly follow the latest
	 * timestamp in the data. The delay introduced by this strategy is mainly the periodic interval
	 * in which the watermarks are generated.
	 * 水印是定期生成的，并严格遵循数据中的最新时间戳。 这种策略引入的延迟主要是产生水印的周期间隔。
	 *
	 * @see AscendingTimestampsWatermarks
	 */
	static <T> WatermarkStrategy<T> forMonotonousTimestamps() {
		return (ctx) -> new AscendingTimestampsWatermarks<>();
	}

	/**
	 * Creates a watermark strategy for situations where records are out of order, but you can place
	 * an upper bound on how far the events are out of order. An out-of-order bound B means that
	 * once the an event with timestamp T was encountered, no events older than {@code T - B} will
	 * follow any more.
	 * 为记录无序的情况创建水印策略，但您可以为事件无序的程度设置上限。 无序边界 B 意味着一旦遇到时间戳为 T 的事件，就不会再出现比 {@code T - B} 更旧的事件。
	 *
	 * <p>The watermarks are generated periodically. The delay introduced by this watermark
	 * strategy is the periodic interval length, plus the out of orderness bound.
	 * 水印是周期性生成的。 这种水印策略引入的延迟是周期间隔长度，加上乱序界限。
	 *
	 * @see BoundedOutOfOrdernessWatermarks
	 */
	static <T> WatermarkStrategy<T> forBoundedOutOfOrderness(Duration maxOutOfOrderness) {
		return (ctx) -> new BoundedOutOfOrdernessWatermarks<>(maxOutOfOrderness);
	}

	/**
	 * Creates a watermark strategy based on an existing {@link WatermarkGeneratorSupplier}.
	 * 基于现有的 {@link WatermarkGeneratorSupplier} 创建水印策略。
	 */
	static <T> WatermarkStrategy<T> forGenerator(WatermarkGeneratorSupplier<T> generatorSupplier) {
		return generatorSupplier::createWatermarkGenerator;
	}

	/**
	 * Creates a watermark strategy that generates no watermarks at all. This may be useful in
	 * scenarios that do pure processing-time based stream processing.
	 * 创建一个完全不生成水印的水印策略。 这在执行纯基于处理时间的流处理的场景中可能很有用。
	 */
	static <T> WatermarkStrategy<T> noWatermarks() {
		return (ctx) -> new NoWatermarksGenerator<>();
	}

}

3、Watermark策略测试案例

注意：策略一定要设置withIdleness，防止多分区时某分区无数据导致watermark值无法更新

SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
                .assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序，1秒的空闲检测

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 水印策略
 * origin ->
 */
public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {

    //抽取事件时间
    @Override
    public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new MyTimestampAssigner(); //引用 TimestampAssigner接口
    }

    //发射水印
    @Override
    public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new BoundedOutOfOrdernessGenerator(); //引用WatermarkGenerator接口
    }

}

三、自定义事件时间分配器

当使用事件时间处理数据的时候，需要手动的为每个元素分配事件时间时间戳，需要实现TimestampAssigner接口

1、TimestampAssigner接口源码

TimestampAssigner源码

package org.apache.flink.api.common.eventtime;

import org.apache.flink.annotation.Public;

/**
 * A {@code TimestampAssigner} assigns event time timestamps to elements.
 * These timestamps are used by all functions that operate on event time,
 * for example event time windows.
 * {@code TimestampAssigner} 为元素分配事件时间时间戳。所有对事件时间进行操作的函数都使用这些时间戳，例如事件时间窗口。
 *
 * <p>Timestamps can be an arbitrary {@code long} value, but all built-in implementations
 * represent it as the milliseconds since the Epoch (midnight, January 1, 1970 UTC),
 * the same way as {@link System#currentTimeMillis()} does it.
 * 时间戳可以是任意的 {@code long} 值，但所有内置实现都将其表示为自纪元（UTC 1970 年 1 月 1 日午夜）以来的毫秒数，与 {@link System#currentTimeMillis()} 相同。
 *
 * @param <T> The type of the elements to which this assigner assigns timestamps.
 *           <T>：此分配器为其分配时间戳的元素的类型。
 */
@Public
@FunctionalInterface
public interface TimestampAssigner<T> {

	/**
	 * The value that is passed to {@link #extractTimestamp} when there is no previous timestamp
	 * attached to the record.
	 * 当记录没有附加时间戳时传递给 {@link #extractTimestamp} 的值。
	 */
	long NO_TIMESTAMP = Long.MIN_VALUE;

	/**
	 * Assigns a timestamp to an element, in milliseconds since the Epoch. This is independent of
	 * any particular time zone or calendar.
	 * 为元素分配时间戳，以 Epoch 以来的毫秒数为单位。 这独立于任何特定的时区或日历。
	 *
	 * <p>The method is passed the previously assigned timestamp of the element.
	 * That previous timestamp may have been assigned from a previous assigner. If the element did
	 * not carry a timestamp before, this value is {@link #NO_TIMESTAMP} (= {@code Long.MIN_VALUE}:
	 * {@value Long#MIN_VALUE}).
	 * 该方法传递元素的先前分配的时间戳。 先前的时间戳可能是从先前的分配者分配的。 如果元素之前没有携带时间戳，则此值为 {@link #NO_TIMESTAMP} (= {@code Long.MIN_VALUE}:
	 {@value Long#MIN_VALUE})
	 *
	 * @param element The element that the timestamp will be assigned to.
	 *                时间戳将被分配到的元素。
	 * @param recordTimestamp The current internal timestamp of the element,
	 *                         or a negative value, if no timestamp has been assigned yet.
	 *                        元素的当前内部时间戳，如果尚未分配时间戳，则为负值。
	 * @return The new timestamp. 新的时间戳。
	 */
	long extractTimestamp(T element, long recordTimestamp);
}

2、TimestampAssigner接口测试案例

public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {
    //抽取事件时间
    @Override
    public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new MyTimestampAssigner(); //引用 TimestampAssigner接口
    }
}

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.TimestampAssigner;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 指定事件时间
 * origin ->
 */
public class MyTimestampAssigner implements TimestampAssigner<Tuple2<String, Long>> {
    @Override
    public long extractTimestamp(Tuple2<String, Long> event, long l) {
        return event.f1;
    }
}

四、自定义Watermark生成器

1、Watermark生成器接口源码

自定义Watermark生成器可以引用接口WatermarkGenerator，并按照业务逻辑构造其方法
WatermarkGenerator源码

package org.apache.flink.api.common.eventtime;

import org.apache.flink.annotation.Public;
import org.apache.flink.api.common.ExecutionConfig;

/**
 * The {@code WatermarkGenerator} generates watermarks either based on events or
 * periodically (in a fixed interval).
 * {@code WatermarkGenerator} 基于事件或定期（以固定间隔）生成水印。
 *
 * <p><b>Note:</b> This WatermarkGenerator subsumes the previous distinction between the
 * {@code AssignerWithPunctuatedWatermarks} and the {@code AssignerWithPeriodicWatermarks}.
 * 此 WatermarkGenerator 包含了 {@code AssignerWithPunctuatedWatermarks} 和 {@code AssignerWithPeriodicWatermarks} 之间的先前区别。
 */
@Public
public interface WatermarkGenerator<T> {

	/**
	 * Called for every event, allows the watermark generator to examine and remember the
	 * event timestamps, or to emit a watermark based on the event itself.
	 * 为每个事件调用，允许水印生成器检查并记住事件时间戳，或根据事件本身发出水印。
	 */
	void onEvent(T event, long eventTimestamp, WatermarkOutput output);

	/**
	 * Called periodically, and might emit a new watermark, or not.
	 * 定期调用，并且可能会发出或不发出新的水印。
	 *
	 * <p>The interval in which this method is called and Watermarks are generated
	 * depends on {@link ExecutionConfig#getAutoWatermarkInterval()}.
	 * 调用此方法和生成 Watermarks 的时间间隔取决于 {@link ExecutionConfig#getAutoWatermarkInterval()}。
	 */
	void onPeriodicEmit(WatermarkOutput output);
}

watermark 的生成方式本质上是有两种：周期性生成和标记生成。
1、周期性生成器通常通过 onEvent() 观察传入的事件数据，然后在框架调用 onPeriodicEmit() 时发出 watermark。
2、标记生成器将查看 onEvent() 中的事件数据，并等待检查在流中携带 watermark 的特殊标记事件或打点数据。当获取到这些事件数据时，它将立即发出 watermark。通常情况下，标记生成器不会通过 onPeriodicEmit() 发出 watermark。

2、自定义周期性Watermark生成器

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.Watermark;
import org.apache.flink.api.common.eventtime.WatermarkGenerator;
import org.apache.flink.api.common.eventtime.WatermarkOutput;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 该Watermark生成器可以覆盖的场景：数据在一定程度上乱序
 */
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<Tuple2<String, Long>> {
    private final long maxOutOfOrderness = 30000; // 允许30秒的乱序
    private long currentMaxTimestamp;

    @Override
    public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
        currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness));
    }
}

/**
 * 该生成器生成的 watermark 滞后于处理时间固定量。它假定元素会在有限延迟后到达 Flink。
 */
public class TimeLagWatermarkGenerator implements WatermarkGenerator<MyEvent> {

    private final long maxTimeLag = 5000; // 5 秒

    @Override
    public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
        // 处理时间场景下不需要实现
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(System.currentTimeMillis() - maxTimeLag));
    }
}

3、自定义标点Watermark生成器

标记 watermark 生成器观察流事件数据并在获取到带有 watermark 信息的特殊事件元素时发出 watermark。
如下是实现标记生成器的方法，当事件带有某个指定标记时，该生成器就会发出 watermark：

public class PunctuatedAssigner implements WatermarkGenerator<MyEvent> {

    @Override
    public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
        if (event.hasWatermarkMarker()) {
            output.emitWatermark(new Watermark(event.getWatermarkTimestamp()));
        }
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        // onEvent 中已经实现
    }
}

五、Watermark策略的使用位置

1、在Kafka连接器上使用

生产中挺难遇到，日后再整理

2、在算子上使用

六、在使用Watermark时需注意的地方

1、window的触发条件

此处需要注意的是，并不是有数据流入和有水印生成就会触发窗口计算，窗口的计算需要同时满足两个条件，记住是同时满足这两个条件：
1、watermark时间 >= window_end_time
2、在窗口范围内有数据，即[window_start_time,window_end_time]

其实稍微动动脑子就能明白，若想让该窗口触发，首先得窗口结束前的数据都能到来，水印的定义就是水印时间之前所有的数据都已经到来，当水印时间>=window_end_time时也就能确保窗口结束之前时间内数据都已经到来。

2、处理空闲数据源

这个问题很严重，在文章最后的完整测试案例中我就遇到了这个问题，当时我把下游并行度设置为10，但是source我只起了一个线程，也就是说下游10个并行中只有1个是有数据的，
即：env.setParallelism(10)。
由于下游算子 watermark 的计算方式是取所有不同的上游并行数据源 watermark 的最小值，则其 watermark 将不会发生变化。
因此在我测试数据的时候，无论将数据中的时间字段设置的多大，都不会触发计算，当把下游计算修改为1个并发的时候才能正常触发计算，即：env.setParallelism(1)。
当然，官方给出的正确的处理空闲数据源的方法为进行空闲检测：
即 .withIdleness(Duration.ofSeconds(10))

SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
                .assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序，10秒的空闲检测

代码的意思就是如果某分区内10秒内未产生数据，就忽略掉对该分区的水印的计算。

七、完整测试代码

1、入口类

package com.scallion.entry;

import com.scallion.job.TumblingWindowAccumulatingJob;
import com.scallion.utils.FlinkUtil;

/**
 * created by gaowj.
 * created on 2021-05-28.
 * function: watermark  allowedLateness  RichWindowFunction
 * origin -> https://blog.csdn.net/lmalds/article/details/55259718
 */
public class TumblingWindowAccumulatingTest {
    public static void main(String[] args) {
        FlinkUtil.run(new TumblingWindowAccumulatingJob());
        FlinkUtil.execution("TumblingWindowAccumulatingTest");
    }
}

1.1、Flink工具类

package com.scallion.utils;

import com.scallion.common.Common;
import com.scallion.job.Job;
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;

import java.util.Properties;
import java.util.concurrent.TimeUnit;

/**
 * created by gaowj.
 * created on 2021-03-01.
 * function: Flink上下文配置，源数据获取类
 * origin ->
 */
public class FlinkUtil {
    private static StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    static {
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        env.enableCheckpointing(1000);
        env.setParallelism(10);
        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
                3,
                Time.of(10, TimeUnit.SECONDS)
        ));
    }

    public static void run(Job job) {
        job.run();
    }

    public static DataStream<String> getSocketTextStream(String ip, int port) {
        DataStreamSource<String> source = env.socketTextStream(ip, port);
//        DataStreamSource<String> source = env.fromElements("a a a b c c c d d f h");
        return source;
    }

    public static DataStream<String> getSocketTextStream() {
        DataStreamSource<String> source = env.socketTextStream(Common.SOCKET_IP, Common.SOCKET_PORT);
        return source;
    }

    public static DataStream<String> getText() {
        DataStreamSource<String> source = env.fromElements("key2,1487225040000");
        return source;
    }

    public static DataStream<String> getKafkaStream(String broker, String topic, String groupId) {
        Properties prop = new Properties();
        prop.setProperty("bootstrap.servers", broker);
        prop.setProperty("group.id", groupId);
        FlinkKafkaConsumer011<String> kafkaConsumer = new FlinkKafkaConsumer011<>(topic, new SimpleStringSchema(), prop);
        kafkaConsumer.setStartFromLatest();
        DataStreamSource<String> source = env.addSource(kafkaConsumer).setParallelism(1);
        return source;
    }

    public static void execution(String jobName) {
        try {
            env.execute(jobName);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2、业务处理类

package com.scallion.job;

import com.scallion.transform.MyWatermarkStrategy;
import com.scallion.transform.WordCountProcessWindowFunction;
import com.scallion.utils.FlinkUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

import java.time.Duration;

/**
 * created by gaowj.
 * created on 2021-05-28.
 * function: watermark  allowedLateness  RichWindowFunction
 * origin -> https://blog.csdn.net/lmalds/article/details/55259718
 */
public class TumblingWindowAccumulatingJob implements Job {
    @Override
    public void run() {
        /**
         * Source
         */
        DataStream<String> sourceDS = FlinkUtil.getSocketTextStream();
        /**
         * Transform
         */
        SingleOutputStreamOperator<Tuple2<String, Long>> inputMapDS = sourceDS
                .map(new MapFunction<String, Tuple2<String, Long>>() {
                    @Override
                    public Tuple2<String, Long> map(String input) throws Exception {
                        String[] split = input.split(",");
                        return new Tuple2<>(split[0], Long.parseLong(split[1]));
                    }
                });
        SingleOutputStreamOperator<Tuple2<String, Long>> watermarkDS = inputMapDS
                .assignTimestampsAndWatermarks(new MyWatermarkStrategy().withIdleness(Duration.ofSeconds(10))); //允许30秒的乱序，1秒的空闲检测

        SingleOutputStreamOperator<String> outDS = watermarkDS
                .keyBy(tuple -> tuple.f0)
                .window(TumblingEventTimeWindows.of(Time.seconds(30))) //30秒翻滚窗口大小
                .allowedLateness(Time.seconds(60)) //允许60秒的延迟数据
                .process(new WordCountProcessWindowFunction());
        /**
         * Sink
         */
        outDS.print();
    }
}

3、水印策略类

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 水印策略
 * origin ->
 */
public class MyWatermarkStrategy implements WatermarkStrategy<Tuple2<String, Long>> {

    //抽取事件时间
    @Override
    public TimestampAssigner<Tuple2<String, Long>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
        return new MyTimestampAssigner(); //引用 TimestampAssigner接口
    }

    //发射水印
    @Override
    public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
        return new BoundedOutOfOrdernessGenerator(); //引用WatermarkGenerator接口
    }

}

4、事件时间生成器

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.TimestampAssigner;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 指定事件时间
 * origin ->
 */
public class MyTimestampAssigner implements TimestampAssigner<Tuple2<String, Long>> {
    @Override
    public long extractTimestamp(Tuple2<String, Long> event, long l) {
        return event.f1;
    }
}

5、水印生成器

package com.scallion.transform;

import org.apache.flink.api.common.eventtime.Watermark;
import org.apache.flink.api.common.eventtime.WatermarkGenerator;
import org.apache.flink.api.common.eventtime.WatermarkOutput;
import org.apache.flink.api.java.tuple.Tuple2;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 该Watermark生成器可以覆盖的场景：数据在一定程度上乱序
 */
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<Tuple2<String, Long>> {
    private final long maxOutOfOrderness = 30000; // 允许30秒的乱序
    private long currentMaxTimestamp;

    @Override
    public void onEvent(Tuple2<String, Long> event, long eventTimestamp, WatermarkOutput output) {
        currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness));
    }
}

6、窗口计算函数

package com.scallion.transform;

import com.scallion.utils.TimeUtil;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.collect.Iterables;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

/**
 * created by gaowj.
 * created on 2021-06-01.
 * function: 计算窗口内数据的count值
 * origin ->
 */
public class WordCountProcessWindowFunction extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
    ValueState<Integer> state;
    int count;

    @Override
    public void open(Configuration parameters) throws Exception {
        //该状态值为全局状态，不随某个窗口的结束清理
        state = getRuntimeContext().getState(new ValueStateDescriptor<Integer>("WordCountProcess test", Integer.class));
    }

    @Override
    public void process(String key, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> collector) throws Exception {
        int inputSize = Iterables.size(elements);
        if (state.value() == null)
            count = inputSize;
        else
            count = state.value() + inputSize;
        state.update(count);
        String res = "key:" + key + "\t"
                + "windowStart:" + TimeUtil.getTimestampToDate(context.window().getStart()) + "\t"
                + "windowEnd:" + TimeUtil.getTimestampToDate(context.window().getEnd()) + "\t"
                + "inputSize:" + inputSize + "\t"
                + "allCount:" + count;
        collector.collect(res);
    }
}

八、参考文章

Flink–WaterMark理解和实践
 生成 Watermark

GScallion

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Flink:Watermark

一、Watermark简介与用途二、Watermark策略1、Watermark策略用途2、Watermark策略源码3、Watermark策略测试案例三、自定义Watermark生成器1、自定义周期性Watermark生成器1、源码2、测试案例2、自定义标点Watermark生成器1、源码2、测试案例四、Watermark策略的使用位置1、在Kafka连接器上使用2、在算子上使用五、在使用Watermark时需注意的地方1、window的触发条件2、处理空闲数据源六、完
复制链接

扫一扫

专栏目录