flink重温笔记(九):Flink 高级 API 开发——flink 四大基石之WaterMark

}

/\*\*

* 水位传感器,用来接受水位数据
*/
@Data
@AllArgsConstructor
@NoArgsConstructor
private static class WaterSensor {
private String id; //传感器id
private long ts; //时间
private Integer vc; //水位
}
}


注意:flink 1.12 版本之后的无序流添加固定延迟水印



SingleOutputStreamOperator waterMarkStream = lines
.assignTimestampsAndWatermarks(
// 固定延迟水印,是 Duration 类型
WatermarkStrategyforBoundedOutOfOrderness(Duration.ofSeconds(3))
.withTimestampAssigner(
new SerializableTimestampAssigner() {
@Override
public long extractTimestamp(WaterSensor waterSensor, long l) {
return waterSensor.getTs() * 1000L;
}}));


结果:



情况一:一种类别输入
sensor_6,1547718201,15 -》2019-01-17 17:43:21
sensor_6,1547718205,15 -》2019-01-17 17:43:25
sensor_6,1547718210,15 -》2019-01-17 17:43:30

输出:
数据>>>> OutOfOrdernessWaterMark.WaterSensor(id=sensor_6, ts=1547718201, vc=15)
数据>>>> OutOfOrdernessWaterMark.WaterSensor(id=sensor_6, ts=1547718205, vc=15)
数据>>>> OutOfOrdernessWaterMark.WaterSensor(id=sensor_6, ts=1547718210, vc=15)
key: sensor_6
数据为: [OutOfOrdernessWaterMark.WaterSensor(id=sensor_6, ts=1547718201, vc=15)]
条数为:1
时间窗口为:1547718200000->1547718205000
(2019-01-17 17:43:20 -> 2019-01-17 17:43:25)


总结:


* 1- 如果是有 key 类别的差异,触发窗口计算往往在 key 变化时,而不需要两个一样的 key 作为对照
* 2- 因为设置了延迟,在触发窗口范围的时候,事件时间 - 延迟时间 = 水印时间


	+ (例子中打印了 3 条数据,即第 3 条数据触发计算,第3条数据的水印时间的秒级:30 - 3 = 27 >= 窗口的 endTime)
	+ 窗口触发两个条件,一是水印时间达到窗口 endTime,二是窗口内有数据
* 3- api版本区别:


	+ flink1.12之前:调用 assignTimestampAndWatermarks,new 一个 BoundedOutofOrdernessTimestampExtractor
	
	 注意设置 `延迟时间`,重写方法获取事件时间
	+ flink1.12之后:调用 assignTimestampAndWatermarks,有序流回调固定延迟水印策略
	
	
		- WatermarkStrategy.forBoundedOutOfOrderness(`Duration` 类型延迟时间).withTimestampAssigner
		- new 一个序列化 SerializableTimestampAssigner,重写方法获取时间戳
* 4- 应用场景:固定延迟水印解决数据乱序场景




---


###### (3) 并行数据流中的Watermark


对齐机制会取所有 Channel 中最小的 Watermark,即:


每个并行度中必须都有数据,且都满足触发窗口条件,才会有对齐机制


例子:将并行度设置为2,带有线程号



package cn.itcast.day09.WaterMark;

/**
* @author lql
* @time 2024-03-02 19:27:53
* @description TODO:多并行度下的水印操作演示
*/

import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.stringtemplate.v4.ST;

import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.Iterator;

/**
* 测试数据:
* 并行度设置为2测试:
* hadoop,1626934802000 ->2021-07-22 14:20:02
* hadoop,1626934805000 ->2021-07-22 14:20:05
* hadoop,1626934806000 ->2021-07-22 14:20:06
*/
public class Watermark_Parallelism {
//定义打印数据的日期格式
final private static SimpleDateFormat sdf = new SimpleDateFormat(“yyyy-MM-dd HH:mm:ss.SSS”);

public static void main(String[] args) throws Exception {
    // todo 1) 流式环境
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(2);

    // todo 2) 数据源
    SingleOutputStreamOperator<Tuple2<String, Long>> tupleDataStream = env.socketTextStream("node1", 9999).map(new MapFunction<String, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> map(String line) throws Exception {
            try {
                String[] array = line.split(",");
                return Tuple2.of(array[0], Long.parseLong(array[1]));
            } catch (NumberFormatException e) {
                System.out.println("输入的数据格式不正确:" + line);
                return Tuple2.of("null", 0L);
            }
        }
    }).filter(new FilterFunction<Tuple2<String, Long>>() {
        @Override
        public boolean filter(Tuple2<String, Long> tuple) throws Exception {
            if (!tuple.f0.equals("null") && tuple.f1 != 0L) {
                return true;
            }
            return false;
        }
    });

    // todo 3) 水印操作
    SingleOutputStreamOperator<Tuple2<String, Long>> waterMarkDataStream = tupleDataStream.assignTimestampsAndWatermarks(
            //TODO 自定义watermark生成器
            WatermarkStrategy.forGenerator(new WatermarkGeneratorSupplier<Tuple2<String, Long>>() {
                @Override
                public WatermarkGenerator<Tuple2<String, Long>> createWatermarkGenerator(Context context) {
                    return new MyWatermarkGenerator<>();
                }
            }).withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                @Override
                public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                    // 获取数据中的 eventTime
                    Long timestamp = element.f1;

                    // 定义字符串并打印
                    System.out.println("键值:" + element.f0 + ",线程号:" + Thread.currentThread().getId() + "," +
                            "事件时间:【" + sdf.format(timestamp) + "】");
                    return timestamp;
                }
            })
    );

    // todo 4) 分流和窗口
    SingleOutputStreamOperator<String> result = waterMarkDataStream
            .keyBy(0)
            .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                @Override
                public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                    //定义该窗口所有时间字段的集合对象
                    ArrayList<Long> timeArr = new ArrayList<>();
                    // 首先获取了输入数据流(input)的迭代器
                    Iterator<Tuple2<String, Long>> iterator = input.iterator();
                    while (iterator.hasNext()) {
                        Tuple2<String, Long> tuple2 = iterator.next();
                        timeArr.add(tuple2.f1);
                    }
                    //对保存到集合列表的时间进行排序
                    Collections.sort(timeArr);
                    //打印输出该窗口触发的所有数据
                    String outputData = "" +
                            "\n 键值:【" + tuple + "】" +
                            "\n 触发窗口数据的个数:【" + timeArr.size() + "】" +
                            "\n 触发窗口的数据:" + sdf.format(new Date(timeArr.get(timeArr.size() - 1))) +
                            "\n 窗口计算的开始时间和结束时间:" + sdf.format(new Date(window.getStart())) + "----->" +
                            sdf.format(new Date(window.getEnd()));
                    out.collect(outputData);
                }
            });

    //TODO 6)打印测试
    result.printToErr("触发窗口计算结果>>>");

    //TODO 7)启动作业
    env.execute();
}

public static class MyWatermarkGenerator<T> implements WatermarkGenerator<T>{
    //定义变量,存储当前窗口中最大的时间戳
    private long maxTimestamp = -1L;
    /\*\*

* 每条数据都会调用该方法
* @param event
* @param eventTimestamp
* @param output
*/
@Override
public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
//System.out.println(“on Event…”);
maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
}

    /\*\*\*

* 周期性的执行,默认是200ms调用一次
* @param output
*/
@Override
public void onPeriodicEmit(WatermarkOutput output) {
//System.out.println(“on Periodic…”+System.currentTimeMillis());
//发射watermark
output.emitWatermark(new Watermark(maxTimestamp -1L));
}
}
}


结果:



输入:
* hadoop,1626934802000 ->2021-07-22 14:20:02
* hadoop,1626934805000 ->2021-07-22 14:20:05
* hadoop,1626934806000 ->2021-07-22 14:20:06

输出:
键值:hadoop,线程号:68,事件时间:【2021-07-22 14:20:02.000】
键值:hadoop,线程号:69,事件时间:【2021-07-22 14:20:05.000】
键值:hadoop,线程号:68,事件时间:【2021-07-22 14:20:06.000】
触发窗口计算结果>>>:2>
键值:【(hadoop)】
触发窗口数据的个数:【1】
触发窗口的数据:2021-07-22 14:20:02.000
窗口计算的开始时间和结束时间:2021-07-22 14:20:00.000----->2021-07-22 14:20:05.000


总结:


* 1- 获取线程号:Thread.currentThread().getId()
* 2- 自定义日期格式:new SimpleDateFormat()
* 3- 看到 input 是 Iterate 类型,需要调用 iterator()方法转化为迭代对象,运用 while 循环结合 hashNext()边迭代边加入列表
* 4- Collections.sort(列表),可以对列表进行排序




---


###### (4) 自定义 Watermark


**A. 周期性水印**



package cn.itcast.day09.WaterMark;

/**
* @author lql
* @time 2024-03-02 17:07:17
* @description TODO
*/

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import org.apache.flink.api.common.eventtime.*;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

/**
* 自定义周期性水印,内置水印:固定延迟水印和单调递增水印都是基于周期性水印开发的,默认是200ms生成一次watermark
*/
public class WaterMark_Periodic {
public static void main(String[] args) throws Exception {
// todo 1) 设置流处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// todo 2) 数据源
SingleOutputStreamOperator lines = env.socketTextStream(“node1”, 9999).map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] data = value.split(“,”);
return new WaterSensor(data[0], Long.valueOf(data[1]), Integer.parseInt(data[2]));
}
});

    // todo 3) 设置水印操作
    SingleOutputStreamOperator<WaterSensor> sensorDS  = lines.assignTimestampsAndWatermarks(WatermarkStrategy.forGenerator(
            new WatermarkGeneratorSupplier<WaterSensor>() {
                @Override
                public WatermarkGenerator<WaterSensor> createWatermarkGenerator(Context context) {
                    return new MyWatermarkGenerator<>();
                }
            }).withTimestampAssigner(new SerializableTimestampAssigner<WaterSensor>() {
        @Override
        public long extractTimestamp(WaterSensor waterSensor, long l) {
            return waterSensor.getTs() \* 1000L;
        }
    }));

    // todo 4) 分组
    KeyedStream<WaterSensor, String> sensorKS  = sensorDS.keyBy(t -> t.getId());

    // todo 5) 开窗
    WindowedStream<WaterSensor, String, TimeWindow> sensorWS  = sensorKS.window(TumblingEventTimeWindows.of(Time.seconds(10)));

    // todo 6) 自定义窗口
    SingleOutputStreamOperator<String> result = sensorWS.process(new ProcessWindowFunction<WaterSensor, String, String, TimeWindow>() {
        @Override
        public void process(String s, Context context, Iterable<WaterSensor> iterable, Collector<String> out) throws Exception {
            out.collect("key: " + s + "\n" +
             "数据为: " + iterable + "\n" +
             "条数为:" + iterable.spliterator().estimateSize() + "\n" +
             "时间窗口为:" + context.window().getStart() + "->" + context.window().getEnd() + "\n");
        }
    });

    // todo 7) 打印和启动
    result.print();
    env.execute();
}

/\*\*

* 水位传感器,用来接受水位数据
*/
@Data
@AllArgsConstructor
@NoArgsConstructor
private static class WaterSensor {
private String id; //传感器id
private long ts; //时间
private Integer vc; //水位
}

private static class MyWatermarkGenerator<T> implements WatermarkGenerator<T> {
    private long maxTimestamp = -1L;

    /\*\*

* 每条数据执行一次
* @param event
* @param eventTimestamp
* @param watermarkOutput
*/
@Override
public void onEvent(T event, long eventTimestamp, WatermarkOutput watermarkOutput) {
System.out.println(“onEvent……”);
maxTimestamp = Math.max(eventTimestamp, maxTimestamp);
}

    /\*\*

* 周期性执行一次
* @param watermarkOutput
*/
@Override
public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
System.out.println(“onPeriodicEmit……”+ +System.currentTimeMillis());
// 发生水印
watermarkOutput.emitWatermark(new Watermark(maxTimestamp));
}
}
}


结果:



onPeriodicEmit……1709376044007
onPeriodicEmit……1709376044214
onPeriodicEmit……1709376044415
onPeriodicEmit……1709376044631
onPeriodicEmit……1709376044834


总结:


* 1- 自定义水印:WatermarkStrategy.forGenerator(new WatermarkGeneratorSupplier)
	+ 重写方法,返回新的 Class<>()
	+ 继承 WatermarkGenerator ,重写两个方法,一个每条数据执行一次,一个周期执行一次(默认是200ms)
* 2- 更改执行周期:env.getConfig().setAutoWatermarkInterval(2000)
* 3- 调用易出错:forGenerate 有 withTimestampAssigner 方法




---


**B. 间歇性水印:**


* 在上述自定义周期性水印方法的 onEvent 中实现 onPeriodicEmit 中的生成水印代码即可实现



watermarkOutput.emitWatermark(new Watermark(maxTimestamp));


##### 2.6 在数据源之后使用水印 (Kafka) [重点]


###### 2.6.1 kafka 向指定分区写入数据



package cn.itcast.day09.watermark.kafka;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.internals.Topic;

import java.util.Properties;

/**
* kafka生产者工具类,模拟数据的生成,将数据写入到指定的分区中
*
* 第一个分区写入:1000,hadoop、7000,hadoop-》没有触发窗口计算
* 第二个分区写入:7000,flink -》触发了窗口计算
*/
public class KafkaMock {
private final KafkaProducer<String, String> producer;
public final static String TOPIC = “test3”;

private KafkaMock(){
    Properties props = new Properties();
    props.put(ProducerConfig.BOOTSTRAP\_SERVERS\_CONFIG, "node01:9092");
    props.put(ProducerConfig.ACKS\_CONFIG, "all");
    props.put(ProducerConfig.RETRIES\_CONFIG, 0);
    props.put(ProducerConfig.KEY\_SERIALIZER\_CLASS\_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
    props.put(ProducerConfig.VALUE\_SERIALIZER\_CLASS\_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

    producer = new KafkaProducer<String, String>(props);
}
public void producer(){
    long timestamp = 1000;
    String value = "hadoop";
    String key = String.valueOf(value);
    String data = String.format("%s,%s", timestamp, value);
    producer.send(new ProducerRecord<String, String>(TOPIC, 1, key, data));
    producer.close();
}
public static void main(String[] args) {
    new KafkaMock().producer();
}

}


###### 2.6.2 水印机制消费 kafak 数据



package cn.itcast.day09.watermark.kafka;

import org.apache.flink.api.common.eventtime.TimestampAssigner;
import org.apache.flink.api.common.eventtime.TimestampAssignerSupplier;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase;
import org.apache.flink.util.Collector;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import scala.collection.convert.Wrappers;

import java.time.Duration;
import java.util.Iterator;
import java.util.Properties;
import java.util.concurrent.TimeUnit;

/**
* 使用水印消费kafka里面的数据
*/
public class WatermarkTest {
public static void main(String[] args) throws Exception {
//todo 1)初始化flink流处理环境
Configuration configuration = new Configuration();
configuration.setInteger(“rest.port”, 8081);//设置webui的端口号
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(configuration);

    env.setParallelism(2);
    env.enableCheckpointing(5000);
    //todo 2)接入数据源
    //指定topic的名称
    String topicName = "test3";
    //实例化kafkaConsumer对象
    Properties props = new Properties();
    props.setProperty(ConsumerConfig.BOOTSTRAP\_SERVERS\_CONFIG, "node01:9092");
    props.setProperty(ConsumerConfig.GROUP\_ID\_CONFIG, "test001");
    props.setProperty(ConsumerConfig.AUTO\_OFFSET\_RESET\_CONFIG, "latest");
    props.setProperty(ConsumerConfig.ENABLE\_AUTO\_COMMIT\_CONFIG, "true");
    props.setProperty(ConsumerConfig.AUTO\_COMMIT\_INTERVAL\_MS\_CONFIG, "2000");
    props.setProperty("flink.partition-discovery.interval-millis", "5000");//开启一个后台线程每隔5s检测一次kafka的分区情况

    FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<String>(topicName, new SimpleStringSchema(), props);
    kafkaSource.setCommitOffsetsOnCheckpoints(true);//todo 在开启checkpoint以后,offset的递交会随着checkpoint的成功而递交,从而实现一致性语义,默认就是true

    DataStreamSource<String> kafkaDS = env.addSource(kafkaSource);

    //在数据源上添加水印
    SingleOutputStreamOperator<String> watermarkStream = kafkaDS.assignTimestampsAndWatermarks(
            WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
                    .withTimestampAssigner(new TimestampAssignerSupplier<String>() {
                        @Override
                        public TimestampAssigner<String> createTimestampAssigner(Context context) {
                            return new TimestampAssigner<String>() {
                                @Override
                                public long extractTimestamp(String element, long recordTimestamp) {
                                    return Long.parseLong(element.split(",")[0]);
                                }
                            };
                        }
                    }).withIdleness(Duration.ofSeconds(60))
    );

    //todo 3)单词计数操作
    SingleOutputStreamOperator<Tuple2<String, Long>> wordAndOne = watermarkStream.map(new MapFunction<String, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> map(String value) throws Exception {
            return new Tuple2<String, Long>(value.split(",")[1], 1L);
        }
    });
    //todo 4)单词分组操作
    wordAndOne.keyBy(x-> x.f0).window(TumblingEventTimeWindows.of(Time.of(5, TimeUnit.SECONDS))).process(
            new ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
                @Override
                public void process(String s, Context context, Iterable<Tuple2<String, Long>> elements, Collector<String> out) throws Exception {
                    long sum = 0L;
                    Iterator<Tuple2<String, Long>> iterator = elements.iterator();
                    while (iterator.hasNext()){
                        Tuple2<String, Long> tuple2 = iterator.next();
                        System.out.println(tuple2.f0);
                        sum += tuple2.f1;
                    }
                    out.collect(s + ","+sum);
                }
            }
    ).print();

    env.execute();

    //todo 6)启动作业
    env.execute();
}

}


结果1:没加 withIdleness



输入:
* 第一个分区写入:1000,hadoop、7000,hadoop-》没有触发窗口计算
* 第二个分区写入:7000,flink -》触发了窗口计算


结果2:加上 withIdleness



输入:
* 第一个分区写入:1000,hadoop、7000,hadoop-》30s 后触发窗口计算


结论:


* 1- 当某一个分区的触发机制达到的时候,其他的分区触发机制迟迟未触发的时候,无法触发机制
* 2- withIdleness(Duration.ofSeconds(30)),允许 30s 等待其他分区触发计算,如果还没有触发,直接计算该分区
* 3- 工作中一般设置 1 - 10分钟
* 4- kafka 数据源添加水印,withTimestampAssigner 需要 new 一个 TimestampAssignerSupplier (第一次出现)




---


##### 2.7 Flink 对严重迟到数据的处理


例子:延迟数据处理机制设计



package cn.itcast.day09.WaterMark;

/**
* @author lql
* @time 2024-03-03 13:11:44
* @description TODO
*/

import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import java.time.Duration;

/**
* flink默认情况下会将迟到的数据丢弃,但是对于绝大多数的业务中是不允许删除迟到数据的,因此可以使用flink的延迟数据处理机制进行数据的获取并处理
*/
public class LatenessDataDemo {
public static void main(String[] args) throws Exception {
// 设置环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);

    // 数据源
    DataStreamSource<String> lines = env.socketTextStream("node1", 9999);

    SingleOutputStreamOperator<Tuple2<String, Long>> wordAndOne = lines.map(new MapFunction<String, Tuple2<String, Long>>() {
        @Override
        public Tuple2<String, Long> map(String value) throws Exception {
            String[] data = value.split(",");
            return new Tuple2<String, Long>(data[0], Long.parseLong(data[1]));
        }
    });

    // 水印操作 -> 水印3秒
    SingleOutputStreamOperator<Tuple2<String, Long>> watermarkStream = wordAndOne.assignTimestampsAndWatermarks(
            WatermarkStrategy.<Tuple2<String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(3))
                    .withTimestampAssigner(new SerializableTimestampAssigner<Tuple2<String, Long>>() {
                        @Override
                        public long extractTimestamp(Tuple2<String, Long> element, long recordTimestamp) {
                            // 报错地方:因为我们的数据源已经是毫秒级别了,就不需要转换 \*1000L哦!
                            return element.f1;
                        }
                    })
    );

    // 窗口操作 -> 5秒窗口
    // todo 1. 设置允许延迟的时间是通过allowedLateness(lateness: Time)设置
    WindowedStream<Tuple2<String, Long>, String, TimeWindow> windowStream = watermarkStream
            .keyBy(t -> t.f0)
            .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            .allowedLateness(Time.seconds(2));

    // todo 2.初始化延迟到达的数据对象
    OutputTag<Tuple2<String,Long>> outputTag = new OutputTag<>(
            "side output",
            TypeInformation.of(new TypeHint<Tuple2<String, Long>>() {})
    );

    // todo 3.保存延迟到达的数据
    WindowedStream<Tuple2<String, Long>, String, TimeWindow> sideOutputLateData = windowStream.sideOutputLateData(outputTag);

    // 数据聚合
    SingleOutputStreamOperator<Tuple2<String, Long>> result = sideOutputLateData.apply(new WindowFunction<Tuple2<String, Long>, Tuple2<String, Long>, String, TimeWindow>() {
        @Override
        public void apply(String s, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<Tuple2<String, Long>> out) throws Exception {
            String key = null;
            Long counter = 0L;
            for (Tuple2<String, Long> element : input) {
                key = element.f0;
                counter += 1;
            }
            out.collect(Tuple2.of(key, counter));
        }
    });
    result.print("正常到达的数据>>>");

    // todo 4.获取延迟到达的数据
    DataStream<Tuple2<String, Long>> sideOutput = result.getSideOutput(outputTag);

    sideOutput.printToErr("延迟到达的数据>>>");
    env.execute();
}

}


结果:



/*
* 每5s一个窗口,水印:3s,延迟等待:2s
* 测试数据:
* hadoop,1626936202000 -> 2021-07-22 14:43:22 第一个窗口的数据
* hadoop,1626936207000 -> 2021-07-22 14:43:27 因为设置了水印,所以不会触发窗口计算
* hadoop,1626936202000 -> 2021-07-22 14:43:22 第一个窗口的数据
* hadoop,1626936203000 -> 2021-07-22 14:43:23 第一个窗口的数据
* hadoop,1626936208000 -> 2021-07-22 14:43:28 触发了窗口计算(hadoop,3),水印时间满足窗口endtime
*
* ====================事件时间 28 秒 -> 水印时间 25 秒 刚好临界 endtime =======================
* =延迟 2s 等待机制:延迟到事件时间 30s 即 水印时间 27s 关闭第一个窗口=
*
* 第一个窗口时间 2021-07-22 14:43:20 -> 2021-07-22 14:43:25
*
* hadoop,1626936202000 -> 2021-07-22 14:43:22 已经触发过计算的窗口再次有新数据到达,(hadoop,4)(数据重复计算)
* hadoop,1626936203000 -> 2021-07-22 14:43:23 已经触发过计算的窗口再次有新数据到达,(hadoop,5)
* hadoop,1626936209000 -> 2021-07-22 14:43:29 虽然 水印时间达到endtime,但是窗口里面没有新数据,不触发计算
* hadoop,1626936202000 -> 2020-07-22 14:43:22 已经触发过计算的窗口再次有新数据到达,(hadoop,6)
* hadoop,1626936210000 -> 2021-07-22 14:43:30 满足了窗口销毁的条件,开始专注于第二个新窗口
*
* 第二个窗口时间 2021-07-22 14:43:25 -> 2021-07-22 14:43:30
*
* ====================事件时间 33 秒 -> 水印时间 30 秒 刚好临界 endtime ====================================
* * =延迟 2s 等待机制:延迟到事件时间 35s 即 水印时间 32s 关闭第二个窗口=
*
* hadoop,1626936202000 -> 2021-07-22 14:43:22 打印迟到数据,(hadoop,1626936202000)
* hadoop,1626936215000 -> 2021-07-22 14:43:35 达到水印时间触发窗口计算:(hadoop,3),之前27,28,29秒的数据
*/


总结:


* 1- 设计允许迟到数据时间:

 在水印策略后面加上:allowedLateness(Times.seconds())
* 2- 初始化迟到的数据对象:

 new OutputTag<>(id名字,TypeInformation.of(new TypeHint<迟到数据类型>(){ }))
* 3- 保存延迟到达迟到数据:

 窗口流.sideOutputLateData(初始化对象)
* 4- 获取延迟到达迟到数据:

 结果流.getSideOutput(初始化对象)
* 5- 测输出流是之前 Window Function API 中的重要算子

 OutputTag(注意复习!)


思考:


* allowedLateness(Times.seconds) 设计允许迟到时间和

 withIdleness(Duration.ofSeconds(30)) 设计允许等待触发时间有什么不同呢?


回答:


* (1) 从概念上看,allowedLateness 是延迟窗口关闭,不影响触发时间,而 withIdleness 等待分区一段时间,等不到就触发
* (2) 从应用来看,allowedLateness 适用于车联网入隧道一段时间没上报数据等待数据,而 withIdleness 适用于分区木桶原理等待数据,等不到数据就单独分区触发计算。






![img](https://img-blog.csdnimg.cn/img_convert/dde1cb7da391102c175e2d9b6a4fbe3f.png)
![img](https://img-blog.csdnimg.cn/img_convert/a6699059458e112d3b6847b890f8f7c9.png)

**网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。**

**[需要这份系统化资料的朋友,可以戳这里获取](https://bbs.csdn.net/topics/618545628)**


**一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!**

,TypeInformation.of(new TypeHint<迟到数据类型>(){ }))
* 3- 保存延迟到达迟到数据:

 窗口流.sideOutputLateData(初始化对象)
* 4- 获取延迟到达迟到数据:

 结果流.getSideOutput(初始化对象)
* 5- 测输出流是之前 Window Function API 中的重要算子

 OutputTag(注意复习!)


思考:


* allowedLateness(Times.seconds) 设计允许迟到时间和

 withIdleness(Duration.ofSeconds(30)) 设计允许等待触发时间有什么不同呢?


回答:


* (1) 从概念上看,allowedLateness 是延迟窗口关闭,不影响触发时间,而 withIdleness 等待分区一段时间,等不到就触发
* (2) 从应用来看,allowedLateness 适用于车联网入隧道一段时间没上报数据等待数据,而 withIdleness 适用于分区木桶原理等待数据,等不到数据就单独分区触发计算。






[外链图片转存中...(img-PAuMNF0M-1714404714069)]
[外链图片转存中...(img-2SH1jWpf-1714404714069)]

**网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。**

**[需要这份系统化资料的朋友,可以戳这里获取](https://bbs.csdn.net/topics/618545628)**


**一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!**

  • 25
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值