Flink的Window

flink中的三种时间

flink中的window

window概述

window类型

适用场景:主要是用来做变化趋势的

用的不多

GlobalWindow(CountWindow)

Nonkeyed的GlobalWindow(CountWindowAll)

package cn._51doit.flink.day05;

import org.apache.flink.streaming.api.datastream.AllWindowedStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;

/**
 * 所有的window,按照是否key后再划分窗口,分为KeyedWindow和NonKeyedWindow
 *
 * 如果没有keyBy就划分窗口,就是NonKeyedWindow,底层调用的是windowAll方法
 *
 * 如果是NonKeyedWindow,window和window operator对应的Task并行度永远为1,在1个subtask里有1个计数器
 *
 */
public class CountWindowAllDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);
        SingleOutputStreamOperator<Integer> nums = lines.map(Integer::parseInt);//转为Int,因为后边要sum
        //没有keyBy,调用countWindowAll,按照条数划分窗口,当窗口中的数据达到一定的条数再输出计算结果
        AllWindowedStream<Integer, GlobalWindow> windowedStream = nums.countWindowAll(5);
        //划分窗口后,需要调用相应的方法,window operator
        SingleOutputStreamOperator<Integer> res = windowedStream.sum(0);
        res.print();
        env.execute();
      
      //1
      //2
      //3
      //4
      //5
      
      //4>15
      
      //1
      //1
      //1
      //1
      //1
      
      //2>5
      
      //五条输出一次,没到五条永远不输出,输出结果可以做一些聚合操作等
      //对于窗口内部进行处理的操作,是来一条计算一条,到五条发出,而不是攒五条发出前再计算五条的和,即增量聚合

    }
}

Keyed的GlobalWindow(CountWindow)

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;

/**
 * 所有的window,按照是否key后再划分窗口,分为KeyedWindow和NonKeyedWindow
 *
 * 如果先keyBy再划分窗口,就是KeyedWindow,底层调用的是window方法
 *
 * 如果是KeyedWindow,window和window operator对应的Task并行度永远为可以是1到多个
 *
 * keyBy后划分的countWindow,是多并行的,当一个组(key)中的数据条数达到指定的数量,这个组对应的数据单独触发。每个分区中的每个组都有计数器
 *
 */
public class CountWindowDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //spark,4
        //hive,5
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //对数据进行映射
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = lines.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                String word = fields[0];
                int count = Integer.parseInt(fields[1]);
                return Tuple2.of(word, count);
            }
        });

        //先keyBy,再划分窗口
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, GlobalWindow> windowedStream = keyedStream.countWindow(5);
        //划分窗口后,还要调用window operator
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);

        res.print();

        env.execute();
      
      //spark,1
      //hive,1
      //hive,1
      //spark,1
      //spark,1
      //spark,1
      //spark,1
      
      //1>(spark,5)

    }
}

 

窗口起始时间

不论是processing还是eventTime,窗口起始时间都是窗口长度的整数倍

可以理解为窗口起始时间和结束时间,分别是向上取窗口长度的整和向下取窗口长度的整

package cn._51doit.flink.day05;

public class WindowBoundTest {
    public static void main(String[] args) {

        //窗口长度是5000毫秒(5秒)
        long time = 1645252201000L;

        long startTime = time - time % 5000;
        System.out.println(startTime);

        long endTime = startTime + 5000;
        System.out.println(endTime);

        //老的API窗口是前闭后开的
        //[1645252200000, 1645252205000)
        //[1645252200000, 1645252204999]

    }
}

TimeWindow(processingTime)

分为三个维度:(滚动、滑动)(keyed,nonkeyed)(processingTime, eventTime)

Tumbling滚动窗口

Nonkeyed的TimeWindow(windowAll)

划分方法:TumblingProcessingTimeWindows 按processingTime划分的滚动窗口

package cn._51doit.flink.day05;

import org.apache.flink.streaming.api.datastream.AllWindowedStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 按照ProcessingTime划分滚动窗口,
 * 没有KeyBy,就调用windowAll方法,得到的是NonKeyedWindow,window和window operator对应的Task并行度为1
 *
 */
public class ProcessingTimeTumblingWindowAllDemo {

    public static void main(String[] args) throws Exception {


        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);
        SingleOutputStreamOperator<Integer> nums = lines.map(Integer::parseInt);
        //没有keyBy,然后按照ProcessingTime划分窗口
        AllWindowedStream<Integer, TimeWindow> windowedStream = nums.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)));
        //划分窗口后,需要调用相应的方法,window operator
        SingleOutputStreamOperator<Integer> res = windowedStream.sum(0);
        res.print();
        env.execute();

		//结果是每10秒输出一次sum的结果,每个窗口是独立算的,不与之前的窗口累加
    }
}

 

Keyed的TimeWindow(window)

划分方法:TumblingProcessingTimeWindows 按processingTime划分的滚动窗口

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 先keyBy,然后按照ProcessingTime划分滚动窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 *
 * 窗口触发后,每个分区中,每一个组的数据都会产生结果,然后输出
 */
public class ProcessingTimeTumblingWindowDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //spark,4
        //hive,5
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //对数据进行映射
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = lines.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                String word = fields[0];
                int count = Integer.parseInt(fields[1]);
                return Tuple2.of(word, count);
            }
        });

        //先keyBy,再划分窗口
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        //按照ProcessingTime划分滚动窗口
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingProcessingTimeWindows.of(Time.seconds(10)));
        //调用window operator
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);

        res.print();
        env.execute();

		//spark,1
        //saprk,1
        //hive,1
        //flink,1
      
        //10s窗口触发
        //4>(flink,1)
        //1>(spark,2)
        //1>(hive,1)
      
        //hadoop,1
        //hive,1
        //spark,1
        //hive,1
      
        //10s窗口触发
        //4>(hadoop,1) 
        //1>(hive,2) 
        //1>(spark,1)

    }
}

sliding滑动窗口

Nonkeyed的TimeWindow(windowAll)

划分方法:SlidingProcessingTimeWindows 按processingTime划分的滑动窗口

package cn._51doit.flink.day05;

import org.apache.flink.streaming.api.datastream.AllWindowedStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 按照ProcessingTime划分滑动窗口,
 * 没有KeyBy,就调用windowAll方法,得到的是NonKeyedWindow,window和window operator对应的Task并行度为1
 *
 */
public class ProcessingTimeSlidingWindowAllDemo {

    public static void main(String[] args) throws Exception {


        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);
        SingleOutputStreamOperator<Integer> nums = lines.map(Integer::parseInt);
        //没有keyBy,然后按照ProcessingTime划分窗口
        //SlidingProcessingTimeWindows要传入两个参数:第一个为窗口的长度,第二个为:滑动步长
        AllWindowedStream<Integer, TimeWindow> windowedStream = nums.windowAll(SlidingProcessingTimeWindows.of(Time.seconds(20), Time.seconds(10)));
        //划分窗口后,需要调用相应的方法,window operator
        SingleOutputStreamOperator<Integer> res = windowedStream.sum(0);
        res.print();
        env.execute();
		
      	//输入9个1
        //输出2,6,7,3

    }
}

 

 

Keyed的TimeWindow(window)

划分方法:SlidingProcessingTimeWindows 按processingTime划分的滑动窗口

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.SlidingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 先keyBy,然后按照ProcessingTime划分化动窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 *
 * 窗口触发,窗口中的没个分区,每个组的数据都输出
 *
 */
public class ProcessingTimeSlidingWindowDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //spark,4
        //hive,5
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //对数据进行映射
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = lines.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                String word = fields[0];
                int count = Integer.parseInt(fields[1]);
                return Tuple2.of(word, count);
            }
        });

        //先keyBy,再划分窗口
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        //按照ProcessingTime划分滚动窗口
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(SlidingProcessingTimeWindows.of(Time.seconds(20), Time.seconds(10)));
        //调用window operator
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);

        res.print();

        env.execute();


    }
}

SessionWindow

NonKeyed的SessionWindow

package cn._51doit.flink.day05;

import org.apache.flink.streaming.api.datastream.AllWindowedStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 按照ProcessingTime划分回话窗口,session window按照时间的间隔生成窗口和触发的
 * 没有KeyBy,就调用windowAll方法,得到的是NonKeyedWindow,window和window operator对应的Task并行度为1
 *
 * 窗口触发的实际:当前的ProcessingTime(系统时间) -  进入到窗口中的最后一条数据对应的时间 > 指定的时间间隔
 *
 */
public class ProcessingTimeSessionWindowAllDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);
        SingleOutputStreamOperator<Integer> nums = lines.map(Integer::parseInt);
        //没有keyBy,然后按照ProcessingTime划分窗口
        AllWindowedStream<Integer, TimeWindow> windowedStream = nums.windowAll(ProcessingTimeSessionWindows.withGap(Time.seconds(10)));
        //划分窗口后,需要调用相应的方法,window operator
        SingleOutputStreamOperator<Integer> res = windowedStream.sum(0);
        res.print();
        env.execute();
      
      //结果是10s没输入数据就会输出,同样的也是增量聚合,而不是到最后输出前一刻的全量聚合

    }
}

Keyed的SessionWindow

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.ProcessingTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 先keyBy,然后按照ProcessingTime划分回话窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 *
 * 触发时间:当前时间 - 每个分区每个组最后进入数据的时间 > 指定的时间间隔(这个组的数据单独触发)
 */
public class ProcessingTimeSessionWindowDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //spark,4
        //hive,5
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //对数据进行映射
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = lines.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                String word = fields[0];
                int count = Integer.parseInt(fields[1]);
                return Tuple2.of(word, count);
            }
        });

        //先keyBy,再划分窗口
        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        //按照ProcessingTime划分滚动窗口
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(ProcessingTimeSessionWindows.withGap(Time.seconds(10)));
        //调用window operator
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);

        res.print();
        env.execute();

    }
}

TimeWindow(eventTime)

单分区DataStream产生窗口的WaterMark问题

Nonkeyed的TimeWindow(Window)

划分方法:TumblingeventTimeWindows 按eventTime划分的滚动窗口

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 先keyBy,然后按照Event划分滚动窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 *
 * 窗口触发后,每个分区中,每一个组的数据都会产生结果,然后输出
 *
 * 1.Flink中EventTime类型的窗口,是按照数据中的EventTime触发的,EventTime要转换成long类型的,精确到毫秒的时间戳
 * 2.窗口是根据输入的数据中的EventTime确定的,窗口的起始时间、结束时间是对齐的,是窗口长度的整数倍,而且是前闭后开的 [1645252200000, 1645252205000)
 *
 * 设置窗口延迟2秒触发
 * 理想情况,数据先产生,先进入窗口,但是在实际情况,可能会有网络延迟、服务器故障等原因,导致数据迟到
 * 就有可能导致数据先产生,但是后进入到窗口,迟到的数据如果迟到过长时间,就会被丢弃。
 *
 * 在提取EventTime时,可以设置窗口是允许数据乱序延迟触发的
 *
 * WaterMark = 每个分区中最大的EventTime - 延迟时间(WaterMark是从上游以广播的形式发送的)
 *
 * 窗口触发时机:WaterMark >= 窗口的结束时间(窗口结束时间是闭区间)
 *
 */
public class EventTimeTumblingWindowDemo2 {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //1000,spark,4
        //2000,hive,5
        //4000,hive,2
        //4998,spark,1
        //4999,spark,2
        //6666,flink,2
        //7777,spark,3
        //8888,spark,1
        //9998,spark,2
        //10000,spark,200
        //14999,spark,100
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //提取数据中的EventTime,按照数据中的时间划分窗口
        //该方法仅是提取数据中的时间,不会改变原有数据的样子
        //WaterMark是一种特殊的消息,这种特殊的消息是提取EventTime的算子,想下游发送的,发送给窗口对应的Task(而且是用广播的形式发送的)在window算子之前的下游算子也可以收到watermark的信息,对于上下游并行度不同的情况,数据是轮询发送,而watermark最为一种特殊的消息是广播发送的,而且是几百毫秒周期性地发送
        //WaterMark = 每个分区中最大的EventTime - 延迟时间
        //窗口触发的时机为:WaterMark >= 窗口的结束时间,窗口触发
        SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(0)) {
            @Override
            public long extractTimestamp(String element) {
                return Long.parseLong(element.split(",")[0]); //将数据中的数据提取出来,返回long类型的时间戳
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);
        res.print();
        env.execute();


    }
}

以上代码块中,延迟时间为0.

窗口延迟触发时间的问题

以下的ts指eventTime的ts

此时输入:其中ts为4997的消息由于延迟晚到了

1000,spark,1
4000,hive,1
4998,spark,2
4999,hive,5
4997,spark,10000
9998,spark,100
9999,spark,200

结果是,不论是[0,5000)还是[5000,10000)窗口,都没有接受并输出这条消息,ts4997消失了

为了让类似ts4997能出现在它该出现的窗口,可以做一些窗口延迟触发时间的设置,给予其宽裕的时间宽容

SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(2))
1000,spark,1       wm:-1000
3000,hive,1        wm: 1000
4999,spark,5       wm: 2999
4998,hive,100      wm: 2999(当前分区最大的eventTime为4999)
5000,spark,10000
6998,spark,10000
6999,hive,2000
1>(spark,6) 
1>(hive,101)

延迟时间为2000,第一个触发的窗口还是[0,5000)而不是[0, 7000),[5000,7000)的数据算在第二个窗口,但是对于迟到的4998消息,已经成功读取到了

如果这时候要计算第一条信息:(1000,spark,1) 的wm,则为1000 - 2000 = -1000,watermark是可以为负的

多分区DataStream产生窗口的WaterMark问题

Keyed的TimeWindow(Window)

划分方法:TumblingeventTimeWindows 按eventTime划分的滚动窗口

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * assignTimestampsAndWatermarks方法也是一个Transformation,不会改变数据的样式,仅仅会提取数据中的EventTime,然后生成WaterMark,向下游发送
 * assignTimestampsAndWatermarks方法返回的DataStream的并行度与调用该方法的DataStream并行度一致
 *
 *
 */
public class EventTimeTumblingWindowDemo3 {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //1000,spark,4
        //2000,hive,5
        //4000,hive,2
        //4998,spark,1
        //4999,spark,2
        //6666,flink,2
        //7777,spark,3
        //8888,spark,1
        //9998,spark,2
        //10000,spark,200
        //14999,spark,100
        DataStreamSource<String> lines = env.socketTextStream("localhost", 9999);

        //(1000,spark,4)
        SingleOutputStreamOperator<Tuple3<Long, String, Integer>> tpStream = lines.map(new MapFunction<String, Tuple3<Long, String, Integer>>() {
            @Override
            public Tuple3<Long, String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                long timestamp = Long.parseLong(fields[0]);
                String word = fields[1];
                int count = Integer.parseInt(fields[2]);
                return Tuple3.of(timestamp, word, count);
            }
        });
      // 相比于上一个程序,多做了以上处理,主要是为了改变调用assignTimestampsAndWatermarks方法的datastream的并行度为多个,从而使发watermark的datastream(tpStream)并行度也为多个(提供多分区的条件)
      //不过现在还没有能力发waterMark

        //再调用assignTimestampsAndWatermarks
        SingleOutputStreamOperator<Tuple3<Long, String, Integer>> tpStreamWithWaterMark = tpStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Tuple3<Long, String, Integer>>(Time.seconds(2)) {
            @Override
            public long extractTimestamp(Tuple3<Long, String, Integer> tp) {
                return tp.f0;
            }
        });
      //现在已经有能力发waterMark了

		
      //时间字段就不需要了,应为已经变为了waterMark发往下游了
        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = tpStreamWithWaterMark.map(new MapFunction<Tuple3<Long, String, Integer>, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(Tuple3<Long, String, Integer> tp) throws Exception {
                return Tuple2.of(tp.f1, tp.f2);
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);
        res.print();
        env.execute();

    }
}

每个分区的waterMark和整个窗口的waterMark

输入如下:延迟为2000,最大并行度为4,socketTextStream轮询写入

1000,spark,1 	index为0的分区(subTask)的waterMark:-1000
2000,spark,1 	index为1的分区(subTask)的waterMark:0
3000,spark,1 	index为2的分区(subTask)的waterMark:2000
500,spark,1 	index为3的分区(subTask)的waterMark:-1500

前三条数据都没有watermark产生(从web页面查看),第四条数据输入后才产生waterMark,且为500 - 2000 = -1500。注意需要写满所有的分区才能判断整个窗口的waterMark

每个分区里的waterMark是这个分区里最大的eventTime - 延迟时间,而整个窗口的waterMark是这个窗口所有分区里最小的waterMark

接着上边的输入:延迟为2000,最大并行度为4,socketTextStream轮询写入

1600,spark,1 	index为0的分区(subTask)的waterMark:-400
2200,spark,1 	index为1的分区(subTask)的waterMark:200
2500,spark,1 	index为2的分区(subTask)的waterMark:500
1800,spark,1 	index为1的分区(subTask)的waterMark:-200

由上边的分析,整个窗口的waterMark变为了-400

 

那么为什么是最小值呢?因为取最大值会丢数据

回顾一下窗口触发的条件

窗口触发的时机为:WaterMark >= 窗口的结束时间,窗口触发

丢失举例:

使用最小值的情况

数据1 wm1攒着 1000 

数据2 wm2攒着 1500 

数据3 wm3攒着 2000 

数据4 wm4攒着 2200   窗口wm为1000

数据5 wm5攒着 5000  替代wm1   窗口wm为1500

数据6 wm6攒着 4999  替代wm2   窗口wm为2000

数据7 wm7攒着 5500  替代wm3   窗口wm为2200

数据8 wm8攒着 6000  替代wm4    窗口wm为4999   触发第一窗口 输出数据1,2,3,4,6

数据9 wm9攒着 10000        替代wm5   窗口wm为4999

数据10 wm10攒着 10500    替代wm6   窗口wm为5500

数据11 wm11攒着 11000    替代wm7   窗口wm为6000

数据12 wm12攒着 12000    替代wm8   窗口wm为10000  触发第二窗口  输出数据 5 7 8

wm = min(wm9,wm10,wm11,wm12) = 10000  >= 9999?  

使用最大值的情况

数据1 wm1攒着 1000 

数据2 wm2攒着 1500 

数据3 wm3攒着 2000 

数据4 wm4攒着 2200  窗口wm为2200

数据5 wm5攒着 5400  替代wm1    窗口wm为 5400      在这里触发第一个窗口,输出数据1 2 3 4

数据6 wm6攒着 4999  替代wm2    窗口wm为 5400

数据7 wm7攒着 5500  替代wm3    窗口wm为 5500

数据8 wm8攒着 6000  替代wm4    窗口wm为 6000

数据9 wm9攒着 10000 替代wm5    窗口wm为 10000    在这里触发第二个窗口,输出数据7  8   此时数据6已经丢了

数据10 wm10攒着 10500

数据11 wm11攒着 11000

数据12 wm12攒着 12000

wm = max(wm9,wm10,wm11,wm12) = 12000>= 9999? 

SessionWindow(eventTime)

Keyed的SessionWindow

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 先keyBy,然后按照Event划分会话窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 * 触发窗口的条件是看一个组(同一个key)多久没收到同组信息的时间来判定条件
 */
public class EventTimeSessionWindowDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //1000,spark,1
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(0)) {
            @Override
            public long extractTimestamp(String element) {
                return Long.parseLong(element.split(",")[0]); //将数据中的数据提取出来,返回long类型的时间戳
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, I nteger>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(EventTimeSessionWindows.withGap(Time.seconds(5)));
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);
        res.print();
        env.execute();


    }
}

注意这里实验用的eventTime是自己写入的,对于同组的数据来说,写入非同组或同组的数据且eventTime间隔大于5000,则触发窗口输出窗口内所有同组数据。写入同组的数据且eventTime间隔小于5000,则不触发并更新最新的eventTime,如果一直没写入数据,组等待可能性的发生,什么都不做。

全量聚合还是增量聚合

在窗口里进行聚合操作时,是将数据攒起来,当窗口触发之后进行全量聚合还是在窗口触发之前就进行增量聚合呢?

对滚动窗口进行增量聚合

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

/**
 * 对滚动窗口中的数据进行【增量】聚合
 *
 */
public class EventTimeWindowReduceDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //100,spark,1
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(0)) {
            @Override
            public long extractTimestamp(String element) {
                return Long.parseLong(element.split(",")[0]); //将数据中的数据提取出来,返回long类型的时间戳
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
        //划分窗口后,调用reduce,对数据进行聚合
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> reduce(Tuple2<String, Integer> tp1, Tuple2<String, Integer> tp2) throws Exception {
                tp1.f1 = tp1.f1 + tp2.f1;
                return tp1;
            }
        });
        res.print();
        env.execute();

    }
}

如果是增量聚合,那么窗口没触发也会调reduce方法,如果是全量聚合那么要等窗口触发才会调这个方法。

实验结果是,输入了(1000,spark,1)(2000,hive,1)(3000,saprk,2)后,尚未触发窗口的情况下,就进入了reduce方法。

这说明窗口里进行聚合操作时进行的是增量聚合。

对滚动窗口进行全量聚合(写入Hbase使用这种方法)、

需要有一个windowState来缓存数据

package cn._51doit.flink.day05;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

/**
 * 对滚动窗口中的数据进行全量聚合(进入窗口中的数据先缓存到WindowState,当窗口触发后,在将缓存的数据取出来进行聚合)
 *
 *
 *
 *
 */
public class EventTimeWindowApplyDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //100,spark,1
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(0)) {
            @Override
            public long extractTimestamp(String element) {
                return Long.parseLong(element.split(",")[0]); //将数据中的数据提取出来,返回long类型的时间戳
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
        //对窗口中的数据进行全量操作
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.apply(new WindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String, TimeWindow>() {
//输入类型和输出类型都是 Tuple2<String, Integer> ,然后是key和timeWindow的类型
//apply方法每个组调一次

            @Override
            public void apply(String key, TimeWindow window, Iterable<Tuple2<String, Integer>> buffer, Collector<Tuple2<String, Integer>> out) throws Exception {//key、timeWindow的类型,buffer是缓存到里的数据,最后是要输出的数据
                int count = 0;
                for (Tuple2<String, Integer> tp : buffer) {
                    count += tp.f1;
                }
                out.collect(Tuple2.of(key, count));
            }
        });

        res.print();
        env.execute();

    }
}

以上所用的assignTimestampsAndWatermarks(new)是老api

lines.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<String>(Time.seconds(0))

生成WaterMark的新API

Nonkeyed的TimeWindow(Window)

划分方法:TumblingeventTimeWindows 按eventTime划分的滚动窗口

lines.assignTimestampsAndWatermarks(WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(0)).withTimestampAssigner
package cn._51doit.flink.day06;

import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;

import java.time.Duration;

/**
 * 先keyBy,然后按照Event划分滚动窗口
 * 底层调用的是window方法,返回的是keyedWindow,window和window operator对应的Task是多并行的
 *
 * 如果使用新的API,窗口会比原来老的API延迟触发1毫秒,因为新的API生成WaterMark时,减去了1毫秒  结果是老API4999能触发窗口,新API要5000才能触发。但是eventTime=5000的数据不属于当前窗口,属于下一个窗口
 *
 */
public class EventTimeTumblingWindowNewAPIDemo {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //设置生成WaterMark并先下游发送的时间周期
        env.getConfig().setAutoWatermarkInterval(200);
        //2000,spark,1
        DataStreamSource<String> lines = env.socketTextStream("localhost", 8888);

        //使用新的API提取EventTime生成WaterMark
        SingleOutputStreamOperator<String> linesWithWaterMark = lines.assignTimestampsAndWatermarks(WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(0)).withTimestampAssigner(
                new SerializableTimestampAssigner<String>() {
                    @Override
                    public long extractTimestamp(String line, long l) {
                        String[] fields = line.split(",");
                        return Long.parseLong(fields[0]);
                    }
                }
        ));


        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

        KeyedStream<Tuple2<String, Integer>, String> keyedStream = wordAndCount.keyBy(t -> t.f0);
        WindowedStream<Tuple2<String, Integer>, String, TimeWindow> windowedStream = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)));
        SingleOutputStreamOperator<Tuple2<String, Integer>> res = windowedStream.sum(1);
        res.print();
        env.execute();
    }
}
  in:
  1000,spark,1
  4000,spark,1
  3000,spark,2
  4998,hive,1
  4999,hive,5
  5000,hive,100
  
  out:
  1>(spark,4)
  1>(hive,6)

用Lambda表达式来写改写的部分

        SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndCount = linesWithWaterMark.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String line) throws Exception {
                String[] fields = line.split(",");
                return Tuple2.of(fields[1], Integer.parseInt(fields[2]));
            }
        });

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Flink中的窗口(window)是在流处理过程中对数据进行分组和聚合的一种机制。窗口将流数据划分为有限大小的数据块,然后对每个窗口中的数据进行处理和计算。 在Flink中,窗口有两种类型:时间窗口和计数窗口。时间窗口根据事件发生的时间范围对数据进行划分,而计数窗口根据事件发生的次数对数据进行划分。 根据时间的划分方式,时间窗口可以分为滚动窗口和滑动窗口。滚动窗口将数据按照固定长度的时间间隔进行划分,比如每5分钟划分一个窗口。滑动窗口则以固定的时间间隔进行滑动,比如每隔1分钟滑动一次窗口。 对于计数窗口,可以定义固定数量的事件来进行划分,比如每10个事件划分一个窗口。 窗口操作可以包括聚合、计数、求和、最大值、最小值等操作。在窗口操作中,Flink提供了丰富的函数和操作符来实现不同的聚合和计算需求。 窗口操作可以通过窗口函数(window function)实现,窗口函数定义了对窗口中的数据进行聚合和处理的逻辑。 使用窗口操作可以提高流处理的性能和效率,通过将连续的数据划分为有限的窗口,可以减少计算的复杂性。同时,窗口操作也可以使得流处理应用更具可控性和灵活性。 在Flink中,窗口操作广泛应用于各种实时数据分析、实时计算和数据流处理的场景,如实时监测、实时查询、实时报警等。通过合理设置窗口大小和窗口滑动间隔,可以根据实际需求来进行数据处理和聚合,以满足不同的业务需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值