目录
1.写在前面
时间语义,要配合窗口操作才能发挥作用。最主要的用途,当然就是开窗口、根据时间段做计算了。下面我们就来看看 Table API 和 SQL 中,怎么利用时间字段做窗口操作。 在 Table API 和 SQL 中,主要有两种窗口:Group Windows 和 Over Windows。
2.窗口(Windows)
2.1 分组窗口(Group Windows)
分组窗口(Group Windows)会根据时间或行计数间隔,将行聚合到有限的组(Group) 中,并对每个组的数据执行一次聚合函数。Table API 中的 Group Windows 都是使用.window(w:GroupWindow)子句定义的,并且必须由 as 子句指定一个别名。为了按窗口对表进行分组,窗口的别名必须在 group by 子句 中,像常规的分组字段一样引用。
2.1.1 滚动窗口
代码如下:
2.1.2 滑动窗口
代码如下:
2.1.3 会话窗口
代码如下:
2.2 Over Windows
可以在事件时间或处理时间,以及指定为时间间隔、或行计数的范围内,定义 Over windows。 无界的over window是使用常量指定的。也就是说,时间间隔要指定UNBOUNDED_RANGE, 或者行计数间隔要指定 UNBOUNDED_ROW。而有界的 over window 是用间隔的大小指定的。
2.2.1 无界 Over Windows
2.2.2 有界的 over window
2.3 SQL 中窗口的定义
我们已经了解了在 Table API 里 window 的调用方式,同样,我们也可以在 SQL 中直接加入窗口的定义和使用。
2.3.1 Group Windows
Group Windows 在 SQL 查询的 Group BY 子句中定义。与使用常规 GROUP BY 子句的查询 一样,使用 GROUP BY 子句的查询会计算每个组的单个结果行。
2.3.2 Over Windows
由于 Over 本来就是 SQL 内置支持的语法,所以这在 SQL 中属于基本的聚合操作。所有聚合必须在同一窗口上定义,也就是说,必须是相同的分区、排序和范围。目前仅支持在当前行范围之前的窗口(无边界和有边界)。 注意,ORDER BY 必须在单一的时间属性上指定。
3.代码总结
import com.atguigu.apitest.beans.SensorReading;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.table.api.Over;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.Tumble;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.types.Row;
public class TableTest5_TimeAndWindow {
public static void main(String[] args) throws Exception {
// 1. 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
// 2. 读入文件数据,得到DataStream
DataStream<String> inputStream = env.readTextFile("D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\sensor.txt");
// 3. 转换成POJO
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<SensorReading>(Time.seconds(2)) {
@Override
public long extractTimestamp(SensorReading element) {
return element.getTimestamp() * 1000L;
}
});
// 4. 将流转换成表,定义时间特性
// Table dataTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp, pt.proctime");
Table dataTable = tableEnv.fromDataStream(dataStream, "id, timestamp as ts, temperature as temp, rt.rowtime");
tableEnv.createTemporaryView("sensor", dataTable);
// 5. 窗口操作
// 5.1 Group Window
// table API
Table resultTable = dataTable.window(Tumble.over("10.seconds").on("rt").as("tw"))
.groupBy("id, tw")
.select("id, id.count, temp.avg, tw.end");
// SQL
Table resultSqlTable = tableEnv.sqlQuery("select id, count(id) as cnt, avg(temp) as avgTemp, tumble_end(rt, interval '10' second) " +
"from sensor group by id, tumble(rt, interval '10' second)");
// 5.2 Over Window
// table API
Table overResult = dataTable.window(Over.partitionBy("id").orderBy("rt").preceding("2.rows").as("ow"))
.select("id, rt, id.count over ow, temp.avg over ow");
// SQL
Table overSqlResult = tableEnv.sqlQuery("select id, rt, count(id) over ow, avg(temp) over ow " +
" from sensor " +
" window ow as (partition by id order by rt rows between 2 preceding and current row)");
// dataTable.printSchema();
// tableEnv.toAppendStream(resultTable, Row.class).print("result");
// tableEnv.toRetractStream(resultSqlTable, Row.class).print("sql");
tableEnv.toAppendStream(overResult, Row.class).print("result");
tableEnv.toRetractStream(overSqlResult, Row.class).print("sql");
env.execute();
}
}
4.代码练习(以分组滚动窗口为例)
我们可以综合学习过的内容,用一段完整的代码实现一个具体的需求。例如,可以开一 个滚动窗口,统计 10 秒内出现的每个 sensor 的个数。
package com.ucas.windows;
import com.ucas.beans.SensorReading;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.Tumble;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.types.Row;
import sun.management.Sensor;
/**
* @author GONG
* @version 1.0
* @date 2021/3/14 23:01
*/
public class WindowDemo {
public static void main(String[] args) throws Exception {
/**
* 注册环境
*/
EnvironmentSettings mySetting = EnvironmentSettings
.newInstance()
// .useOldPlanner()
.useBlinkPlanner()
.inStreamingMode()
.build();
//1.设置流执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//2.设置并行度
env.setParallelism(1).setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
//3.根据流执行环境创建流转表的环境
StreamTableEnvironment tableEnvironment = StreamTableEnvironment.create(env, mySetting);
//4.读取文件,得到dataStream
String filePath = "D:\\IDEA_WORK\\flinkstudy\\src\\main\\resources\\sensor.txt";
//5.利用流执行环境读取txt
DataStreamSource<String> inputStream = env.readTextFile(filePath);
//6.转换成 Java Bean,并指定 timestamp 和 watermark
SingleOutputStreamOperator<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<SensorReading>(Time.seconds(1)) {
@Override
public long extractTimestamp(SensorReading sensorReading) {
return sensorReading.getTimestamp() * 1000L;
}
});
//7.将 DataStream 转换为 Table
Table sensorTable = tableEnvironment.fromDataStream(dataStream, "id,timestamp.rowtime as ts,temperature");
//8.开窗聚合
Table selectTable = sensorTable.window(Tumble.over("10.seconds").on("ts").as("tw"))
.groupBy("id,tw")
.select("id,id.count,temperature.avg,tw.end");
//9.sql
tableEnvironment.createTemporaryView("sensor", sensorTable);
Table resultSqlTable = tableEnvironment.sqlQuery(
"select id, count(id), avg(temperature), tumble_end(ts, interval '10' second) " + "from sensor group by id, tumble(ts, interval '10' second)");
//10.转换成流打印输出
tableEnvironment.toAppendStream(selectTable, Row.class).print("select:");
tableEnvironment.toAppendStream(resultSqlTable, Row.class).print("sql:");
//11.开始执行环境
env.execute("time and window test");
}
}
package com.ucas.beans;
// 传感器温度读数的数据类型
public class SensorReading {
// 属性:id,时间戳,温度值
private String id;
private Long timestamp;
private Double temperature;
public SensorReading() {
}
public SensorReading(String id, Long timestamp, Double temperature) {
this.id = id;
this.timestamp = timestamp;
this.temperature = temperature;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public Long getTimestamp() {
return timestamp;
}
public void setTimestamp(Long timestamp) {
this.timestamp = timestamp;
}
public Double getTemperature() {
return temperature;
}
public void setTemperature(Double temperature) {
this.temperature = temperature;
}
@Override
public String toString() {
return "SensorReading{" +
"id='" + id + '\'' +
", timestamp=" + timestamp +
", temperature=" + temperature +
'}';
}
}