学习工具与软件版本:开发软件IDEA、Flink1.10.2、Kafka2.0.0、Scala2.11
本章建议有一定Flink基础的伙伴学习
- Apache Flink介绍、架构、原理以及实现:点击这里
文章目录
- 一 创建Flink Table执行环境需要的依赖
- 二 创建Flink Table执行环境的几种方式
- 三 表(Table)的概念
- 四 更新模式
- 五 Table转换成DataStream
- 六 查看执行计划
- 七 流处理与关系代数的区别
- 八 动态表(Dynamic Tables)
- 九 时间特性(Time Attributes )
- 十、窗口
- 十一 函数(Functions)
- 十二 自定义函数(UDF)
- 实践:Flink Table读取本地数据文件
- 实践:Flink Table读取Kafka Topic中的数据
- 实践:DataStream转Table
- 实践:将Table 数据输出至本地
- 实践:将Table数据输出至Kafka
- 实践:将Table 数据输出至ES
- 实践:将Table 数据输出至MySQL
一 创建Flink Table执行环境需要的依赖
<!- 根据自己使用的版本修改对应的版本号 ->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_2.11</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>1.10.2</version>
</dependency>
二 创建Flink Table执行环境的几种方式
//创建Flink流处理环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
/**
* 以下的几种创建方式对应的结果基本相同,但高版本建议只需要blink planner即可
*/
//1.0 在刚开始的学习阶段直接create env即可
val tableEnv = StreamTableEnvironment.create(env)
//1.1基于老版本pLanner的流处理
val settings = EnvironmentSettings.newInstance()
.use0ldPlanner()
.inStreamingMode()
.build()
val oldStreamTableEnv = StreamTableEnvironment.create(env,settings)
//1.2基于老版本pLanner的批处理
val batchEnv = ExecutionEnvironment.getExecutionEnvironment
val oldBatchTableEnv = BatchTableEnvironment.create(batchEnv)
// 1.3基于blink planner的流处理
val blinkStreamSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
val blinkStreamTableEnv = StreamTableEnvironment.create(env,blinkStreamSettings)
// 1.4基于blink planner的批处理
val blinkBatchSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inBatchMode()
.build()
val blinkBatchTableEnv = TableEnvironment.create(blinkBatchSettings)
三 表(Table)的概念
- TableEnvironment可以注册目录Catalog,并可以基于Catalog注册表
- 表(Table) 是由一个”标识符” (identifier) 来指定的,由3部分组成:Catalog名、数据库(database) 名和对象名
- 表可以是常规的, 也可以是虚拟的(视图,View)
- 常规表(Table)一般可以用来描述外部数据,比如文件、数据库表或消息队列的数据,也可以直接从DataStream转换而来
- 视图(View) 可以从现有的表中创建,通常是table API或者SQL查询的个结果集
创建表的标准格式
- TableEnvironment 可以调用.connect()方法,连接外部系统,并调用.create TemporaryTable()方法,在Catalog中注册表
tableEnv
.connect(...) //定义表的数据来源,和外部系统建立连接
.withFormat(...) //定义数据格式化方法
.withSchema(...) //定义表结构
.createTemporaryTable("MyTable") //创建临时表
四 更新模式
-
对于流式查询, 需要声明如何在表和外部连接器之间执行转换
-
与外部系统交换的消息类型, 由更新模式(Update Mode)指定
Table保存本地只支持Append模式,如果需要撤回、更新模式,就需要连接支持这两种模式的外部系统
1.追加(Append)模式
- 表只做插入操作,和外部连接器只交换插入(insert) 消息
2.撤回(Retract) 模式
-
表和外部连接器交换添加 (Add)和撤回(Retract) 消息
-
插入操作(Insert)编码为Add消息;删除(Delete)编码为Retract消息;更新(Update)编码为上一条的Retract和下一条的Add消息
3.更新插入(Upsert) 模式
- 更新和插入都被编码为 Upsert消息;删除编码为Delete消息
五 Table转换成DataStream
- 表可以转换为DataStream或DataSet,这样自定义流处理或批处理程序就可以继续在Table API或SQL查询的结果上运行了
- 将表转换为DataStream或DataSet时,需要指定生成的数据类型,即要将表的每一-行转换成的数据类型
- 表作为流式查询的结果,是动态更新的
- 转换有两种转换模式:追加(Appende) 模式和撤回(Retract) 模式
示例:追加模式(Append Mode)
- 用于表只会被插入(Insert) 操作更改的场景
val resultStream: DataStream[ Row] = tableEnv. toAppendStream[Row] (resultTable)
示例:撤回模式(Retract Mode)
- 用于任何场景。有些类似于更新模式中Retract模式,它只有Insert和Delete两类操作。
- 得到的数据会增加一-个Boolean类型的标识位(返回的第一个字段) , 用它来表示到底是新增的数据(Insert) ,还是被删除的数据(Delete)
val aggResultStream: DataStream[ (Boolean, (String, Long))] = tableEnv.toRetractStream[(String,Long)](aggResultTable)
六 查看执行计划
- Table API提供了- -种机制来解释计算表的逻 辑和优化查询计划
- 查看执行计划,可以通过TableEnvironment.explain(table)方法或TableEnvironment.explain0方法完成,返回一个字符串,描述三个计划
➢优化的逻辑查询计划
➢优化后的逻辑查询计划
➢实际执行计划。
//生成执行计划
val explaination:String = tableEnv.explain(resultTable)
///输出执行计划
println(explaination)
七 流处理与关系代数的区别
八 动态表(Dynamic Tables)
- 动态表是Flink对流数据的Table API和SQL支持的核心概念
- 与表示批处理数据的静态表不同,动态表是随时间变化的
➢持续查询(Continuous Query) - 动态表可以像静态的批处理表一样进行查询,查询-一个动态表会产生持续查询(Continuous Query)
- 连续查询永远不会终止,并会生成另-个动态表
- 查询会不断更新其动态结果表,以反映其动态输入表上的更改
1.动态表与持续查询
- 流式表查询的处理过程:
1.流被转换为动态表
2.对动态表计算连续查询,生成新的动态表
3.生成的动态表被转换回流
2.将流转换成动态表
- 为了处理带有关系查询的流,必须先将其转换为表
- 从概念上讲, 流的每个数据记录,都被解释为对结果表的插入(Insert)修改操作
3.持续查询
- 持续查询会在动态表上做计算处理,并作为结果生成新的动态表
4.将动态表转换成DataStream
- 与常规的数据库表- 样,动态表可以通过插入(Insert) 、更新(Update)和删除(Delete)更改,进行持续的修改
- 将动态表转换为流或将其写入外部系统时, 需要对这些更改进行编码
a.仅追加(Append-only) 流
- 仅通过插入(Insert) 更改来修改的动态表,可以直接转换为仅追加流
b.撤回(Retract) 流
- 撤回流是包含两类消息的流: 添加(Add) 消息和撒回(Retract) 消息
c.Upsert (更新插入)流
- Upsert流也包含两种类型的消息: Upsert消息和删除(Delete) 消息。
九 时间特性(Time Attributes )
- 基于时间的操作(比如Table API和SQL中窗口操作),需要定义相关的时间语义和时间数据来源的信息
- Table 可以提供一个逻辑上的时间字段,用于在表处理程序中,指示时间和访问相应的时间戳
- 时间属性,可以是每个表schema的一部分。一旦定义了时间属性,它就可以作为一个字段引用,并且可以在基于时间的操作中使用
- 时间属性的行为类似于常规时间戳,可以访问,并且进行计算
定义处理时间(Processing Time)
- 处理时间语义下,允许表处理程序根据机器的本地时间生成结果。它是时间的最简单概念。它既不需要提取时间戳,也不需要生成watermark
- 在定义Schema期间,可以使用proctime,指定字段名定义处理时间字段
- 这个proctime属性只能通过附加逻辑字段,来扩展物理schema。因此,只能在schema定义的末尾定义它
val sensorTable = tableEnv.fromDataStream( dataStream,'id, 'temperature, 'timestamp, 'pt.proctime )
定义事件时间(Event Time)
- 事件时间语义,允许表处理程序根据每个记录中包含的时间生成结果。这样即使在有乱序事件或者延迟事件时,也可以获得正确的结果。
- 为了处理无序事件,并区分流中的准时和迟到事件; Flink 需要从事件数据中,提取时间戳,并用来推进事件时间的进展
- 定义事件时间,同样有三种方法,下面用示例来实现一下
示例:由DataStream转换成表时指定
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//配置使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
//创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
//********************************由DataStream转换成表时指定时间戳******************************************
//读取本地文件数据
val datas: DataStream[String] = env.readTextFile("in/StringToClass.txt")
//转换成自定义类型,并设置watermark延迟1秒
val dataStream: DataStream[WaterSensor] = datas.filter(data=>{data.split(",")==3}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts*1000L
})
//将流转换成表,使用rowtime指定时间字段
val table_DataStream: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime ,'vc)
//输出表结构
table_DataStream.printSchema()
//输出结果
table_DataStream.toAppendStream[Row].print()
//执行
env.execute()
}
}
示例:定义Table Schema时指定
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//配置使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
//创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
//************************************定义Table Schema时指定**********************************************
tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
.withFormat(new Csv)
.withSchema(new Schema()
.field("id",DataTypes.STRING())
.field("ts",DataTypes.BIGINT())
.rowtime(
new Rowtime()
//指定时间字段
.timestampsFromField("ts")
//1000毫秒
.watermarksPeriodicBounded(1000)
)
.field("vc",DataTypes.DOUBLE())
).createTemporaryTable("timewindow_to_schema")
val table_Schema: Table = tableEnv.from("timewindow_to_schema")
//输出表结构
table_Schema.printSchema()
//输出结果
table_Schema.toAppendStream[Row].print()
//执行
env.execute()
}
}
示例:在创建表的 DDL中定义
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, FileSystem, Rowtime, Schema}
import org.apache.flink.table.types.DataType
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Def_Time {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//配置使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
//创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
//************************************在创建表的DDL中定义*******************************************************
val tableDDL:String=
"""
|create table table_DDL(
| id varchar(20) not null,
| ts bigint,
| vc double,
| rt AS TO_TIMESTAMP(FROM_UNIXTIME(ts)),
| watermark for rt as rt - interval '1' second
| ) with (
| 'connector.type' = 'filesystem',
| 'connector.path' = 'D:\ideaProject\FlinkDemo\in\StringToClass.txt',
| 'format.type' = 'csv'
| )
""".stripMargin
//注册表
tableEnv.sqlUpdate(tableDDL)
val table_DDL: Table = tableEnv.from("table_DDL")
//查看表结构
table_DDL.printSchema()
//输出结果
table_DDL.toAppendStream[Row].print()
//执行
env.execute()
}
}
十、窗口
- 时间语义,要配合窗口操作才能发挥作用
- 在Table API和SQL中,主要有两种窗口
1. Group Windows (分组窗口)
-
根据时间或行计数间隔,将行聚合到有限的组(Group) 中,并对每个组的数据执行一次聚合函数
-
Group Windows是使用window (w:GroupWindow) 子句定义的,并且必须由as子句指定一个别名。
-
为了按窗口对表进行分组,窗口的别名必须在group by子句中,像常规的分组字段一样引用
val table = input
.window([w:GroupWindow] as 'w) //定义窗口别名为w
.groupBy('w, 'a) //按照字段a和窗口w分组
.select('a,'b.sum) //聚合
- Table API提供了一组具有特定语义的预定义Window类, 这些类会被转换为底层DataStream或DataSet的窗口操作
滚动窗口(Tumbling windows)
滚动窗口要用Tumble类来定义
- 基于事件时间的滚动窗口
// Tumbling Event-time Window :基于事件时间的滚动窗口
.window(Tumble over 10.minutes on 'rowtime as 'w)
- 基于处理时间的滚动窗口
// Tumbling Process ing-time Window :基于处理时间的滚动窗口
.window(Tumble over 10.minutes on 'proctime as ' w)
- 基于计数的滚动窗口
// Tumbling Row- count Window:基于计数的滚动窗口
.window(Tumb1e over 10.rows on 'proctime as 'w)
滑动窗口(Sliding windows)
滑动窗口要用Slide类来定义
- 基于事件时间的滚动窗口
// Sliding Event-time Window 基于事件时间的滑动窗口
.window(Slide over 10.minutes every 5.minutes on 'rowtime as 'w)
- 基于处理时间的滚动窗口
// Sliding Processing-time window 基于处理时间的滑动窗口
.window(Slide over 10.minutes every 5.minutes on 'proctime as 'w)
- 基于计数的滚动窗口
// Sliding Row- count window:基于计数的滑动窗口
. window(Slide over 10.rows every 5.rows on 'proctime as 'w)
会话窗口(Session windows)
会话窗口要用Session类来定义
- 基于事件时间的滚动窗口
// Session Event-time Window 基于事件时间的会话窗口
.window(Session withGap 10.minutes on 'rowtime as 'w)
- 基于处理时间的滚动窗口
// Session Processing-time Window 基于处理时间的会话窗口
.window(Session withGap 10.minutes on 'proctime as 'w)
示例:滚动窗口
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//*********************************Group winodow****************************************************
val table_gw: Table = table
.window(Tumble over 10.seconds on 'ts as 'tw) //设置窗口
.groupBy('id, 'tw)
.select('id, 'id.count,'tw.end)
table_gw.toAppendStream[Row].print()
//启动
env.execute()
}
}
2. Group Windows(SQL)
- Group Windows定义在SQL查询的Group By子句中
➢TUMBLE(time_ attr, interval) - 定义一个滚动窗口,第一个参数是时间字段,第二个参数是窗口长度
➢HOP(time_ attr, interval, interval) - 定义一个滑动窗口,第一个参数是时间字段,第二个参数是窗口滑动步长,第三个是窗口长度
➢SESSION(time_ attr, interval) - 定义一个会话窗口,第一个参数是时间字段,第二个参数是窗口间隔
示例:滚动窗口(SQL)
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//*********************************Group winodow(SQL)****************************************************
tableEnv.createTemporaryView("table_view",table) //注册视图
val table_sql_gw: Table = tableEnv.sqlQuery(
"""
select id,count(id),sum(vc), tumble_end(ts,interval '10' second) from table_view
group by id,tumble(ts,interval '10' second)
""".stripMargin)
//输出执行结果
table_sql_gw.toAppendStream[Row].print()
//启动
env.execute()
}
}
3. Over Windows
- Over window聚合是标准SQL中已有的(over 子句),可以在查询的SELECT子句中定义
- Over window聚合,会针对每个输入行,计算相邻行范围内的聚合
- Over windows使用window (w:overwindows*) 子句定义,并在select ()方法中通过别名来引用
val tabLe = input
.window([w: OverWindow] as 'w)
.select('a,'b.sum over 'W,'c.min over 'w)
- Table API提供了Over类,来配置Over窗口的属性
无界Over Windows
可以在事件时间或处理时间,以及指定为时间间隔、或行计数的范围内,定义Over windows
无界的over window是使用常量指定的
- 基于事件时间的over window
//无界的事件时间over window
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_ RANGE as 'w)
- 基于处理时间的over window
//无界的处理时间over window
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_ RANGE as 'w)
- 基于事件时间的计数over window
//无界的事件时间Row-count over window
.window(Over partitionBy 'a orderBy 'rowtime preceding UNBOUNDED_ROW as w)
- 基于处理时间的计数over window
//.无界的处理时间Row-count over window
.window(Over partitionBy 'a orderBy 'proctime preceding UNBOUNDED_ROW as 'w)
有界Over Windows
有界的over window是用间隔的大小指定的
- 基于事件时间的有界over window
//有界的事件时间的有界over window
.window(Over partitionBy 'a orderBy 'rowtime preceding 1.minutes as 'w)
- 基于处理时间的over window
//有界的处理时间over window
.window(Over partitionBy 'a orderBy 'proctime preceding 1.minutes as 'w)
- 基于事件时间的有界计数over window
//有界的事件时间Row-count over window
.window(Over partitionBy 'a orderBy 'rowtime preceding 10.rows as 'w)
- 基于处理时间的有界计数over window
//有界的处理时间Row-count over window
.window(Over partitionBy 'a orderBy 'proctime preceding 10.rows as 'w)
示例:无界Over Windows
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//*********************************Over winodow****************************************************
val table_ow: Table = table.window(Over partitionBy 'id orderBy 'ts preceding 2.rows as 'ow)
.select('id, 'id.count over 'ow, 'vc.sum over 'ow)//做聚合函数后指定窗口
//输出
table_ow.toAppendStream[Row].print()
//执行
env.execute()
}
}
4. Over Windows(SQL)
- 用Over做窗口聚合时,所有聚合必须在同一窗口.上定义,也就是说必须是相同的分区、排序和范围
- 目前仅支持在当前行范围之前的窗口
- ORDER BY必须在单一的时间属性上指定
SELECT COUNT(amount) OVER (
PARTITION BY user
ORDER BY proctime
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM Orders
示例:Over Windows(SQL)
- 测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Over, Table, Tumble}
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object TimeAndWindow {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//*********************************Over winodow(SQL)****************************************************
val table_sql_ow=tableEnv.sqlQuery(
"""
|select id,count(id) over ow ,sum(vc) over ow from table_view
|window ow as(
|partition by id
|order by ts
|rows between 2 preceding and current row
|)
""".stripMargin)
table_sql_ow.toAppendStream[Row].print()
//执行
env.execute()
}
}
十一 函数(Functions)
- Flink Table API和SQL为用户提供了组用于数据转换的内置函数
- SQL中支持的很多函数,Table API和SQL都已经做了实现
比较函数
SQL:
value1 = value2
value1 > value2
Table API:
ANY1=== ANY2
ANY1 > ANY2
逻辑函数
SQL:
boolean1 OR boolean2
boolean IS FALSE
NOT boolean
TABLE API:
BOOLEAN1||BOOLEAN2
BOOL EAN.isFalse
!BOOLEAN
算数函数
SQL:
numeric1 + numeric2
POWER(numeric1, numeric2)
Table API:
NUMERIC1 + NUMERIC2
NUMERIC1.power(NUMERIC2)
字符串函数
SQL:
string1 || string2
UPPER(string)
CHAR_LENGTH(string)
Table API:
STRING1 + STRING2
STRING.upperCase()
STRING.charLength()
时间函数
SQL:
DATE string
TIMESTAMP string
CURRENT TIME
INTERVAL string range
Table API:
STRING.toDate
STRING.toTimestamp
currentTime()
NUMERIC.days
NUMERIC.minutes
聚合函数
SQL:
COUNT(*)
SUM(expression)
RANK()
ROW_NUMBER()
Table API:
FIELD.count
FIELD.sum
十二 自定义函数(UDF)
- 用户定义函数(User-defined Functions, UDF)是一个重要的特性,它们显著地扩展了查询的表达能力
- 在大多数情况下,用户定义的函数必须先注册,然后才能在查询中使用
- 函数通过调用 registerFunction ()方法在TableEnvironment中注册。当用户定义的函数被注册时,它被插入到TableEnvironment的函数目录中,这样Table API或SQL解析器就可以识别并正确地解释它
标量函数(Scalar Functions)
- 用户定义的标量函数,可以将0、1或多个标量值,映射到新的标量值
- 为了定义标量函数,必须在org.apache.flink table.functions中扩展基类Scalar Function,并实现(一个或多个)求值(eval) 方法
- 标量函数的行为由求值方法决定,求值方法必须公开声明并命名为eval
- 实现的是一进一出,类似于map
class HashCode( factor: Int ) extends ScalarFunction {
def eva1( s: String ): Int = {
s.hashCode * factor
}}
示例:Scalar Functions
测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
代码实现
import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object Test_utf {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//实例化UDF对象
val hashCode = new HashCode(10)
//****************************Table API:使用自定义ScalarFunction***************************
val table_tf: Table = table.select('id,hashCode('id))
table_tf.toAppendStream[Row].print("table_tf")
//********************************SQL:使用自定义ScalarFunction******************************
//注册视图
tableEnv.createTemporaryView("table_view",table)
//注册UDF
tableEnv.registerFunction("hashCode",hashCode)
//使用UDF
val table_sql: Table = tableEnv.sqlQuery(
"""
select id,hashCode(id) from table_view
""".stripMargin)
table_sql.toAppendStream[Row].print("table_sql")
env.execute()
}
}
//自定义UDF,继承ScalarFunction
//ScalarFunction:实现的是一进一出,类似于map
class HashCode(factor:Int) extends ScalarFunction{
//必须实现eval方法,而且是public
def eval(s:String):Int={
//具体实现需求
s.hashCode * factor - 10000
}
}
表函数(Table Functions)
- 用户定义的表函数,也可以将0、1或多个标量值作为输入参数;与标量函数不同的是,它可以返回任意数量的行作为输出,而不是单个值
- 为了定义一个表函数,必须扩展org.apache.flink.table.functions中的基类TableFunction并实现(一个或多个) 求值方法
- 表函数的行为由其求值方法决定,求值方法必须是public的,并命名为eval
class Split(separator: String) extends TableFunction[(String, Int)]{
def eval(str: String): Unit = {
str.split(separator).foreach(word => collect((word, word.length)))
}}
示例:Table Functions
测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
代码实现
import TableAPI.WaterSensor
import org.apache.commons.math3.geometry.spherical.oned.ArcsSet.Split
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableFunction
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_table {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//**********************************Table API:使用自定义TableFunction**********************************
//创建TableFunction实例
val spilt = new Spilt("_")
val table_tf: Table = table
//使用侧输出流,关联经过UDF处理返回的数据
.joinLateral(spilt('id) as ('word,'length))
//在select中可以直接调用测输出流设置的字段
.select('id,'ts,'vc,'word,'length)
//转换成流输出
table_tf.toAppendStream[Row].print("table_tf")
//**********************************SQL:使用自定义TableFunction**********************************
//注册视图
tableEnv.createTemporaryView("table_view",table)
//注册UDF
tableEnv.registerFunction("split",spilt)
//使用UDF
//使用lateral table()来指定使用UDF后返回的数据信息,并未返回的信息,设置字段信息
val table_sql: Table = tableEnv.sqlQuery(
"""
select id, ts, vc, word, length from table_view,
lateral table( split(id) ) as splitId(word,length)
""".stripMargin)
//转换成流输出
table_sql.toAppendStream[Row].print("table_sql")
env.execute()
}
}
//自定义表函数,继承TableFunction,并设置最后返回的数据类型
//TableFunction:实现的是一进多出,类似于flatMap
class Spilt(separator:String) extends TableFunction[(String,Int)]{
//实现eavl方法,并指定,输入的数据类型
def eval(s:String): Unit ={
//具体实现操作
//示例:将传入的string类型进行拆分,然后foreach遍历
s.split(separator).foreach(word=>{
//通过collect发送出去传入的字符串以及字符串长度
collect((word,word.length))
})
}
}
聚合函数(Aggregate Functions)
-
用户自定义聚合函数(User-Defined Aggregate Functions, UDAGGs)可以把一个表中的数据,聚合成-个标量值
-
用户定义的聚合函数,是通过继承AggregateFunction抽象类实现的
-
AggregationFunction要求必须实现的方法:
createAccumulator()
accumulate()
getValue() -
AggregateFunction的工作原理如下:
a. 首先,它需要一个累加器(Accumulator) ,用来保存聚合中间结果的数据结构;可以通过调用createAccumulator()方法创建空累加器
b. 随后,对每个输入行调用函数的accumulate()方法来更新累加器
c. 处理完所有行后,将调用函数的getValue()方法来计算并返回最终结果
示例:Aggregate Functions
测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
实现代码
import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.AggregateFunction
import org.apache.flink.types.Row
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_AggregateFunctions {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
val avgTemp = new AvgTemp()
//****************************Table API:使用自定义AggregateFunction***************************
val table_avg: Table = table
.groupBy('id)
//使用自定义聚合函数,并对返回的数据重命名
.aggregate(avgTemp('vc) as 'avgTemp)
.select('id, 'avgTemp)
table_avg.toRetractStream[Row].print("table api->")
//********************************SQL:使用自定义AggregateFunction******************************
tableEnv.createTemporaryView("table_view",table)
//注册UDF
tableEnv.registerFunction("avgTemp",avgTemp)
//使用UDF
val table_sql: Table = tableEnv.sqlQuery(
"""
select id, avgTemp(vc) from table_view group by id
""".stripMargin)
table_sql.toRetractStream[Row].print("table sql->")
env.execute()
}
}
//定义一个类型
class AvgTempAcc{ var sum:Double=0.0; var count:Int=0 }
//自定义聚合函数,继承AggregateFunction,并设置最终返回类型
//AggregateFunction:实现的是多进一出,类似于聚合函数
class AvgTemp extends AggregateFunction[Double,AvgTempAcc] {
//实现对传入的数据进行求平均数
override def getValue(accumulator: AvgTempAcc): Double = accumulator.sum/accumulator.count
//设置初始值
override def createAccumulator(): AvgTempAcc = new AvgTempAcc
//还需要实现一个具体的处理计算,accumulate,完成当传入一个数据应该执行的操作
def accumulate(accumulator:AvgTempAcc, temp:Double):Unit={
accumulator.sum+=temp
accumulator.count+=1
}
}
表聚合函数(Table Aggregate Functions)
- 用户定义的表聚 合函数(User-Defined Table Aggregate Functions,UDTAGGs),可以把一个表中数据,聚合为具有多行和多列的结果表
- 用户定义表聚合函数, 是通过继承TableAggregateFunction抽象类来实现的
示例:Table Aggregate Functions
测试数据
ws_001,1577844001,24.0
ws_001,1577844002,23.0
ws_001,1577844003,23.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
ws_002,1577844025,43.0
ws_002,1577844045,43.0
ws_001,1577844090,23.0
ws_002,1577844055,43.0
ws_003,1577844060,32.0
代码实现
import TableAPI.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.{EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableAggregateFunction
import org.apache.flink.types.Row
import org.apache.flink.util.Collector
//自定义类型
case class WaterSensor(id:String,ts:Long,vc:Double)
object UDF_TableAggregateFunctions {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,settings)
val datas: DataStream[String] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
val dataStream: DataStream[WaterSensor] = datas.filter(x=>{
x.split(",").length==3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
}).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[WaterSensor](Time.seconds(1)) {
override def extractTimestamp(t: WaterSensor): Long = t.ts * 1000L
})
val table: Table = tableEnv.fromDataStream(dataStream,'id,'ts.rowtime,'vc)
//**********************************Table API:使用TableAggregateFunction**********************************
//实例化一个表聚合函数的对象
val top2Temp = new Top2Temp
val table_api: Table = table
.groupBy('id)
//调用表聚合函数,指定传入的字段,并定义返回数据的字段名称
.flatAggregate(top2Temp('vc) as('temp, 'rank))
.select('id, 'temp, 'rank)
table_api.toRetractStream[Row].print("table_api")
env.execute()
}
}
//定义一个类,来表示聚合函数的状态
class Top2TempAcc{
var highestTemp:Double=Double.MinValue
var secondHighestTemp:Double=Double.MinValue
}
//自定义表聚合函数
class Top2Temp extends TableAggregateFunction[(Double,Int),Top2TempAcc]{
//实例化一个状态类
override def createAccumulator(): Top2TempAcc = new Top2TempAcc
//实时计算聚合结果的函数accumulate
def accumulate(acc:Top2TempAcc,temp: Double):Unit={
//判断当前温度值,是否比状态中的值大
if(temp>acc.highestTemp){
//将原先的第一名退至第二名,第二名直接丢弃
acc.secondHighestTemp=acc.highestTemp
//有新传入的值当第一位
acc.highestTemp=temp
}else if (temp>acc.secondHighestTemp){//如果传入的值只比第二名高,则只替换第二名即可
acc.secondHighestTemp=temp
}
}
//实现一个输出结果的方法,最终处理完表中所有数据时调用,方法名只能是emitValue,不可变
def emitValue(acc:Top2TempAcc,out:Collector[(Double,Int)]):Unit={
out.collect((acc.highestTemp,1))
out.collect((acc.secondHighestTemp,2))
}
}
实践:Flink Table读取本地数据文件
老版本可以直接用OldCsv方法格式化数据,但高版本Flink,就需要单独引入格式化类型的依赖
示例:Csv格式
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-csv</artifactId>
<version>1.10.2</version>
</dependency>
测试数据
ws_001,1577844001,35.0
ws_002,1577844015,43.0
ws_003,1577844020,72.0
ws_001,1577844001,45.0
ws_002,1577844015,73.0
实现代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._
import org.apache.flink.table.factories.TableSourceFactory
object TableApiTest {
def main(args: Array[String]): Unit = {
//创建流处理环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//创建表执行环境
val tableEnv = StreamTableEnvironment.create(env)
//读取本地CSV文件,并存入表中
tableEnv.connect(new FileSystem().path("D:\\...\\InputTableCsv.txt"))
.withFormat(new Csv) //指定读取数据文件格式
.withSchema(new Schema() //设置数据字段信息
.field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
.field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
.field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
).createTemporaryTable("inputTable") //创建一张表来保存读取得数据
//根据表名生成Table类型得实例对象
val table: Table = tableEnv.from("inputTable")
//将Table实例对象输出(使用toAppendStream,需要引入import org.apache.flink.table.api.scala._)
table.toAppendStream[(String,Long,Double)].print()
//启动
env.execute()
}
}
运行结果
实践:Flink Table读取Kafka Topic中的数据
添加依赖
<!- 根据自己使用的版本进行修改 ->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.2</version>
</dependency>
创建kafka Topic
kafka-topics.sh --create --zookeeper 192.168.**.**:2181 --topic tableTest --partitions 1 --replication-factor 1
创建生产者
kafka-console-producer.sh --topic tableTest --broker-list 192.168.**.**:9092
实现代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._
import org.apache.flink.table.factories.TableSourceFactory
object TableApiTest {
def main(args: Array[String]): Unit = {
//创建流处理环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//创建表执行环境
val tableEnv = StreamTableEnvironment.create(env)
//2.2从Kafka中读取数据
tableEnv.connect(new Kafka()
.version("universal") //设置kafka版本:universal表示使用通用连接器,自动匹配最新版本,兼容0.10之后的版本
.topic("tableTest") //指定消费的Topic
.property("zookeeper.connect","192.168.95.99:2181") //指定zookeeper地址
.property("bootstrap.servers","192.168.95.99:9092") //指定kafka地址
).withFormat(new Csv()) //指定格式化标准
.withSchema(new Schema() //配置字段信息
.field("id",DataTypes.STRING())
.field("ts",DataTypes.BIGINT())
.field("vc",DataTypes.DOUBLE())
).createTemporaryTable("kafkaInputTable") //注册成表
//根据表名生成Table类型得实例对象
val table: Table = tableEnv.from("kafkaInputTable")
//将Table实例对象输出(使用toAppendStream,需要加上import org.apache.flink.table.api.scala._)
table.toAppendStream[(String,Long,Double)].print()
env.execute()
}
}
生产者输入数据
启动程序后接收到的数据
实践:DataStream转Table
测试数据
ws_001,1577844001,35.0
ws_002,1577844015,43.0
ws_003,1577844020,72.0
ws_001,1577844001,45.0
ws_002,1577844015,73.0
实现代码
import Source.WaterSensor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
//定义样例类
case class WaterSensor(id:String,ts:Long,vc:Double)
object Example {
def main(args: Array[String]): Unit = {
//创建流处理执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//读取本地数据文件,生成DataStream
val dataStream: DataStream[String] = env.readTextFile("D:\\...\\inputTableCsv.txt")
//转换读取的数据为WaterSensor类型
val dataStream2: DataStream[WaterSensor] = dataStream.filter(data => {
val strings: Array[String] = data.split(",")
strings.length == 3
}).map(data => {
val strings: Array[String] = data.split(",")
WaterSensor(strings(0), strings(1).toLong, strings(2).toDouble)
})
//首先创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//方式一:基于流创建一张表类型得对象
val dataTable: Table = tableEnv.fromDataStream(dataStream2)
//调用Table API进行转换
val resultTable: Table = dataTable.select("id,vc").filter("id='ws_003'")
//方式二:之间用SQL实现分析数据
//注册视图
tableEnv.createTemporaryView("dataTable",resultTable)
val sqlTable: Table = tableEnv.sqlQuery("select id,vc from dataTable where id='ws_003'")
//将通过API得到的表转换成流输出
resultTable.toAppendStream[(String,Double)].print("table api")
//将通过sql得到的表转换成流输出
sqlTable.toAppendStream[(String,Double)].print("sql")
env.execute()
}
}
控制台输出数据
实践:将Table 数据输出至本地
示例代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, FileSystem, Schema}
//定义样例类
case class WaterSensor(id:String,ts:Long,vc:Double)
object TableOutCsv {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//创建表执行环境
val table: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env,table)
//读取本地数据文件,并转换成WaterSonser类型
val dataStream: DataStream[WaterSensor] = env.readTextFile("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt")
.map(a=>{
val strings: Array[String] = a.split(",")
WaterSensor(strings(0),strings(1).toLong,strings(2).toDouble)
})
//根据流创建一张Table类型得得对象
val dataTable: Table = tableEnv.fromDataStream(dataStream)
//调用Table API查询
val dataTable2: Table = dataTable.select("id,vc")
//注册输出表
val outputPath:String="D:\\ideaProject\\FlinkDemo\\in\\outputTable.txt"
tableEnv.connect(new FileSystem().path(outputPath))
.withFormat(new Csv())
.withSchema(
new Schema()
.field("id",DataTypes.STRING())
.field("vc",DataTypes.DOUBLE())
).createTemporaryTable("outputTable")
//将查询的Table insert到输出表
dataTable2.insertInto("outputTable")
//启动执行
env.execute()
}
}
测试数据
ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
执行前
执行后
注意:CsvTableSink只实现了BatchTableSink批处理表SInk和AppendStreamTableSink追加表Sink,如果使用了聚合运算等操作就无法保存使用CsvTableSink保存至本地
实践:将Table数据输出至Kafka
代码示例
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, Kafka, Schema}
object TableOutKafka {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//创建表执行环境
val tableStream: EnvironmentSettings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inStreamingMode()
.build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, tableStream)
//创建kafka输入表
tableEnv.connect(new Kafka()
.version("universal")
.topic("inTable") //指定Topic
.property("zookeeper.connect","192.168.**.**:2181")
.property("bootstrap.servers","192.168.**.**:9092")
).withFormat(new Csv())
.withSchema(new Schema()
.field("id",DataTypes.STRING())
.field("ts",DataTypes.BIGINT())
.field("vc",DataTypes.DOUBLE())
).createTemporaryTable("kafkaInputTable")
//根据表名生成Table类型得实例对象
val table: Table = tableEnv.from("kafkaInputTable")
//查询
val resultTable: Table = table.select('id,'vc).filter('id === "ws_003")
//创建Kafka输出表
tableEnv.connect(new Kafka()
.version("universal")
.topic("outTable")
.property("zookeeper.connect","192.168.95.99:2181")
.property("bootstrap.servers","192.168.95.99:9092")
).withFormat(new Csv())//符合rfc格式得csv
.withSchema(new Schema()
.field("id",DataTypes.STRING())
.field("vc",DataTypes.DOUBLE())
).createTemporaryTable("kafkaOutputTable")
//查询结果插入输出表
resultTable.insertInto("kafkaOutputTable")
env.execute()
}
}
创建对应的kafka生产者、消费者
- 生产者:
kafka-console-producer.sh --topic inTable --broker-list 192.168.**.**:9092
- 消费者:
kafka-console-consumer.sh --topic outTable--bootstrap-server 192.168.**.**:9092
启动程序
- 在生产者中输入数据,进行测试:ws_003,1577844020,32.0
这样就实现了kafka进kafka出的一个管道测试,但是注意,kafkaTableSink依旧只实现了Append追加模式
实践:将Table 数据输出至ES
添加依赖
<!- 修改成自己使用的版本 ->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch-base_2.11</artifactId>
<version>1.10.2</version>
</dependency>
测试数据
ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
代码示例
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, Table}
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors._
object TableOutES {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//2.1读取CSV文件,并存入表中
tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
.withFormat(new Csv) //指定读取数据文件格式
.withSchema(new Schema() //设置数据字段信息
.field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
.field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
.field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
).createTemporaryTable("inputTable") //创建一张表来保存读取得数据
//根据表名生成Table类型得实例对象
val tableData: Table = tableEnv.from("inputTable")
//调用Table API聚合查询
val dataTable2: Table = tableData.groupBy('id).select('id,'id.count as 'count)
//配置ES连接器,并注册表
tableEnv.connect(new Elasticsearch()
.version("6")
.host("192.168.**.**",9200,"http")
.index("sensor")
.documentType("temperature")
).inUpsertMode() //ES支持Upsert模式
.withFormat(new Json()) //需要引入Json依赖包
.withSchema(new Schema()
.field("id",DataTypes.STRING())
.field("count",DataTypes.BIGINT())
).createTemporaryTable("esOutputTable")
//将聚合的数据insert到ES注册的表中
dataTable2.insertInto("esOutputTable")
env.execute("es out put test")
}
}
查看ES
curl "192.168.**.**:9200/_cat/indices?v"
查看数据
curl "192.168.**.**:9200/sensor/_search?pretty"
实践:将Table 数据输出至MySQL
添加依赖
<!- 根据自己使用的版本进行修改 ->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-jdbc_2.11</artifactId>
<version>1.10.2</version>
</dependency>
测试数据
ws_001,1577844001,24.0
ws_002,1577844015,43.0
ws_003,1577844020,32.0
实现代码
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api._
import org.apache.flink.table.api.scala._
import org.apache.flink.table.descriptors.{Csv, FileSystem, Schema}
object TableOutMySQL {
def main(args: Array[String]): Unit = {
//创建流执行环境
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//创建表执行环境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//2.1读取CSV文件,并存入表中
tableEnv.connect(new FileSystem().path("D:\\ideaProject\\FlinkDemo\\in\\StringToClass.txt"))
.withFormat(new Csv) //指定读取数据文件格式
.withSchema(new Schema() //设置数据字段信息
.field("id",DataTypes.STRING()) //第一列字段名:id 类型:String
.field("ts",DataTypes.BIGINT()) //第二列字段名:ts 类型:Long
.field("vc",DataTypes.DOUBLE()) //第三列字段名:vc 类型:Double
).createTemporaryTable("inputTable") //创建一张表来保存读取得数据
//根据表名生成Table类型得实例对象
val tableData: Table = tableEnv.from("inputTable")
//调用Table API聚合查询
val dataTable2: Table = tableData.groupBy('id).select('id,'id.count as 'count)
//配置连接MySQL的DDL
//OutputToMySQLTables是在catalog中注册的表
//flink_to_mysql才是mysql中的目标表名
val sinkDDL=
"""
|create table OutputToMySQLTable(
| id varchar(20) not null,
| cnt bigint not null
| ) with (
| 'connector.type' = 'jdbc',
| 'connector.url' = 'jdbc:mysql://192.168.95.99:3306/test',
| 'connector.table' = 'flink_to_mysql',
| 'connector.driver' = 'com.mysql.jdbc.Driver',
| 'connector.username' = 'root',
| 'connector.password' = 'root123'
| )
""".stripMargin
//执行DDL创建表
tableEnv.sqlUpdate(sinkDDL)
//将查询的数据insert到Mysql中
dataTable2.insertInto("OutputToMySQLTable")
env.execute()
}
}
Mysql中创建表
- 创建表:
create table flink_to_mysql(id varchar(20),cnt bigint);
- 查看表结构:
desc flink_to_mysql;
启动程序
执行效果
- 查看数据:
select * from flink_to_mysql;