Flink的TableAPI与SQL

    Table API是流处理和批处理通用的关系型API,Table API可以基于流输入或者批输入来运行而不需要进行任何修改。Table API是SQL语言的超集并专门为Apache Flink设计的,Table API是Scala 和Java语言集成式的API。与常规SQL语言中将查询指定为字符串不同,Table API查询是以Java或Scala中的语言嵌入样式来定义的,具有IDE支持如:自动完成和语法检测。1.需要引入的pom依赖

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-table_2.11</artifactId>
    <version>1.7.0</version>
</dependency>

2.简单了解TableAPI

def main(args: Array[String]): Unit = {
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)

  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    }
    )
  // 基于env创建 tableEnv
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, settings)

  // 从一条流创建一张表
  val dataTable: Table = tableEnv.fromDataStream(dataStream)

  // 从表里选取特定的数据
  val selectedTable: Table = dataTable.select('id, 'temperature)
    .filter("id = 'sensor_1'")

  val selectedStream: DataStream[(String, Double)] = selectedTable
    .toAppendStream[(String, Double)]

  selectedStream.print()

  env.execute("table test")

}

2.1 动态表

    如果流中的数据类型是case class可以直接根据case class的结构生成table

tableEnv.fromDataStream(startupLogDstream)  

    或者根据字段顺序单独命名

tableEnv.fromDataStream(startupLogDstream,’mid,’uid  .......)  

    最后的动态表可以转换为流进行输出

table.toAppendStream[(String,String)]

2.2 字段

    用一个单引放到字段前面 来标识字段名, 如 ‘name , ‘mid ,’amount 等

 

3.TableAPI 的窗口聚合操作 3.1 通过一个例子连了解TableAPI

// 统计每10秒中每个传感器温度值的个数
def main(args: Array[String]): Unit = {
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    }
    )
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(1)) {
      override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
    })
  // 基于env创建 tableEnv
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, settings)


  // 从一条流创建一张表,按照字段去定义,并指定事件时间的时间字段
  val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id, 'temperature, 'ts.rowtime)


  // 按照时间开窗聚合统计
  val resultTable: Table = dataTable
    .window( Tumble over 10.seconds on 'ts as 'tw )
    .groupBy('id, 'tw)
    .select('id, 'id.count)


  val selectedStream: DataStream[(Boolean, (String, Long))] = resultTable
    .toRetractStream[(String, Long)]


  selectedStream.print()


  env.execute("table window test")
}

3.2 关于group by

3.2.1 如果使用 groupby table转换为流的时候只能用toRetractDstream

  val rDstream: DataStream[(Boolean, (String, Long))] = table.toRetractStream[(String,Long)]

3.2.2 toRetractDstream 得到的第一个boolean型字段标识 true就是最新的数据,false表示过期老数据

val rDstream: DataStream[(Boolean, (String, Long))] = table.toRetractStream[(String,Long)]

  rDstream.filter(_._1).print()

3.2.3 如果使用的api包括时间窗口,那么时间的字段必须,包含在group by中。

val table: Table = startupLogTable.filter("ch ='appstore'").window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch ,'tt).select("ch,ch.count ")

3.3 关于时间窗口

3.3.1 用到时间窗口,必须提前声明时间字段,如果是processTime直接在创建动态表时进行追加就可以

val table: Table = startupLogTable.filter("ch ='appstore'").window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch ,'tt).select("ch,ch.count ")

3.3.2 如果是EventTime要在创建动态表时声明

val startupLogTable: Table = tableEnv.fromDataStream(startupLogWithEtDstream,'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ps.processtime)

3.3.3 滚动窗口可以使用Tumble over 10000.millis on

val startupLogTable: Table = tableEnv.fromDataStream(startupLogWithEtDstream,'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ps.processtime)

4.SQL如何编写

// 统计每10秒中每个传感器温度值的个数
def main(args: Array[String]): Unit = {
  val env = StreamExecutionEnvironment.getExecutionEnvironment
  env.setParallelism(1)
  env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

  val inputStream = env.readTextFile("..\\sensor.txt")
  val dataStream = inputStream
    .map( data => {
      val dataArray = data.split(",")
      SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    }
    )
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[SensorReading](Time.seconds(1)) {
      override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
    })
  // 基于env创建 tableEnv
val settings: EnvironmentSettings = EnvironmentSettings.newInstance().useOldPlanner().inStreamingMode().build()
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env, settings)


  // 从一条流创建一张表,按照字段去定义,并指定事件时间的时间字段
  val dataTable: Table = tableEnv.fromDataStream(dataStream, 'id, 'temperature, 'ts.rowtime)


  // 直接写sql完成开窗统计 
  val resultSqlTable: Table = tableEnv.sqlQuery("select id, count(id) from "
  + dataTable + " group by id, tumble(ts, interval '15' second)")


  val selectedStream: DataStream[(Boolean, (String, Long))] = resultSqlTable.toRetractStream[(String, Long)]


  selectedStream.print()


  env.execute("table window test")
}

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

程序员学习圈

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值