一、准备数据
首先构建一个DataStream
case class Sensor(id: String, timestamp: Long, temperature: Double)
传感器样例类包含,传感器id、时间戳、温度
构建Stream
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//flatMap可以完成map和filter的操作,map和filter有明确的语义,转换和过滤更加直白
val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
val array = data.split(",")
new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
})
二、使用split和select方法完成
此方法在新版api已标记为废弃
//使用split将流数据打上不同标记,结合select方法真正的分离出新的流DataStream
//已废弃,参考SideOutPutTest,使用ProcessFunction实现侧输出流
val splitStream = dataStream.split(data => {
if (data.temperature >= 30) {
Seq("high")
}
else if (data.temperature >= 20 && data.temperature < 30) {
Seq("mid")
}
else
Seq("low")
})
val high = splitStream.select("high")
val mid = splitStream.select("mid")
val low = splitStream.select("low")
三、使用outPutTag和ProcessFunction完成
package com.hk.processFunctionTest
import java.util.Properties
import com.hk.transformTest.Sensor
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector
case class Sensor(id: String, timestamp: Long, temperature: Double)
/**
* Description: 使用ProcessFunction实现,流的切分功能,异常温度单独放到一个流里,和split类似
*
* @author heroking
* @version 1.0.0
*/
object SideOutPutTest {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//flatMap可以完成map和filter的操作,map和filter有明确的语义,转换和过滤更加直白
val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
val array = data.split(",")
new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
})
//泛型为侧输出流要输出的数据格式
val tag: OutputTag[Sensor] = new OutputTag[Sensor]("hot")
val result = dataStream
.process(new HotAlarm(tag))
//获取侧输出流信息
val sideOutPut: DataStream[Sensor] = result.getSideOutput(tag)
sideOutPut.print("侧输出流:")
result.print("out:")
env.execute("TransformTest")
}
}
//第二个参数是主输出流将要输出的数据类型
/**
* 如果温度过高,输出报警信息到侧输出流
*/
class HotAlarm(alarmOutPutStream:OutputTag[Sensor]) extends ProcessFunction[Sensor, Sensor] {
override def processElement(sensor: Sensor, context: ProcessFunction[Sensor, Sensor]#Context, collector: Collector[Sensor]): Unit = {
if (sensor.temperature > 0.5) {
context.output(alarmOutPutStream, sensor)
} else {
collector.collect(sensor)
}
}
}
输出:
out:> Sensor(1,7282164761,0.1)
out:> Sensor(1,7282164765,0.5)
侧输出流:> Sensor(1,7282164774,1.14)
out:> Sensor(1,7282164762,0.2)
out:> Sensor(1,7282164763,0.3)
out:> Sensor(1,7282164764,0.4)
侧输出流:> Sensor(1,7282164766,0.6)
侧输出流:> Sensor(1,7282164767,0.7)
侧输出流:> Sensor(1,7282164768,0.8)
四、窗口迟到数据
窗口watermark和allowedLateness之后依然迟到的流数据,也是通过.sideOutputLateData(outputTag)和result.getSideOutput(outputTag)的侧输出流方式输出的,拿到这一部分数据后用户可以自己处理,相比于spark的水印和数据延迟机制来说,flink的更加完善和易用
val outputTag = new OutputTag[(String, Double)]("side")
val result = waterMarkDataStream
.map(data => (data.id, data.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(2))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(outputTag)
.minBy(1)
//.reduce((data1, data2) => (data1._1, data1._2.min(data2._2)))
dataStream.print("in")
val sideOutPutStream = result.getSideOutput(outputTag)
sideOutPutStream.print("窗口watermark和allowedLateness之后依然迟到的流数据:")
result.print("out")
env.execute("TransformTest")