Flink流处理API
代码主要分为四个模块:environment, source,transform,sink
Transform
1.基本简单聚合算子
map
对每个元素做相应操作
val dataStream2 = dataStream.filter(x => !x.isEmpty)
.map(data => {
val dataArray = data.split(",")
SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
})
flatmap
全部打乱后做操作(传进来的本身就是多个list)
Filter
过滤
KeyBy
keyby之后由dataStream变为KeyedStream,根据hash值进行重分区
dataStream2.keyBy(_.id)
.sum(2)
.print("sum")
滚动聚合算子
针对KeyedStream做聚合
sum,max,min等
reduce
比sum更加灵活,把当前的元素和之前的reduce结果做叠加
dataStream2.keyBy(_.id).reduce((x, y) => SensorReading(x.id, x.timestamp + 1, y.temperature + 10))
.print("reduce")
2.多流转换算子
split和select
Split:DataStream->SplitStream,根据某些特征将一个DataStream拆分成两个或多个DataStream(并不是真正的切开,只是带了不同的标签)
Select:SplitStream->DataStream,从一个SplitStream中获取一个或多个DataStream
val dataStreamSplit = dataStream2.split(data => {
if (data.temperature > 36) Seq("high") else Seq("low")
})
//selcet
val high = dataStreamSplit.select("high")
val low = dataStreamSplit.select("low")
val all = dataStreamSplit.select("high", "low")
connect和CoMap
Connect:DataStream,DataStream(数据类型可不同)->ConnectedStreams。连接两个保持他们类型的数据流,两个数据流被Connect之后只是被放在了一个同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立
CoMap和CoFlatMap:ConnectedStreams -> DataStream。对ConnectedStreams流直接做Map或FlatMap运算
val high = dataStreamSplit.select("high")
val low = dataStreamSplit.select("low")
val all = dataStreamSplit.select("high", "low")
//Connect和CoMap
//Connect合并两条流
val warning = high.map(x => (x.id, x.temperature))
val connectedStream = warning.connect(low)
//Comap
connectedStream.map(
warningData => (warningData._1, warningData._2, "warning"),
lowData => (lowData.id, "healthy")
).print("connect")
Union
合并多条流:DataStream->DataStream,多个同类型的DataStream合并为一个DataStream
val unionStream = high.union(low, all).print("union")
3.支持的数据类型
- 基础数据类型
- java和scala元组(tuple)
- scala样例类(case class)
- java简单对象
- 其他(Arrays,Lists,Maps,Enums)
4.实现UDF函数
函数类
flink暴露了所有udf函数的接口(实现方式为接口或者抽象类)。例如MapFunction,FliterFunction,ProcessFunction等
object FunctionClass {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val dataStream = env.readTextFile("E:\\qmlidea\\flink\\src\\main\\resources\\sensor.txt")
dataStream.filter(new MyFilter).map(new MyMap).print()
env.execute("function test")
}
}
class MyFilter extends FilterFunction[String] {
override def filter(t: String): Boolean = {
!t.isEmpty
}
}
class MyMap extends MapFunction[String, SensorReading] {
override def map(t: String): SensorReading = {
val dataArray = t.split(",")
SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
}
}
/**
* 可以设置参数,在new对象时传参数进来
*
* @param keyword
*/
class KeyWordFilter(keyword: String) extends FilterFunction[String] {
override def filter(t: String): Boolean = {
!t.isEmpty
}
}
还可以以匿名类的方式实现,即直接在map或者fliter等中,new其抽象类型
dataStream.filter(new RichFilterFunction[String] {
override def filter(t: String): Boolean = {
!t.isEmpty
}
}).print("内部类")
匿名函数
val dataStream2 = dataStream.filter(x => !x.isEmpty)
富函数
“富函数”是DataStream API提供的一个函数类接口,所有Flink函数类都有其Rich版本。与常规函数类的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法,所以可以实现更复杂的功能。
包括:RickMapFunction,RichFlatMapFunction,RichFilterFunction等。
Rich Function有一个生命周期的概念,典型的生命周期方法有:
1.open(),是Rich function的初始化方法,当一个算子例如map或者filter被调用之前open会被调用
2.close(),是生命周期中的最后一个调用的方法,做一些清理工作
3.getRuntimeContext()方法提供了函数的RuntimeContext的一些信息,例如函数执行的并行度,任务的名字以及state状态
object FunctionClass {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val dataStream = env.readTextFile("E:\\qmlidea\\flink\\src\\main\\resources\\sensor.txt")
dataStream.flatMap(new MyMapper).print()
env.execute("function test")
}
}
/**
* 富函数
*/
class MyMapper() extends RichFlatMapFunction[String, String] {
var index = 0
override def open(parameters: Configuration): Unit = {
index = getRuntimeContext.getIndexOfThisSubtask
}
override def close(): Unit = {
index = 0
}
override def flatMap(in: String, collector: Collector[String]): Unit = {
collector.collect(in + " " + index)
}
}