1. 基本转换操作
2.1. map
val streamMap = stream.map { x => x * 2 }
2.2 flapMap
val streamFlatMap = stream.flatMap{
x => x.split(" ")
}
2.3. Filter
val streamFilter = stream.filter{
x => x == 1
}
2. 分组操作
2.1. KeyBy
DataStream → KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。
import org.apache.flink.api.java.functions.KeySelector
import org.apache.flink.api.scala._
object WordCount {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val inputDataStream = env.fromCollection(List(
"sensor1,193,39.5",
"sensor1,194,38.5",
"sensor1,195,40.5",
"sensor2,196,39.8",
"sensor1,197,39.1",
"sensor2,198,34.5",
"sensor2,199,37.1"
))
// DataStream → KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。
// 注意:这里是逻辑的将一个流拆分成不相交的分区,泛型分别表示输入数据类型和输出key的类型
// keyBy() 参数有四种写法
val resultDataStream = inputDataStream
.map { data =>
val dataArray = data.split(",")
SensorReading2(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
}
// 1. 根据元素位置
val resultStream2: DataStream[SensorReading2] = resultDataStream.keyBy(0).sum(1)
// 2. 根据元素样例类属性 sum等滚动聚合函数只能通过 1.元素位置索引 2.元素名称
val resultStream3: DataStream[SensorReading2] = resultDataStream.keyBy(data => data.id).sum(1)
// 3. 根据元素样例类的属性名
val resultStream4: DataStream[SensorReading2] = resultDataStream.keyBy("id").sum("temperature")
// 4. 根据自定义选择器
val resultStream5: DataStream[SensorReading2] = resultDataStream.keyBy(new MyIDSelector()).sum("temperature")
resultStream2.print()
resultStream3.print()
resultStream4.print()
resultStream5.print()
env.execute()
}
}
class MyIDSelector() extends KeySelector[SensorReading2, String] {
override def getKey(value: SensorReading2): String = value.id
}
case class SensorReading2(id: String, timestamp: Long, temperature: Double)
3. 滚动聚合算子(Rolling Aggregation)
3.1 sum/min/max/minBy/maxBy
这些算子可以针对KeyedStream的每一个支流做聚合。
sum()
min()
max()
minBy()
maxBy()
注意:
这些滚动聚合函数只能根据 1.元素位置索引 2.元素名称 进行滚动操作
3.2 reduce
import org.apache.flink.api.common.functions.ReduceFunction
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala._
object WordCount {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val collectionStgream = env.fromCollection(List(
("sensor1", 193, 39.5),
("sensor1", 194, 38.5),
("sensor1", 195, 40.5),
("sensor2", 196, 39.8),
("sensor1", 197, 39.1),
("sensor2", 198, 34.5),
("sensor2", 199, 37.1)
))
val resultStream: DataStream[SensorReading2] = collectionStgream.map(data => SensorReading2(data._1, data._2.toLong, data._3.toDouble))
val resultStream2: KeyedStream[SensorReading2, Tuple] = resultStream.keyBy(0)
resultStream2.print("keyBy结果:") // 结果:SensorReading2(sensor2,196,39.8)
// reduce算子参数可以是 1.匿名函数 2.继承ReduceFunction接口的对象
// 1.匿名函数
val resultStream3: DataStream[SensorReading2] = resultStream2.reduce {
(curData, newData) =>
// 根据id聚合,取一个分组中的最大时间戳,取最小温度
SensorReading2(curData.id, curData.timestamp.max(newData.timestamp), curData.temperature.min(newData.temperature))
}
// 2.继承ReduceFunction接口的对象
val resultStream4: DataStream[SensorReading2] = resultStream2.reduce {
(curData, newData) =>
// 根据id聚合,取一个分组中的最大时间戳,取最小温度
SensorReading2(curData.id, curData.timestamp.max(newData.timestamp), curData.temperature.min(newData.temperature))
}
resultStream3.print("reduce结果:")
resultStream4.print("自定义ReduceFunction结果:")
env.execute("xxxx")
}
}
class MyReduceFunction extends ReduceFunction[SensorReading2]{
// 以往的数据curData和不断到来的数据newData进行聚合,返回聚合后的数据
override def reduce(curData: SensorReading2, newData: SensorReading2): SensorReading2 = {
SensorReading2(curData.id,curData.timestamp.max(newData.timestamp) , curData.temperature.min(newData.temperature))
}
}
case class SensorReading2(id: String, timestamp: Long, temperature: Double)
4. 分流操作
4.1. Split 和 Select
import org.apache.flink.streaming.api.scala._
object SplitAndSelect {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val collectionStgream = env.fromCollection(List(
("sensor1", 193, 39.5),
("sensor1", 194, 38.5),
("sensor1", 195, 40.5),
("sensor2", 196, 39.8),
("sensor1", 197, 39.1),
("sensor2", 198, 34.5),
("sensor2", 199, 37.1)
))
val resultStream = collectionStgream.map {
data => SensorReading3(data._1, data._2.toLong, data._3.toDouble)
}
// 分流,对流入的数据贴标签
val splitStream = resultStream.split {
data =>
if (data.temperature > 39) {
Seq("high")
} else {
Seq("low")
}
}
// 根据标签选择需要的流
val hightStream: DataStream[SensorReading3] = splitStream.select("high")
val lowStream: DataStream[SensorReading3] = splitStream.select("low")
val allStream: DataStream[SensorReading3] = splitStream.select("high","low")
hightStream.print("high")
lowStream.print("low")
allStream.print("low-high")
env.execute("xxxx")
}
}
case class SensorReading3(id: String, timestamp: Long, temperature: Double)
5. 合流操作
5.1 connect 和 coMap
5.1.2 connect
DataStream,DataStream → ConnectedStreams:连接两个保持他们类型的数据流,两个数据流被Connect之后,只是被放在了同一个流中,内部依然保持各自的数据和形式不发生任何变化,两个流相互独立。
5.1.2 coMap和coFlatMap
ConnectedStreams → DataStream:作用于ConnectedStreams上,功能与map和flatMap一样,对ConnectedStreams中的每一个Stream分别进行map和flatMap处理。
val Stream1: DataStream[(Double, String)] = hightStream.map(data => (data.temperature, "warning information"))
// 1.连接两条流,两条流的类型可以不一样
val connectedStream: ConnectedStreams[(Double, String), SensorReading3] = Stream1.connect(lowStream)
// 2.合并两条流,合并的两条流的类型仍然可以不一样 这里的map不能写中括号,否则报错
val coStream: DataStream[Any] = connectedStream.map(
stream1 => stream1._1,
stream2 => (stream2.temperature, "normal information")
)
coStream.print("coStream:")
env.execute("xxxx")
5.2 union
DataStream → DataStream:对两个或者两个以上的DataStream进行union操作,产生一个包含所有DataStream元素的新DataStream,union流的数据类型必须一样。
val unionStream: DataStream[SensorReading3] = hightStream.union(lowStream).union(allStream)
5.3 Connect与 Union 区别
- Union之前两个流的类型必须是一样,Connect可以不一样,可以在之后的coMap中再去调整成为一样的,但仍然可以保持不一样。
- Connect只能操作两个流,Union可以操作多个。