官网链接:Apache Flink 1.11 Documentation: 算子
目录
map算子
map算子就是将每一条数据,根据map里面的逻辑处理
我这里都是将一个单词,转化成kv格式
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.scala._
object MapTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker",8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
//scala的写法
val kvDS: DataStream[(String, Int)] = wordDS.map((_,1))
kvDS.print()
env.execute()
}
}
flatMap算子
flatMap作用:将集合展开
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.scala._
object MapTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker",8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
wordDS.print()
env.execute()
}
}
filter算子
根据filter里面写入的逻辑,true就保留,flase就丢弃
import org.apache.flink.streaming.api.scala._
object FilterTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker",8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
val filterDS = wordDS.filter(_.contains("java"))
filterDS.print()
env.execute()
}
}
keyBy算子
KeyBy 在逻辑上是基于 key 对流进行分区,相同的 Key 会被分到一个分区(这里分区指的就是下游算子多个并行节点的其中一个)。在内部,它使用 hash 函数对流进行分区。它返回 KeyedDataStream 数据流。
import org.apache.flink.streaming.api.scala._
object KeyByTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker", 8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
val kvDS: DataStream[(String, Int)] = wordDS.map((_, 1))
val keyByDS: KeyedStream[(String, Int), String] = kvDS.keyBy(_._1)
keyByDS.print()
env.execute()
}
}
可以看见,相同的key,进入到了同一个分区之中
reduce算子
在KeyBy算子使用后才可以使用,对全局进行聚合
import org.apache.flink.streaming.api.scala._
object ReduceTF {
def main(args: Array[String]): Unit = {
//创建环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker",8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
val kvDS: DataStream[(String, Int)] = wordDS.map((_,1))
val keyByDS: KeyedStream[(String, Int), String] = kvDS.keyBy(_._1)
//reduce只有在keyby之后才可以使用
val countDS = keyByDS.reduce((x, y) => (x._1, x._2 + y._2))
countDS.print()
env.execute()
}
}
Aggregations算子
DataStream API 支持各种聚合,例如 min、max、sum 等。 这些函数可以应用于 KeyedStream 以获得 Aggregations 聚合。
通常是指定字段位置0,1,2,3等
如果使用了样例类,可以直接使用max('列名')来求最大值
KeyedStream.sum(0) KeyedStream.sum("key") KeyedStream.min(0) KeyedStream.min("key") KeyedStream.max(0) KeyedStream.max("key") KeyedStream.minBy(0) KeyedStream.minBy("key") KeyedStream.maxBy(0) KeyedStream.maxBy("key")
max 和 maxBy 之间的区别在于 max 返回流中的最大值,但 maxBy 返回具有最大值的键, min 和 minBy 同理。
import org.apache.flink.streaming.api.scala._
object AggregationSTF {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment
val linesDS: DataStream[String] = env.readTextFile("data/students.txt")
val clazzDS: DataStream[Student] = linesDS.map(line => {
val stuarr: Array[String] = line.split(",")
Student(stuarr(0),stuarr(1),stuarr(2).toInt,stuarr(3),stuarr(4))
})
val keyByDS: KeyedStream[Student, String] = clazzDS.keyBy(_.clazz)
val value: DataStream[Student] = keyByDS.max("age")
value.print()
env.execute()
}
}
case class Student(id:String,name:String,age:Int,gender:String,clazz:String)
使用max后,这个clazz和age的值确实是对应的,但是其他信息不是年龄最大的学生,如果想要其他信息也正确,需要使用maxby
将上面的max改成maxBy之后,所有信息可以对应上了
Window算子
window算子比较多,我先简单写一个timeWindow
在timeWindow方法种,可以指定一个一个参数,也可以指定2个参数
指定一个参数,就是称为滚动窗口,每隔一段时间重新计算窗口中的内容
指定两个参数,可以称为滑动窗口,每隔第二个参数的时间,计算前面窗口中的内容,比如:
timeWindow(time.seconds(15),time.seconds(5)) 就是每隔5s钟计算前15s窗口中的内容
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
object WindowTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lineDS: DataStream[String] = env.socketTextStream("doker",8888)
val wordDS: DataStream[String] = lineDS.flatMap(_.split(","))
val kvDS: DataStream[(String, Int)] = wordDS.map((_,1))
val keyByDS: KeyedStream[(String, Int), String] = kvDS.keyBy(_._1)
//这是滚动窗口,每隔5s钟计算一次,与spark streaming的默认方式相同
val windowDS: WindowedStream[(String, Int), String, TimeWindow] = keyByDS.timeWindow(Time.seconds(5))
val countDS: DataStream[(String, Int)] = windowDS.reduce((x, y) => (x._1, x._2 + y._2))
countDS.print()
env.execute()
}
}
可以看到最后一个java单词又开始重新计算了
后面会写一篇文章,单独介绍几种window
Union算子
可以将两个格式相同的DataStream合并成一个
import org.apache.flink.streaming.api.scala._
object UnionTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socketDS1: DataStream[String] = env.socketTextStream("doker",8888)
val socketDS2: DataStream[String] = env.socketTextStream("doker",9999)
val unionDS: DataStream[String] = socketDS1.union(socketDS2)
val countDS = unionDS.flatMap(_.split(","))
.map((_, 1))
.keyBy(_._1)
.reduce((x, y) => (x._1, x._2 + y._2))
countDS.print()
env.execute()
}
}
结果:
SideOutPut旁路输出
除了由DataStream
操作产生的主要流之外,还可以生成任意数量的附加输出结果流。结果流中的数据类型不必与主流中的数据类型匹配,并且不同侧输出的类型也可以不同。当您想要拆分通常必须复制流的数据流,然后从每个流中过滤掉您不希望拥有的数据时,此操作会很有用。旁路输出时,需要定义一个OutputTag
用于标识侧输出流
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
object SideOutPutTF {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val flieDS = env.readTextFile("data\\students.txt")
val male: OutputTag[String] = OutputTag[String]("男")
val female: OutputTag[String] = OutputTag[String]("女")
val sideDS = flieDS
.process(new ProcessFunction[String, String] {
override def processElement(
value: String,
ctx: ProcessFunction[String, String]#Context,
out: Collector[String]): Unit = {
val gender = value.split(",")(3)
if ("男".equals(gender)) {
ctx.output(male, value)
} else {
ctx.output(female, value)
}
}
})
val maleDS = sideDS.getSideOutput(male)
val femaleDS = sideDS.getSideOutput(female)
maleDS.print()
//femaleDS.print()
env.execute()
}
}