Operators
DataStream Transformations
DataStream --> DataStream
- Map
Takes one element and produces one element. A map function that doubles the values of the input stream
获取一个元素并生成一个元素。将输入流的值加倍的映射函数:
dataStream.map { x => x * 2 }
- FlatMap
Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:
获取一个元素并生成零个、一个或多个元素。一种将句子分割成单词的平面映射函数
dataStream.flatMap { str => str.split(" ") }
- Filter
Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values:
对每个元素求布尔函数的值,并保留函数返回true的元素。过滤出零值的过滤器:
dataStream.filter { _ != 0 }
DataStream* -> DataStream
- Union
Union of two or more data streams creating a new stream containing all the elements from all the streams.
Note: If you union a data stream with itself you will get each element twice in the resulting stream
两个或多个数据流的并集,创建包含来自所有流的所有元素的新流。
注意:如果您将一个数据流与它自己相结合,您将得到结果流中的每个元素两次
dataStream.union(otherStream1, otherStream2, ...)
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
val text2 = env.socketTextStream("train",8888)
//3.执行DataStreaem的转换算子
val counts = text1.union(text2)
.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
DataStream,DataStream → ConnectedStreams
- connect
“Connects” two data streams retaining their types, allowing for shared state between the two streams.
“连接”两个数据流,保持它们的类型,允许在两个流之间共享状态。
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)
ConnectedStreams → DataStream
- CoMap, CoFlatMap
Similar to map and flatMap on a connected data stream
类似于连接数据流上的映射和平面映射
connectedStreams.map(
(_ : Int) => true,
(_ : String) => false
)
connectedStreams.flatMap(
(_ : Int) => true,
(_ : String) => false
)
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
val text2 = env.socketTextStream("train",8888)
//3.执行DataStreaem的转换算子
val counts = text1.connect(text2)
.flatMap((line:String)=>line.split("\\s+"),(line:String)=>line.split(","))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
connection和union区别是:connection可以对两个流进行处理;union只能对一个流处理。
DataStream → SplitStream
- Split
Split the stream into two or more streams according to some criterion.
根据某些标准将流分成两个或多个流。
val split = someDataStream.split(
(num: Int) =>
(num % 2) match {
case 0 => List("even")
case 1 => List("odd")
}
)
SplitStream → DataStream
- select
Select one or more streams from a split stream.
从拆分流中选择一个或多个流
val even = split.select("even")
val odd = split.select("odd")
val all = split.select("even","odd")
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
//3.执行DataStreaem的转换算子
val counts = text1.split(line=>{
if(line.contains("error")){
List("error")
} else{
List("info")
}
})
//4.将计算的结果在控制台打印
counts.select("error").printToErr("错误")
counts.select("info").print("信息")
counts.select("error","info").print("ALL")
//5.执行流计算任务
env.execute("Window Stream WordCount")
- ProcessFuntion
一般来说,更多使用ProcessFuntion完成流的分支
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
val errorTag = new OutputTag[String]("error")
val allTag = new OutputTag[String]("all")
//3.执行DataStreaem的转换算子
val counts = text1.process(new ProcessFunction[String,String] {
override def processElement(value: String,
context: ProcessFunction[String, String]#Context,
out: Collector[String]): Unit = {
if(value.contains("error")){
context.output(errorTag,value) //边输出
}else{
out.collect(value) //正常数据
}
context.output(allTag,value) //正常数据
}
})
//4.将计算的结果在控制台打印
counts.getSideOutput(errorTag).printToErr("错误")
counts.getSideOutput(allTag).print("所有")
counts.print("正常")
//5.执行流计算任务
env.execute("Window Stream WordCount")
DataStream → KeyedStream
- KeyBy
Logically partitions a stream into disjoint partitions, each partition containing elements of the same key.
Internally, this is implemented with hash partitioning. See keys on how to specify keys. This transformation returns a KeyedStream.
逻辑上将一个流划分为不相连的分区,每个分区包含相同键的元素。
在内部,这是通过哈希分区实现的。有关如何指定密钥,请参阅密钥。这个转换返回一个KeyedStream。
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
KeyedStream → DataStream
- Reduce
A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
A reduce function that creates a stream of partial sums:
keyedStream.reduce(_ + _)
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
//3.执行DataStreaem的转换算子
val counts = text1.flatMap(_.split(" "))
.map((_,1))
.keyBy(0)
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
//4.将计算的结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
- Fold
A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence “start-1”, “start-1-2”,“start-1-2-3”, …
val result: DataStream[String] = keyedStream.fold("start")((str, i) => { str + "-" + i })
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
//3.执行DataStreaem的转换算子
val counts = text1.flatMap(_.split(" "))
.map((_,1))
.keyBy(0)
.fold((null:String,0:Int))((z,v)=>(v._1,z._2+v._2))
//4.将计算的结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
- Aggregations
Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy)
在键控数据流上滚动聚合。min和minBy的区别在于,min返回最小值,而minBy返回该字段中最小值的元素(max和maxBy也是如此)
keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//zhangsan 研发部 1000
//lisi 研发部 9000
//ww 研发部 9000
//2.创建DataStream - 细化
val text1 = env.socketTextStream("train",9999)
//3.执行DataStreaem的转换算子
val counts = text1.map(_.split(" "))
.map(ts=>Emp(ts(0),ts(1),ts(2).toDouble))
.keyBy("dept")
.maxBy("salary") //lisi 研发部 9000
//4.将计算的结果在控制台打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
如果使用max,则返回的是Emp(zhangsan,研发部,9000.0)
Physical partitioning
Flink还通过以下function对转换后的DataStream进⾏分区(如果需要)。
Rebalancing (Round-robin partitioning):
分区元素轮循,从⽽为每个分区创建相等的负载。在存在数据偏斜的情况下对性能优化有⽤。
dataStream.rebalance()
Random partitioning
根据均匀分布对元素进⾏随机划分。
dataStream.shuffle()
Rescaling
和Roundrobin Partitioning⼀样,Rescaling Partitioning也是⼀种通过循环的⽅式进⾏数据重平衡的分区策略。但是不同的是,当使⽤Roundrobin Partitioning时,数据会全局性地通过⽹络介质传输到其他的节点完成数据的重新平衡,⽽Rescaling Partitioning仅仅会对上下游继承的算⼦数据进⾏重平衡,具体的分区主要根据上下游算⼦的并⾏度决定。例如上游算⼦的并发度为2,下游算⼦的并发度为4,就会发⽣上游算⼦中⼀个分区的数据按照同等⽐例将数据路由在下游的固定的两个分区中,另外⼀个分区同理路由到下游两个分区中。
dataStream.rescale()
Broadcasting
Broadcasts elements to every partition
向每个分区广播元素
dataStream.broadcast
Custom partitioning
Selects a subset of fields from the tuples
dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS", 9999)
.map((_,1))
.partitionCustom(new Partitioner[String] {
override def partition(key: String, numPartitions: Int): Int = {
key.hashCode & Integer.MAX_VALUE % numPartitions
}
},_._1)
.print()
.setParallelism(4)
println(env.getExecutionPlan)
env.execute("Stream WordCount")
Task chaining and resource groups
对两个⼦操作进⾏Chain,意味着将这两个 算⼦放置⼦⼀个线程中,这样是为了没必要的线程开销,提升性能。如果可能的话,默认情况下Flink会链接运算符。例如⽤户可以调⽤:
StreamExecutionEnvironment.disableOperatorChaining()
禁⽤chain⾏为,但是不推荐。
startNewChain
someStream.filter(...).map(...).startNewChain().map(...)
将第⼀map算⼦和filter算⼦进⾏隔离
disableChaining
someStream.map(...).disableChaining()
所有操作符禁⽌和map操作符进⾏chain
slotSharingGroup
设置操作的slot共享组。 Flink会将具有相同slot共享组的operator放在同⼀个Task slot中,同时将没有slot共享组的operator保留在其他Task slot中。这可以⽤来隔离Task Slot。下游的操作符会⾃动继承上游资源组。默认情况下,所有的输⼊算⼦的资源组的名字是 default ,因此当⽤户不对程序进⾏资源划分的情况下,⼀个job所需的资源slot,就等于最⼤并⾏度的Task。
someStream.filter(...).slotSharingGroup("name")
302

被折叠的 条评论
为什么被折叠?



