Flink Operators

Operators

DataStream Transformations
DataStream --> DataStream
  • Map

Takes one element and produces one element. A map function that doubles the values of the input stream

获取一个元素并生成一个元素。将输入流的值加倍的映射函数:

dataStream.map { x => x * 2 }
  • FlatMap
    Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:
    获取一个元素并生成零个、一个或多个元素。一种将句子分割成单词的平面映射函数
dataStream.flatMap { str => str.split(" ") }
  • Filter
    Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values:
    对每个元素求布尔函数的值,并保留函数返回true的元素。过滤出零值的过滤器:
dataStream.filter { _ != 0 }
DataStream* -> DataStream
  • Union
    Union of two or more data streams creating a new stream containing all the elements from all the streams.
    Note: If you union a data stream with itself you will get each element twice in the resulting stream
    两个或多个数据流的并集,创建包含来自所有流的所有元素的新流。
    注意:如果您将一个数据流与它自己相结合,您将得到结果流中的每个元素两次
dataStream.union(otherStream1, otherStream2, ...)
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)
    val text2 = env.socketTextStream("train",8888)


    //3.执行DataStreaem的转换算子
    val counts = text1.union(text2)
      .flatMap(line=>line.split("\\s+"))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)

    //4.将计算的结果在控制台打印
    counts.print()

    //5.执行流计算任务
    env.execute("Window Stream WordCount")
DataStream,DataStream → ConnectedStreams
  • connect
    “Connects” two data streams retaining their types, allowing for shared state between the two streams.
    “连接”两个数据流,保持它们的类型,允许在两个流之间共享状态。
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)
ConnectedStreams → DataStream
  • CoMap, CoFlatMap
    Similar to map and flatMap on a connected data stream
    类似于连接数据流上的映射和平面映射
connectedStreams.map(
 (_ : Int) => true,
 (_ : String) => false
)
connectedStreams.flatMap(
 (_ : Int) => true,
 (_ : String) => false
)
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)
    val text2 = env.socketTextStream("train",8888)


    //3.执行DataStreaem的转换算子
    val counts = text1.connect(text2)
      .flatMap((line:String)=>line.split("\\s+"),(line:String)=>line.split(","))
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)

    //4.将计算的结果在控制台打印
    counts.print()

    //5.执行流计算任务
    env.execute("Window Stream WordCount")

connection和union区别是:connection可以对两个流进行处理;union只能对一个流处理。

DataStream → SplitStream
  • Split
    Split the stream into two or more streams according to some criterion.
    根据某些标准将流分成两个或多个流。
val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)
SplitStream → DataStream
  • select
    Select one or more streams from a split stream.
    从拆分流中选择一个或多个流
val even = split.select("even")
val odd = split.select("odd")
val all = split.select("even","odd")
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)


    //3.执行DataStreaem的转换算子
    val counts = text1.split(line=>{
      if(line.contains("error")){
        List("error")
      } else{
        List("info")
      }
    })



    //4.将计算的结果在控制台打印
    counts.select("error").printToErr("错误")
    counts.select("info").print("信息")
    counts.select("error","info").print("ALL")
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  • ProcessFuntion
    一般来说,更多使用ProcessFuntion完成流的分支
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)

    val errorTag = new OutputTag[String]("error")
    val allTag = new OutputTag[String]("all")

    //3.执行DataStreaem的转换算子
    val counts = text1.process(new ProcessFunction[String,String] {
      override def processElement(value: String,
                                  context: ProcessFunction[String, String]#Context,
                                  out: Collector[String]): Unit = {

        if(value.contains("error")){
          context.output(errorTag,value) //边输出
        }else{
          out.collect(value) //正常数据
        }
        context.output(allTag,value) //正常数据
      }
    })

    //4.将计算的结果在控制台打印
    counts.getSideOutput(errorTag).printToErr("错误")
    counts.getSideOutput(allTag).print("所有")
    counts.print("正常")
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
DataStream → KeyedStream
  • KeyBy
    Logically partitions a stream into disjoint partitions, each partition containing elements of the same key.
    Internally, this is implemented with hash partitioning. See keys on how to specify keys. This transformation returns a KeyedStream.
    逻辑上将一个流划分为不相连的分区,每个分区包含相同键的元素。
    在内部,这是通过哈希分区实现的。有关如何指定密钥,请参阅密钥。这个转换返回一个KeyedStream。
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
KeyedStream → DataStream
  • Reduce
    A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

A reduce function that creates a stream of partial sums:

keyedStream.reduce(_ + _)
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)


    //3.执行DataStreaem的转换算子
    val counts = text1.flatMap(_.split(" "))
        .map((_,1))
        .keyBy(0)
        .reduce((v1,v2)=>(v1._1,v1._2+v2._2))


    //4.将计算的结果在控制台打印
    counts.print()
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  • Fold

A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.

A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence “start-1”, “start-1-2”,“start-1-2-3”, …

val result: DataStream[String] = keyedStream.fold("start")((str, i) => { str + "-" + i })
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)


    //3.执行DataStreaem的转换算子
    val counts = text1.flatMap(_.split(" "))
        .map((_,1))
        .keyBy(0)
        .fold((null:String,0:Int))((z,v)=>(v._1,z._2+v._2))


    //4.将计算的结果在控制台打印
    counts.print()
    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  • Aggregations

Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy)

在键控数据流上滚动聚合。min和minBy的区别在于,min返回最小值,而minBy返回该字段中最小值的元素(max和maxBy也是如此)

keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
//1.创建流计算执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    //zhangsan 研发部 1000
    //lisi 研发部 9000
    //ww 研发部 9000
    //2.创建DataStream - 细化
    val text1 = env.socketTextStream("train",9999)


    //3.执行DataStreaem的转换算子
    val counts = text1.map(_.split(" "))
        .map(ts=>Emp(ts(0),ts(1),ts(2).toDouble))
        .keyBy("dept")
        .maxBy("salary") //lisi 研发部 9000

    //4.将计算的结果在控制台打印
    counts.print()
    //5.执行流计算任务
    env.execute("Window Stream WordCount")

如果使用max,则返回的是Emp(zhangsan,研发部,9000.0)

Physical partitioning

Flink还通过以下function对转换后的DataStream进⾏分区(如果需要)。

Rebalancing (Round-robin partitioning):

分区元素轮循,从⽽为每个分区创建相等的负载。在存在数据偏斜的情况下对性能优化有⽤。

dataStream.rebalance()
Random partitioning

根据均匀分布对元素进⾏随机划分。

dataStream.shuffle()
Rescaling

和Roundrobin Partitioning⼀样,Rescaling Partitioning也是⼀种通过循环的⽅式进⾏数据重平衡的分区策略。但是不同的是,当使⽤Roundrobin Partitioning时,数据会全局性地通过⽹络介质传输到其他的节点完成数据的重新平衡,⽽Rescaling Partitioning仅仅会对上下游继承的算⼦数据进⾏重平衡,具体的分区主要根据上下游算⼦的并⾏度决定。例如上游算⼦的并发度为2,下游算⼦的并发度为4,就会发⽣上游算⼦中⼀个分区的数据按照同等⽐例将数据路由在下游的固定的两个分区中,另外⼀个分区同理路由到下游两个分区中。

dataStream.rescale()
Broadcasting

Broadcasts elements to every partition
向每个分区广播元素

dataStream.broadcast
Custom partitioning

Selects a subset of fields from the tuples

dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.socketTextStream("CentOS", 9999)
 .map((_,1))
 .partitionCustom(new Partitioner[String] {
 override def partition(key: String, numPartitions: Int): Int = {
 key.hashCode & Integer.MAX_VALUE % numPartitions
 }
 },_._1)
 .print()
 .setParallelism(4)
 println(env.getExecutionPlan)
 env.execute("Stream WordCount")
Task chaining and resource groups

对两个⼦操作进⾏Chain,意味着将这两个 算⼦放置⼦⼀个线程中,这样是为了没必要的线程开销,提升性能。如果可能的话,默认情况下Flink会链接运算符。例如⽤户可以调⽤:

StreamExecutionEnvironment.disableOperatorChaining()

禁⽤chain⾏为,但是不推荐。

startNewChain
someStream.filter(...).map(...).startNewChain().map(...)

将第⼀map算⼦和filter算⼦进⾏隔离

disableChaining
someStream.map(...).disableChaining()

所有操作符禁⽌和map操作符进⾏chain

slotSharingGroup

设置操作的slot共享组。 Flink会将具有相同slot共享组的operator放在同⼀个Task slot中,同时将没有slot共享组的operator保留在其他Task slot中。这可以⽤来隔离Task Slot。下游的操作符会⾃动继承上游资源组。默认情况下,所有的输⼊算⼦的资源组的名字是 default ,因此当⽤户不对程序进⾏资源划分的情况下,⼀个job所需的资源slot,就等于最⼤并⾏度的Task。

someStream.filter(...).slotSharingGroup("name")
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值