Flink 水位线结合窗口进行KeyBy、Reduce案例

水位线(Watermark)和窗口(Window) 

Watermark

在事件时间语义下,我们不依赖系统时间,而是基于数据自带的时间戳去定义了一个时钟,用来表示当前时间的进展。于是每个并行子任务都会有一个自己的逻辑时钟,它的前进是靠数据的时间戳来驱动的。

 

Window

Flink 是一种流式计算引擎,主要是来处理无界数据流的,数据源源不断、无穷无尽。想要更加方便高效地处理无界流,一种方式就是将无限数据切割成有限的“数据块”进行处理,这就是所谓的“窗口”(Window)。在 Flink 中, 窗口就是用来处理无界流的核心

import java.time.Duration

import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time


case class User(name: String, money: Double, time: Long)

object StreamWindow {
  def main(args: Array[String]): Unit = {
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    env.setParallelism(1)

    val strategy: WatermarkStrategy[User] = WatermarkStrategy.forBoundedOutOfOrderness[User](Duration.ofSeconds(5))
      .withTimestampAssigner(new SerializableTimestampAssigner[User] {
        override def extractTimestamp(element: User, recordTimestamp: Long): Long = element.time
      })

    val dataStream: DataStream[User] = env.socketTextStream("localhost", 7777)
      .map((data: String) => {
        val arr: Array[String] = data.split(",")
        User(arr(0), arr(1).toDouble, arr(2).toLong)
      })
      .assignTimestampsAndWatermarks(strategy)

    val resultStream: DataStream[(String, Double)] = dataStream.map((data: User) => (data.name, data.money))
      .keyBy((_: (String, Double))._1)
      .window(TumblingEventTimeWindows.of(Time.seconds(10)))
      .reduce((x: (String, Double), y: (String, Double)) => (x._1, y._2 + x._2))
    resultStream.print()

    env.execute("stream window")
  }
}

基本处理函数(ProcessFunction)

处理函数主要是定义数据流的转换操作,所以也可以把它归到转换算子中。我们知道在Flink 中几乎所有转换算子都提供了对应的函数类接口,处理函数也不例外;它所对应的函数类,就叫作 ProcessFunction。ProcessWindowFunction 既是处理函数又是全窗口函数。从名字上也可以推测出,它的本质似乎更倾向于“窗口函数”一些。事实上它的用法也确实跟其他处理函数有很大不同。

package com.atguigu.chapter02

import java.time.Duration
import java.util.Calendar

import com.atguigu.chapter05.Event
import org.apache.flink.api.common.eventtime.{SerializableTimestampAssigner, WatermarkStrategy}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector


case class Event(user: String, url: String, timestamp: Long)

object StreamWordCount {

  def main(args: Array[String]): Unit = {

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val parameterTool: ParameterTool = ParameterTool.fromArgs(args)
    val hostname: String = parameterTool.get("host")
    val port: Int = parameterTool.getInt("port")
    val lineDataStream: DataStream[String] = env.socketTextStream(hostname, port)


    val stream: DataStream[Event] = lineDataStream.map((data: String) => {
      val fields: Array[String] = data.split(",")
      Event(fields(0).trim, fields(1).trim, fields(2).trim.toLong)
    })

    stream.assignTimestampsAndWatermarks(WatermarkStrategy.forBoundedOutOfOrderness[Event](Duration.ofSeconds(5))
      .withTimestampAssigner(
        new SerializableTimestampAssigner[Event] {
          override def extractTimestamp(t: Event, l: Long): Long = t.timestamp
        }
      ))
      .keyBy((_: Event).user)
      .window(TumblingEventTimeWindows.of(Time.seconds(10)))
      .process(new WatermarkWindowResult)
      .print()


    env.execute()
  }

  class WatermarkWindowResult extends ProcessWindowFunction[Event, String, String, TimeWindow] {
    override def process(user: String, context: Context, elements: Iterable[Event], out: Collector[String]): Unit = {
  
      val start: Long = context.window.getStart
      val end: Long = context.window.getEnd
      val count: Int = elements.size

      val currentWatermark: Long = context.currentWatermark

      out.collect(s"窗口 $start ~ $end , 用户 $user 的活跃度为:$count, 水位线现在位于:$currentWatermark")
    }
  }

}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值