Flink三种流和窗口转换的梳理

1. 简单介绍

目前所能理解的程度(持续更新),知道Flink中有三种流,DataStream, AllWindowedStream,KeyedStream,WindowedStream。

1.1 DataStream经过以下方法可以转化为AllWindowedStream

	// 1.TimeWindow之滚动窗口
  def timeWindowAll(size: Time): AllWindowedStream[T, TimeWindow] = {
    new AllWindowedStream(javaStream.timeWindowAll(size))
  }
  // 2.TimeWindow之滑动窗口
  def timeWindowAll(size: Time, slide: Time): AllWindowedStream[T, TimeWindow] = {
  	new AllWindowedStream(javaStream.timeWindowAll(size, slide))
  }
  // 3.CountWindow之滑动窗口
  def countWindowAll(size: Long, slide: Long): AllWindowedStream[T, GlobalWindow] = {
    new AllWindowedStream(stream.countWindowAll(size, slide))
  }
  // 4.CountWindow之滚动窗口
  def countWindowAll(size: Long): AllWindowedStream[T, GlobalWindow] = {
    new AllWindowedStream(stream.countWindowAll(size))
  }
  // 5.
  def windowAll[W <: Window](assigner: WindowAssigner[_ >: T, W]): AllWindowedStream[T, W] = {
    new AllWindowedStream[T, W](new JavaAllWindowedStream[T, W](stream, assigner))
  }

AllWindowedStream有很多与DataStream相似的方法。

对AllWindowStream的操作流程:

stream
       .windowAll(...)           <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

1.2 DataStream经过以下方法可以转化为KeyedStream


def keyBy(fields: Int*): KeyedStream[T, JavaTuple] = asScalaStream(stream.keyBy(fields: _*))

def keyBy(firstField: String, otherFields: String*): KeyedStream[T, JavaTuple] =
    asScalaStream(stream.keyBy(firstField +: otherFields.toArray: _*))
    
def keyBy[K: TypeInformation](fun: T => K): KeyedStream[T, K] = {

    val cleanFun = clean(fun)
    val keyType: TypeInformation[K] = implicitly[TypeInformation[K]]

    val keyExtractor = new KeySelector[T, K] with ResultTypeQueryable[K] {
      def getKey(in: T) = cleanFun(in)
      override def getProducedType: TypeInformation[K] = keyType
    }
    asScalaStream(new JavaKeyedStream(stream, keyExtractor, keyType))
}

def keyBy[K: TypeInformation](fun: KeySelector[T, K]): KeyedStream[T, K] = {

    val cleanFun = clean(fun)
    val keyType: TypeInformation[K] = implicitly[TypeInformation[K]]

    asScalaStream(new JavaKeyedStream(stream, cleanFun, keyType))
}

对AllWindowStream的操作流程:

stream
       .keyBy(...)               <-  keyed versus non-keyed windows
       .window(...)              <-  required: "assigner"
      [.trigger(...)]            <-  optional: "trigger" (else default trigger)
      [.evictor(...)]            <-  optional: "evictor" (else no evictor)
      [.allowedLateness(...)]    <-  optional: "lateness" (else zero)
      [.sideOutputLateData(...)] <-  optional: "output tag" (else no side output for late data)
       .reduce/aggregate/fold/apply()      <-  required: "function"
      [.getSideOutput(...)]      <-  optional: "output tag"

1.3 KeyedStream经过以下方法可以转化为WindowedStream

注意WindowedStream是由KeyedStream转换而来的

  def timeWindow(size: Time): WindowedStream[T, K, TimeWindow] = {
    new WindowedStream(javaStream.timeWindow(size))
  }

  def countWindow(size: Long, slide: Long): WindowedStream[T, K, GlobalWindow] = {
    new WindowedStream(javaStream.countWindow(size, slide))
  }

  def countWindow(size: Long): WindowedStream[T, K, GlobalWindow] = {
    new WindowedStream(javaStream.countWindow(size))
  }

  def timeWindow(size: Time, slide: Time): WindowedStream[T, K, TimeWindow] = {
    new WindowedStream(javaStream.timeWindow(size, slide))
  }

  def window[W <: Window](assigner: WindowAssigner[_ >: T, W]): WindowedStream[T, K, W] = {
    new WindowedStream(new WindowedJavaStream[T, K, W](javaStream, assigner))
  }

2. 运用示例

2.1 WindowedStream运用

2.1.1 示例一

import java.sql.Timestamp
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.ListStateDescriptor
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import scala.collection.mutable.ListBuffer

// 定义输入数据的样例类
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

// 定义窗口聚合结果的样例类
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)

object HotItems {
    def main(args: Array[String]): Unit = {
        // 创建一个流处理环境
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)

        // 定义时间语义
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        // 从文件读取数据
        val inputStream = env.readTextFile("F:\\SparkWorkSpace\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv")
        // 将数据转换成样例类型,并且提取timestamp定义watermark
        val dataStream = inputStream.map(data => {
            val dataArray = data.split(",")
            UserBehavior(dataArray(0).toLong, dataArray(1).toLong, dataArray(2).toInt, dataArray(3), dataArray(4).toLong)
        }).assignAscendingTimestamps(_.timestamp * 1000L) // 指定时间时间  将时间戳的秒转换为微秒

        // 对数据进行转换 过滤出pv行为,开窗聚合统计个数
        val aggStream: DataStream[ItemViewCount] = dataStream
            .filter(_.behavior == "pv") // 过滤pv行为
            .keyBy("itemId") // 按照itemId分组
            .timeWindow(Time.hours(1), Time.minutes(5)) //定义滑动窗口
            .aggregate(new CountAgg(), new ItemCountWindowResult())

        val aggStream111: WindowedStream[UserBehavior, Tuple, TimeWindow] = dataStream
            .filter(_.behavior == "pv") // 过滤pv行为
            .keyBy("itemId") // 按照itemId分组
            .timeWindow(Time.hours(1), Time.minutes(5)) //定义滑动窗口

        //        aggStream.print("aggStream")
        // 对窗口聚合结果按照窗口进行分组,并做排序取TopN输出
        val resultStream: DataStream[String] = aggStream
            .keyBy("windowEnd")
            .process(new TopNHotItems(5))

        resultStream.print("resultStream")
        env.execute("hot items job")
    }
}

// 自定义预聚合函数 来一条数据就加1
class CountAgg() extends AggregateFunction[UserBehavior, Long, Long] {
    // 初始值
    override def createAccumulator(): Long = 0L

    // 来一条数据就加1
    override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

    // 输出结果
    override def getResult(accumulator: Long): Long = accumulator

    // 两个累加器结果相加
    override def merge(a: Long, b: Long): Long = a + b
}

// 扩展示例:求平均值
class AggAvg() extends AggregateFunction[UserBehavior, (Int, Long), Double] {
    override def createAccumulator(): (Int, Long) = (0, 0L)

    override def add(value: UserBehavior, accumulator: (Int, Long)): (Int, Long) = {
        (accumulator._1 + 1, accumulator._2 + value.timestamp)
    }

    override def getResult(accumulator: (Int, Long)): Double = {
        accumulator._2 / accumulator._1.toDouble
    }

    override def merge(a: (Int, Long), b: (Int, Long)): (Int, Long) = {
        (a._1 + b._1, a._2 + b._2)
    }
}

// 自定义窗口函数,结合window信息包装成样例类  泛型:预聚合的输出作为in 输出out key window
class ItemCountWindowResult() extends WindowFunction[Long, ItemViewCount, Tuple, TimeWindow] {

    override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
        // 获取key 需要用java的元组类型
        val itemId = key.asInstanceOf[Tuple1[Long]].f0
        // 获取窗口结束时间
        val windowEnd = window.getEnd
        // 获取预聚合值
        val count = input.iterator.next()
        // 输出结果
        out.collect(ItemViewCount(itemId, windowEnd, count))
    }
}

// 自定义keyedProcessFunction
class TopNHotItems(n: Int) extends KeyedProcessFunction[Tuple, ItemViewCount, String] {
    // 定义listState,用于保存当前所有的count结果
    lazy val itemCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemcount-list", classOf[ItemViewCount]))

    override def processElement(value: ItemViewCount,
                                ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context,
                                out: Collector[String]): Unit = {
        // 每来一条数据,就把它保存在状态中
        itemCountListState.add(value)
        // 注册定时器,在windowEnd+100触发 1.定时器是对keyBy后的每个键分别起作用2.对于同一个key多次注册也只有一个
        // 对窗口分组,对其中的每条数据排序,然后输出
        ctx.timerService().registerEventTimeTimer(value.windowEnd + 100)
    }

    // 定时器触发时,从状态中取数据,然后排序输出
    override def onTimer(timestamp: Long,
                         ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext,
                         out: Collector[String]): Unit = {
        // 先把状态中的数据提取到一个ListBuffer中
        val allItemCountList: ListBuffer[ItemViewCount] = ListBuffer()
        // 由于itemCountListState底层是java使用scala的遍历语法需要引入隐式转换
        import scala.collection.JavaConversions._
        for (itemCount <- itemCountListState.get()) {
            allItemCountList += itemCount
        }

        // 按照count大小排序
        val sortedItemCountList = allItemCountList.sortBy(_.count)(Ordering.Long.reverse).take(n)

        // 清空状态
        itemCountListState.clear()

        // 排名信息格式化成String, 方便监控显示
        val result: StringBuilder = new StringBuilder
        result.append("时间").append(new Timestamp(timestamp - 100)).append("\n")
        // 遍历sorted列表,输出TopN信息
        for (i <- sortedItemCountList.indices) {
            // 获取当前商品的count信息
            val currentItemCount = sortedItemCountList(i)
            result.append("Top").append(i + 1).append(":")
                .append(" 商品Id=").append(currentItemCount.itemId)
                .append(" 访问量=").append(currentItemCount.count)
                .append("\n")
        }
        result.append("===============================\n\n")

        Thread.sleep(1000)
        // 输出结果
        out.collect(result.toString())
    }
}

2.1.2 示例二

import java.sql.Timestamp
import java.text.SimpleDateFormat
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import scala.collection.mutable.ListBuffer

// 定义输入数据样例类
case class ApacheLogEvent(ip: String, userId: String, eventTime: Long, method: String, url: String)

// 定义聚合结果样例类
case class PageViewCount(url: String, windowEnd: Long, count: Long)

object NetworkTopNPage {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment

        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        env.setParallelism(1)

        // 从文件中读取数据
        val inputStream: DataStream[String] = env.readTextFile("F:\\SparkWorkSpace\\UserBehaviorAnalysis\\NetworKFlowAnalysis\\src\\main\\resources\\apache.log")

        // 转换成样例类,指定timestamp和watermark
        val dataStream = inputStream
            .map(data => {
                val dataArray = data.split(" ")
                // 将时间字段转换为时间戳
                val simpleDataFormat = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss")
                val timestamp = simpleDataFormat.parse(dataArray(3)).getTime

                ApacheLogEvent(dataArray(0), dataArray(1), timestamp, dataArray(5), dataArray(6))
            })
            // 1.设置延迟时间为1秒,不能保证所有数据延迟一秒就到了,但是仍然延迟1秒后输出
            .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
                override def extractTimestamp(element: ApacheLogEvent): Long = element.eventTime
            })

        // 开窗聚合
        // 定义侧输出流处理乱序数据
        val lateOutputTag = new OutputTag[ApacheLogEvent]("late data")
        val aggStream = dataStream
            //            .keyBy("url")
            .keyBy(_.url) // 后面获取间的类型就可以不是Tuple类型了,且可以直接获取
            .timeWindow(Time.minutes(10), Time.seconds(5))
            // 2.上面设置的watermark为1秒,即延迟1秒后输出窗口结果,但是不能保证沦胥数据都输出了
            // 因此允许窗口不关闭,等一分钟后再把该时间端的数据累加到上面的窗口再输出一次结果
            .allowedLateness(Time.minutes(1))
            // 3.通过侧输出流来输出窗口延迟1分钟仍然还有乱序数据的输出
            .sideOutputLateData(lateOutputTag)
            .aggregate(new PageCountAgg(), new PageCountWindowResult())

        // 乱序迟到数据 然后可以和之前的数据合并,或者监控哪些数据丢了
        val lateDataStream = aggStream.getSideOutput(lateOutputTag)
        // 每个窗口的统计值排序输出
        val resultStream = aggStream
            //            .key("window")
            .keyBy(_.windowEnd)
            .process(new TopNHotPage(3))

        aggStream.print("agg")
        lateDataStream.print("late data")
        resultStream.print("resultStream")
        env.execute("  ")
    }
}

// 自定义预聚合函数
class PageCountAgg() extends AggregateFunction[ApacheLogEvent, Long, Long] {
    override def createAccumulator(): Long = 0L

    override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1

    override def getResult(accumulator: Long): Long = accumulator

    override def merge(a: Long, b: Long): Long = a + b
}

// 自定义windowFunction, 包装成样例类
class PageCountWindowResult() extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
    override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
        out.collect(PageViewCount(key, window.getEnd, input.head))
    }
}

// 自定义Process Function
class TopNHotPage(n: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {

    // 定义ListState保存所有聚合结果
    //    lazy val pageCountListState = getRuntimeContext.getListState(new ListStateDescriptor[PageViewCount]("pagecount-list", classOf[PageViewCount]))
    lazy val pageCountMapState = getRuntimeContext.getMapState(new MapStateDescriptor[String, Long]("pagecount-list", classOf[String], classOf[Long]))

    override def processElement(value: PageViewCount,
                                ctx: KeyedProcessFunction[Long,
                                    PageViewCount, String]#Context, out: Collector[String]): Unit = {
        pageCountMapState.put(value.url, value.count)
        // 输出result结果的定时器
        ctx.timerService().registerEventTimeTimer(value.windowEnd + 1)
        // 定义真正清空状态的定时器
        ctx.timerService().registerEventTimeTimer(value.windowEnd + 60 * 1000L)
    }

    // 等到数据都到齐,从状态中取出,排序输出
    override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {


        val iter = pageCountMapState.entries().iterator()
        // 清空状态
        if (timestamp == ctx.getCurrentKey + 60 * 1000L) {
            pageCountMapState.clear()
            return
        }

        pageCountMapState.clear()

        val allPageCountList: ListBuffer[(String, Long)] = ListBuffer()
        while (iter.hasNext) {
            val entry = iter.next()
            allPageCountList += ((entry.getKey, entry.getValue))
        }

        val sortedPageCountList= allPageCountList.sortWith(_._2 > _._2).take(n)

        // 排名信息格式化成String, 方便监控显示
        val result: StringBuilder = new StringBuilder
        result.append("时间").append(new Timestamp(timestamp - 1)).append("\n")
        // 遍历sorted列表,输出TopN信息
        for (i <- sortedPageCountList.indices) {
            // 获取当前商品的count信息
            val currentItemCount = sortedPageCountList(i)
            result.append("Top").append(i + 1).append(":")
                .append(" 页面url=").append(currentItemCount._1)
                .append(" 访问量=").append(currentItemCount._2)
                .append("\n")
        }
        result.append("===============================\n\n")

        Thread.sleep(1000)
        // 输出结果
        out.collect(result.toString())
    }
}

2.2 AllWindowedStream运用

2.2.1 示例一

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.AllWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

case class UvCount(windowEnd: Long, count: Long)

object UniqueVisitor {
    def main(args: Array[String]): Unit = {
        // 创建一个流处理环境
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(4)

        // 定义时间语义
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        // 从文件读取数据
        val inputStream = env.readTextFile("F:\\SparkWorkSpace\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv")
        // 将数据转换成样例类型,并且提取timestamp定义watermark
        val dataStream = inputStream.map(data => {
            val dataArray = data.split(",")
            UserBehavior(dataArray(0).toLong, dataArray(1).toLong, dataArray(2).toInt, dataArray(3), dataArray(4).toLong)
        }).assignAscendingTimestamps(_.timestamp * 1000L) // 指定时间字段  将时间戳的秒转换为微秒

        val uvStream = dataStream
            .filter(_.behavior == "pv")
            .timeWindowAll(Time.hours(1))
            //            .apply(new UvCountResult()) // 所有数据都汇总在一起,没有聚合 效率低
            .aggregate(new UvCountAgg(), new UvCountResultWithIncreAgg()) // 预聚合  后做最终结果聚合

        uvStream.print()

        env.execute("pv job")
    }
}

// 自定义全窗口函数
class UvCountResult() extends AllWindowFunction[UserBehavior, UvCount, TimeWindow] {
    override def apply(window: TimeWindow, input: Iterable[UserBehavior], out: Collector[UvCount]): Unit = {
        // 定义一个Set类型来保存所有的userId,自动去重
        var idSet = Set[Long]()
        // 将当前窗口所有数据添加到set
        for (userBehavior <- input) {
            idSet += userBehavior.userId
        }

        // 输出set的大小,就是去重的uv
        out.collect(UvCount(window.getEnd, idSet.size))

    }

}

// 自定义增量聚合函数,需要定义一个Set作为累加状态
class UvCountAgg() extends AggregateFunction[UserBehavior, Set[Long], Long] {
    override def createAccumulator(): Set[Long] = Set[Long]()

    override def add(value: UserBehavior, accumulator: Set[Long]): Set[Long] = accumulator + value.userId

    override def getResult(accumulator: Set[Long]): Long = accumulator.size

    override def merge(a: Set[Long], b: Set[Long]): Set[Long] = a ++ b
}

// 自定义窗口函数,添加window信息包装成样例类
class UvCountResultWithIncreAgg() extends AllWindowFunction[Long, UvCount, TimeWindow] {
    override def apply(window: TimeWindow, input: Iterable[Long], out: Collector[UvCount]): Unit = {
        //        out.collect(UvCount(window.getEnd, input.head))
        out.collect(UvCount(window.getEnd, input.iterator.next()))
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark Streaming、Flink和Storm都是处理技术,用于实时处理大规模数据。下面对这三种技术进行分析和对比。 1. Spark Streaming:Spark Streaming是通过将数据拆分成小批次进行处理的微批处理技术。它使用Spark引擎来处理数据,并提供高级API,方便用户进行开发。Spark Streaming具有较低的延迟和较高的容错性,同时可以与Spark的批处理和机器学习库集成,使其非常适合于从批处理到处理的转换。Spark Streaming还支持容错、状态管理和恢复机制。 2. FlinkFlink是一种纯粹的式处理引擎,支持事件时间、处理时间和摄取时间,并提供了低延迟和高吞吐量的处理。Flink具有精确的状态管理和一致性,可以容错地处理故障。Flink还具有灵活的窗口操作、丰富的操作符和连接器,使其非常适合复杂的处理任务。Flink还支持迭代计算和-批一体化计算。 3. Storm:Storm是一款开源的分布式式处理引擎,支持低延迟、高吞吐量的实时数据处理。Storm的核心概念是spout和bolt。spout用于从数据源接收数据,将数据转化为并发送给bolt。bolt负责处理数据,并可以将结果发送给下一个bolt或输出到外部存储系统。Storm具有较低的延迟和高可伸缩性,并且支持故障恢复和容错性。 在对比方面,Flink和Spark Streaming是基于微批处理的技术,而Storm是基于实时处理的技术。Flink和Storm提供了精确的事件时间处理和一致性,具有较低的延迟和高吞吐量,适用于对实时性要求更高的场景。而Spark Streaming在数据处理上更适合对延迟要求相对较低的场景,并且可以与Spark的批处理和机器学习库集成,使得在大规模批处理到处理的转换中更加方便。此外,Flink还支持迭代计算和-批一体化计算,这使得在某些特定的场景下更加有优势。 总结来说,Spark Streaming适合对延迟要求较低且需与Spark批处理和机器学习库集成的场景;Flink适合对精确性要求高、更复杂的处理任务;而Storm则适合对实时性要求极高的场景。选择何种技术取决于具体需求和场景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值