20-09-flink项目

最新推荐文章于 2024-07-24 11:58:44 发布

nzch

最新推荐文章于 2024-07-24 11:58:44 发布

阅读量246

点赞数

分类专栏：尚硅谷大数据

本文链接：https://blog.csdn.net/qq_28764557/article/details/114575771

版权

尚硅谷大数据专栏收录该内容

32 篇文章 0 订阅

订阅专栏

复习略

---01---

这次的数据是乱序的。

基于web服务器的热门数据的统计。实时的热门的页面的统计。

如今分析这个log日志呢，就是根据代码的url去分析的。

单例对象是object的。

看下这个数据是乱序的。

在数据源分配时间戳和水位线。

主要是搭建了代码的整体的框架。

我们看下keyBy的返回值，可以看下是一个元组。

如何可以不得到元组呢？

所以需要改进下keyBy，注意这个是返回的是元组的类型的：

如何直接返回自字符串类型呢？

---02---

// 自定义的预聚合函数 注意这个泛型是什么 三个参数 输入是样例类 中间的聚合的结果 返回值的类型
class PageCountAgg() extends AggregateFunction[ApacheLogEvent, Long, Long] {
  // 来一个就加1
  override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1
  // 开始是0
  override def createAccumulator(): Long = 0L
  // 获得结果
  override def getResult(accumulator: Long): Long = accumulator
  override def merge(a: Long, b: Long): Long = a + b
}

// 自定义WindowFunction，包装成样例类输出  第一个预聚合的结果是这里的输入 输出是一个PageViewCount key的类型 TimeWindow是最主要的
class PageCountWindowResult() extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
    out.collect(PageViewCount(key, window.getEnd, input.head))
  }
}

这个是一个processFunction

// 自定义Process Function kry i o
class TopNHotPage(n: Int) extends KeyedProcessFunction[Long, PageViewCount, String]{
  // 定义MapState保存所有聚合结果
  lazy val pageCountMapState: MapState[String, Long] = getRuntimeContext.getMapState(new MapStateDescriptor[String, Long]("pagecount-map", classOf[String], classOf[Long]))
  override def processElement(value: PageViewCount, ctx: KeyedProcessFunction[Long, PageViewCount, String]#Context, out: Collector[String]): Unit = {
    pageCountMapState.put(value.url, value.count)
    ctx.timerService().registerEventTimeTimer(value.windowEnd + 1)
    ctx.timerService().registerEventTimeTimer(value.windowEnd + 60 * 1000L)
  }
  // 等到数据都到齐，从状态中取出，排序输出
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    if( timestamp == ctx.getCurrentKey + 60*1000L ) {
      pageCountMapState.clear()
      return
    }
    val allPageCountList: ListBuffer[(String, Long)] = ListBuffer()
    val iter = pageCountMapState.entries().iterator()
    while( iter.hasNext ){
      val entry = iter.next()
      allPageCountList += ((entry.getKey, entry.getValue))
    }
    val sortedPageCountList = allPageCountList.sortWith(_._2 > _._2).take(n)
    val result: StringBuilder = new StringBuilder
    result.append("时间：").append( new Timestamp(timestamp - 1) ).append("\n")
    // 遍历sorted列表，输出TopN信息
    for( i <- sortedPageCountList.indices ){
      // 获取当前商品的count信息
      val currentItemCount = sortedPageCountList(i)
      result.append("Top").append(i+1).append(":")
        .append(" 页面url=").append(currentItemCount._1)
        .append(" 访问量=").append(currentItemCount._2)
        .append("\n")
    }
    result.append("==============================\n\n")
    // 控制输出频率
    Thread.sleep(1000)
    out.collect(result.toString())
  }
}

这个是最简单的，是不带延时的。

---03---

延时时间：

定义延时时间是1分钟，那么10:14才会输出的。

5s一次，但是定义延时时间是1分钟是没有必要的。

定义延时真的可以搞定数据吗？

flink处理乱序数据的三重保证：

1.时间戳和水位线

2.窗口允许处理迟到数据

这里的延时改为1秒，给一个比较小的额可以hold住大部分的场景的。

迟到一分钟

3.扔到测输出流,侧输出流数据实质上是不参与计算了。

---

注册一个新的定时器，就是一分钟。关闭的时间是滑动时间的整数倍的其实是。

因为延迟是1秒，所以51秒的时候才会输出50s的窗口的关闭。

---

注意一点很只要的一点，就是52s的时间窗口已经关闭了，但是此时来了一个46s的数据，此时会怎么样呢？

看下在之前count(1)的基础上又增加了一个。

迟到数据会更新之前的结果。

---

注册是50s+1ms的定时器，因为窗口的滑动是5s的。

我们这里面会有一个bug，就是迟到数据来了统计的话，之前窗口的状态都清空了，这样的话就会有问题的，所以我们还要注册一个定时器的。

currentKey是按照windowEnd分区的数据。

---04---

实时性高，准确还要，注意得点很多的。

---05---

代码：

package com.atguigu.networkflow_analysis

import org.apache.flink.api.common.functions.{AggregateFunction, MapFunction, RichMapFunction}
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.util.Random

/**
  * Copyright (c) 2018-2028 尚硅谷 All Rights Reserved 
  *
  * Project: UserBehaviorAnalysis
  * Package: com.atguigu.networkflow_analysis
  * Version: 1.0
  *
  * Created by wushengran on 2020/4/27 11:47
  */

// 定义输入输出的样例类
case class UserBehavior( userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long )
case class PvCount( windowEnd: Long, count: Long )

object PageView {
  def main(args: Array[String]): Unit = {
    // 创建一个流处理执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(4)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    // 从文件读取数据
    val inputStream: DataStream[String] = env.readTextFile("D:\\Projects\\BigData\\UserBehaviorAnalysis\\HotItemsAnalysis\\src\\main\\resources\\UserBehavior.csv")

    // 将数据转换成样例类类型，并且提取timestamp定义watermark
    val dataStream: DataStream[UserBehavior] = inputStream
      .map( data => {
        val dataArray = data.split(",")
        UserBehavior( dataArray(0).toLong, dataArray(1).toLong, dataArray(2).toInt, dataArray(3), dataArray(4).toLong )
      } )
      .assignAscendingTimestamps(_.timestamp * 1000L)

    // 分配key，包装成二元组开创聚合
    val pvStream: DataStream[PvCount] = dataStream
      .filter(_.behavior == "pv")
//      .map( data => ("pv", 1L) )    // map成二元组("pv", count)
      .map( new MyMapper() )    // 自定义Mapper，将key均匀分配
      .keyBy(_._1)    // 把所有数据分到一组做总计
      .timeWindow(Time.hours(1))    // 开一小时的滚动窗口进行统计
      .aggregate( new PvCountAgg(), new PvCountResult() )

    // 把各分区的结果汇总起来
    val pvTotalStream: DataStream[PvCount] = pvStream
      .keyBy(_.windowEnd)
      .process( new TotalPvCountResult() )
//      .sum("count")

    pvTotalStream.print()

    env.execute("pv job")
  }
}

// 自定义预聚合函数
class PvCountAgg() extends AggregateFunction[(String, Long), Long, Long]{
  override def add(value: (String, Long), accumulator: Long): Long = accumulator + 1

  override def createAccumulator(): Long = 0L

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}

// 自定义窗口函数，把窗口信息包装到样例类类型输出
class PvCountResult() extends WindowFunction[Long, PvCount, String, TimeWindow]{
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PvCount]): Unit = {
    out.collect( PvCount(window.getEnd, input.head) )
  }
}

// 自定义MapFunction，随机生成key
class MyMapper() extends RichMapFunction[UserBehavior, (String, Long)]{
  lazy val index: Long = getRuntimeContext.getIndexOfThisSubtask
  override def map(value: UserBehavior): (String, Long) = (index.toString, 1L)
}

// 自定义ProcessFunction，将聚合结果按窗口合并
class TotalPvCountResult() extends KeyedProcessFunction[Long, PvCount, PvCount]{
  // 定义一个状态，用来保存当前所有结果之和
  lazy val totalCountState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long]("total-count", classOf[Long]))

  override def processElement(value: PvCount, ctx: KeyedProcessFunction[Long, PvCount, PvCount]#Context, out: Collector[PvCount]): Unit = {
    // 加上新的count值，更新状态
    totalCountState.update( totalCountState.value() + value.count )
    // 注册定时器，windowEnd+1之后触发
    ctx.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PvCount, PvCount]#OnTimerContext, out: Collector[PvCount]): Unit = {
    // 定时器触发时，所有分区count值都已到达，输出总和
    out.collect( PvCount(ctx.getCurrentKey, totalCountState.value()) )
    totalCountState.clear()
  }
}

时间戳和水位线：https://www.cnblogs.com/Springmoon-venn/p/11403665.html

----------------------------

数据倾斜的问题吗，我们要走的是什么呢？

---06-07---

补录：

可能同一个窗口两个聚合的结果。

---08---

代码：

package com.atguigu.networkflow_analysis

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.AllWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * Copyright (c) 2018-2028 尚硅谷 All Rights Reserved 
  *
  * Project: UserBehaviorAnalysis
  * Package: com.atguigu.networkflow_analysis
  * Version: 1.0
  *
  * Created by wushengran on 2020/4/27 14:39
  */

case class UvCount( windowEnd: Long, count: Long )

object UniqueVisitor {
  def main(args: Array[String]): Unit = {
    // 创建一个流处理执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    // 从文件读取数据
    val inputStream: DataStream[String] = env.readTextFile("D:\\codeMy\\CODY_MY_AFTER__KE\\sggBigData\\20-flink\\UserBehaviorAnalysis\\NetworkFlowAnalysis\\src\\main\\resources\\UserBehavior.csv")

    // 将数据转换成样例类类型，并且提取timestamp定义watermark
    val dataStream: DataStream[UserBehavior] = inputStream
      .map( data => {
        val dataArray = data.split(",")
        UserBehavior( dataArray(0).toLong, dataArray(1).toLong, dataArray(2).toInt, dataArray(3), dataArray(4).toLong )
      } )
      .assignAscendingTimestamps(_.timestamp * 1000L)

    // 分配key，包装成二元组开创聚合
    val uvStream: DataStream[UvCount] = dataStream
      .filter(_.behavior == "pv")
      .timeWindowAll(Time.hours(1))    // 基于DataStream开一小时的滚动窗口进行统计
//      .apply( new UvCountResult() )
      .aggregate( new UvCountAgg(), new UvCountResultWithIncreAgg() )

    uvStream.print()

    env.execute("uv job")
  }
}

// 自定义全窗口函数
class UvCountResult() extends AllWindowFunction[UserBehavior, UvCount, TimeWindow]{
  override def apply(window: TimeWindow, input: Iterable[UserBehavior], out: Collector[UvCount]): Unit = {
      // 定义一个Set类型来保存所有的userId，自动去重
      var idSet = Set[Long]()
      // 将当前窗口的所有数据，添加到set里
      for( userBehavior <- input ){
        idSet += userBehavior.userId
      }
      // 输出set的大小，就是去重之后的UV值
      out.collect( UvCount(window.getEnd, idSet.size) )
  }
}

// 自定义增量聚合函数，需要定义一个Set作为累加状态
class UvCountAgg() extends AggregateFunction[UserBehavior, Set[Long], Long]{
  override def add(value: UserBehavior, accumulator: Set[Long]): Set[Long] = accumulator + value.userId

  override def createAccumulator(): Set[Long] = Set[Long]()

  override def getResult(accumulator: Set[Long]): Long = accumulator.size

  override def merge(a: Set[Long], b: Set[Long]): Set[Long] = a ++ b
}
// 自定义窗口函数，添加window信息包装成样例类
class UvCountResultWithIncreAgg() extends AllWindowFunction[Long, UvCount, TimeWindow]{
  override def apply(window: TimeWindow, input: Iterable[Long], out: Collector[UvCount]): Unit = {
    out.collect( UvCount(window.getEnd, input.head) )
  }
}

看下：如果我们不考虑数据倾斜，不考虑增量聚合的话

// 自定义全窗口函数
class UvCountResult() extends AllWindowFunction[UserBehavior, UvCount, TimeWindow]{
  override def apply(window: TimeWindow, input: Iterable[UserBehavior], out: Collector[UvCount]): Unit = {
      // 定义一个Set类型来保存所有的userId，自动去重
      var idSet = Set[Long]()
      // 将当前窗口的所有数据，添加到set里
      for( userBehavior <- input ){
        idSet += userBehavior.userId
      }
      // 输出set的大小，就是去重之后的UV值
      out.collect( UvCount(window.getEnd, idSet.size) )
  }
}

---09---

改进：之前是没有性能的，就是都存进去了，需要一个小时的数据，是没有性能的，流处理比较好的点就是来一个处理一个，不需要攒着，是很不好的。

// 自定义增量聚合函数，需要定义一个Set作为累加状态 定义的泛型是输入 中间的状态  输出
class UvCountAgg() extends AggregateFunction[UserBehavior, Set[Long], Long]{
  override def add(value: UserBehavior, accumulator: Set[Long]): Set[Long] = accumulator + value.userId

  override def createAccumulator(): Set[Long] = Set[Long]()

  override def getResult(accumulator: Set[Long]): Long = accumulator.size

  override def merge(a: Set[Long], b: Set[Long]): Set[Long] = a ++ b
}
// 自定义窗口函数，添加window信息包装成样例类
class UvCountResultWithIncreAgg() extends AllWindowFunction[Long, UvCount, TimeWindow]{
  override def apply(window: TimeWindow, input: Iterable[Long], out: Collector[UvCount]): Unit = {
    out.collect( UvCount(window.getEnd, input.head) )
  }
}

---10---

set是占内存的。

窗口操作比较常用的就是trigger。

去重的话可以考虑布隆过滤波器的。

这样我们可以直接用1bit表示一个userId。

一个Byte由8 bits组成

整体的代码：

package com.atguigu.networkflow_analysis

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import redis.clients.jedis.Jedis

/**
  * Copyright (c) 2018-2028 尚硅谷 All Rights Reserved 
  *
  * Project: UserBehaviorAnalysis
  * Package: com.atguigu.networkflow_analysis
  * Version: 1.0
  *
  * Created by wushengran on 2020/4/27 15:45
  */
object UvWithBloomFilter {
  def main(args: Array[String]): Unit = {
    // 创建一个流处理执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    // 从文件读取数据
    val inputStream: DataStream[String] = env.readTextFile("D:\\codeMy\\CODY_MY_AFTER__KE\\sggBigData\\20-flink\\UserBehaviorAnalysis\\NetworkFlowAnalysis\\src\\main\\resources\\UserBehavior.csv")

    // 将数据转换成样例类类型，并且提取timestamp定义watermark
    val dataStream: DataStream[UserBehavior] = inputStream
      .map( data => {
        val dataArray = data.split(",")
        UserBehavior( dataArray(0).toLong, dataArray(1).toLong, dataArray(2).toInt, dataArray(3), dataArray(4).toLong )
      } )
      .assignAscendingTimestamps(_.timestamp * 1000L)

    // 分配key，包装成二元组开创聚合
    val uvStream: DataStream[UvCount] = dataStream
      .filter(_.behavior == "pv")
      .map( data => ("uv", data.userId) )
      .keyBy(_._1)
      .timeWindow(Time.hours(1))
      .trigger(new MyTrigger())    // 自定义Trigger
      .process( new UvCountResultWithBloomFilter() )

    uvStream.print()

    env.execute("uv job")
  }
}

// 自定义一个触发器，每来一条数据就触发一次窗口计算操作
class MyTrigger() extends Trigger[(String, Long), TimeWindow]{
  override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {}

  // 数据来了之后，触发计算并清空状态，不保存数据
  override def onElement(element: (String, Long), timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.FIRE_AND_PURGE
}

// 自定义ProcessWindowFunction，把当前数据进行处理，位图保存在redis中
class UvCountResultWithBloomFilter() extends ProcessWindowFunction[(String, Long), UvCount, String, TimeWindow]{
  var jedis: Jedis = _
  var bloom: Bloom = _

  override def open(parameters: Configuration): Unit = {
    jedis = new Jedis("192.168.244.133", 6379)
    jedis.auth("123456")
    // 位图大小10亿个位，也就是2^30，占用128MB
    bloom = new Bloom(1<<30)
  }

  // 每来一个数据，主要是要用布隆过滤器判断redis位图中对应位置是否为1
  override def process(key: String, context: Context, elements: Iterable[(String, Long)], out: Collector[UvCount]): Unit = {
    // bitmap用当前窗口的end作为key，保存到redis里，（windowEnd，bitmap）
    val storedKey = context.window.getEnd.toString

    // 我们把每个窗口的uv count值，作为状态也存入redis中，存成一张叫做countMap的表
    val countMap = "countMap"
    // 先获取当前的count值
    var count = 0L
    if( jedis.hget(countMap, storedKey) != null )
      count = jedis.hget(countMap, storedKey).toLong

    // 取userId，计算hash值，判断是否在位图中
    val userId = elements.last._2.toString
    val offset = bloom.hash(userId, 61)
    val isExist = jedis.getbit( storedKey, offset )

    // 如果不存在，那么就将对应位置置1，count加1；如果存在，不做操作
    if( !isExist ){
      jedis.setbit( storedKey, offset, true )
      jedis.hset( countMap, storedKey, (count + 1).toString )
    }
  }
}

// 自定义一个布隆过滤器
class Bloom(size: Long) extends Serializable{
  // 定义位图的大小，应该是2的整次幂
  private val cap = size

  // 实现一个hash函数
  def hash(str: String, seed: Int): Long = {
    var result = 0
    for( i <- 0 until str.length ){
      result = result * seed + str.charAt(i)
    }
    // 返回一个在cap范围内的一个值
    (cap - 1) & result
  }
}

代码：

// 自定义一个触发器，每来一条数据就触发一次窗口计算操作
class MyTriggerMy() extends Trigger[(String, Long), TimeWindow]{
  override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE

  override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {}

  // 数据来了之后，触发计算并清空状态，不保存数据
  override def onElement(element: (String, Long), timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = TriggerResult.FIRE_AND_PURGE
}

onEventTime:有waterMark来了会发生什么事情

onProcessingTime：处理时间变了会触发什么操作

onElement：来了一个元素我们要做什么操作

TriggerResult：控制是不是触发计算操作。

有两个时间：窗口结束时间，窗口真正的销毁时间。

fire and purge

---

实现一个自定义的processFunction：

// 自定义ProcessWindowFunction，把当前数据进行处理，位图保存在redis中 这个是全窗口函数 泛型是输入输出和key等
class UvCountResultWithBloomFilter() extends ProcessWindowFunction[(String, Long), UvCount, String, TimeWindow]{
  var jedis: Jedis = _
  var bloom: Bloom = _

  override def open(parameters: Configuration): Unit = {
    jedis = new Jedis("192.168.244.133", 6379)
    jedis.auth("123456")
    // 位图大小10亿个位，也就是2^30，占用128MB
    bloom = new Bloom(1<<30)
  }
  // 每来一个数据，主要是要用布隆过滤器判断redis位图中对应位置是否为1
  override def process(key: String, context: Context, elements: Iterable[(String, Long)], out: Collector[UvCount]): Unit = {
    // bitmap用当前窗口的end作为key，保存到redis里，（windowEnd，bitmap）
    val storedKey = context.window.getEnd.toString

    // 我们把每个窗口的uv count值，作为状态也存入redis中，存成一张叫做countMap的表
    val countMap = "countMap"
    // 先获取当前的count值
    var count = 0L
    if( jedis.hget(countMap, storedKey) != null )
      count = jedis.hget(countMap, storedKey).toLong

    // 取userId，计算hash值，判断是否在位图中
    val userId = elements.last._2.toString
    val offset = bloom.hash(userId, 61)
    val isExist = jedis.getbit( storedKey, offset )

    // 如果不存在，那么就将对应位置置1，count加1；如果存在，不做操作
    if( !isExist ){
      jedis.setbit( storedKey, offset, true )
      jedis.hset( countMap, storedKey, (count + 1).toString )
    }
  }
}

自定义一个Bloom过滤器：

// 自定义一个布隆过滤器
class Bloom(size: Long) extends Serializable{
  // 定义位图的大小，应该是2的整次幂
  private val cap = size

  // 实现一个hash函数  这个bloom有实现我们简单写下原理
  def hash(str: String, seed: Int): Long = {
    var result = 0
    for( i <- 0 until str.length ){
      result = result * seed + str.charAt(i)
    }
    // 返回一个在cap范围内的一个值
    (cap - 1) & result
  }
}

我们现在是有1亿个数据的，该怎么分配呢/

我们的数据是10的10次方bit /8是字节

10**10 = 2**30