flink实时项目电商用户行为分析(1)---实时流量统计之实时热门商品统计

最新推荐文章于 2024-04-21 12:18:59 发布

cjlion

最新推荐文章于 2024-04-21 12:18:59 发布

阅读量660

点赞数 2

分类专栏： flink电商用户行为分析文章标签： flink 大数据

本文链接：https://blog.csdn.net/cjlion/article/details/118446640

版权

flink电商用户行为分析专栏收录该内容

7 篇文章 2 订阅

订阅专栏

4.2.2 Datastream map转换为输入样例类

4.2.3 处理逻辑（1）----过滤PV以itemid分组，设置滑动窗口返回一个WindowedStream，调用其aggregate方法,分组计数

4.2.3 处理逻辑（2）----分组排序取TopN

4.3 完整代码

1.知识点

如何定义scala样例类
如何设置事件时间语义,给stream赋一个时间特征
如何添加文件source和kafka source
如何提取业务时间戳并设置为自增watermark
如何使用aggregate(AggregateFunction，WindowFunction),预聚合然后再进行窗口内统计
如何使用KeyedProcessFunction进行topN的计算
如何创建状态和获取状态
如何使用定时器OnTimer进行事件触发

2.业务目标

每隔 5 分钟输出最近一小时内点击量最多的前 N 个商品.

窗口：1小时

步长：5分钟

维度：前N个、商品id

指标：点击量

3.流程心法

总体流程：定义输入输出样例类--->主OBJECT实现

主object:

1)创建执行环境,设置并行度为1防止乱序（因为是读文件模拟），设置事件时间语义

2)添加文件source或者kafka source

3) datastream转换为输入样例类，抽取业务时间戳timestamp * 1000L

并设置自增watermark

4）过滤PV行为，keyby商品id,滑动窗口聚合(5分钟，窗口大小1小时)

5）aggregate（AggregateFunction，WindowFunction) 预聚合之后，进行窗口内统计并输出样例类

6) 按照开窗结束时间keyby，调用大招process，并传入到KeyedProcessFunction 进行分组排序取topN

4.模块详解

4.1 创建输入输出样例类

用户行为日志：用户id、商品id、商品类别id、行为(pv、cart)、时间戳

case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

商品点击量结果：商品id、窗口结束时间、点击量

case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)

4.2 主object实现

4.2.1 创建执行环境并添加数据源

env.addSource(new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties)))

val env = StreamExecutionEnvironment.getExecutionEnvironment
    //因为是读取文件可能会乱序,所以设置并行度为1
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征


//从文件读取数据，转换成样例类,提取时间戳生成watermark
//    val inputStream = env.readTextFile("xxx/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")

    //从kafka读取数据

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "localhost:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")

    val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )

4.2.2 Datastream map转换为输入样例类

 val dataStream = inputStream.map(data => {
      var arr = Array[String]()

      if(data != ""){
        arr = data.split(",")
      }

      UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
    })
      .assignAscendingTimestamps(_.timestamp * 1000L)

4.2.3 处理逻辑（1）----过滤PV以itemid分组，设置滑动窗口返回一个WindowedStream，调用其aggregate方法,分组计数

val aggStream:DataStream[ItemViewCount] = dataStream
      .filter(_.behavior == "pv")
      .keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
      .timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
      .aggregate(new CountAgg(),new ItemViewWindowResult())

CountAgg的实现：

public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {}
参数为IN:输入类型，UserBehavior样例类
ACC:累加变量，Long类型
OUT:输出，是个计数值，肯定也是Long类型

class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
  override def createAccumulator(): Long = 0L

  override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}

ItemViewWindowResult的实现： WindowFunction trait,实现apply方法

trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key    The key for which this window is evaluated.
    * @param window The window that is being evaluated.
    * @param input  The elements in the window being evaluated.
    * @param out    A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}
apply方法

IN ：aggregateFunction计算出来的数值，本例为Long

OUT:输出的样例类，本例为ItemViewCount

Key:本例中的Keyby返回KeyedStream[T,JavaTuple], key的类型是JavaTuple，Tuple.

key.asInstanceOf[Tuple1[Long]].f0

window:TimeWindow

计数方法：迭代器的iterator.next()取值即是数值
val count = input.iterator.next()  //迭代器的iterator取值

class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
    val itemId = key.asInstanceOf[Tuple1[Long]].f0  //scala tuple转为java tuple1,import java tuple1
    val windowEnd = window.getEnd
    val count = input.iterator.next()  //迭代器的iterator取值

    out.collect(ItemViewCount(itemId,windowEnd,count))  //apply本身无返回值，用out.collect
  }
}

4.2.4 处理逻辑（2）----分组排序取TopN

由于是计算分组排序topn，肯定要以key分组,既然以key计算，肯定用到keyedProcessFunction。有同学会想到，没有窗口个吗？窗口我们已经在上一步定义好了一个WindowFunction。这一步就不需要windows了。那么自然我们会想到KeyBy + aggregate(AggregateFunction ,WindowFunction) + KeyedProcessFunction = KeyBy + ProcessWindowFunction???

1.上一步的结果ItemViewCount 已经计算出来，那么要分组排序，自然想到将数据集合在一起，按窗口group by ,按照count进行排序。窗口此刻都是同一个

2.需要定义个状态变量ListState，存储当前item不断变换的count值. open中取状态句柄

 //先定义状态变量，每个窗口都应该有一个ListState，保存窗口内所有商品的状态值
  //先定义状态变量，再从open里获取状态句柄
  var itemViewCountListState: ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
  }

3.processElement中改变ListState的值，定义定时器等

  //[i,ctx上下文可定时器等各种事情,o]
  override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
    //每来一条数据,直接加入ListState
    itemViewCountListState.add(value)
    //注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册，windowEnd都是一样的
    context.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

4.需要定义一个定时器onTimer延迟1s触发等待延迟数据;需要定义一个ListBuffer将状态变量缓存，缓存完毕后清空状态变量.

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    //为方便排序,另外定义一个ListBuffer，保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
    val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
    val iter = itemViewCountListState.get().iterator() //无法排序，所以可以放到一个ListBuffer中进行排序
    while (iter.hasNext){
      allItemViewCounts += iter.next()
    }

    //导出来之后就可以清空状态
    itemViewCountListState.clear()

    //按照count大小进行排序
    val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    //将排名信息进行格式化，便于打印输出可视化展示
    val result:StringBuilder = new StringBuilder
    //timestamp:Long，就是processElement里面的value.windowEnd + 1，所以再减1就可以了
    result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
    //循环遍历结果中每个ItemViewCount
    for(i <- sortedItemViewCounts.indices){
      val currentItemViewCount = sortedItemViewCounts(i)
      result.append("No").append(i+1).append(": ")
        .append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
        .append("热门度 = ").append(currentItemViewCount.count).append("\n")
    }

4.3 完整代码

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector

import java.sql.Timestamp
import java.util.Properties
import scala.collection.mutable.ListBuffer


// 定义输入数据样例类
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

// 定义窗口聚合结果样例类
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)



object HotItems{

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //因为是读取文件可能会乱序,所以设置并行度为1
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征

    //从文件读取数据，转换成样例类,提取时间戳生成watermark
//    val inputStream = env.readTextFile("/Users/attacker/iqiyi/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")

    //从kafka读取数据

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "localhost:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")

    val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )

    val dataStream = inputStream.map(data => {
      var arr = Array[String]()

      if(data != ""){
        arr = data.split(",")
      }

      UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
    })
      .assignAscendingTimestamps(_.timestamp * 1000L)

    //得到窗口聚合结果
    val aggStream:DataStream[ItemViewCount] = dataStream
      .filter(_.behavior == "pv")
      .keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
      .timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
      .aggregate(new CountAgg(),new ItemViewWindowResult())


    val resultStream:DataStream[String] = aggStream
      .keyBy("windowEnd") //按窗口分组排序取前几名
      .process(new TopNHotItems(5))

//    dataStream.print("data")
//    aggStream.print("agg")
    resultStream.print()

    env.execute("hotitems")



  }
}

//自定义预聚合函数AggregateFunction
class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
  override def createAccumulator(): Long = 0L

  override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


//自定 义windowFunction
class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
    val itemId = key.asInstanceOf[Tuple1[Long]].f0  //scala tuple转为java tuple1,import java tuple1
    val windowEnd = window.getEnd
    val count = input.iterator.next()  //迭代器的iterator取值

    out.collect(ItemViewCount(itemId,windowEnd,count))  //apply本身无返回值，用out.collect
  }
}



//自定义KeyedProcessFunction
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple,ItemViewCount,String] {

  //先定义状态变量，每个窗口都应该有一个ListState，保存窗口内所有商品的状态值
  //先定义状态变量，再从open里获取状态句柄
  var itemViewCountListState: ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
  }

  //[i,ctx上下文可定时器等各种事情,o]
  override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
    //每来一条数据,直接加入ListState
    itemViewCountListState.add(value)
    //注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册，windowEnd都是一样的
    context.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

  //当定时器触发，认为所有窗口统计结果都已到齐了，可以排序输出
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    //为方便排序,另外定义一个ListBuffer，保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
    val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
    val iter = itemViewCountListState.get().iterator() //无法排序，所以可以放到一个ListBuffer中进行排序
    while (iter.hasNext){
      allItemViewCounts += iter.next()
    }

    //导出来之后就可以清空状态
    itemViewCountListState.clear()

    //按照count大小进行排序
    val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    //将排名信息进行格式化，便于打印输出可视化展示
    val result:StringBuilder = new StringBuilder
    //timestamp:Long，就是processElement里面的value.windowEnd + 1，所以再减1就可以了
    result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
    //循环遍历结果中每个ItemViewCount
    for(i <- sortedItemViewCounts.indices){
      val currentItemViewCount = sortedItemViewCounts(i)
      result.append("No").append(i+1).append(": ")
        .append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
        .append("热门度 = ").append(currentItemViewCount.count).append("\n")
    }

    //每个窗口隔开
    result.append("================\n\n")
    Thread.sleep(1000) //1s的输出

    //输出到缓冲里
    out.collect(result.toString())
  }

}

cjlion

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
flink实时项目电商用户行为分析(1)---实时流量统计之实时热门商品统计

1.知识点如何定义scala样例类如何设置事件时间语义,给stream赋一个时间特征如何添加文件source和kafka source 如何提取业务时间戳并设置为自增watermark 如何使用aggregate(AggregateFunction，WindowFunction),预聚合然后再进行窗口内统计如何使用KeyedProcessFunction进行topN的计算如何创建状态和获取状态如何使用定时器OnTimer进行事件触发2.业务目标每隔 5 分钟...
复制链接

扫一扫