flink实时项目电商用户行为分析(1)---实时流量统计之实时热门商品统计

目录

1.知识点

2.业务目标

3.流程心法

4.模块详解

4.1 创建输入输出样例类

4.2 主object实现

4.2.1 创建执行环境并添加数据源

4.2.2 Datastream map转换为输入样例类

4.2.3 处理逻辑(1)----过滤PV以itemid分组,设置滑动窗口返回一个WindowedStream,调用其aggregate方法,分组计数

4.2.3 处理逻辑(2)----分组排序取TopN

4.3 完整代码


1.知识点

  • 如何定义scala样例类
  • 如何设置事件时间语义,给stream赋一个时间特征
  • 如何添加文件source和kafka source
  • 如何提取业务时间戳并设置为自增watermark
  • 如何使用aggregate(AggregateFunction,WindowFunction),预聚合然后再进行窗口内统计
  • 如何使用KeyedProcessFunction进行topN的计算
  • 如何创建状态和获取状态
  • 如何使用定时器OnTimer进行事件触发

2.业务目标

每隔 5 分钟输出最近一小时内点击量最多的前 N 个商品.

窗口:1小时

步长:5分钟

维度:前N个、商品id

指标:点击量

3.流程心法

总体流程:定义输入输出样例类--->主OBJECT实现

主object:

     1)创建执行环境,设置并行度为1防止乱序(因为是读文件模拟),设置事件时间语义

     2)添加文件source或者kafka source

     3) datastream转换为输入样例类,抽取业务时间戳timestamp * 1000L

并设置自增watermark

     4)过滤PV行为,keyby商品id,滑动窗口聚合(5分钟,窗口大小1小时)

     5)aggregate(AggregateFunction,WindowFunction)  预聚合之后,进行窗口内统计并输出样例类

    6) 按照开窗结束时间keyby,调用大招process,并传入到KeyedProcessFunction 进行分组排序取topN

4.模块详解

4.1 创建输入输出样例类

  用户行为日志:用户id、商品id、商品类别id、行为(pv、cart)、时间戳

case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

 商品点击量结果:商品id、窗口结束时间、点击量

case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)

4.2 主object实现

4.2.1 创建执行环境并添加数据源

env.addSource(new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties)))

val env = StreamExecutionEnvironment.getExecutionEnvironment
    //因为是读取文件可能会乱序,所以设置并行度为1
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征


//从文件读取数据,转换成样例类,提取时间戳生成watermark
//    val inputStream = env.readTextFile("xxx/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")

    //从kafka读取数据

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "localhost:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")

    val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )

4.2.2 Datastream map转换为输入样例类

 val dataStream = inputStream.map(data => {
      var arr = Array[String]()

      if(data != ""){
        arr = data.split(",")
      }

      UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
    })
      .assignAscendingTimestamps(_.timestamp * 1000L)

4.2.3 处理逻辑(1)----过滤PV以itemid分组,设置滑动窗口返回一个WindowedStream,调用其aggregate方法,分组计数

val aggStream:DataStream[ItemViewCount] = dataStream
      .filter(_.behavior == "pv")
      .keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
      .timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
      .aggregate(new CountAgg(),new ItemViewWindowResult())

CountAgg的实现:

public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {}

参数为IN:输入类型,UserBehavior样例类
  ACC:累加变量,Long类型
  OUT:输出,是个计数值,肯定也是Long类型

class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
  override def createAccumulator(): Long = 0L

  override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}

ItemViewWindowResult的实现: WindowFunction trait,实现apply方法
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable {

  /**
    * Evaluates the window and outputs none or several elements.
    *
    * @param key    The key for which this window is evaluated.
    * @param window The window that is being evaluated.
    * @param input  The elements in the window being evaluated.
    * @param out    A collector for emitting elements.
    * @throws Exception The function may throw exceptions to fail the program and trigger recovery.
    */
  def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT])
}

apply方法

IN :aggregateFunction计算出来的数值,本例为Long

OUT:输出的样例类,本例为ItemViewCount

Key:本例中的Keyby返回KeyedStream[T,JavaTuple], key的类型是JavaTuple,Tuple.

         key.asInstanceOf[Tuple1[Long]].f0

window:TimeWindow

计数方法:迭代器的iterator.next()取值即是数值

val count = input.iterator.next()  //迭代器的iterator取值

class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
    val itemId = key.asInstanceOf[Tuple1[Long]].f0  //scala tuple转为java tuple1,import java tuple1
    val windowEnd = window.getEnd
    val count = input.iterator.next()  //迭代器的iterator取值

    out.collect(ItemViewCount(itemId,windowEnd,count))  //apply本身无返回值,用out.collect
  }
}

4.2.4 处理逻辑(2)----分组排序取TopN

由于是计算分组排序topn,肯定要以key分组,既然以key计算,肯定用到keyedProcessFunction。有同学会想到,没有窗口个吗?窗口我们已经在上一步定义好了一个WindowFunction。这一步就不需要windows了。那么自然我们会想到KeyBy + aggregate(AggregateFunction ,WindowFunction) + KeyedProcessFunction  = KeyBy + ProcessWindowFunction???

1.上一步的结果ItemViewCount 已经计算出来,那么要分组排序,自然想到将数据集合在一起,按窗口group by ,按照count进行排序。 窗口此刻都是同一个

2.需要定义个状态变量ListState,存储当前item不断变换的count值. open中取状态句柄

 //先定义状态变量,每个窗口都应该有一个ListState,保存窗口内所有商品的状态值
  //先定义状态变量,再从open里获取状态句柄
  var itemViewCountListState: ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
  }

3.processElement中改变ListState的值,定义定时器等

  //[i,ctx上下文可定时器等各种事情,o]
  override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
    //每来一条数据,直接加入ListState
    itemViewCountListState.add(value)
    //注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册,windowEnd都是一样的
    context.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

4.需要定义一个定时器onTimer延迟1s触发等待延迟数据;需要定义一个ListBuffer将状态变量缓存,缓存完毕后清空状态变量.

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    //为方便排序,另外定义一个ListBuffer,保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
    val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
    val iter = itemViewCountListState.get().iterator() //无法排序,所以可以放到一个ListBuffer中进行排序
    while (iter.hasNext){
      allItemViewCounts += iter.next()
    }

    //导出来之后就可以清空状态
    itemViewCountListState.clear()

    //按照count大小进行排序
    val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    //将排名信息进行格式化,便于打印输出可视化展示
    val result:StringBuilder = new StringBuilder
    //timestamp:Long,就是processElement里面的value.windowEnd + 1,所以再减1就可以了
    result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
    //循环遍历结果中每个ItemViewCount
    for(i <- sortedItemViewCounts.indices){
      val currentItemViewCount = sortedItemViewCounts(i)
      result.append("No").append(i+1).append(": ")
        .append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
        .append("热门度 = ").append(currentItemViewCount.count).append("\n")
    }

4.3 完整代码

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector

import java.sql.Timestamp
import java.util.Properties
import scala.collection.mutable.ListBuffer


// 定义输入数据样例类
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)

// 定义窗口聚合结果样例类
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)



object HotItems{

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //因为是读取文件可能会乱序,所以设置并行度为1
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征

    //从文件读取数据,转换成样例类,提取时间戳生成watermark
//    val inputStream = env.readTextFile("/Users/attacker/iqiyi/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")

    //从kafka读取数据

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "localhost:9092")
    properties.setProperty("group.id", "consumer-group")
    properties.setProperty("key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")
    properties.setProperty("value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserializer")

    val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )

    val dataStream = inputStream.map(data => {
      var arr = Array[String]()

      if(data != ""){
        arr = data.split(",")
      }

      UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
    })
      .assignAscendingTimestamps(_.timestamp * 1000L)

    //得到窗口聚合结果
    val aggStream:DataStream[ItemViewCount] = dataStream
      .filter(_.behavior == "pv")
      .keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
      .timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
      .aggregate(new CountAgg(),new ItemViewWindowResult())


    val resultStream:DataStream[String] = aggStream
      .keyBy("windowEnd") //按窗口分组排序取前几名
      .process(new TopNHotItems(5))

//    dataStream.print("data")
//    aggStream.print("agg")
    resultStream.print()

    env.execute("hotitems")



  }
}

//自定义预聚合函数AggregateFunction
class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
  override def createAccumulator(): Long = 0L

  override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


//自定 义windowFunction
class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
    val itemId = key.asInstanceOf[Tuple1[Long]].f0  //scala tuple转为java tuple1,import java tuple1
    val windowEnd = window.getEnd
    val count = input.iterator.next()  //迭代器的iterator取值

    out.collect(ItemViewCount(itemId,windowEnd,count))  //apply本身无返回值,用out.collect
  }
}



//自定义KeyedProcessFunction
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple,ItemViewCount,String] {

  //先定义状态变量,每个窗口都应该有一个ListState,保存窗口内所有商品的状态值
  //先定义状态变量,再从open里获取状态句柄
  var itemViewCountListState: ListState[ItemViewCount] = _

  override def open(parameters: Configuration): Unit = {
    itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
  }

  //[i,ctx上下文可定时器等各种事情,o]
  override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
    //每来一条数据,直接加入ListState
    itemViewCountListState.add(value)
    //注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册,windowEnd都是一样的
    context.timerService().registerEventTimeTimer(value.windowEnd + 1)
  }

  //当定时器触发,认为所有窗口统计结果都已到齐了,可以排序输出
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    //为方便排序,另外定义一个ListBuffer,保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
    val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
    val iter = itemViewCountListState.get().iterator() //无法排序,所以可以放到一个ListBuffer中进行排序
    while (iter.hasNext){
      allItemViewCounts += iter.next()
    }

    //导出来之后就可以清空状态
    itemViewCountListState.clear()

    //按照count大小进行排序
    val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    //将排名信息进行格式化,便于打印输出可视化展示
    val result:StringBuilder = new StringBuilder
    //timestamp:Long,就是processElement里面的value.windowEnd + 1,所以再减1就可以了
    result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
    //循环遍历结果中每个ItemViewCount
    for(i <- sortedItemViewCounts.indices){
      val currentItemViewCount = sortedItemViewCounts(i)
      result.append("No").append(i+1).append(": ")
        .append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
        .append("热门度 = ").append(currentItemViewCount.count).append("\n")
    }

    //每个窗口隔开
    result.append("================\n\n")
    Thread.sleep(1000) //1s的输出

    //输出到缓冲里
    out.collect(result.toString())
  }

}

  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值