目录
4.2.3 处理逻辑(1)----过滤PV以itemid分组,设置滑动窗口返回一个WindowedStream,调用其aggregate方法,分组计数
1.知识点
- 如何定义scala样例类
- 如何设置事件时间语义,给stream赋一个时间特征
- 如何添加文件source和kafka source
- 如何提取业务时间戳并设置为自增watermark
- 如何使用aggregate(AggregateFunction,WindowFunction),预聚合然后再进行窗口内统计
- 如何使用KeyedProcessFunction进行topN的计算
- 如何创建状态和获取状态
- 如何使用定时器OnTimer进行事件触发
2.业务目标
每隔 5 分钟输出最近一小时内点击量最多的前 N 个商品.
窗口:1小时
步长:5分钟
维度:前N个、商品id
指标:点击量
3.流程心法
总体流程:定义输入输出样例类--->主OBJECT实现
主object:
1)创建执行环境,设置并行度为1防止乱序(因为是读文件模拟),设置事件时间语义
2)添加文件source或者kafka source
3) datastream转换为输入样例类,抽取业务时间戳timestamp * 1000L
并设置自增watermark
4)过滤PV行为,keyby商品id,滑动窗口聚合(5分钟,窗口大小1小时)
5)aggregate(AggregateFunction,WindowFunction) 预聚合之后,进行窗口内统计并输出样例类
6) 按照开窗结束时间keyby,调用大招process,并传入到KeyedProcessFunction 进行分组排序取topN
4.模块详解
4.1 创建输入输出样例类
用户行为日志:用户id、商品id、商品类别id、行为(pv、cart)、时间戳
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)
商品点击量结果:商品id、窗口结束时间、点击量
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)
4.2 主object实现
4.2.1 创建执行环境并添加数据源
env.addSource(new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties)))
val env = StreamExecutionEnvironment.getExecutionEnvironment
//因为是读取文件可能会乱序,所以设置并行度为1
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征
//从文件读取数据,转换成样例类,提取时间戳生成watermark
// val inputStream = env.readTextFile("xxx/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")
//从kafka读取数据
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "consumer-group")
properties.setProperty("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )
4.2.2 Datastream map转换为输入样例类
val dataStream = inputStream.map(data => {
var arr = Array[String]()
if(data != ""){
arr = data.split(",")
}
UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
})
.assignAscendingTimestamps(_.timestamp * 1000L)
4.2.3 处理逻辑(1)----过滤PV以itemid分组,设置滑动窗口返回一个WindowedStream,调用其aggregate方法,分组计数
val aggStream:DataStream[ItemViewCount] = dataStream
.filter(_.behavior == "pv")
.keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
.timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
.aggregate(new CountAgg(),new ItemViewWindowResult())
CountAgg的实现:
public interface AggregateFunction<IN, ACC, OUT> extends Function, Serializable {}
参数为IN:输入类型,UserBehavior样例类
ACC:累加变量,Long类型
OUT:输出,是个计数值,肯定也是Long类型
class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
override def createAccumulator(): Long = 0L
override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a + b
}
ItemViewWindowResult的实现: WindowFunction trait,实现apply方法
trait WindowFunction[IN, OUT, KEY, W <: Window] extends Function with Serializable { /** * Evaluates the window and outputs none or several elements. * * @param key The key for which this window is evaluated. * @param window The window that is being evaluated. * @param input The elements in the window being evaluated. * @param out A collector for emitting elements. * @throws Exception The function may throw exceptions to fail the program and trigger recovery. */ def apply(key: KEY, window: W, input: Iterable[IN], out: Collector[OUT]) }
apply方法
IN :aggregateFunction计算出来的数值,本例为Long
OUT:输出的样例类,本例为ItemViewCount
Key:本例中的Keyby返回KeyedStream[T,JavaTuple], key的类型是JavaTuple,Tuple.
key.asInstanceOf[Tuple1[Long]].f0
window:TimeWindow
计数方法:迭代器的iterator.next()取值即是数值
val count = input.iterator.next() //迭代器的iterator取值
class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
val itemId = key.asInstanceOf[Tuple1[Long]].f0 //scala tuple转为java tuple1,import java tuple1
val windowEnd = window.getEnd
val count = input.iterator.next() //迭代器的iterator取值
out.collect(ItemViewCount(itemId,windowEnd,count)) //apply本身无返回值,用out.collect
}
}
4.2.4 处理逻辑(2)----分组排序取TopN
由于是计算分组排序topn,肯定要以key分组,既然以key计算,肯定用到keyedProcessFunction。有同学会想到,没有窗口个吗?窗口我们已经在上一步定义好了一个WindowFunction。这一步就不需要windows了。那么自然我们会想到KeyBy + aggregate(AggregateFunction ,WindowFunction) + KeyedProcessFunction = KeyBy + ProcessWindowFunction???
1.上一步的结果ItemViewCount 已经计算出来,那么要分组排序,自然想到将数据集合在一起,按窗口group by ,按照count进行排序。 窗口此刻都是同一个
2.需要定义个状态变量ListState,存储当前item不断变换的count值. open中取状态句柄
//先定义状态变量,每个窗口都应该有一个ListState,保存窗口内所有商品的状态值
//先定义状态变量,再从open里获取状态句柄
var itemViewCountListState: ListState[ItemViewCount] = _
override def open(parameters: Configuration): Unit = {
itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
}
3.processElement中改变ListState的值,定义定时器等
//[i,ctx上下文可定时器等各种事情,o]
override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
//每来一条数据,直接加入ListState
itemViewCountListState.add(value)
//注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册,windowEnd都是一样的
context.timerService().registerEventTimeTimer(value.windowEnd + 1)
}
4.需要定义一个定时器onTimer延迟1s触发等待延迟数据;需要定义一个ListBuffer将状态变量缓存,缓存完毕后清空状态变量.
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
//为方便排序,另外定义一个ListBuffer,保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
val iter = itemViewCountListState.get().iterator() //无法排序,所以可以放到一个ListBuffer中进行排序
while (iter.hasNext){
allItemViewCounts += iter.next()
}
//导出来之后就可以清空状态
itemViewCountListState.clear()
//按照count大小进行排序
val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)
//将排名信息进行格式化,便于打印输出可视化展示
val result:StringBuilder = new StringBuilder
//timestamp:Long,就是processElement里面的value.windowEnd + 1,所以再减1就可以了
result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
//循环遍历结果中每个ItemViewCount
for(i <- sortedItemViewCounts.indices){
val currentItemViewCount = sortedItemViewCounts(i)
result.append("No").append(i+1).append(": ")
.append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
.append("热门度 = ").append(currentItemViewCount.count).append("\n")
}
4.3 完整代码
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.util.Collector
import java.sql.Timestamp
import java.util.Properties
import scala.collection.mutable.ListBuffer
// 定义输入数据样例类
case class UserBehavior(userId: Long, itemId: Long, categoryId: Int, behavior: String, timestamp: Long)
// 定义窗口聚合结果样例类
case class ItemViewCount(itemId: Long, windowEnd: Long, count: Long)
object HotItems{
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//因为是读取文件可能会乱序,所以设置并行度为1
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //定义事件时间语义,给每个stream赋一个时间特征
//从文件读取数据,转换成样例类,提取时间戳生成watermark
// val inputStream = env.readTextFile("/Users/attacker/iqiyi/UserBehaviorAnalysis/HotItemsAnalysis/src/main/resources/UserBehavior.csv")
//从kafka读取数据
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "consumer-group")
properties.setProperty("key.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer",
"org.apache.kafka.common.serialization.StringDeserializer")
val inputStream = env.addSource( new FlinkKafkaConsumer[String]("hotitem0621", new SimpleStringSchema(), properties) )
val dataStream = inputStream.map(data => {
var arr = Array[String]()
if(data != ""){
arr = data.split(",")
}
UserBehavior(arr(0).toLong,arr(1).toLong,arr(2).toInt,arr(3),arr(4).toLong)
})
.assignAscendingTimestamps(_.timestamp * 1000L)
//得到窗口聚合结果
val aggStream:DataStream[ItemViewCount] = dataStream
.filter(_.behavior == "pv")
.keyBy("itemId") //商品id分组 ,返回KeyedStream[T,JavaTuple], key的类型是JavaTuple
.timeWindow(Time.hours(1),Time.minutes(5)) //滑动窗口
.aggregate(new CountAgg(),new ItemViewWindowResult())
val resultStream:DataStream[String] = aggStream
.keyBy("windowEnd") //按窗口分组排序取前几名
.process(new TopNHotItems(5))
// dataStream.print("data")
// aggStream.print("agg")
resultStream.print()
env.execute("hotitems")
}
}
//自定义预聚合函数AggregateFunction
class CountAgg() extends AggregateFunction[UserBehavior,Long,Long]{
override def createAccumulator(): Long = 0L
override def add(value: UserBehavior, accumulator: Long): Long = accumulator + 1
override def getResult(accumulator: Long): Long = accumulator
override def merge(a: Long, b: Long): Long = a + b
}
//自定 义windowFunction
class ItemViewWindowResult() extends WindowFunction[Long,ItemViewCount,Tuple,TimeWindow]{
override def apply(key: Tuple, window: TimeWindow, input: Iterable[Long], out: Collector[ItemViewCount]): Unit = {
val itemId = key.asInstanceOf[Tuple1[Long]].f0 //scala tuple转为java tuple1,import java tuple1
val windowEnd = window.getEnd
val count = input.iterator.next() //迭代器的iterator取值
out.collect(ItemViewCount(itemId,windowEnd,count)) //apply本身无返回值,用out.collect
}
}
//自定义KeyedProcessFunction
class TopNHotItems(topSize: Int) extends KeyedProcessFunction[Tuple,ItemViewCount,String] {
//先定义状态变量,每个窗口都应该有一个ListState,保存窗口内所有商品的状态值
//先定义状态变量,再从open里获取状态句柄
var itemViewCountListState: ListState[ItemViewCount] = _
override def open(parameters: Configuration): Unit = {
itemViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[ItemViewCount]("itemviewcountlist", classOf[ItemViewCount]))
}
//[i,ctx上下文可定时器等各种事情,o]
override def processElement(value: ItemViewCount, context: KeyedProcessFunction[Tuple, ItemViewCount, String]#Context, collector: Collector[String]): Unit = {
//每来一条数据,直接加入ListState
itemViewCountListState.add(value)
//注册一个windowEnd +1之后触发的定时器,watermark+1ms时触发排序。即使重复注册,windowEnd都是一样的
context.timerService().registerEventTimeTimer(value.windowEnd + 1)
}
//当定时器触发,认为所有窗口统计结果都已到齐了,可以排序输出
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, ItemViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
//为方便排序,另外定义一个ListBuffer,保存ListState里面的所有数据.用伴生对象创建一个空的ListBuffer
val allItemViewCounts:ListBuffer[ItemViewCount] = ListBuffer()
val iter = itemViewCountListState.get().iterator() //无法排序,所以可以放到一个ListBuffer中进行排序
while (iter.hasNext){
allItemViewCounts += iter.next()
}
//导出来之后就可以清空状态
itemViewCountListState.clear()
//按照count大小进行排序
val sortedItemViewCounts = allItemViewCounts.sortBy(_.count)(Ordering.Long.reverse).take(topSize)
//将排名信息进行格式化,便于打印输出可视化展示
val result:StringBuilder = new StringBuilder
//timestamp:Long,就是processElement里面的value.windowEnd + 1,所以再减1就可以了
result.append("窗口结束时间:").append(new Timestamp(timestamp - 1)).append("\n")
//循环遍历结果中每个ItemViewCount
for(i <- sortedItemViewCounts.indices){
val currentItemViewCount = sortedItemViewCounts(i)
result.append("No").append(i+1).append(": ")
.append("商品ID = ").append(currentItemViewCount.itemId).append("\t")
.append("热门度 = ").append(currentItemViewCount.count).append("\n")
}
//每个窗口隔开
result.append("================\n\n")
Thread.sleep(1000) //1s的输出
//输出到缓冲里
out.collect(result.toString())
}
}