一、数据字段如下:
字段名 |
数据类型 |
说明 |
userId |
Long |
加密后的用户ID |
itemId |
Long |
加密后的商品ID |
categoryId |
Int |
加密后的商品所属类别ID |
behavior |
String |
用户行为类型,包括(‘pv’, ‘’buy, ‘cart’, ‘fav’) |
timestamp |
Long |
行为发生的时间戳,单位秒 |
数据样本如下:
82170,3588374,2465336,pv,1511658004
587599,2067643,4818107,cart,1511658004
367451,15775,4756105,pv,1511658004
428316,2478780,4284875,pv,1511658004
284910,3680091,3829657,pv,1511658004
345119,737662,4357323,pv,1511658004
551442,1762997,1879194,pv,1511658004
550384,3908776,1029459,pv,1511658004
677500,4534693,2640118,pv,1511658004
398626,2791489,1467750,pv,1511658004
118053,3545571,2433095,pv,1511658005
457401,4063698,4801426,pv,1511658005
45105,3234847,3141941,fav,1511658005
604760,2661651,3738615,pv,1511658005
905383,2064903,2939262,cart,1511658005
740788,3657484,4936889,pv,1511658005
456838,1242724,4756105,fav,1511658005
585217,215764,2640118,pv,1511658006
658185,4025021,4048584,fav,1511658006
210431,2035568,2328673,pv,1511658006
二、需求:每隔5分钟输出最近一小时内点击量最多的前N个商品。
三、需求分析:
- 抽取出业务时间戳,告诉Flink框架基于业务时间做窗口
- 过滤出点击行为数据
- 按一小时的窗口大小,每5分钟统计一次,做滑动窗口聚合(Sliding Window)
- 按每个窗口聚合,输出每个窗口中点击量前N名的商品
四、代码实现
数据从kafka到Flink
import java.sql.Timestamp
import java.util.Properties
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.java.tuple.{Tuple, Tuple1}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import or