目录
1.知识点
- RichSourceFunction-自定义生成测试数据
-
自定义ProcessWindowFunction[IN, OUT, KEY, W <: Window]
-
设置水印之assignAscendingTimestamp
2.业务目标
每5秒钟计算前1小时或1天的各渠道下的各行为数据
窗口:1天
步长:5s
维度:渠道(appstore、weibo、weixin、tieba)、行为(浏览、下载、安装、卸载)
输出格式:窗口时间、渠道、行为、count
MarketViewCount(2021-07-01 15:15:40.0,2021-07-02 15:15:40.0,tieba,uninstall,305)
MarketViewCount(2021-07-01 15:15:45.0,2021-07-02 15:15:45.0,weibo,uninstall,301)
MarketViewCount(2021-07-01 15:15:45.0,2021-07-02 15:15:45.0,appstore,uninstall,274)
MarketViewCount(2021-07-01 15:15:45.0,2021-07-02 15:15:45.0,wechat,uninstall,281)
MarketViewCount(2021-07-01 15:15:45.0,2021-07-02 15:15:45.0,tieba,uninstall,314)
MarketViewCount(2021-07-01 15:15:50.0,2021-07-02 15:15:50.0,weibo,uninstall,309)
3.流程心法
- 总体流程:
定义输入样例类---->定义输出样例类---->自定义数据源继承RichSourceFunction---->主object
- 主object
1)创建执行环境
2)添加自定义source
3)开窗统计结果(filter、keyby、timewindow、process大招引入 ProcessWindowFunction):既然进行了keyBy,设置了窗口,那么既可以用大招调用process,调用自定义的ProcessWindowFunction,所有的keyby数据都集中在一个窗口中进行处理
4) 自定义ProcessWindowFunction
4.模块详解
4.1定义输入样例类
用户id/用户行为/渠道/时间戳
case class MarketUserBehavior(userId: String, behavior: String, channel: String, timestamp: Long)
4.2定义输出样例类
窗口开始时间、窗口结束时间、渠道、行为、次数
case class MarketViewCount(windowStart: String, windowEnd: String, channel: String, behavior: String, count: Long)
4.3自定义数据源继承RichSourceFunction
- 自定义数据源,那么应该有运行和结束的标志位,可以控制什么时候结束生成;还要定义用户行为的集合以及渠道的集合seq
- 分析RichSourceFunction,从源码看出RichSourceFunction继承 AbstractRichFunction 实现了SourceFunction接口,最终需要实现SourceFunction中的run 和cancel
public abstract class RichSourceFunction<OUT> extends AbstractRichFunction implements SourceFunction<OUT> {
private static final long serialVersionUID = 1L;
public RichSourceFunction() {
}
}
public interface SourceFunction<T> extends Function, Serializable {
void run(SourceFunction.SourceContext<T> var1) throws Exception;
void cancel();
@Public
public interface SourceContext<T> {
void collect(T var1);
@PublicEvolving
void collectWithTimestamp(T var1, long var2);
@PublicEvolving
void emitWatermark(Watermark var1);
@PublicEvolving
void markAsTemporarilyIdle();
Object getCheckpointLock();
void close();
}
}
完整实现RichSourceFunction如下:
class SimulatedSource() extends RichSourceFunction[MarketUserBehavior] {
//是否运行的标识位
var running = true
//定义用户行为和渠道的集合
val behaviorSet: Seq[String] = Seq("view", "download", "install", "uninstall")
val channelSet: Seq[String] = Seq("appstore", "weibo", "wechat", "tieba")
val rand: Random = Random
override def run(ctx: SourceFunction.SourceContext[MarketUserBehavior]): Unit = {
//定义一个生成数据最大的数量
val maxCounts = Long.MaxValue
var count = 0L
//while循环不断随机产生数据
while (running && count < maxCounts) {
val id = UUID.randomUUID().toString
val behavior = behaviorSet(rand.nextInt(behaviorSet.size))
val channel = channelSet(rand.nextInt(channelSet.size))
val ts = System.currentTimeMillis()
ctx.collect(MarketUserBehavior(id, behavior, channel, ts))
count += 1
Thread.sleep(50L)
}
}
override def cancel(): Unit = {
running = false
}
}
4.4 主object实现
- 创建环境,设置时间语义为EventTime,添加自定义source,设置watermark为assignAscendingTimestamp(已知元素时间戳在每个并行流中单调递增。在这种情况下,系统可以通过跟踪上升时间戳自动且完美地生成水印),详见创建水印三种方法。
- 真正的处理逻辑调用一个自定义的ProcessWindowFunction
- 重写process方法时,参数elements: Iterable[MarketUserBehavior],即把这个窗口下的所有数据都放入elements这个Iterable里,所以如果计算个数,自然可以使用其size属性. elements.size
- process的context:Context参数,可以获取当前窗口的开始时间和结束时间. context.window.getStart、context.window.getEnd.
- context上下文可以获得非常丰富的属性:如当前处理时间、当前水位线、窗口的各种属性等
- 由于是以渠道、行为分组,所以keyBy时可以用一个lambda式定一个一个二元tuple形式的key. 如果是计算一个总计呢,又该如何实现呢? 可以定一个常量的二元组作为keyBy
stream
.filter(_.behavior != "UNINSTALL") .map(data => {
("dummyKey", 1L) })
.keyBy(_._1)
val resultStream = dataStream .filter(_.behavior == "uninstall") .keyBy(data => (data.channel,data.behavior))
完整代码如下:
object AppMarketByChannel{
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
//添加自定义source SimulatedSource
val dataStream = env.addSource( new SimulatedSource )
.assignAscendingTimestamps(_.timestamp)
//开窗统计结果
val resultStream = dataStream
.filter(_.behavior == "uninstall")
.keyBy(data => (data.channel,data.behavior))
.timeWindow(Time.days(1),Time.seconds(5))
.process(new MarketCountByChannel())
resultStream.print()
env.execute("app market by channel job")
}
//自定义processWindowFunction
class MarketCountByChannel() extends ProcessWindowFunction[MarketUserBehavior,MarketViewCount,(String,String),TimeWindow]{
override def process(key: (String, String), context: Context, elements: Iterable[MarketUserBehavior], out: Collector[MarketViewCount]): Unit = {
val start = new Timestamp(context.window.getStart).toString
val end = new Timestamp(context.window.getEnd).toString
val channel = key._1
val behavior = key._2
val count = elements.size
out.collect(MarketViewCount(start, end, channel,behavior, count))
}
}
}