6 Flink流处理核心编程实战
6.1 基于埋点日志数据的网络流量统计
6.1.1 网站总浏览量(PV)的统计
衡量网站流量一个最简单的指标,就是网站的页面浏览量(Page View,PV)。用户每次打开一个页面便记录1次PV,多次打开同一页面则浏览量累计。
一般来说,PV与来访者的数量成正比,但是PV并不直接决定页面的真实来访者数量,如同一个来访者通过不断的刷新页面,也可以制造出非常高的PV。接下来我们就用咱们之前学习的Flink算子来实现PV的统计。
// 封装数据的JavaBean
package com.atguigu.flink.bean;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@NoArgsConstructor
@AllArgsConstructor
public class UserBehavior {
private Long userId;
private Long itemId;
private Integer categoryId;
private String behavior;
private Long timestamp;
}
package com.atguigu.flink;
import com.atguigu.flink.bean.UserBehavior;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
public class Project_PV {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//env.setParallelism(2);
// WordCount
/* env
.readTextFile("input\\UserBehavior.csv")
.map(line -> {
String[] split = line.split(",");
return new UserBehavior(
Long.valueOf(split[0]),
Long.valueOf(split[1]),
Integer.valueOf(split[2]),
split[3],
Long.valueOf(split[4])
);
})
.filter(behavior -> "pv".equals(behavior.getBehavior()))
.map(behavior -> Tuple2.of("pv", 1L)).returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(value -> value.f0)
.sum(1)
.print();*/
// process
env
.readTextFile("input\\UserBehavior.csv")
.map(line -> {
String[] split = line.split(",");
return new UserBehavior(
Long.valueOf(split[0]),
Long.valueOf(split[1]),
Integer.valueOf(split[2]),
split[3],
Long.valueOf(split[4])
);
})
.filter(behavior -> "pv".equals(behavior.getBehavior()))
.keyBy(UserBehavior::getBehavior)
.process(new KeyedProcessFunction<String, UserBehavior, Long>() {
long count = 0;
@Override
public void processElement(UserBehavior value, Context ctx, Collector<Long> out) throws Exception {
count++;
out.collect(count);
}
})
.print();
env.execute();
}
}
6.1.2 网站独立访客数(UV)的统计
上一个案例中,我们统计的是所有用户对页面的所有浏览行为,也就是说,同一用户的浏览行为会被重复统计。而在实际应用中,我们往往还会关注,到底有多少不同的用户访问了网站,所以另外一个统计流量的重要指标是网站的独立访客数(Unique Visitor,UV)
package com.atguigu.flink;
import com.atguigu.flink.bean.UserBehavior;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
import java.util.HashSet;
public class Project_UV {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env
.readTextFile("input\\UserBehavior.csv")
.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
String[] split = line.split(",");
UserBehavior behavior = new UserBehavior(
Long.valueOf(split[0]),
Long.valueOf(split[1]),
Integer.valueOf(split[2]),
split[3],
Long.valueOf(split[4])
);
if ("pv".equals(behavior.getBehavior())) {
out.collect(Tuple2.of("uv", behavior.getItemId()));
}
})
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(t -> t.f0)
.process(new KeyedProcessFunction<String, Tuple2<String, Long>, Long>() {
HashSet<Long> userIds = new HashSet<>();
@Override
public void processElement(Tuple2<String, Long> value, Context ctx, Collector<Long> out) throws Exception {
userIds.add(value.f1);
out.collect((long) userIds.size());
}
})
.print("uv");
env.execute();
}
}
6.2 市场营销商业指标统计分析
随着智能手机的普及,在如今的电商网站中已经有越来越多的用户来自移动端,相比起传统浏览器的登录方式,手机APP成为了更多用户访问电商网站的首选。对于电商企业来说,一般会通过各种不同的渠道对自己的APP进行市场推广,而这些渠道的统计数据(比如,不同网站上广告链接的点击量、APP下载量)就成了市场营销的重要商业指标。
6.2.1 APP市场推广统计-分渠道/不分渠道
// 封装数据的JavaBean
package com.atguigu.flink;
import com.atguigu.flink.bean.MarketingUserBehavior;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import java.util.Arrays;
import java.util.List;
import java.util.Random;
public class Project_AppAnalysis_By_Channel {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 分渠道统计
/* env
.addSource(new AppMarketingDataSource())
.map(behavior -> Tuple2.of(behavior.getChannel() + "_" + behavior.getBehavior(), 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(t -> t.f0)
.sum(1)
.print();*/
// 不分渠道统计
env
.addSource(new AppMarketingDataSource())
.map(behavior -> Tuple2.of(behavior.getBehavior(), 1L))
.returns(Types.TUPLE(Types.STRING, Types.LONG))
.keyBy(t -> t.f0)
.sum(1)
.print();
env.execute();
}
public static class AppMarketingDataSource extends RichSourceFunction<MarketingUserBehavior> {
boolean canRun = true;
Random random = new Random();
List<String> channels = Arrays.asList("huawwei", "xiaomi", "apple", "baidu", "qq", "oppo", "vivo");
List<String> behaviors = Arrays.asList("download", "install", "update", "uninstall");
@Override
public void run(SourceContext<MarketingUserBehavior> ctx) throws Exception {
while (canRun) {
MarketingUserBehavior marketingUserBehavior = new MarketingUserBehavior(
(long) random.nextInt(1000000),
behaviors.get(random.nextInt(behaviors.size())),
channels.get(random.nextInt(channels.size())),
System.currentTimeMillis());
ctx.collect(marketingUserBehavior);
Thread.sleep(2000);
}
}
@Override
public void cancel() {
canRun = false;
}
}
}
6.3 各省份页面广告点击量实时统计
电商网站的市场营销商业指标中,除了自身的APP推广,还会考虑到页面上的广告投放(包括自己经营的产品和其它网站的广告)。所以广告相关的统计分析,也是市场营销的重要指标。
对于广告的统计,最简单也最重要的就是页面广告的点击量,网站往往需要根据广告点击量来制定定价策略和调整推广方式,而且也可以借此收集用户的偏好信息。更加具体的应用是,我们可以根据用户的地理位置进行划分,从而总结出不同省份用户对不同广告的偏好,这样更有助于广告的精准投放。
数据准备
在咱们当前的案例中,给大家准备了某电商网站的广告点击日志数据AdClickLog.csv, 本日志数据文件中包含了某电商网站一天用户点击广告行为的事件流,数据集的每一行表示一条用户广告点击行为,由用户ID、广告ID、省份、城市和时间戳组成并以逗号分隔。
将AdClickLog.csv文件放置项目目录: input下
package com.atguigu.flink.java.chapter_6;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class AdsClickLog {
private Long userId;
private Long adId;
private String province;
private String city;
private Long timestamp;
}
package com.atguigu.flink;
import com.atguigu.flink.bean.AdsClickLog;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import static org.apache.flink.api.common.typeinfo.Types.*;
public class Project_Ads_Click {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env
.readTextFile("input/AdClickLog.csv")
.map(line -> {
String[] split = line.split(",");
AdsClickLog adsClickLog = new AdsClickLog(Long.valueOf(split[0]), Long.valueOf(split[1]), split[2],
split[3], Long.valueOf(split[4]));
return adsClickLog;
})
.map(log -> Tuple2.of(Tuple2.of(log.getProvince(), log.getAdId()), 1L))
.returns(TUPLE(TUPLE(STRING, LONG), LONG))
.keyBy(new KeySelector<Tuple2<Tuple2<String, Long>, Long>, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> getKey(Tuple2<Tuple2<String, Long>, Long> value) throws Exception {
return value.f0;
}
})
.sum(1)
.print("省份-广告");
env.execute();
}
}
6.4 订单支付实时监控
在电商网站中,订单的支付作为直接与营销收入挂钩的一环,在业务流程中非常重要。对于订单而言,为了正确控制业务流程,也为了增加用户的支付意愿,网站一般会设置一个支付失效时间,超过一段时间不支付的订单就会被取消。另外,对于订单的支付,我们还应保证用户支付的正确性,这可以通过第三方支付平台的交易数据来做一个实时对账。
需求: 来自两条流的订单交易匹配
对于订单支付事件,用户支付完成其实并不算完,我们还得确认平台账户上是否到账了。而往往这会来自不同的日志信息,所以我们要同时读入两条流的数据来做合并处理。
订单数据从OrderLog.csv中读取,交易数据从ReceiptLog.csv中读取
package com.atguigu.flink.java.chapter_6;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class OrderEvent {
private Long orderId;
private String eventType;
//交易码
private String txId;
private Long eventTime;
}
package com.atguigu.flink.java.chapter_6;
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class TxEvent {
private String txId;
private String payChannel;
private Long eventTime;
}
package com.atguigu.flink;
import com.atguigu.flink.bean.OrderEvent;
import com.atguigu.flink.bean.TxEvent;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoProcessFunction;
import org.apache.flink.util.Collector;
import java.util.HashMap;
import java.util.Map;
public class Project_Order {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(2);
SingleOutputStreamOperator<OrderEvent> orderEventDS = env
.readTextFile("input/OrderLog.csv")
.map(line -> {
String[] split = line.split(",");
OrderEvent orderEvent = new OrderEvent(Long.valueOf(split[0]), split[1], split[2], Long.valueOf(split[3]));
return orderEvent;
});
SingleOutputStreamOperator<TxEvent> txDS = env
.readTextFile("input/ReceiptLog.csv")
.map(line -> {
String[] split = line.split(",");
TxEvent txEvent = new TxEvent(split[0], split[1], Long.valueOf(split[2]));
return txEvent;
});
ConnectedStreams<OrderEvent, TxEvent> orderAndTx = orderEventDS.connect(txDS);
orderAndTx
.keyBy("txId", "txId")
.process(new CoProcessFunction<OrderEvent, TxEvent, String>() {
// 存储 txId -> OrderEvent
Map<String, OrderEvent> orderMap = new HashMap<>();
// 存储 txId -> TxEvent
Map<String, TxEvent> txMap = new HashMap<>();
@Override
public void processElement1(OrderEvent value, Context ctx, Collector<String> out) throws Exception {
// 获取交易信息
if (txMap.containsKey(value.getTxId())) {
out.collect("订单: " + value + "对账成功");
txMap.remove(value.getTxId());
} else {
orderMap.put(value.getTxId(), value);
}
}
@Override
public void processElement2(TxEvent value, Context ctx, Collector<String> out) throws Exception {
if (orderMap.containsKey(value.getTxId())) {
OrderEvent orderEvent = orderMap.get(value.getTxId());
out.collect("订单: " + orderEvent + "对账成功");
orderMap.remove(value.getTxId());
} else {
txMap.put(value.getTxId(), value);
}
}
})
.print();
env.execute();
}
}
注意1:
new ProcessFunction
=> 每个 并行实例 都会执行一次, 也就是说 => 每个并行实例,有一个对象, 对应的 属性也只会存一份
=> 所谓的来一条数据处理一条,指的是 方法的调用,每来一条数据,就会 调用一次 类的对象.方法()
注意2:
- 一般使用Connect连接两条流,做一些条件匹配的时候,在多并行度条件下,要根据连接条件keyby
- 先分别keyby再connect,和先connect再keyby,效果一样