前言:
近年来,随着社会的不断发展,人们对于海量数据的挖掘和运用越来越重视,大数据的统计分析可以为企业决策者提供充实的依据。例如,通过对某网站日志数据统计分析,可以得出网站的日访问量,从而得出网站的欢迎程度;通过对移动APP的下载数据量进行统计分析,可得出应用程序的受欢迎程度,可通过不同维度进行更深层次的数据分析,为运营分析与推广决策提供可靠的数据依据。
利用flink对前端日志数据做ETL(试题案例)
1.定义UserPageView类有user,url,ts
import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
@Data
@AllArgsConstructor
@NoArgsConstructor
public class UserPageView {
private String user;
private String url;
private Long ts;
}
2.自定义数据源 user从"Mary,Alice,Bob,Cary"中随机获取,url从"./home,./cart,./fav,./prod?id=1,./prod?id=2"中随机,ts使用当前时间戳
GeneratorFunction<Long, UserPageView> generatorFunction = new GeneratorFunction<Long, UserPageView>() {
@Override
public UserPageView map(Long aLong) throws Exception {
String[] users = "Mary,Alice,Bob,Cary".split(",");
String[] urls = "./home,./cart,./fav,./prod?id=1,.prod?id=2".split(",");
Random random = new Random();
UserPageView userPageView = new UserPageView();
userPageView.setUser(users[random.nextInt(users.length)]);
userPageView.setUrl(urls[random.nextInt(urls.length)]);
userPageView.setTs(System.currentTimeMillis());
return userPageView;
}
};
3.设置时间语义为事件时间,水位线设置5秒延迟时间
DataStreamSource<UserPageView> stream =
env.fromSource(source,
WatermarkStrategy
.<UserPageView>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event,ts)->{
return event.getTs();
}),
"Generator Source");
4.转化成二元组,里面有user,数量(两种方法)
5.设置user为key 开1小时窗口,查询出1小时内user出现了多少次【两个小题和为一个代码】
(1).方法一
SingleOutputStreamOperator<Tuple2<String, Integer>> operator = stream.map(value -> Tuple2.of(value.getUser(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(value -> value.f0)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new AggregateFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> createAccumulator() {
return Tuple2.of(null, 0);
}
@Override
public Tuple2<String, Integer> add(Tuple2<String, Integer> value, Tuple2<String, Integer> accumulator) {
accumulator.f0 = value.f0;
accumulator.f1 += value.f1;
return accumulator;
}
@Override
public Tuple2<String, Integer> getResult(Tuple2<String, Integer> accumulator) {
return accumulator;
}
@Override
public Tuple2<String, Integer> merge(Tuple2<String, Integer> a, Tuple2<String, Integer> b) {
return null;
}
});
(2).方法二
SingleOutputStreamOperator<Tuple2<String, Integer>> reduce = stream.map(value -> Tuple2.of(value.getUser(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(value -> value.f0)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> v1, Tuple2<String, Integer> v2) {
return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
}
});
6.输出可查看效果【二选一即可】
operator.print();
reduce.print();
代码总结:
public class UserPageViewCountDmo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//并行度设置为1
env.setParallelism(1);
GeneratorFunction<Long, UserPageView> generatorFunction = new GeneratorFunction<Long, UserPageView>() {
@Override
public UserPageView map(Long aLong) throws Exception {
String[] users = "Mary,Alice,Bob,Cary".split(",");
String[] urls = "./home,./cart,./fav,./prod?id=1,.prod?id=2".split(",");
Random random = new Random();
UserPageView userPageView = new UserPageView();
userPageView.setUser(users[random.nextInt(users.length)]);//自定义数据源 user从"Mary,Alice,Bob,Cary"中随机获取
userPageView.setUrl(urls[random.nextInt(urls.length)]);//url从"./home,./cart,./fav,./prod?id=1,./prod?id=2"中随机
userPageView.setTs(System.currentTimeMillis());//ts使用当前时间戳
return userPageView;
}
};
long numberOfRecords = Long.MAX_VALUE;//模拟无界流
DataGeneratorSource<UserPageView> source =
new DataGeneratorSource<>(
generatorFunction,
numberOfRecords,
RateLimiterStrategy.perSecond(1),//1秒产生一条数据
Types.POJO(UserPageView.class));
DataStreamSource<UserPageView> stream =
env.fromSource(source,
WatermarkStrategy
//设置时间语义为事件时间,水位线设置5秒延迟时间
.<UserPageView>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event,ts)->{
return event.getTs();
}),
"Generator Source");
// stream.print();
// //转化成二元组,里面有user,数量(方法一)
SingleOutputStreamOperator<Tuple2<String, Integer>> operator = stream.map(value -> Tuple2.of(value.getUser(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(value -> value.f0)
//设置user为key 开1小时窗口,查询出1小时内user出现了多少次
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new AggregateFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> createAccumulator() {
return Tuple2.of(null, 0);
}
@Override
public Tuple2<String, Integer> add(Tuple2<String, Integer> value, Tuple2<String, Integer> accumulator) {
accumulator.f0 = value.f0;//记录用户名
accumulator.f1 += value.f1;//用户次数累加
return accumulator;
}
@Override
public Tuple2<String, Integer> getResult(Tuple2<String, Integer> accumulator) {
return accumulator;
}
@Override
public Tuple2<String, Integer> merge(Tuple2<String, Integer> a, Tuple2<String, Integer> b) {
return null;
}
});
//转化成二元组,里面有user,数量(方法二)
SingleOutputStreamOperator<Tuple2<String, Integer>> reduce = stream.map(value -> Tuple2.of(value.getUser(), 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(value -> value.f0)
//设置user为key 开1小时窗口,查询出1小时内user出现了多少次
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> v1, Tuple2<String, Integer> v2) {
return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
}
});
// operator.print();
reduce.print();
//执行
env.execute();
}
}
测试效果:
//每10秒打印一次效果(仅做参考)
(Alice,2)
(Mary,2)
(Bob,1)