Flink练习第一个欺诈实例(带时间范围)
背景介绍
上一个实例我们的核心计算逻辑是检测到同一账户上一个消费小于1元,下一个消费大于100元就定性为有欺诈嫌疑,作为初学者的练习已经领略了Flink状态的威力。
本次的实例逻辑将加上时间的监控,我们把欺诈的规则略微做修改。如果同一个账户,在五分钟以内出现两笔交易,一笔小于1元,一笔大于100元,那么我们就人为这个账户有可能存在欺诈的嫌疑,那么就需要生成一条告警信息。
Flink的状态管理
有状态的流处理是FLink官方给出的最优代表性的特征
What is State?
While many operations in a dataflow simply look at one individual event at a time (for example an event parser), some operations remember information across multiple events (for example window operators). These operations are called stateful. URL
这英文是官方给出的关于state的解释,大概意思是在工作流处理中,有一些操作会记录多个事件的信息,e.g. 窗口操作,这些操作就是有状态的。
上面提到了State是Operation记录状态,那这个operation在Flink里面要分为task的的operation和operation本身。也就是task和operation(我们通常说的算子)都是有能力记录状态的, 并且在task可以在运行失败的时候做数据恢复
Flink的state的类型
- Keyed State
- Keyed state is maintained in what can be thought of as an embedded key/value store. 就是说Keyed state是一个嵌入式的key-value形式,并且只能作用在已经分区的流数据上,也就是必须是keyed-stream上。
- 分类:
- ValueState
- MapState
- ListState
- ReducingState
- AggregatingState
- Operator State
- 通常我们看到都是task级别的state,也就是每个task有一个state的记录。e.g. kafka的consumer,一般来说一个task对应一个kafka partition的消费,那么需要记录消费kafka某个partition的offset。
对于state的介绍,官网主要介绍了keyed state,主要也是因为我们平时在处理业务的时候大多数时候都是用的keyed state
带计时器的状态处理
上面已经描述了本次实例的核心逻辑,废话不多说,上代码。
package com.qingshan.practise;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.util.Collector;
import com.qingshan.source.ReadLineSource;
/**
*
* @author qingshanit
*
*/
public class FraudWithTimerDemo {
public static void main(String[] args) {
try {
// 创建Job执行的上下文环境,这里构建的其实就是本地环境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 接入数据源
final DataStream<String> source = env.addSource(new ReadLine2Source());
// 数据规范化处理,将data.csv中的每一行转化成ComsumeRecord对象
final DataStream<ComsumeRecord> consumeRecordsStream = source.map(
new MapFunction<String, ComsumeRecord>() {
private static final long serialVersionUID = 1L;
@Override
public ComsumeRecord map(String line) throws Exception {
String [] info = line.split(",");
return new ComsumeRecord(info[0], info[1],Double.parseDouble(info[2].trim()), Long.parseLong(info[3].trim()));
}});
// 把数据流按用户的编号进行分区
final KeyedStream<ComsumeRecord, String> keyedDataSteam = consumeRecordsStream.keyBy(new KeySelector<ComsumeRecord, String>(){
private static final long serialVersionUID = 1L;
//返回分区的key
@Override
public String getKey(ComsumeRecord record) throws Exception {
return record.getUserid();
}
});
// 通过MyKeyedProcessFunction生成告警
final DataStream<Alert> alertStream = keyedDataSteam.process(new MyKeyedProcessWithTimerFunction());
// 输出到控制台
//alertStream.print();
alertStream.printToErr();
// 启动job,这一步不能忘!
env.execute("demo");
} catch (Exception e) {
e.printStackTrace();
}
}
}
class MyKeyedProcessWithTimerFunction extends KeyedProcessFunction<String, ComsumeRecord, Alert>{
private transient ValueState<Boolean> isFraudState;
private transient ValueState<Long> timerState;
private final static Double LARGE_AMOUNT = 100d;
private final static Double SAMLL_AMOUNT = 1d;
@Override
public void open(Configuration parameters) throws Exception {
//首先需要注册一个ValueState的Descriptor,这个注册的动作只会有一次,所以要在open这个方法里实现
ValueStateDescriptor<Boolean> isFraudStateDescriptor = new ValueStateDescriptor<Boolean>("isFraudState", Types.BOOLEAN);
// 从上下文中获取ValueState,类似于初始化属性参数
isFraudState = this.getRuntimeContext().getState(isFraudStateDescriptor);
//注册一个计时器的状态,用来记录某个key的当前的计时器状态信息
ValueStateDescriptor<Long> timerStateDescriptor = new ValueStateDescriptor<Long>("timerState", Types.LONG);
// 从上下文中获取ValueState,类似于初始化属性参数
timerState = this.getRuntimeContext().getState(timerStateDescriptor);
}
private static final long serialVersionUID = 1L;
/**
* 方法processElement是处理每一条记录,而ValueState的作用域是当前的key,也就是我们之前使用keyby分区的账户信息。
*/
@Override
public void processElement(ComsumeRecord record, KeyedProcessFunction<String, ComsumeRecord, Alert>.Context context,
Collector<Alert> colloetor) throws Exception {
// 针对当前的key获取对应的状态
Boolean isLastAmoutIsSamll = isFraudState.value();
// isLastAmoutIsSamll不为空表示处理的上一条记录是金额小于1元的账单
if(isLastAmoutIsSamll != null) {
if(record.getAmount() > LARGE_AMOUNT) {
Alert alert = new Alert(record.getUserid(),record.getAmount());
colloetor.collect(alert);
}
// delete timer
Long timer = timerState.value();
context.timerService().deleteProcessingTimeTimer(timer);
// clean up
timerState.clear();
isFraudState.clear();
}
// 如果小于LARGE_AMOUNT, 注册一个新的计时器并重新标记为是否欺诈的状态为true,开启下一轮检测
if(record.getAmount() < SAMLL_AMOUNT) {
isFraudState.update(true);
//以当前时间,并且制定在
long timer = context.timerService().currentWatermark() + 1000 * 5;
context.timerService().registerProcessingTimeTimer(timer);
timerState.update(timer);
}
}
/**
* 当定时器触发的时候,将会调用onTimer方法
* 我们在onTimer方法中实现我们清理状态的逻辑
* onTimer在这里的作用就是定时清理状态,也就是我们设置的过期时间到了,就会清理状态中记录的信息
*/
@Override
public void onTimer(long timestamp, KeyedProcessFunction<String, ComsumeRecord, Alert>.OnTimerContext ctx,
Collector<Alert> out) throws Exception {
// 清理状态信息
isFraudState.clear();
timerState.clear();
}
}
class ReadLine2Source implements SourceFunction<String>{
private static final long serialVersionUID = 1L;
private volatile boolean running = true;
/**
* 关闭输出
*/
@Override
public void cancel() {
running = false;
}
/**
* 这里的逻辑主要就是每个一秒钟输出文件中的一行
*/
@Override
public void run(SourceContext<String> context) throws Exception {
BufferedReader br = null;
String filePath = ReadLineSource.class.getClassLoader().getResource("./data2.csv").getPath();
try {
br = new BufferedReader(new FileReader(new File(filePath)));
String line = null;
while(running && (line=br.readLine()) != null) {
//System.err.println("source output:" + line);
context.collect(line);
Thread.sleep(1000 * 2);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}finally {
if(br != null) {
br.close();
}
}
}
}