在流处理应用中,对单个event的处理如果不涉及与其他event交互或访问是比较简单的。但是如果对单个event的处理依赖其他topic过来的event或者后续处理的event依赖当前处理的event,这种情景类似于多表之间join,A和B join取A表的某几个字段。使用flink的state就可以实现。
一个场景如下:
课程信息(table,class_id,class_name,money)来自一个topic,学生信息来自一个topic(table,student_id,student_name,class_id),当学生信息过来的时候我们需要计算出课程对应的价格,如果两份数据在表中,很好计算,两表根据class_id关联一下即可得出。在flink中有一个Flink SQL可以做join但是我们是全量数据进行join,因此不合适。所以在这里可以把两类信息放入两个state中,注意这里使用Operator State,在取关联数据的时候访问状态就可以了。
为什么不用keyed state?
使用keyed state,按key做分组,这里如果把class_id和student_id作为key分组,map里定义的各个状态是无法互相访问的。一种取巧的方法是课程信息和学生信息赋予相同的key,这样进入一个map里,定义的多个状态可以访问。
还有一种比较笨的方法是,把课程信息和学生信息存起来,计算的时候再查出来,比如存入hbase,kudu等外部存储。这种方式可行,但过于依赖第三方组件,没有充分发挥flink状态计算的优势,而且扩展性不好。
逻辑计算过程如下:
代码如下:
package com.xxx.flink.demo
import com.xxx.flink.demo.KafkaConsts._
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.contrib.streaming.state.{RocksDBOptions, RocksDBStateBackend}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext, StateBackend}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.util.{StringUtils, TernaryBoolean}
import scala.collection.mutable
object StateVisitStreaming {
private val checkpointDataUri = "hdfs:///flink/checkpoints"
private val tmpDir = "file:///tmp/rocksdb/data/"
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// val rocksDBBackend:StateBackend = new RocksDBStateBackend(checkpointDataUri,true)
val fsStateBackend: StateBackend = new FsStateBackend(checkpointDataUri)
val rocksDBBackend: RocksDBStateBackend = new RocksDBStateBackend(fsStateBackend, TernaryBoolean.TRUE)
val config = new Configuration()
//TIMER分为HEAP(默认,性能更好)和RocksDB(扩展好)
config.setString(RocksDBOptions.TIMER_SERVICE_FACTORY, RocksDBStateBackend.PriorityQueueStateType.ROCKSDB.toString)
rocksDBBackend.configure(config)
rocksDBBackend.setDbStoragePath(tmpDir)
env.setStateBackend(rocksDBBackend.asInstanceOf[StateBackend])
env.enableCheckpointing(5000)
val consumer = new FlinkKafkaConsumer010[String](KAFKA_CONSUMER_TOPIC, new SimpleStringSchema(), KAFKA_PROP)
consumer.setStartFromLatest()
val ds = env.addSource(consumer)
ds.filter((!StringUtils.isNullOrWhitespaceOnly(_)))
.map(new StateJoin())
.print()
env.execute()
}
/**
* 模拟join
* table,class_id,class_name,money
* class,1001,piano,10000
*
* table,student_id,student_name,class_id
* user,user01,zhangsan,1001
*
* 使用多个state来存储每个主题域的信息
* classState:ListState[Map[String,Map[String,String]]]-存储class信息
* studentState:ListState[Map[String,Map[String,String]]]-存储student信息
*/
class StateJoin extends RichMapFunction[String, (String, String)] with CheckpointedFunction {
@transient
private var classListState: ListState[Map[String, Map[String, String]]] = _
@transient
private var studentListState: ListState[Map[String, Map[String, String]]] = _
private val classMap = new mutable.HashMap[String, Map[String, String]]()
private val studentMap = new mutable.HashMap[String, Map[String, String]]()
override def snapshotState(context: FunctionSnapshotContext): Unit = {
classMap.clear()
studentMap.clear()
}
override def initializeState(context: FunctionInitializationContext): Unit = {
val classDescriptor = new ListStateDescriptor[Map[String, Map[String, String]]]("class-info", TypeInformation.of(new TypeHint[Map[String, Map[String, String]]]() {}))
classListState = context.getOperatorStateStore.getListState(classDescriptor)
val studentDescriptor = new ListStateDescriptor[Map[String, Map[String, String]]]("student-info", TypeInformation.of(new TypeHint[Map[String, Map[String, String]]]() {}))
studentListState = context.getOperatorStateStore.getListState(studentDescriptor)
}
override def map(value: String): (String, String) = {
val arr = value.split("\\,")
val table = arr(0)
var student_id = ""
var money:String = ""
if ("class".equals(table)) {
val classValueMap = new mutable.HashMap[String, String]()
val class_id = arr(1)
val class_name = arr(2)
val money = arr(3)
classValueMap += ("class_name" -> class_name)
classValueMap += ("money" -> money)
classMap += (class_id -> classValueMap.toMap)
classListState.add(classMap.toMap)
} else if ("user".equals(table)) {
val studentValueMap = new mutable.HashMap[String, String]()
student_id = arr(1)
val student_name = arr(2)
val class_id = arr(3)
val tmp = classListState.get().iterator()
var contain: Boolean = false
while (tmp.hasNext) {
val ele = tmp.next()
val flag = ele.contains(class_id)
if (flag) {
val re = ele.get(class_id)
re match {
case Some(c)=> money = c.getOrElse("money","")
case None=> money = ""
}
contain = true
}
}
studentValueMap += ("student_name" -> student_name)
studentValueMap += ("class_id" -> class_id)
studentValueMap += ("money" -> money)
studentMap += (student_id -> studentValueMap.toMap)
studentListState.add(studentMap.toMap)
} else {
}
if(StringUtils.isNullOrWhitespaceOnly(student_id)){
("","")
}else{
(student_id,money)
}
}
}
}
在本例中使用了RocksDB来存储超大状态,开启增量checkpoint以提高性能。在flink的.out日志中可以看到输出的结果是否正确。