Flink实现状态数据互相访问(即依赖数据访问,类似表之间的Join)

在流处理应用中,对单个event的处理如果不涉及与其他event交互或访问是比较简单的。但是如果对单个event的处理依赖其他topic过来的event或者后续处理的event依赖当前处理的event,这种情景类似于多表之间join,A和B join取A表的某几个字段。使用flink的state就可以实现。

一个场景如下:

     课程信息(table,class_id,class_name,money)来自一个topic,学生信息来自一个topic(table,student_id,student_name,class_id),当学生信息过来的时候我们需要计算出课程对应的价格,如果两份数据在表中,很好计算,两表根据class_id关联一下即可得出。在flink中有一个Flink SQL可以做join但是我们是全量数据进行join,因此不合适。所以在这里可以把两类信息放入两个state中,注意这里使用Operator State,在取关联数据的时候访问状态就可以了。

为什么不用keyed state?

     使用keyed state,按key做分组,这里如果把class_id和student_id作为key分组,map里定义的各个状态是无法互相访问的。一种取巧的方法是课程信息和学生信息赋予相同的key,这样进入一个map里,定义的多个状态可以访问。

     还有一种比较笨的方法是,把课程信息和学生信息存起来,计算的时候再查出来,比如存入hbase,kudu等外部存储。这种方式可行,但过于依赖第三方组件,没有充分发挥flink状态计算的优势,而且扩展性不好。

     逻辑计算过程如下:

代码如下:

package com.xxx.flink.demo

import com.xxx.flink.demo.KafkaConsts._
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.contrib.streaming.state.{RocksDBOptions, RocksDBStateBackend}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext, StateBackend}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.util.{StringUtils, TernaryBoolean}

import scala.collection.mutable



object StateVisitStreaming {
  private val checkpointDataUri = "hdfs:///flink/checkpoints"
  private val tmpDir = "file:///tmp/rocksdb/data/"

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //    val rocksDBBackend:StateBackend = new RocksDBStateBackend(checkpointDataUri,true)
    val fsStateBackend: StateBackend = new FsStateBackend(checkpointDataUri)
    val rocksDBBackend: RocksDBStateBackend = new RocksDBStateBackend(fsStateBackend, TernaryBoolean.TRUE)
    val config = new Configuration()
    //TIMER分为HEAP(默认,性能更好)和RocksDB(扩展好)
    config.setString(RocksDBOptions.TIMER_SERVICE_FACTORY, RocksDBStateBackend.PriorityQueueStateType.ROCKSDB.toString)
    rocksDBBackend.configure(config)
    rocksDBBackend.setDbStoragePath(tmpDir)
    env.setStateBackend(rocksDBBackend.asInstanceOf[StateBackend])
    env.enableCheckpointing(5000)
    val consumer = new FlinkKafkaConsumer010[String](KAFKA_CONSUMER_TOPIC, new SimpleStringSchema(), KAFKA_PROP)
    consumer.setStartFromLatest()
    val ds = env.addSource(consumer)
    ds.filter((!StringUtils.isNullOrWhitespaceOnly(_)))
      .map(new StateJoin())
      .print()
    env.execute()
  }


  /**
    * 模拟join
    * table,class_id,class_name,money
    * class,1001,piano,10000
    *
    * table,student_id,student_name,class_id
    * user,user01,zhangsan,1001
    *
    * 使用多个state来存储每个主题域的信息
    * classState:ListState[Map[String,Map[String,String]]]-存储class信息
    * studentState:ListState[Map[String,Map[String,String]]]-存储student信息
    */
  class StateJoin extends RichMapFunction[String, (String, String)] with CheckpointedFunction {

    @transient
    private var classListState: ListState[Map[String, Map[String, String]]] = _
    @transient
    private var studentListState: ListState[Map[String, Map[String, String]]] = _
    private val classMap = new mutable.HashMap[String, Map[String, String]]()
    private val studentMap = new mutable.HashMap[String, Map[String, String]]()

    override def snapshotState(context: FunctionSnapshotContext): Unit = {
      classMap.clear()
      studentMap.clear()
    }

    override def initializeState(context: FunctionInitializationContext): Unit = {
      val classDescriptor = new ListStateDescriptor[Map[String, Map[String, String]]]("class-info", TypeInformation.of(new TypeHint[Map[String, Map[String, String]]]() {}))
      classListState = context.getOperatorStateStore.getListState(classDescriptor)
      val studentDescriptor = new ListStateDescriptor[Map[String, Map[String, String]]]("student-info", TypeInformation.of(new TypeHint[Map[String, Map[String, String]]]() {}))
      studentListState = context.getOperatorStateStore.getListState(studentDescriptor)

    }

    override def map(value: String): (String, String) = {
      val arr = value.split("\\,")
      val table = arr(0)
      var student_id = ""
      var money:String = ""
      if ("class".equals(table)) {
        val classValueMap = new mutable.HashMap[String, String]()
        val class_id = arr(1)
        val class_name = arr(2)
        val money = arr(3)
        classValueMap += ("class_name" -> class_name)
        classValueMap += ("money" -> money)
        classMap += (class_id -> classValueMap.toMap)
        classListState.add(classMap.toMap)
      } else if ("user".equals(table)) {
        val studentValueMap = new mutable.HashMap[String, String]()
        student_id = arr(1)
        val student_name = arr(2)
        val class_id = arr(3)
        val tmp = classListState.get().iterator()
        var contain: Boolean = false
        while (tmp.hasNext) {
          val ele = tmp.next()
          val flag = ele.contains(class_id)
          if (flag) {
            val re = ele.get(class_id)
            re match {
             case Some(c)=> money = c.getOrElse("money","")
             case None=> money = ""
            }
            contain = true
          }
        }
        studentValueMap += ("student_name" -> student_name)
        studentValueMap += ("class_id" -> class_id)
        studentValueMap += ("money" -> money)
        studentMap += (student_id -> studentValueMap.toMap)
        studentListState.add(studentMap.toMap)
      } else {

      }
      if(StringUtils.isNullOrWhitespaceOnly(student_id)){
        ("","")
      }else{
        (student_id,money)
      }
    }
  }
}

 在本例中使用了RocksDB来存储超大状态,开启增量checkpoint以提高性能。在flink的.out日志中可以看到输出的结果是否正确。

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值