SparkStreaming 状态计算 (updateStateByKey mapWithState)区别

最新推荐文章于 2022-03-20 21:02:12 发布

冬瓜螺旋雪碧

最新推荐文章于 2022-03-20 21:02:12 发布

阅读量508

点赞数 1

分类专栏： Spark 文章标签： SparkStreaming 状态计算 updateStateBykey mapWithState

本文链接：https://blog.csdn.net/kzw11/article/details/102781861

版权

Spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

文章目录

- - updateStateByKey 算子
  - mapWithState （生产中推荐使用）

SparkStreaming 状态计算 (updateStateByKey mapWithState) 优缺点

updateStateByKey 算子

返回的是带有状态的DStream,在这个DStream里面，每一个key的状态是可以被更新的，通过一个给定的函数，把之前的key的状态和当前key新的状态联合起来计算，用于维持每个key的任意状态。说白了也就是说每一次的老的计算状态都给记录下来，新的计算状态用于去更新老的状态。

也即是说它会统计全局的key的状态，就算没有数据输入，它也会在每一个批次的时候返回之前的key的状态。
缺点：若数据量太大的话，需要checkpoint的数据会占用较大的存储，效率低下。
两步骤：

Define the state - The state can be an arbitrary data type.
Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.

object SparkStreamingState02 {

  val checkPointPath = "." // 将checkpoint 保存到当前目录

  def main(args: Array[String]): Unit = {


    /*
    根据checkpoint数据重新创建StreamingContext或创建新的StreamingContext。
    如果提供的`checkpointPath`中存在checkpoint数据，则将根据检查点数据重新创建StreamingContext。
    如果数据不存在,则将通过调用提供的 方法 来创建StreamingContext
     */
    val ssc = StreamingContext.getOrCreate(checkPointPath,CreatStreamingContext)

    ssc.start()
    ssc.awaitTermination()
  }


  def CreatStreamingContext() ={
    val sparkConf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getSimpleName)
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    ssc.checkpoint(checkPointPath)

    val lines = ssc.socketTextStream("hadoop001",9999)
    val wc = lines.flatMap(_.split(",")).map((_,1)).updateStateByKey(updateFunction)

    wc.print()
    ssc
  }


  /**
    *
    * @param newValues  当前批次的值
    *         key对应的新值 可能有多个 所以就是一个 seq
    *
    * @param oldValue   之前批次的值
    *         key已经存在的值  有可能没有，有可能有，所以定义成 Option
    * @return
    */
  def updateFunction(newValues: Seq[Int], oldValue: Option[Int]): Option[Int] = {
    var newCount = newValues.sum

    newCount += oldValue.getOrElse(0)

    Some(newCount)
  }

}

不论是否重启，程序的运行状态都保存完好

-------------------------------------------
Time: 1572245440000 ms
-------------------------------------------
(d,7)
(b,6)
(f,4)
(e,4)
(a,6)
(g,7)
(c,6)
-------------------------------------------
Time: 1572245445000 ms
-------------------------------------------
(d,7)
(b,6)
(f,4)
(e,4)
(a,6)
(g,7)
(c,6)

mapWithState （生产中推荐使用）

mapWithState：也是用于全局统计key的状态，但是它如果没有数据输入，便不会返回之前的key的状态，有一点增量的感觉。效率更高，生产中建议使用

优点：我们可以只是关心那些已经发生的变化的key，对于没有数据输入，则不会返回那些没有变化的key的数据。这样的话，即使数据量很大，checkpoint也不会像updateStateByKey那样，占用太多的存储

object SparkStreamingMapState {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getSimpleName)
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    ssc.checkpoint(".")

	// 初始状态RDD
    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))



    val lines = ssc.socketTextStream("tool",9999)
    val wc = lines.flatMap(_.split(",")).map((_,1))


    val spec = StateSpec.function(mappingFunction _)//.numPartitions(ssc.sparkContext.defaultParallelism)
    val stateDstream = wc.mapWithState(spec.initialState(initialRDD)).stateSnapshots()

    stateDstream.print()


    ssc.start()
    ssc.awaitTermination()
  }


  /**
    *
    * @param key
    * @param value 当前值
    * @param state 历史状态
    * @return mapWithState 函数最终返回的格式
    */
  def mappingFunction(key: String, value: Option[Int], state: State[Int]) = {
    // Use state.exists(), state.get(), state.update() and state.remove()
    // to manage state, and return the necessary string
    val sum = value.getOrElse(0) + state.getOption().getOrElse(0)
    // 更新 state
    state.update(sum)
    // 返回 mappingdata
    Some(key)
  }
}