spark streaming中updateStateByKey算子的使用介绍

最新推荐文章于 2020-09-16 12:49:47 发布

失散Lost

最新推荐文章于 2020-09-16 12:49:47 发布

阅读量311

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/jason_9527/article/details/107015455

版权

Spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

前言

在streaming中可以分为有状态运算和无状态运算
无状态运算就是每个批次间都彼此隔离，每次都从空开始
有状态运算为批次之间提供了管道，管道中保存的信息就是历史状态
常见的有状态算子包括updateStateByKey，mapWithState，窗口函数
其中updateStateByKey和mapWithState是比较相似的，区别在于无论本批次内有没有key对应的数据，updateStateByKey都会执行一遍运算逻辑，而mapWithState则不会被触发。
下面看一下updateStateByKey的几类使用：

1.最基础

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }

使用wordCount演示

object UpdateStateByKeyDemo1 {
  def updateFunc(one: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = one.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
        .map(word => (word, 1))
        .updateStateByKey(updateFunc)
        .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

2.可以指定分区数的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param numPartitions Number of partitions of each RDD in the new DStream.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      numPartitions: Int
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
  }

演示

object updateStateByKeyDemo2 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, 1)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

3.可以使用自定义分区器的updateStateByKey

/**
 * Return a new "state" DStream where the state for each key is updated by applying
 * the given function on the previous state of the key and the new values of the key.
 * In every batch the updateFunc will be called for each state even if there are no new values.
 * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
 * @param updateFunc State update function. If `this` function returns None, then
 *                   corresponding state key-value pair will be eliminated.
 * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
 *                    DStream.
 * @tparam S State type
 */
def updateStateByKey[S: ClassTag](
    updateFunc: (Seq[V], Option[S]) => Option[S],
    partitioner: Partitioner
  ): DStream[(K, S)] = ssc.withScope {
  val cleanedUpdateF = sparkContext.clean(updateFunc)
  val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
    iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
  }
  updateStateByKey(newUpdateFunc, partitioner, true)
}

演示

object updateStateByKeyDemo3 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4))
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}


class MyPartitionerDemo(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    val code = (key.hashCode % numPartitions)
    if (code < 0) {
      code + numPartitions
    } else {
      code
    }
  }
}

4.可以自定义是否记住分区器的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
   * @param updateFunc State update function. Note, that this function may generate a different
   *                   tuple with a different key than the input key. Therefore keys may be removed
   *                   or added in this way. It is up to the developer to decide whether to
   *                   remember the partitioner despite the key being changed.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream
   * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
    val cleanedFunc = ssc.sc.clean(updateFunc)
    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
      cleanedFunc(it)
    }
    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
  }

值得注意的是，此时的updateFunc要求参数是迭代器了。

这也说明这种形式下，进行的是批处理。而之前则是每条数据调用updateFunc一次。

object updateStateByKeyDemo4 {

  def MyFunction(key: String, value: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = value.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(it => MyFunction(it._1, it._2, it._3).map(s => (it._1, s)))
    }


    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

5.可以指定初始化状态的updateStateByKey

/**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of the key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * org.apache.spark.Partitioner is used to control the partitioning of each RDD.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream.
   * @param initialRDD initial state value of each key.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      partitioner: Partitioner,
      initialRDD: RDD[(K, S)]
    ): DStream[(K, S)] = ssc.withScope {
    val cleanedUpdateF = sparkContext.clean(updateFunc)
    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
      iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
    }
    updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
  }

代码演示

object updateStateByKeyDemo5 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

//    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
//      val sum = one.sum + state.getOrElse(0)
//      Some(sum)
//    }

    val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
      iter.flatMap(elem => {
        val one = elem._2
        val state = elem._3
        val sum = state.getOrElse(0) + one.sum
        Option(sum)
      }.map(s => (elem._1, s)))
    }

    val initialRDD = ssc.sparkContext.parallelize(Seq(("hello", 10)))

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false, initialRDD)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

总结

根据实际需求，我们选择使用updateStateByKey的不同重载。

失散Lost

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark streaming中updateStateByKey算子的使用介绍

前言在streaming中可以分为有状态运算和无状态运算无状态运算就是每个批次间都彼此隔离，每次都从空开始有状态运算为批次之间提供了管道，管道中保存的信息就是历史状态常见的有状态算子包括updateStateByKey，mapWithState，窗口函数其中updateStateByKey和mapWithState是比较相似的，区别在于无论本批次内有没有key对应的数据，updateStateByKey都会执行一遍运算逻辑，而mapWithState则不会被触发。下面看一下updateState
复制链接

扫一扫