spark streaming中updateStateByKey算子的使用介绍

前言

在streaming中可以分为有状态运算和无状态运算
无状态运算就是每个批次间都彼此隔离,每次都从空开始
有状态运算为批次之间提供了管道,管道中保存的信息就是历史状态
常见的有状态算子包括updateStateByKey,mapWithState,窗口函数
其中updateStateByKey和mapWithState是比较相似的,区别在于无论本批次内有没有key对应的数据,updateStateByKey都会执行一遍运算逻辑,而mapWithState则不会被触发。
下面看一下updateStateByKey的几类使用:

1.最基础

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S]
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner())
  }

使用wordCount演示

object UpdateStateByKeyDemo1 {
  def updateFunc(one: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = one.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
        .map(word => (word, 1))
        .updateStateByKey(updateFunc)
        .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

2.可以指定分区数的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * Hash partitioning is used to generate the RDDs with `numPartitions` partitions.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param numPartitions Number of partitions of each RDD in the new DStream.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      numPartitions: Int
    ): DStream[(K, S)] = ssc.withScope {
    updateStateByKey(updateFunc, defaultPartitioner(numPartitions))
  }

演示

object updateStateByKeyDemo2 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, 1)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

3.可以使用自定义分区器的updateStateByKey

/**
 * Return a new "state" DStream where the state for each key is updated by applying
 * the given function on the previous state of the key and the new values of the key.
 * In every batch the updateFunc will be called for each state even if there are no new values.
 * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
 * @param updateFunc State update function. If `this` function returns None, then
 *                   corresponding state key-value pair will be eliminated.
 * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
 *                    DStream.
 * @tparam S State type
 */
def updateStateByKey[S: ClassTag](
    updateFunc: (Seq[V], Option[S]) => Option[S],
    partitioner: Partitioner
  ): DStream[(K, S)] = ssc.withScope {
  val cleanedUpdateF = sparkContext.clean(updateFunc)
  val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
    iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
  }
  updateStateByKey(newUpdateFunc, partitioner, true)
}

演示

object updateStateByKeyDemo3 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
      val sum = one.sum + state.getOrElse(0)
      Some(sum)
    }

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4))
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}


class MyPartitionerDemo(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    val code = (key.hashCode % numPartitions)
    if (code < 0) {
      code + numPartitions
    } else {
      code
    }
  }
}

4.可以自定义是否记住分区器的updateStateByKey

  /**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of each key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * [[org.apache.spark.Partitioner]] is used to control the partitioning of each RDD.
   * @param updateFunc State update function. Note, that this function may generate a different
   *                   tuple with a different key than the input key. Therefore keys may be removed
   *                   or added in this way. It is up to the developer to decide whether to
   *                   remember the partitioner despite the key being changed.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream
   * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
      partitioner: Partitioner,
      rememberPartitioner: Boolean): DStream[(K, S)] = ssc.withScope {
    val cleanedFunc = ssc.sc.clean(updateFunc)
    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {
      cleanedFunc(it)
    }
    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, None)
  }

值得注意的是,此时的updateFunc要求参数是迭代器了。

这也说明这种形式下,进行的是批处理。而之前则是每条数据调用updateFunc一次。

object updateStateByKeyDemo4 {

  def MyFunction(key: String, value: Seq[Int], state: Option[Int]): Option[Int] = {
    val sum = value.sum + state.getOrElse(0)
    Some(sum)
  }

  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

    val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
      iterator.flatMap(it => MyFunction(it._1, it._2, it._3).map(s => (it._1, s)))
    }


    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

5.可以指定初始化状态的updateStateByKey

/**
   * Return a new "state" DStream where the state for each key is updated by applying
   * the given function on the previous state of the key and the new values of the key.
   * In every batch the updateFunc will be called for each state even if there are no new values.
   * org.apache.spark.Partitioner is used to control the partitioning of each RDD.
   * @param updateFunc State update function. If `this` function returns None, then
   *                   corresponding state key-value pair will be eliminated.
   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new
   *                    DStream.
   * @param initialRDD initial state value of each key.
   * @tparam S State type
   */
  def updateStateByKey[S: ClassTag](
      updateFunc: (Seq[V], Option[S]) => Option[S],
      partitioner: Partitioner,
      initialRDD: RDD[(K, S)]
    ): DStream[(K, S)] = ssc.withScope {
    val cleanedUpdateF = sparkContext.clean(updateFunc)
    val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {
      iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
    }
    updateStateByKey(newUpdateFunc, partitioner, true, initialRDD)
  }

代码演示

object updateStateByKeyDemo5 {
  def main(args: Array[String]): Unit = {
    val sparkconf = new SparkConf().setAppName(this.getClass.getSimpleName).setMaster("local[4]")
    val ssc = new StreamingContext(sparkconf, Seconds(5))
    ssc.checkpoint("file:///E:/chk")

//    val updateFunc = (one: Seq[Int], state: Option[Int]) => {
//      val sum = one.sum + state.getOrElse(0)
//      Some(sum)
//    }

    val updateFunc = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
      iter.flatMap(elem => {
        val one = elem._2
        val state = elem._3
        val sum = state.getOrElse(0) + one.sum
        Option(sum)
      }.map(s => (elem._1, s)))
    }

    val initialRDD = ssc.sparkContext.parallelize(Seq(("hello", 10)))

    val scoketDStream = ssc.socketTextStream("mini", 8888)
    scoketDStream.flatMap(_.split("\\s+"))
      .map(word => (word, 1))
      .updateStateByKey(updateFunc, new MyPartitionerDemo(4), false, initialRDD)
      .print()

    ssc.start()
    ssc.awaitTermination()
  }
}

总结

根据实际需求,我们选择使用updateStateByKey的不同重载。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值