第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

原创 2016年05月31日 16:51:32

背景:

  整个Spark Streaming是按照Batch Duractions划分Job的。但是很多时候我们需要算过去的一天甚至一周的数据,这个时候不可避免的要进行状态管理,而Spark Streaming每个Batch Duractions都会产生一个Job,Job里面都是RDD,所以此时面临的问题就是怎么对状态进行维护?这个时候就需要借助updateStateByKey和mapWithState方法完成核心的步骤。

源码分析:

1. 无论是updateStateByKey还是mapWithState方法在DStream中均没有,但是是通过隐身转换函数实现其功能。

object DStream {

  // `toPairDStreamFunctions` was in SparkContext before1.3 and users had to
  // `import StreamingContext._` toenable it. Now we move it here to make the compiler find
  // it automatically. However, we stillkeep the old function in StreamingContext for backward
  // compatibility and forward to thefollowing function directly.

 
implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])
      (implicit kt: ClassTag[K], vt: ClassTag[V],ord: Ordering[K] = null):
    PairDStreamFunctions[K, V] = {
    new PairDStreamFunctions[K, V](stream)
  }

updateStateByKey 

1. 在PairDStreamFunctions中updateStateByKey具体实现如下:

updateStateByKey是在历史已有的状态基础上,采用updateFunc对历史数据进行更新。updateFunc进行操作,该函数的返回值是DStream类型的。

/**
 * Return a new "state" DStreamwhere the state for each key is updated by applying
 * the given function on the previousstate of the key and the new values of each key.
 * Hash partitioning is used to generatethe RDDs with Spark's default number of partitions.
 *
@param updateFunc State update function. If `this` function returns None, then
 *                   corresponding statekey-value pair will be eliminated.
 *
@tparam S State type
 */
def updateStateByKey[S:ClassTag](

//
    updateFunc: (Seq[V], Option[S]) => Option[S]
  ): DStream[(K, S)] = ssc.withScope {
  updateStateByKey(updateFunc, defaultPartitioner())
}

2. defaultPartitioner

private[streaming] def defaultPartitioner(numPartitions:Int = self.ssc.sc.defaultParallelism) = {
  new HashPartitioner(numPartitions)
}

3.partitioner就是控制RDD的每个patition

/**
 * Return a new "state" DStreamwhere the state for each key is updated by applying
 * the given function on the previousstate of the key and the new values of the key.
 * org.apache.spark.Partitioner is usedto control the partitioning of each RDD.
 *
@param updateFunc State update function. If `this` function returns None, then
 *                   corresponding statekey-value pair will be eliminated.
 *
@param partitioner Partitioner for controlling the partitioning of each RDDin the new
 *                    DStream.
 *
@tparam S State type
 */
def updateStateByKey[S:ClassTag](
    updateFunc: (Seq[V], Option[S]) => Option[S],
    partitioner: Partitioner
  ): DStream[(K, S)] = ssc.withScope {
  val cleanedUpdateF= sparkContext.clean(updateFunc)
  val newUpdateFunc= (iterator: Iterator[(K, Seq[V], Option[S])])=> {
    iterator.flatMap(t =>cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))
  }
  updateStateByKey(newUpdateFunc, partitioner, true)
}

4.rememberPartitioner默认为true

 

/**
 * Return a new "state" DStreamwhere the state for each key is updated by applying
 * the given function on the previousstate of the key and the new values of each key.
 * org.apache.spark.Partitioner is usedto control the partitioning of each RDD.
 *
@param updateFunc State update function. Note, that this function maygenerate a different
 *                   tuple with a different keythan the input key. Therefore keys may be removed
 *                   or added in this way. It isup to the developer to decide whether to
 *                   remember the partitionerdespite the key being changed.
 *
@param partitioner Partitioner for controlling the partitioning of each RDDin the new
 *                    DStream
 *
@param rememberPartitioner Whether to remember the paritioner objectin the generated RDDs.
 *
@tparam S State type
 */
def updateStateByKey[S:ClassTag](
    updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
    partitioner: Partitioner,
   rememberPartitioner: Boolean
  ): DStream[(K, S)] = ssc.withScope {
   new StateDStream(self, ssc.sc.clean(updateFunc),partitioner, rememberPartitioner, None)
}

5.在StateDStream中,StorageLevel是直接存储到磁盘,因为此时的数据非常大

private[streaming]
class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](
    parent: DStream[(K, V)],
    updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
    partitioner: Partitioner,
    preservePartitioning: Boolean,
    initialRDD : Option[RDD[(K, S)]]
  ) extends DStream[(K, S)](parent.ssc) {

  super.persist(StorageLevel.MEMORY_ONLY_SER)

在computeUsingPreiviousRDD源码如下:

private [this] def computeUsingPreviousRDD(
  parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {
  // Define the function for the mapPartition operation oncogrouped RDD;
  // first map the cogrouped tuple totuples of required type,
  // and then apply the update function
 
val updateFuncLocal= updateFunc
  val finalFunc= (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {
    val i =iterator.map(t => {
      val itr =t._2._2.iterator
      val headOption= if (itr.hasNext) Some(itr.next()) else None
      (t._1, t._2._1.toSeq, headOption)
    })
    updateFuncLocal(i)
  }
  //cogroup每次计算的时候都会遍历prevSrateRDD中的所有parititioner的信息
 
val cogroupedRDD= parentRDD.cogroup(prevStateRDD, partitioner)
  val stateRDD= cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)
  Some(stateRDD)
}

所以,如果数据很多的时候不建议使用updateStateByKey。 
updateStateByKey函数实现如下:

mapWithState: 

1. 返回MapWithStateDStream函数,维护和更新历史状态都是基于Key。使用一个function对key-value形式的数据进行状态维护。

/**
 * :: Experimental ::
 * Return a
[[MapWithStateDStream]] byapplying a function to every key-value element of
 *
`this` stream, while maintaining some state datafor each unique key. The mapping function
 * and other specification (e.g.partitioners, timeouts, initial state data, etc.) of this
 * transformation can be specified using
[[StateSpec]] class.The state data is accessible in
 * as a parameter of type
[[State]] inthe mapping function.
 *
 * Example of using
`mapWithState`:
 *
{{{
 *    // A mappingfunction that maintains an integer state and return a String
  *   //
此时的state就可以看成一张表,这张表记录了状态维护中所有的历史状态。
 *   def mappingFunction(key: String, value: Option[Int], state: State[Int]):Option[String] = {
 *     // Use state.exists(), state.get(), state.update() and state.remove()
 *     // to manage state, and return the necessary string
 *   }
 *
 *   val spec = StateSpec.function(mappingFunction).numPartitions(10)
 *
 *   val mapWithStateDStream = keyValueDStream.mapWithState[StateType,MappedType](spec)
 *
}}}
 *
 *
@param spec         Specification ofthis transformation
 *
@tparam StateType   Class type of thestate data
 *
@tparam MappedType  Class type of themapped data
 */
@Experimental
def mapWithState[StateType: ClassTag, MappedType: ClassTag](
    spec: StateSpec[K, V, StateType, MappedType]
  ): MapWithStateDStream[K, V, StateType, MappedType] = {
  new MapWithStateDStreamImpl[K, V, StateType, MappedType](
    self,
    // StateSpecImpl类封装了StateSpec操作。
   
spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
  )
}

2. MapWithStateDStream源码如下:

/**
 * :: Experimental ::
 * DStream representing the stream ofdata generated by
`mapWithState` operationon a
 *
[[org.apache.spark.streaming.dstream.PairDStreamFunctions pairDStream]].
 * Additionally, it also gives access tothe stream of state snapshots, that is, the state data of
 * all keys after a batch has updatedthem.
 *
 *
@tparam KeyType Class of the key
 *
@tparam ValueType Class of the value
 *
@tparam StateType Class of the state data
 *
@tparam MappedType Class of the mapped data
 */
@Experimental
sealed abstract class MapWithStateDStream[KeyType, ValueType, StateType, MappedType: ClassTag](
    ssc: StreamingContext) extends DStream[MappedType](ssc) {

  /** Return a pair DStream where each RDD is the snapshotof the state of all the keys. */
 
def stateSnapshots():DStream[(KeyType, StateType)]
}

/** Internal implementation of the [[MapWithStateDStream]] */
private[streaming] class MapWithStateDStreamImpl[
    KeyType:ClassTag, ValueType: ClassTag, StateType:ClassTag, MappedType: ClassTag](
    dataStream: DStream[(KeyType, ValueType)],
    spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
  extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

  private val internalStream =
    new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)

  override def slideDuration: Duration = internalStream.slideDuration

  override def dependencies: List[DStream[_]] = List(internalStream)
  //计算的时候是通过InternalMapWithStateDStream来实现的
 
override def compute(validTime: Time): Option[RDD[MappedType]] = {
    internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }
  }

StateSpecImpl

/** Internalimplementation of [[org.apache.spark.streaming.StateSpec]] interface.*/
private[streaming]
case class StateSpecImpl[K, V, S, T](
    function: (Time, K, Option[V], State[S])=> Option[T]) extends StateSpec[K, V, S, T] {

  require(function != null)

  @volatile privatevar partitioner: Partitioner = null
 
@volatile privatevar initialStateRDD: RDD[(K, S)] = null
 
@volatile privatevar timeoutInterval: Duration = null

  override def
initialState(rdd: RDD[(K, S)]): this.type = {
    this.initialStateRDD = rdd
    this
 
}

  override def initialState(javaPairRDD: JavaPairRDD[K, S]): this.type = {
    this.initialStateRDD = javaPairRDD.rdd
    this
 
}

 

MapWithStateDStreamImpl

/** Internalimplementation of the [[MapWithStateDStream]] */
private[streaming] class MapWithStateDStreamImpl[
    KeyType:ClassTag, ValueType: ClassTag, StateType:ClassTag, MappedType: ClassTag](
    dataStream: DStream[(KeyType, ValueType)],
    spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])
  extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

 

3.  更新历史数据。

/**
 * A DStream that allows per-key state tobe maintains, and arbitrary records to be generated
 * based on updates to the state. This isthe main DStream that implements the
`mapWithState`
 * operation on DStreams.
 *
 *
@param parent Parent (key, value) stream that is the source
 *
@param spec Specifications of the mapWithState operation
 *
@tparam K   Key type
 *
@tparam V   Value type
 *
@tparam S   Type of the state maintained
 *
@tparam E   Type of the mapped data
 */
private[streaming]
class InternalMapWithStateDStream[K:ClassTag, V: ClassTag, S:ClassTag, E: ClassTag](
    parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E])
  extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) {
  //不断的更新内存数据结构。
 
persist(StorageLevel.MEMORY_ONLY)

  private val partitioner = spec.getPartitioner().getOrElse(
    new HashPartitioner(ssc.sc.defaultParallelism))

  private val mappingFunction = spec.getFunction()

  override def slideDuration: Duration = parent.slideDuration

  override def dependencies: List[DStream[_]] = List(parent)

  /** Enable automatic checkpointing */
 
override val mustCheckpoint = true

 
/** Override the default checkpoint duration */
 
override def initialize(time: Time): Unit = {
    if (checkpointDuration == null) {
      checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
   
}
    super.initialize(time)
  }

4.  MapWithStateDStream.Compute

  /** Method that generates a RDD for the given time */
 
override def compute(validTime: Time):Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {
    // Get the previous state or create a new empty state RDD
   
val prevStateRDD= getOrCompute(validTime - slideDuration) match {
      case Some(rdd) =>
       if (rdd.partitioner != Some(partitioner)) {
          // If the RDD is not partitioned the right way, let usrepartition it using the
          // partition index as the key.This is to ensure that state RDD is always partitioned
          // before creating anotherstate RDD using it
         
MapWithStateRDD.createFromRDD[K, V, S, E](
            rdd.flatMap {_.stateMap.getAll() }, partitioner, validTime)
        } else {
          rdd
        }
      case None=>
        MapWithStateRDD.createFromPairRDD[K, V, S, E](
         spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),
          partitioner,
          validTime
        )
    }


    // Compute the new state RDD with previous state RDD andpartitioned data RDD
    // Even if there is no data RDD, usean empty one to create a new state RDD
    //
基于时间窗口创建RDD
   
val dataRDD = parent.getOrCompute(validTime).getOrElse {
      context.sparkContext.emptyRDD[(K, V)]
    }
    val partitionedDataRDD= dataRDD.partitionBy(partitioner)
    val timeoutThresholdTime= spec.getTimeoutInterval().map { interval =>
      (validTime - interval).milliseconds
    }
    Some(new MapWithStateRDD(
      prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))
  }
}

spark streaming updateStateByKey 用法

updateStateByKey 解释: 以DStream中的数据进行按key做reduce操作,然后对各个批次的数据进行累加 在有新的数据信息进入或更新时,可以让用户保持想要的任何状。使用这个功...

SparkStreaming updateStateByKey 使用

updateStateByKey算子经常在实时计算时使用,最常见的就是wordCount类型的统计需求,那么这里使用官网并结合自己一些网上看的一些例子写的demo,如下: 官方: update...

Spark学习笔记(14)State管理之updateStateByKey解密

从这节课开始,简介Spark Streaming的状态管理。   Spark Streaming 是按Batch Duration来划分Job的,但我们有时需要根据业务要求按照另外的时间周期(比如说,...

第93讲:Spark Streaming updateStateByKey案例实战和内幕源码

本节课程主要分二个部分: 一、Spark Streaming updateStateByKey案例实战 二、Spark Streaming updateStateByKey源码解密 第一部分: ...

Spark Streaming 的 UpdateStateByKey操作

updateStateByKey利用给定的函数更新DStream的状态,返回一个新"state"的DStream。操作允许不断用新信息更新它的同时保持任意状态。 你需要通过两步来使用它 定义状...

第93课:Spark Streaming updateStateByKey案例实战和内幕源码解密

本节课程主要分二个部分: 一、Spark Streaming updateStateByKey案例实战 二、Spark Streaming updateStateByKey源码解密 第一...
  • lhui798
  • lhui798
  • 2016年04月30日 12:03
  • 9540

Spark定制班第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

从这节课开始,简介Spark Streaming的状态管理。   Spark Streaming 是按Batch Duration来划分Job的,但我们有时需要根据业务要求按照另外的时间周期(比如说,...

sparkstreaming中的mapWithState和updateStateBykey代码模版对比

import org.apache.spark.SparkConf import org.apache.spark.streaming.kafka.KafkaUtils import org.apac...

14:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

首先简单解释一下什么是state(状态)管理?我们以wordcount为例。每个batchInterval会计算当前batch的单词计数,那如果需要计算从流开始到目前为止的单词出现的次数,该如计算呢?...

Spark Streaming源码解读之State管理之UpdataStateByKey和MapWithState解密

转自:http://www.cnblogs.com/yinpin2011/p/5539708.html 本期内容 : UpdateStateByKey解密MapWithState解密 ...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密
举报原因:
原因补充:

(最多只允许输入30个字)