第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

转载 2016年05月31日 15:57:07


背景: 
整个Spark Streaming是按照Batch Duractions划分Job的。但是很多时候我们需要算过去的一天甚至一周的数据,这个时候不可避免的要进行状态管理,而Spark Streaming每个Batch Duractions都会产生一个Job,Job里面都是RDD,所以此时面临的问题就是怎么对状态进行维护?这个时候就需要借助 updateStateByKey和mapWithState方法完成核心的步骤。
源码分析: 
1. 无论是updateStateByKey还是mapWithState方法在DStream中均没有,但是是通过隐身转换函数实现其功能。

object DStream {

 

  // `toPairDStreamFunctions` was in SparkContext before 1.3 and users had to

  // `import StreamingContext._` to enable it. Now we move it here to make the compiler find

  // it automatically. However, we still keep the old function in StreamingContext for backward

  // compatibility and forward to the following function directly.

 

  implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])

      (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):

    PairDStreamFunctions[K, V] = {

    new PairDStreamFunctions[K, V](stream)

  }

updateStateByKey: 
1. 在PairDStreamFunctions中updateStateByKey具体实现如下:
在已有的历史基础上,updateFunc对历史数据进行更新。该函数的返回值是DStream类型的。

/**

 * Return a new "state" DStream where the state for each key is updated by applying

 * the given function on the previous state of the key and the new values of each key.

 * Hash partitioning is used to generate the RDDs with Spark's default number of partitions.

 * @param updateFunc State update function. If `this` function returns None, then

 *                   corresponding state key-value pair will be eliminated.

 * @tparam S State type

 */

def updateStateByKey[S: ClassTag](

    updateFunc: (Seq[V], Option[S]) => Option[S]

  ): DStream[(K, S)] = ssc.withScope {

// defaultPartitioner

  updateStateByKey(updateFunc, defaultPartitioner())

}

2.  defaultPartitioner:

private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism) = {

  new HashPartitioner(numPartitions)

}

3.  partitioner就是控制RDD的每个patition

/**

 * Return a new "state" DStream where the state for each key is updated by applying

 * the given function on the previous state of the key and the new values of the key.

 * org.apache.spark.Partitioner is used to control the partitioning of each RDD.

 * @param updateFunc State update function. If `this` function returns None, then

 *                   corresponding state key-value pair will be eliminated.

 * @param partitioner Partitioner for controlling the partitioning of each RDD in the new

 *                    DStream.

 * @tparam S State type

 */

def updateStateByKey[S: ClassTag](

    updateFunc: (Seq[V], Option[S]) => Option[S],

    partitioner: Partitioner

  ): DStream[(K, S)] = ssc.withScope {

  val cleanedUpdateF = sparkContext.clean(updateFunc)

  val newUpdateFunc = (iterator: Iterator[(K, Seq[V], Option[S])]) => {

    iterator.flatMap(t => cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))

  }

  updateStateByKey(newUpdateFunc, partitioner, true)

}

4.  rememberPartitioner默认为true

/**

 * Return a new "state" DStream where the state for each key is updated by applying

 * the given function on the previous state of the key and the new values of each key.

 * org.apache.spark.Partitioner is used to control the partitioning of each RDD.

 * @param updateFunc State update function. Note, that this function may generate a different

 *                   tuple with a different key than the input key. Therefore keys may be removed

 *                   or added in this way. It is up to the developer to decide whether to

 *                   remember the partitioner despite the key being changed.

 * @param partitioner Partitioner for controlling the partitioning of each RDD in the new

 *                    DStream

 * @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.

 * @tparam S State type

 */

def updateStateByKey[S: ClassTag](

    updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

    partitioner: Partitioner,

    rememberPartitioner: Boolean

  ): DStream[(K, S)] = ssc.withScope {

   new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)

}

5.  在StateDStream中,StorageLevel是直接存储到磁盘,因为此时的数据非常大

class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](

    parent: DStream[(K, V)],

    updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

    partitioner: Partitioner,

    preservePartitioning: Boolean,

    initialRDD : Option[RDD[(K, S)]]

  ) extends DStream[(K, S)](parent.ssc) {

 

  super.persist(StorageLevel.MEMORY_ONLY_SER)

1. 在computeUsingPreiviousRDD源码如下:

private [this] def computeUsingPreviousRDD (

  parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {

  // Define the function for the mapPartition operation on cogrouped RDD;

  // first map the cogrouped tuple to tuples of required type,

  // and then apply the update function

  val updateFuncLocal = updateFunc

  val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {

    val i = iterator.map(t => {

      val itr = t._2._2.iterator

      val headOption = if (itr.hasNext) Some(itr.next()) else None

      (t._1, t._2._1.toSeq, headOption)

    })

    updateFuncLocal(i)

  }

//cogroup每次计算的时候都会遍历prevSrateRDD中的所有parititioner的信息

//

  val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)

  val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

  Some(stateRDD)

}

所以,如果数据很多的时候不建议使用updateStateByKey。
updateStateByKey函数实现如下:

mapWithState: 
1. 返回MapWithStateDStream函数,维护和更新历史状态都是基于Key。使用一个function对key-value形式的数据进行状态维护。

/**

 * :: Experimental ::

 * Return a [[MapWithStateDStream]] by applying a function to every key-value element of

 * `this` stream, while maintaining some state data for each unique key. The mapping function

 * and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this

 * transformation can be specified using [[StateSpec]] class. The state data is accessible in

 * as a parameter of type [[State]] in the mapping function.

 *

 * Example of using `mapWithState`:

 * {{{

 *    // A mapping function that maintains an integer state and return a String

//此时的state就可以看成一张表,这张表记录了状态维护中所有的历史状态。

 *    def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {

 *      // Use state.exists(), state.get(), state.update() and state.remove()

 *      // to manage state, and return the necessary string

 *    }

 *

 *    val spec = StateSpec.function(mappingFunction).numPartitions(10)

 *

 *    val mapWithStateDStream = keyValueDStream.mapWithState[StateType, MappedType](spec)

 * }}}

 *

 * @param spec          Specification of this transformation

 * @tparam StateType    Class type of the state data

 * @tparam MappedType   Class type of the mapped data

 */

@Experimental

def mapWithState[StateType: ClassTag, MappedType: ClassTag](

    spec: StateSpec[K, V, StateType, MappedType]

  ): MapWithStateDStream[K, V, StateType, MappedType] = {

  new MapWithStateDStreamImpl[K, V, StateType, MappedType](

    self,

// StateSpecImpl类封装了StateSpec操作。

    spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

  )

}

2.  MapWithStateDStream源码如下:

/**

 * :: Experimental ::

 * DStream representing the stream of data generated by `mapWithState` operation on a

 * [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]].

 * Additionally, it also gives access to the stream of state snapshots, that is, the state data of

 * all keys after a batch has updated them.

 *

 * @tparam KeyType Class of the key

 * @tparam ValueType Class of the value

 * @tparam StateType Class of the state data

 * @tparam MappedType Class of the mapped data

 */

@Experimental

sealed abstract class MapWithStateDStream[KeyType, ValueType, StateType, MappedType: ClassTag](

    ssc: StreamingContext) extends DStream[MappedType](ssc) {

 

  /** Return a pair DStream where each RDD is the snapshot of the state of all the keys. */

  def stateSnapshots(): DStream[(KeyType, StateType)]

}

 

/** Internal implementation of the [[MapWithStateDStream]] */

private[streaming] class MapWithStateDStreamImpl[

    KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](

    dataStream: DStream[(KeyType, ValueType)],

    spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])

  extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

 

  private val internalStream =

    new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)

 

  override def slideDuration: Duration = internalStream.slideDuration

 

  override def dependencies: List[DStream[_]] = List(internalStream)

//计算的时候是通过InternalMapWithStateDStream来实现的。

  override def compute(validTime: Time): Option[RDD[MappedType]] = {

    internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }

  }

3.  更新历史数据。

/**

 * A DStream that allows per-key state to be maintains, and arbitrary records to be generated

 * based on updates to the state. This is the main DStream that implements the `mapWithState`

 * operation on DStreams.

 *

 * @param parent (key, value) stream that is the source

 * @param spec Specifications of the mapWithState operation

 * @tparam K   Key type

 * @tparam V   Value type

 * @tparam S   Type of the state maintained

 * @tparam E   Type of the mapped data

 */

private[streaming]

class InternalMapWithStateDStream[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

    parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E])

  extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) {

//不断的更新内存数据结构。

  persist(StorageLevel.MEMORY_ONLY)

4.  MapWithStateDStream.Compute

/** Method that generates a RDD for the given time */

  override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {

    // Get the previous state or create a new empty state RDD

    val prevStateRDD = getOrCompute(validTime - slideDuration) match {

      case Some(rdd) =>

        if (rdd.partitioner != Some(partitioner)) {

          // If the RDD is not partitioned the right way, let us repartition it using the

          // partition index as the key. This is to ensure that state RDD is always partitioned

          // before creating another state RDD using it

          MapWithStateRDD.createFromRDD[K, V, S, E](

            rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)

        } else {

          rdd

        }

      case None =>

        MapWithStateRDD.createFromPairRDD[K, V, S, E](

          spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),

          partitioner,

          validTime

        )

    }

 

//基于时间窗口创建RDD

    // Compute the new state RDD with previous state RDD and partitioned data RDD

    // Even if there is no data RDD, use an empty one to create a new state RDD

    val dataRDD = parent.getOrCompute(validTime).getOrElse {

      context.sparkContext.emptyRDD[(K, V)]

    }

    val partitionedDataRDD = dataRDD.partitionBy(partitioner)

    val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>

      (validTime - interval).milliseconds

    }

    Some(new MapWithStateRDD(

      prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))

  }

}

5.  MapWithStateRDD: 是一个RDD,他本身包含了对mapWithState操作的数据,以及对数据怎么操作,MapWithStateRDDRecord代表了每个RDD的partition。

/**

 * RDD storing the keyed states of `mapWithState` operation and corresponding mapped data.

 * Each partition of this RDD has a single record of type [[MapWithStateRDDRecord]]. This contains a

 * [[StateMap]] (containing the keyed-states) and the sequence of records returned by the mapping

 * function of  `mapWithState`.

 * @param prevStateRDD The previous MapWithStateRDD on whose StateMap data `this` RDD

  *                    will be created

 * @param partitionedDataRDD The partitioned data RDD which is used update the previous StateMaps

 *                           in the `prevStateRDD` to create `this` RDD

 * @param mappingFunction  The function that will be used to update state and return new data

 * @param batchTime        The time of the batch to which this RDD belongs to. Use to update

 * @param timeoutThresholdTime The time to indicate which keys are timeout

 */

private[streaming] class MapWithStateRDD[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

    private var prevStateRDD: RDD[MapWithStateRDDRecord[K, S, E]],

    private var partitionedDataRDD: RDD[(K, V)],

    mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

    batchTime: Time,

    timeoutThresholdTime: Option[Long]

  ) extends RDD[MapWithStateRDDRecord[K, S, E]](

    partitionedDataRDD.sparkContext,

    List(

      new OneToOneDependency[MapWithStateRDDRecord[K, S, E]](prevStateRDD),

      new OneToOneDependency(partitionedDataRDD))

  ) {

 

  @volatile private var doFullScan = false

 

  require(prevStateRDD.partitioner.nonEmpty)

  require(partitionedDataRDD.partitioner == prevStateRDD.partitioner)

 

  override val partitioner = prevStateRDD.partitioner

 

  override def checkpoint(): Unit = {

    super.checkpoint()

    doFullScan = true

  }

 

  override def compute(

      partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

 

    val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]

    val prevStateRDDIterator = prevStateRDD.iterator(

      stateRDDPartition.previousSessionRDDPartition, context)

    val dataIterator = partitionedDataRDD.iterator(

      stateRDDPartition.partitionedDataRDDPartition, context)

 

    val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None

    val newRecord = MapWithStateRDDRecord.updateRecordWithData(

      prevRecord,

      dataIterator,

      mappingFunction,

      batchTime,

      timeoutThresholdTime,

      removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled

    )

    Iterator(newRecord)

  }

6.  updateRecordWithData: RDD本身不可变的,但是可以处理变化的数据。

def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

    prevRecord: Option[MapWithStateRDDRecord[K, S, E]],

    dataIterator: Iterator[(K, V)],

    mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

    batchTime: Time,

    timeoutThresholdTime: Option[Long],

    removeTimedoutData: Boolean

  ): MapWithStateRDDRecord[K, S, E] = {

    // Create a new state map by cloning the previous one (if it exists) or by creating an empty one

    val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

 

    val mappedData = new ArrayBuffer[E]

    val wrappedState = new StateImpl[S]()

 

    // Call the mapping function on each record in the data iterator, and accordingly

    // update the states touched, and collect the data returned by the mapping function

    dataIterator.foreach { case (key, value) =>

      wrappedState.wrap(newStateMap.get(key))

      val returned = mappingFunction(batchTime, key, Some(value), wrappedState)

      if (wrappedState.isRemoved) {

        newStateMap.remove(key)

      } else if (wrappedState.isUpdated || timeoutThresholdTime.isDefined) {

//遍历当前所有batchTime的所有数据,然后使用自定义的函数对当前的batch数据进行计算,更新newStateMap数据结构。

// newStateMap是保存历史数据

        newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)

      }

      mappedData ++= returned

    }

 

    // Get the timed out state records, call the mapping function on each and collect the

    // data returned

    if (removeTimedoutData && timeoutThresholdTime.isDefined) {

      newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>

        wrappedState.wrapTimingOutState(state)

        val returned = mappingFunction(batchTime, key, None, wrappedState)

        mappedData ++= returned

        newStateMap.remove(key)

      }

    }

// MapWithStateRDDRecord所代表的partition,从RDD的角度来说,没有变。但是内部变了。只是内部数据发送变化了。

    MapWithStateRDDRecord(newStateMap, mappedData)

  }

}

MapWithState实现如下:


总结:

 

  • 资料来源于:DT_大数据梦工厂(Spark发行版本定制

  • DT大数据梦工厂微信公众号:DT_Spark 

  • 新浪微博:http://www.weibo.com/ilovepains

  • 王家林老师每晚20:00免费大数据实战

YY直播:68917580



 

相关文章推荐

Spark定制班第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

从这节课开始,简介Spark Streaming的状态管理。   Spark Streaming 是按Batch Duration来划分Job的,但我们有时需要根据业务要求按照另外的时间周期(比如说,...

Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

背景: 整个Spark Streaming是按照Batch Duractions划分Job的。但是很多时候我们需要算过去的一天甚至一周的数据,这个时候不可避免的要进行状态管理,而Spark ...

14:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

首先简单解释一下什么是state(状态)管理?我们以wordcount为例。每个batchInterval会计算当前batch的单词计数,那如果需要计算从流开始到目前为止的单词出现的次数,该如计算呢?...

第93讲:Spark Streaming updateStateByKey案例实战和内幕源码

本节课程主要分二个部分: 一、Spark Streaming updateStateByKey案例实战 二、Spark Streaming updateStateByKey源码解密 第一部分: ...

第132课: Spark Streaming源码经典解读系列之七:Executro容错工作内幕源码解密

第132课:  Spark Streaming源码经典解读系列之七:Executro容错工作内幕源码解密 Executor 容错高可用: 备份方式 1、备份:将数据拷贝一份,不同...

Spark定制班第16课:Spark Streaming源码解读之数据清理内幕彻底解密

这一课讲数据清理的内容。 Spark Streaming应用是持续不断地运行着的。如果不对内存资源进行有效管理,内存就有可能很快就耗尽。 Spark Streaming应用一定有自己的对象、数据、元数...

Spark Streaming源码解读之数据清理内幕彻底解密

转自:http://www.uml.org.cn/bigdata/2016062302.asp 本篇博客的主要目的是: 1. 理清楚Spark Streaming中数据清理的流程 组织...

大数据IMF传奇行动绝密课程第87课:Flume推送数据到Spark Streaming案例实战和内幕源码解密

Flume推送数据到Spark Streaming案例实战和内幕源码解密1、Flume on HDFS案例回顾 2、Flume推送数据到Spark Streaming实战 3、原理绘图剖析 /*...

Spark 定制版:016~Spark Streaming源码解读之数据清理内幕彻底解密

本讲内容:a. Spark Streaming数据清理原因和现象 b. Spark Streaming数据清理代码解析注:本讲内容基于Spark 1.6.1版本(在2016年5月来说是Spark最新...

第110课: Spark Streaming电商广告点击综合案例通过updateStateByKey等实现广告点击流量的在线更新统计

1:UpdateStatebyKey,遵循RDD的不变性,采用cogroup的方式,cogroup就是按照key对value进行聚合操作,在历史的基础上不断的更新,不会造成内存溢出等情况,而且它对数据...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:第14课:Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密
举报原因:
原因补充:

(最多只允许输入30个字)