Spark Streaming Window核心源码剖析

在Spark Streaming中,提供了窗口计算,可以转换滑动窗口内的数据。每次窗口都在DStream中滑动,窗口内的RDD将被合并生成窗口内的DStream的RDD。如下图所示为滑动窗口的操作:
在这里插入图片描述
从上图可以看见,window操作需要指定两个参数:

  • 窗口长度(window length):窗口的周期长度。
  • 滑动间隔(sliding interval):窗口转换的间隔。

在Spark Streaming中提供了多种与window有关的API供我们使用。

  • window(windowLength, slideInterval)
  • countByWindow(windowLength, slideInterval) 返回窗口中的元素的个数
  • reduceByWindow(func, windowLength, slideInterval)
  • reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
  • reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
  • countByValueAndWindow(windowLength, slideInterval, [numTasks])

详情可以参考官方文档,这里不赘述。

假设我们需要每隔10秒对最近30秒的数据做reduce操作。代码按如下方式编写:

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

从之前的文章中,我们了解到了Spark Streaming底层是如何通过Receiver接收数据,将数据存储在哪里?紧接着如何处理,经过这一系列操作,最终会将DStream转换为RDD。所以我们只需要查看WindowedDtream是如何在这一阶段如何转换的即可。
(可参考之前Spark Streaming源码解析的文章【Spark Streaming执行流程源码剖析】)

以上述示例reduceByKeyAndWindow为例,当我们调用它的时候,就会进入PairDStreamFunctions中调用reduceByKeyAndWindow,接着会调用一个重载方法:

  def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration
    ): DStream[(K, V)] = ssc.withScope {
    reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
  }
  
 def reduceByKeyAndWindow(
      reduceFunc: (V, V) => V,
      windowDuration: Duration,
      slideDuration: Duration,
      partitioner: Partitioner
    ): DStream[(K, V)] = ssc.withScope {
    self.reduceByKey(reduceFunc, partitioner)
        .window(windowDuration, slideDuration)
        .reduceByKey(reduceFunc, partitioner)
  }

在这里,我们可以看到,reduceByKeyAndWindow内部实际上被分解windowreduceByKey这两个算子,接着会进入DStream中调用window()函数:

def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
    new WindowedDStream(this, windowDuration, slideDuration)
  }

最终进入了WindowedDStream类,在这个类中,这里是window机制的核心部分。在这个类中,我们观察其comput方法:

override def compute(validTime: Time): Option[RDD[T]] = {
    val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
    val rddsInWindow = parent.slice(currentWindow)
    Some(ssc.sc.union(rddsInWindow))
  }

在这个方法中,主要做了以下几件事情:

  1. 计算滑动间隔,滑动间隔由起始时间和结束时间组成:
    起始时间=当前有效时间-窗口周期时长+父级DStream的滑动周期时长
    结束时间=当前有效时间
    
  2. 滑动窗口
  3. 调用SparkContext的union操作进行计算

其中Interval实现如下所示,它主要定义了一系列计算窗口范围的方法:

private[streaming]
class Interval(val beginTime: Time, val endTime: Time) {
  def this(beginMs: Long, endMs: Long) = this(new Time(beginMs), new Time(endMs))

  def duration(): Duration = endTime - beginTime

  def + (time: Duration): Interval = {
    new Interval(beginTime + time, endTime + time)
  }

  def - (time: Duration): Interval = {
    new Interval(beginTime - time, endTime - time)
  }

  def < (that: Interval): Boolean = {
    if (this.duration != that.duration) {
      throw new Exception("Comparing two intervals with different durations [" + this + ", "
        + that + "]")
    }
    this.endTime < that.endTime
  }

  def <= (that: Interval): Boolean = (this < that || this == that)

  def > (that: Interval): Boolean = !(this <= that)

  def >= (that: Interval): Boolean = !(this < that)

  override def toString: String = "[" + beginTime + ", " + endTime + "]"
}

接着调用了slice方法来计算滑动窗口,其实现如下:

/**
   * Return all the RDDs defined by the Interval object (both end times included)
   */
  def slice(interval: Interval): Seq[RDD[T]] = ssc.withScope {
    slice(interval.beginTime, interval.endTime)
  }
/**
   * Return all the RDDs between 'fromTime' to 'toTime' (both included)
   */
  def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope {
    if (!isInitialized) {
      throw new SparkException(this + " has not been initialized")
    }

    val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) {
      toTime
    } else {
      logWarning(s"toTime ($toTime) is not a multiple of slideDuration ($slideDuration)")
      toTime.floor(slideDuration, zeroTime)
    }

    val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) {
      fromTime
    } else {
      logWarning(s"fromTime ($fromTime) is not a multiple of slideDuration ($slideDuration)")
      fromTime.floor(slideDuration, zeroTime)
    }

    logInfo(s"Slicing from $fromTime to $toTime" +
      s" (aligned to $alignedFromTime and $alignedToTime)")

    alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
      if (time >= zeroTime) getOrCompute(time) else None
    }
  }

上面的代码中主要是通过与ZeroTime比较来计算窗口的起始时间,然后会调用getOrCompute方法进行计算:

 /**
   * Get the RDD corresponding to the given time; either retrieve it from cache
   * or compute-and-cache it.
   */
  private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
    // If RDD was already generated, then retrieve it from HashMap,
    // or else compute the RDD
    //如果RDD已经生成,直接从缓存中获取,否则计算RDD
    generatedRDDs.get(time).orElse {
      // Compute the RDD if time is valid (e.g. correct time in a sliding window)
      // of RDD generation, else generate nothing.
      //计算时间是否非法
      if (isTimeValid(time)) {
        // Set the thread-local property for call sites to this DStream's creation site
        // such that RDDs generated by compute gets that as their creation site.
        // Note that this `getOrCompute` may get called from another DStream which may have
        // set its own call site. So we store its call site in a temporary variable,
        // set this DStream's creation site, generate RDDs and then restore the previous call site.
        val prevCallSite = ssc.sparkContext.getCallSite()
        ssc.sparkContext.setCallSite(creationSite)
        // Disable checks for existing output directories in jobs launched by the streaming
        // scheduler, since we may need to write output to an existing directory during checkpoint
        // recovery; see SPARK-4835 for more details. We need to have this call here because
        // compute() might cause Spark jobs to be launched.
        val rddOption = PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
          compute(time)
        }
        ssc.sparkContext.setCallSite(prevCallSite)

        rddOption.foreach { case newRDD =>
          // Register the generated RDD for caching and checkpointing
          if (storageLevel != StorageLevel.NONE) {
            newRDD.persist(storageLevel)
            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
          }
          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
            newRDD.checkpoint()
            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
          }
          //将RDD放入HashMap的缓存中
          generatedRDDs.put(time, newRDD)
        }
        rddOption
      } else {
        None
      }
    }
  }

从上面源码中我们可以看出,在计算出滑动窗口后,会计算这个窗口内的RDD,那么如何计算呢?首先会去以HashMap为数据结构的缓存中去获取,如果获取到,直接返回,否则会检查当前时间是否符合滑动窗口的时间,如果合法,会计算这个窗口内的RDD,如果我们设置了缓存和checkpoint策略,则会进行相关设置,最后将生成的RDD放入缓存中并且返回。

在计算出该窗口的RDD之后,就会调用SparkContext的相关操作去执行job 的计算工作。至此,我们了解到了整个window机制的底层实现原理。

欢迎加入大数据学习交流群:731423890

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值