在Spark Streaming中,提供了窗口计算,可以转换滑动窗口内的数据。每次窗口都在DStream中滑动,窗口内的RDD将被合并生成窗口内的DStream的RDD。如下图所示为滑动窗口的操作:
从上图可以看见,window操作需要指定两个参数:
- 窗口长度(window length):窗口的周期长度。
- 滑动间隔(sliding interval):窗口转换的间隔。
在Spark Streaming中提供了多种与window有关的API供我们使用。
- window(windowLength, slideInterval)
- countByWindow(windowLength, slideInterval) 返回窗口中的元素的个数
- reduceByWindow(func, windowLength, slideInterval)
- reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
- reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
- countByValueAndWindow(windowLength, slideInterval, [numTasks])
详情可以参考官方文档,这里不赘述。
假设我们需要每隔10秒对最近30秒的数据做reduce操作。代码按如下方式编写:
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
从之前的文章中,我们了解到了Spark Streaming底层是如何通过Receiver
接收数据,将数据存储在哪里?紧接着如何处理,经过这一系列操作,最终会将DStream转换为RDD。所以我们只需要查看WindowedDtream
是如何在这一阶段如何转换的即可。
(可参考之前Spark Streaming源码解析的文章【Spark Streaming执行流程源码剖析】)
以上述示例reduceByKeyAndWindow
为例,当我们调用它的时候,就会进入PairDStreamFunctions
中调用reduceByKeyAndWindow
,接着会调用一个重载方法:
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration
): DStream[(K, V)] = ssc.withScope {
reduceByKeyAndWindow(reduceFunc, windowDuration, slideDuration, defaultPartitioner())
}
def reduceByKeyAndWindow(
reduceFunc: (V, V) => V,
windowDuration: Duration,
slideDuration: Duration,
partitioner: Partitioner
): DStream[(K, V)] = ssc.withScope {
self.reduceByKey(reduceFunc, partitioner)
.window(windowDuration, slideDuration)
.reduceByKey(reduceFunc, partitioner)
}
在这里,我们可以看到,reduceByKeyAndWindow
内部实际上被分解window
和 reduceByKey
这两个算子,接着会进入DStream
中调用window()
函数:
def window(windowDuration: Duration, slideDuration: Duration): DStream[T] = ssc.withScope {
new WindowedDStream(this, windowDuration, slideDuration)
}
最终进入了WindowedDStream
类,在这个类中,这里是window机制的核心部分。在这个类中,我们观察其comput
方法:
override def compute(validTime: Time): Option[RDD[T]] = {
val currentWindow = new Interval(validTime - windowDuration + parent.slideDuration, validTime)
val rddsInWindow = parent.slice(currentWindow)
Some(ssc.sc.union(rddsInWindow))
}
在这个方法中,主要做了以下几件事情:
- 计算滑动间隔,滑动间隔由起始时间和结束时间组成:
起始时间=当前有效时间-窗口周期时长+父级DStream的滑动周期时长 结束时间=当前有效时间
- 滑动窗口
- 调用SparkContext的union操作进行计算
其中Interval实现如下所示,它主要定义了一系列计算窗口范围的方法:
private[streaming]
class Interval(val beginTime: Time, val endTime: Time) {
def this(beginMs: Long, endMs: Long) = this(new Time(beginMs), new Time(endMs))
def duration(): Duration = endTime - beginTime
def + (time: Duration): Interval = {
new Interval(beginTime + time, endTime + time)
}
def - (time: Duration): Interval = {
new Interval(beginTime - time, endTime - time)
}
def < (that: Interval): Boolean = {
if (this.duration != that.duration) {
throw new Exception("Comparing two intervals with different durations [" + this + ", "
+ that + "]")
}
this.endTime < that.endTime
}
def <= (that: Interval): Boolean = (this < that || this == that)
def > (that: Interval): Boolean = !(this <= that)
def >= (that: Interval): Boolean = !(this < that)
override def toString: String = "[" + beginTime + ", " + endTime + "]"
}
接着调用了slice
方法来计算滑动窗口,其实现如下:
/**
* Return all the RDDs defined by the Interval object (both end times included)
*/
def slice(interval: Interval): Seq[RDD[T]] = ssc.withScope {
slice(interval.beginTime, interval.endTime)
}
/**
* Return all the RDDs between 'fromTime' to 'toTime' (both included)
*/
def slice(fromTime: Time, toTime: Time): Seq[RDD[T]] = ssc.withScope {
if (!isInitialized) {
throw new SparkException(this + " has not been initialized")
}
val alignedToTime = if ((toTime - zeroTime).isMultipleOf(slideDuration)) {
toTime
} else {
logWarning(s"toTime ($toTime) is not a multiple of slideDuration ($slideDuration)")
toTime.floor(slideDuration, zeroTime)
}
val alignedFromTime = if ((fromTime - zeroTime).isMultipleOf(slideDuration)) {
fromTime
} else {
logWarning(s"fromTime ($fromTime) is not a multiple of slideDuration ($slideDuration)")
fromTime.floor(slideDuration, zeroTime)
}
logInfo(s"Slicing from $fromTime to $toTime" +
s" (aligned to $alignedFromTime and $alignedToTime)")
alignedFromTime.to(alignedToTime, slideDuration).flatMap { time =>
if (time >= zeroTime) getOrCompute(time) else None
}
}
上面的代码中主要是通过与ZeroTime
比较来计算窗口的起始时间,然后会调用getOrCompute
方法进行计算:
/**
* Get the RDD corresponding to the given time; either retrieve it from cache
* or compute-and-cache it.
*/
private[streaming] def getOrCompute(time: Time): Option[RDD[T]] = {
// If RDD was already generated, then retrieve it from HashMap,
// or else compute the RDD
//如果RDD已经生成,直接从缓存中获取,否则计算RDD
generatedRDDs.get(time).orElse {
// Compute the RDD if time is valid (e.g. correct time in a sliding window)
// of RDD generation, else generate nothing.
//计算时间是否非法
if (isTimeValid(time)) {
// Set the thread-local property for call sites to this DStream's creation site
// such that RDDs generated by compute gets that as their creation site.
// Note that this `getOrCompute` may get called from another DStream which may have
// set its own call site. So we store its call site in a temporary variable,
// set this DStream's creation site, generate RDDs and then restore the previous call site.
val prevCallSite = ssc.sparkContext.getCallSite()
ssc.sparkContext.setCallSite(creationSite)
// Disable checks for existing output directories in jobs launched by the streaming
// scheduler, since we may need to write output to an existing directory during checkpoint
// recovery; see SPARK-4835 for more details. We need to have this call here because
// compute() might cause Spark jobs to be launched.
val rddOption = PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
compute(time)
}
ssc.sparkContext.setCallSite(prevCallSite)
rddOption.foreach { case newRDD =>
// Register the generated RDD for caching and checkpointing
if (storageLevel != StorageLevel.NONE) {
newRDD.persist(storageLevel)
logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")
}
if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {
newRDD.checkpoint()
logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")
}
//将RDD放入HashMap的缓存中
generatedRDDs.put(time, newRDD)
}
rddOption
} else {
None
}
}
}
从上面源码中我们可以看出,在计算出滑动窗口后,会计算这个窗口内的RDD,那么如何计算呢?首先会去以HashMap为数据结构的缓存中去获取,如果获取到,直接返回,否则会检查当前时间是否符合滑动窗口的时间,如果合法,会计算这个窗口内的RDD,如果我们设置了缓存和checkpoint策略,则会进行相关设置,最后将生成的RDD放入缓存中并且返回。
在计算出该窗口的RDD之后,就会调用SparkContext的相关操作去执行job 的计算工作。至此,我们了解到了整个window机制的底层实现原理。
欢迎加入大数据学习交流群:731423890