Spark Streaming Programming Guide + 中文翻译
一、Window Operations 概述
As shown in the figure, every time the windowslidesover a source DStream, the source RDDs that fallwithin the window are combined and operated upon to produce the RDDs of the windowed DStream. Inthis specific case, the operation is applied over the last 3 time units of data, and slides by 2 time units.This shows that any window operation needs to specify two parameters.
- window length - The duration of the window (3 in the figure).
- sliding interval - The interval at which the window operation is performed (2 in the figure).
Example:
counts over the last 30 seconds of data, every 10 seconds. To do this, we have to applythereduceByKeyoperation on thepairsDStream of(word, 1)pairs over the last 30 seconds of data.This is done using the operationreduceByKeyAndWindow.
// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))
二、window operations 函数
Some of the common window operations are as follows :
Transformation | Meaning |
---|---|
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength,slideInterval) | Return a sliding window count of elements in the stream. |
reduceByWindow(func, windowLength,slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func,windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
reduceByKeyAndWindow(func, invFunc,windowLength, slideInterval, [numTasks]) | A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameterinvFunc. Like in reduceByKeyAndWindow , the number of reduce tasks is configurable through an optional argument. |
countByValueAndWindow(windowLength,slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow , the number of reduce tasks is configurable through an optional argument |
转换 | 描述 |
window(windowLength, slideInterval) | 返回一个基于源DStream的窗口批次计算后得到新的DStream。 |
countByWindow(windowLength,slideInterval) | 返回基于滑动窗口的DStream中的元素的数量。 |
reduceByWindow(func, windowLength,slideInterval) | 基于滑动窗口对源DStream中的元素进行聚合操作,得到一个新的DStream。 |
reduceByKeyAndWindow(func,windowLength,slideInterval, [numTasks]) | 基于滑动窗口对(K,V)键值对类型的DStream中的值按K使用聚合函数func进行聚合操作,得到一个新的DStream。 |
reduceByKeyAndWindow(func,invFunc,windowLength, slideInterval, [numTasks]) | 一个更高效的reduceByKkeyAndWindow()的实现版本,先对滑动窗口中新的时间间隔内数据增量聚合并移去最早的与新增数据量的时间间隔内的数据统计量。例如,计算t+4秒这个时刻过去5秒窗口的WordCount,那么我们可以将t+3时刻过去5秒的统计量加上[t+3,t+4]的统计量,在减去[t-2,t-1]的统计量,这种方法可以复用中间三秒的统计量,提高统计的效率。 |
countByValueAndWindow(windowLength,slideInterval, [numTasks]) | 基于滑动窗口计算源DStream中每个RDD内每个元素出现的频次并返回DStream[(K,Long)],其中K是RDD中元素的类型,Long是元素频次。与countByValue一样,reduce任务的数量可以通过一个可选参数进行配置。 |
====================================================================================================================================
三、示例
3.1 countByWindow
(windowLength,slideInterval)
Return a sliding window count of elements in the stream.
TransfromData.txt中内容如下
hello hell0 hello hello hjw hjw hjw hjw hello hello helloprogram arguments:./srcFile/ TransfromData.txt 9999 1000
package com.dt.spark.main.Streaming
import java.io.PrintWriter
import java.net.ServerSocket
import java.util.Date
import scala.io.Source
/**
* Created by hjw on 17/5/1.
*/
object StreamingSimulation {
/*
随机取整函数
*/
def index(length:Int) ={
import java.util.Random
val rdm = new Random()
rdm.nextInt(length)
}
def main(args: Array[String]) {
if (args.length != 3){
System.err.println("Usage: <filename><port><millisecond>")
System.exit(1)
}
val filename = args(0)
val lines = Source.fromFile(filename).getLines().toList
val fileRow = lines.length
val listener = new ServerSocket(args(1).toInt)
//指定端口,当有请求时建立连接
while(true){
val socket = listener.accept()
new Thread(){
override def run() = {
println("Got client connect from: " + socket.getInetAddress)
val out = new PrintWriter(socket.getOutputStream,true)
while(true){
Thread.sleep(args(2).toLong)
//随机发送一行数据至client
val content = lines(index(fileRow))
val now = new Date()
println("time: " + now.getTime + " " + content)
out.write(content + '\n')
out.flush()
}
socket.close()
}
}.start()
}
}
}
CountByWindow
时间片为1s
val ssc = new StreamingContext(conf, Seconds(1))
val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))
windowedWordCounts应为3
package com.dt.spark.main.Streaming.Window
import org.apache.log4j.{Level, Logger}
import org.apache.spark._
import org.apache.spark.streaming._ // not necessary since Spark 1.3
object CountByWindow {
Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit ={
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
ssc.checkpoint("CountByWindow_checkpoint")
// Split each line into words
val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val windowedWordCounts = pairs.countByWindow( Seconds(3), Seconds(2))
// Print the first ten elements of each RDD generated in this DStream to the console
windowedWordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
先运行以下模拟器再运行CountByWindow
StreamingSimulation输出:Got client connect from: /127.0.0.1
time: 1493700495475 hello
time: 1493700496476 hello
time: 1493700497480 hell0
time: 1493700498482 hello
time: 1493700499485 hello
time: 1493700500486 hello
time: 1493700501490 hell0
time: 1493700502491 hjw
time: 1493700503496 hello
time: 1493700504501 hjw
17/05/02 12:48:13 INFO Slf4jLogger: Slf4jLogger started
17/05/02 12:48:13 INFO Remoting: Starting remoting
17/05/02 12:48:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.18.136.133:49188]
17/05/02 12:48:14 INFO WriteAheadLogManager for Thread: Recovered 1 write ahead log files from file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata
-------------------------------------------
Time: 1493700496000 ms
-------------------------------------------
1
17/05/02 12:48:16 INFO WriteAheadLogManager for Thread: Attempting to clear 1 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000: file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata/log-1493700373007-1493700433007
17/05/02 12:48:16 INFO WriteAheadLogManager for Thread: Cleared log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700473000
-------------------------------------------
Time: 1493700498000 ms
-------------------------------------------
3
17/05/02 12:48:18 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493700475000:
-------------------------------------------
Time: 1493700500000 ms
-------------------------------------------
3
3.2 reduceByKeyAndWindow
(func,windowLength,slideInterval, [numTasks])
TransfromData.txt中内容如下,也即保证每次请求发送一个“hello”:
hello hello hello hello
ReduceByKeyAndWindow如下
模拟器输出;package com.dt.spark.main.Streaming.Window import org.apache.log4j.{Level, Logger} import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} /** * Created by hjw on 17/5/2. */ object ReduceByKeyAndWindow { Logger.getLogger("org").setLevel(Level.ERROR) def main(args: Array[String]): Unit = { // Create a local StreamingContext with two working thread and batch interval of 1 second. // The master requires 2 cores to prevent from a starvation scenario. val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) ssc.checkpoint("CountByWindow_checkpoint") // Split each line into words val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) val reduceWordCounts = pairs.reduceByKeyAndWindow((v1:Int,v2:Int)=> v1+v2,Seconds(10),Seconds(2),2) reduceWordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate } }
Got client connect from: /127.0.0.1 time: 1493704449107 hello time: 1493704450110 hello time: 1493704451110 hello time: 1493704452114 hello time: 1493704453116 hello time: 1493704454119 hello time: 1493704455122 hello time: 1493704456127 hello time: 1493704457128 hello time: 1493704458128 hello time: 1493704459134 hello time: 1493704460138 hello time: 1493704461138 hello time: 1493704462142 hello time: 1493704463144 hello
ReduceByKeyAndWindow稳定后的输出:17/05/02 13:54:17 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByWindow_checkpoint/receivedBlockMetadata older than 1493704445000:
-------------------------------------------Time: 1493704459000 ms-------------------------------------------
(hello,10)====================================================================================================================================
3.3 CountByValueAndWindow
countByValueAndWindow的窗长度为10,一次统计中hello的出现频次也就是个数为20(也可理解为变相的(“hello”,1)的reduceByKeyAndWindow)
val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)
模拟器TransfromData.txt中内容如下
hello hello hello hello hello hello hello hello hello hello hello hello hello hello
CountByValueAndWindow
package com.dt.spark.main.Streaming.Window
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Created by hjw on 17/5/2.
*/
object CountByValueAndWindow {
Logger.getLogger("org").setLevel(Level.ERROR)
def main(args: Array[String]): Unit = {
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent from a starvation scenario.
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)
ssc.checkpoint("CountByValueAndWindow_checkpoint")
// Split each line into words
val words = lines.flatMap(_.split(" ")) // not necessary since Spark 1.3 // Count each word in each batch
val countByValueAndWindow = words.countByValueAndWindow(Seconds(10), Seconds(2), 2)
countByValueAndWindow.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
CountByValueAndWindow输出入下:17/05/02 17:47:57 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718447000:
-------------------------------------------
Time: 1493718479000 ms
-------------------------------------------
(,10)
(hello,20)
17/05/02 17:47:59 INFO WriteAheadLogManager for Thread: Attempting to clear 0 old log files in file:/Users/hjw/Documents/Java/spark/LearnSpark/LearnSpark/CountByValueAndWindow_checkpoint/receivedBlockMetadata older than 1493718449000:
-------------------------------------------
Time: 1493718481000 ms
-------------------------------------------
(,10)
(hello,20)