(原文地址:
http://blog.csdn.net/codemosi/article/category/2777045,转载麻烦带上原文地址。hadoop hive hbase mahout storm spark kafka flume,等连载中,做个爱分享的人
)
spark straming是spark软件栈中,用来处理流式数据的,增量数据如socket,目录中新增的文本文件,kafka队列里消息主题等,还是一样,从代码入手。
![奋斗](http://static.blog.csdn.net/xheditor/xheditor_emot/default/struggle.gif)
流式增量数据的数据源,kafka生产者
val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
// Zookeper connection properties
val props = new Properties()
props.put("metadata.broker.list", brokers)
props.put("serializer.class", "kafka.serializer.StringEncoder")
val config = new ProducerConfig(props)
val producer = new Producer[String, String](config)
// Send some messages
while(true) {
val messages = (1 to messagesPerSec.toInt).map { messageNum =>
val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10).toString)
.mkString(" ")
new KeyedMessage[String, String](topic, str)
}.toArray
producer.send(messages: _*)
Thread.sleep(100)
}
流式增量数据的接受者,kafka消费者 ,使用sparkstreaming 来处理增量数据
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
从代码入手分析SparkStreaming 原理
1:DStream(从源源不断的流式数据,分解后,一段段的数据集合)
val lines = KafkaUtils.createStream()
lines一个DStream ,DStream是spark streaming 处理的基本单位,表示一段段的RDD的集合。
假设KafkaUtils.createStream()的数据源是
(DStreams support many of the transformations available on normal Spark RDD’s)
spark Streaming会将数据源切割一段段的数据集合如
RDD1: DStreams support many ,
RDD2: of the transformations available ,
RDD3 on normal Spark RDD’ .
由RDD1,RDD2,RDD3组成一个DStream,也就是这里的lines,spark Streaming中以DStream 作为基本单位,提供高阶的API来处理流式数据。
2:Transformations(DStream和DStream之间的转换函数 ,与spark中RDD的transformation语义一样的类型)
spark Streaming数据的处理,以DStream作为基本单位的,如
val words = lines.flatMap(_.split(" "))
在这里地lines这个DStream做flatMap转换,生成words 这个新的DStream,实际是对RDD1,RDD2,RDD3,这三个RDD都做flatMap转换。spark streaming 提供了针对DStream高阶的API来隐藏底层RDD的转换。目前的DStream高阶API有下面这些
Transformation | Meaning |
map(func) | Return a new DStream by passing each element of the source DStream through a functionfunc. |
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) | Return a new DStream by selecting only the records of the source DStream on which func returns true. |
repartition(numPartitions) | Changes the level of parallelism in this DStream by creating more or fewer partitions. |
union(otherStream) | Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
count() | Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. |
reduce(func) | Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a functionfunc (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. |
countByValue() | When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks]) | When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
join(otherStream, [numTasks]) | When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
cogroup(otherStream, [numTasks]) | When called on DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func) | Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
3:window 操作(DStream和DStream之间的特殊函数,spark streaming特有的转换 )
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2) //reduceByKeyAndWindow这个就是window操作
window操作,是对一段时间窗口的转换函数。相当定期于计算过去一段时间的一个值。
这里相当从于,从words这个DSteam中,每Seconds(2), 计算过去Minutes(10)的值,生产的DSteam wordCounts 。
window操作适合用在top N的场景下使用。目前提供的window操作如下
Transformation | Meaning |
window(windowLength, slideInterval) | Return a new DStream which is computed based on windowed batches of the source DStream. |
countByWindow(windowLength, slideInterval) | Return a sliding window count of elements in the stream. |
reduceByWindow(func, windowLength, slideInterval) | Return a new single-element stream, created by aggregating elements in the stream over a sliding interval usingfunc. The function should be associative so that it can be computed correctly in parallel. |
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce functionfunc over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]) | A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enter the sliding window, and "inverse reducing" the old data that leave the window. An example would be that of "adding" and "subtracting" counts of keys as the window slides. However, it is applicable to only "invertible reduce functions", that is, those reduce functions which have a corresponding "inverse reduce" function (taken as parameter invFunc. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. |
countByValueAndWindow(windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. |