05大数据内存计算spark系列贴-spark straming 流式计算

   (原文地址: http://blog.csdn.net/codemosi/article/category/2777045,转载麻烦带上原文地址。hadoop hive hbase mahout storm spark kafka flume,等连载中,做个爱分享的人 奋斗)
spark straming是spark软件栈中,用来处理流式数据的,增量数据如socket,目录中新增的文本文件,kafka队列里消息主题等,还是一样,从代码入手。

流式增量数据的数据源,kafka生产者

     val Array(brokers, topic, messagesPerSec, wordsPerMessage) = args
    // Zookeper connection properties
    val props = new Properties()
    props.put("metadata.broker.list", brokers)
    props.put("serializer.class", "kafka.serializer.StringEncoder")
    val config = new ProducerConfig(props)
    val producer = new Producer[String, String](config)
    // Send some messages
    while(true) {
      val messages = (1 to messagesPerSec.toInt).map { messageNum =>
        val str = (1 to wordsPerMessage.toInt).map(x => scala.util.Random.nextInt(10).toString)
          .mkString(" ")
        new KeyedMessage[String, String](topic, str)
      }.toArray
      producer.send(messages: _*)
      Thread.sleep(100)
    }
   流式增量数据的接受者,kafka消费者 ,使用sparkstreaming 来处理增量数据

    val Array(zkQuorum, group, topics, numThreads) = args
    val sparkConf = new SparkConf().setAppName("KafkaWordCount")
    val ssc =  new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint("checkpoint")

    val topicpMap = topics.split(",").map((_,numThreads.toInt)).toMap
    val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1L))
                                       .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()

  从代码入手分析SparkStreaming 原理

                1:DStream(从源源不断的流式数据,分解后,一段段的数据集合)
                        val lines = KafkaUtils.createStream()
                          lines一个DStream ,DStream是spark streaming 处理的基本单位,表示一段段的RDD的集合。
                        假设KafkaUtils.createStream()的数据源是
                                                        (DStreams support many of the transformations available on normal Spark RDD’s)
                         spark Streaming会将数据源切割一段段的数据集合如  
                                                          RDD1:  DStreams support many   ,
                                                          RDD2: of the transformations available ,
                                                          RDD3 on normal Spark RDD’ .
                         由RDD1,RDD2,RDD3组成一个DStream,也就是这里的lines,spark Streaming中以DStream 作为基本单位,提供高阶的API来处理流式数据。

               2:Transformations(DStream和DStream之间的转换函数 ,与spark中RDD的transformation语义一样的类型)
                   spark Streaming数据的处理,以DStream作为基本单位的,如
                                           val words = lines.flatMap(_.split(" "))
                      在这里地lines这个DStream做flatMap转换,生成words 这个新的DStream,实际是对RDD1,RDD2,RDD3,这三个RDD都做flatMap转换。spark streaming 提供了针对DStream高阶的API来隐藏底层RDD的转换。目前的DStream高阶API有下面这些


TransformationMeaning
map(func)Return a new DStream by passing each element of the source DStream through a  functionfunc.
flatMap(func)Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)Return a new DStream by selecting only the records of the source DStream on which   func returns true.
repartition(numPartitions)Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)Return a new DStream that contains the union of the elements in the source DStream and   otherDStream.
count()Return a new DStream of single-element RDDs by counting the number of elements in each RDD   of the source DStream.
reduce(func)Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the  source DStream using a functionfunc (which takes two arguments and returns one).  The function should be associative so that it can be computed in parallel.
countByValue()When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs  where the value of each key is its frequency in each RDD of the source DStream.  
reduceByKey(func, [numTasks])When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the  values for each key are aggregated using the given reduce function.Note: By default,  this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number  is determined by the config property spark.default.parallelism) to do the grouping.  You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W))  pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])When called on DStream of (K, V) and (K, W) pairs, return a new DStream of  (K, Seq[V], Seq[W]) tuples.
transform(func)Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream.  This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)Return a new "state" DStream where the state for each key is updated by applying the  given function on the previous state of the key and the new values for the key. This can be  used to maintain arbitrary state data for each key.
   
               3:window 操作(DStream和DStream之间的特殊函数,spark streaming特有的转换 )
                           val wordCounts = words.map(x => (x, 1L))
                                                     .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)      //reduceByKeyAndWindow这个就是window操作

                           window操作,是对一段时间窗口的转换函数。相当定期于计算过去一段时间的一个值。
                                              这里相当从于,从words这个DSteam中,每Seconds(2), 计算过去Minutes(10)的值,生产的DSteam  wordCounts 。

                           window操作适合用在top N的场景下使用。目前提供的window操作如下

TransformationMeaning
window(windowLength, slideInterval)Return a new DStream which is computed based on windowed batches of the source DStream.   
countByWindow(windowLength, slideInterval)Return a sliding window count of elements in the stream.   
reduceByWindow(func, windowLength, slideInterval)Return a new single-element stream, created by aggregating elements in the stream over a  sliding interval usingfunc. The function should be associative so that it can be computed  correctly in parallel.   
reduceByKeyAndWindow(func, windowLength, slideInterval,  [numTasks])When called on a DStream of (K, V) pairs, returns a new DStream of (K, V)  pairs where the values for each key are aggregated using the given reduce functionfunc  over batches in a sliding window. Note: By default, this uses Spark's default number of  parallel tasks (2 for local mode, and in cluster mode the number is determined by the config  property spark.default.parallelism) to do the grouping. You can pass an optional   numTasks argument to set a different number of tasks.   
reduceByKeyAndWindow(func, invFunc, windowLength,   slideInterval, [numTasks])A more efficient version of the above reduceByKeyAndWindow() where the reduce  value of each window is calculated incrementally using the reduce values of the previous window.  This is done by reducing the new data that enter the sliding window, and "inverse reducing" the  old data that leave the window. An example would be that of "adding" and "subtracting" counts  of keys as the window slides. However, it is applicable to only "invertible reduce functions",  that is, those reduce functions which have a corresponding "inverse reduce" function (taken as  parameter invFunc. Like in reduceByKeyAndWindow, the number of reduce tasks  is configurable through an optional argument.
countByValueAndWindow(windowLength,   slideInterval, [numTasks])When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the  value of each key is its frequency within a sliding window. Like in   reduceByKeyAndWindow, the number of reduce tasks is configurable through an  optional argument.

                           

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值