Spark-Streaming wordCount

DStream
DStream(Discretized Stream)是Spark Streaming的基础抽象,代表持续性的数据流(离散化流)和经过各种Spark原语操作后的结果数据流。在内部实现上DStream是一系列连续的RDD来表示。每个RDD包含一段时间间隔内的数据。
DStream可以从各种输入源创建,比如Flume、Kafka或者HDFS。

对数据的操作也按照RDD为单位来进行

计算过程由Spark engine来完成

DStream相关操作
DStream上的原语与RDD类似,分为Transformations和Output Operation两种,转换操作中还要一些比较特殊的原语,如:updateStateByKey()、transform()以及各种Window相关的原语。
Spark1.1只可以在Java和Scala中使用。

Transformations on DStreams
Transformation
Meaning
map(func)
Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)
Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)
Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)
Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()
Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)
Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()
When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])
When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])
When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)
Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations
  1. UpdateStateByKey Operation:UpdateStateByKey原语用于记录历史记录。
  2. Transform Operation:允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便地拓展Spark API。此外,MLlib以及Graphx也是通过本函数来进行结合。
  3. Window Operation:类似Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态获取当前Streaming的允许状态。

Output Operations on DStreams
Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(类似RDD的Action),streaming程序才会开始真正的计算过程。
Output Operation
Meaning
print()
Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. 
saveAsTextFiles(prefix, [suffix])
Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])
Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". 
saveAsHadoopFiles(prefix, [suffix])
Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". 
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.


设置时间间隔,多长时间产生一个批次
5s -> RDD -> RDD (有序)

wordCountDemo
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StreamingWordCount {

  def main(args: Array[ String ]): Unit = {
    LoggerLevels. setStreamingLogLevels ()
    val conf = new SparkConf().setAppName( "StreamingWordCount" ).setMaster( "local[2]" )
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds ( 5 ))
    val ds = ssc.socketTextStream( "127.0.0.1" , 8888 )
    //DStream是一个特殊的RDD
    val result = ds.flatMap(_.split( " " )).map((_, 1 )).reduceByKey(_+_)
    //打印结果
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }

}

nc -lk 8888
设置输出日志级别
import org.apache.spark.Logging
import org.apache.log4j.{Logger, Level}

object LoggerLevels extends Logging {

  def setStreamingLogLevels() {
    val log4jInitialized = Logger. getRootLogger .getAllAppenders.hasMoreElements
    if (!log4jInitialized) {
      logInfo( "Setting log level to [WARN] for streaming example." +
        " To override add a custom log4j.properties to the classpath." )
      Logger. getRootLogger .setLevel(Level. WARN )
    }
  }
}
累加的wordcount
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf, SparkContext}

object StatFullWordCount {

  val func = (iter: Iterator [( String , Seq [Int],Option[Int])]) => {
//    iter.flatMap(it => Some(it._2.sum + it._3.getOrElse(0)).map(m=>(it._1,m)))
    iter.flatMap{ case (x,y,z)=> Some (y.sum + z.getOrElse( 0 )).map(m => (x,m))}
  }

  def main(args: Array[ String ]): Unit = {
    LoggerLevels. setStreamingLogLevels ()
    val conf = new SparkConf().setAppName( "StatFullWordCount" ).setMaster( "local[2]" )
    val sc = new SparkContext(conf)
    val ssc = new StreamingContext(sc, Seconds ( 5 ))
    // 必须设置checkpoint
    sc.setCheckpointDir( "/Users/chenxiaokang/Desktop/checkpoint" )

    val ds = ssc.socketTextStream( "127.0.0.1" , 8888 )
    val result = ds.flatMap(_.split( " " )).map((_, 1 )).reduceByKey(_+_)
      .updateStateByKey( func , new HashPartitioner(sc.defaultParallelism), true )

    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}


画像
不登录情况下如何获取信息:
  1. ip(不推荐)
  2. session(不推荐)
  3. cookie(推荐,持久化到文件)
终端 -> cookie -> Kafka ->Streaming(广播变量)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值