Spark-Streaming wordCount

最新推荐文章于 2022-05-24 22:02:55 发布

chenxiaokang97

最新推荐文章于 2022-05-24 22:02:55 发布

阅读量760

点赞数

文章标签： Spark

本文链接：https://blog.csdn.net/chen45682kang/article/details/81010556

版权

DStream

DStream（Discretized Stream）是Spark Streaming的基础抽象，代表持续性的数据流(离散化流)和经过各种Spark原语操作后的结果数据流。在内部实现上DStream是一系列连续的RDD来表示。每个RDD包含一段时间间隔内的数据。

DStream可以从各种输入源创建，比如Flume、Kafka或者HDFS。

对数据的操作也按照RDD为单位来进行

计算过程由Spark engine来完成

DStream相关操作

DStream上的原语与RDD类似，分为Transformations和Output Operation两种，转换操作中还要一些比较特殊的原语，如：updateStateByKey()、transform()以及各种Window相关的原语。

Spark1.1只可以在Java和Scala中使用。

Transformations on DStreams

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations

UpdateStateByKey Operation：UpdateStateByKey原语用于记录历史记录。
Transform Operation：允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便地拓展Spark API。此外，MLlib以及Graphx也是通过本函数来进行结合。
Window Operation：类似Storm中的State，可以设置窗口的大小和滑动窗口的间隔来动态获取当前Streaming的允许状态。

Output Operations on DStreams

Output Operations可以将DStream的数据输出到外部的数据库或文件系统，当某个Output Operations原语被调用时（类似RDD的Action），streaming程序才会开始真正的计算过程。

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

设置时间间隔，多长时间产生一个批次

5s -> RDD -> RDD (有序)

wordCountDemo

 
  import 
  org.apache.spark.{SparkConf, SparkContext} 
 
  import 
  org.apache.spark.streaming.{Seconds, StreamingContext} 
 
  object 
  StreamingWordCount { 
 
  def 
  main(args: Array[ 
  String 
  ]): Unit = { 
 
      LoggerLevels. 
  setStreamingLogLevels 
  () 
 
  val 
  conf = 
  new 
  SparkConf().setAppName( 
  "StreamingWordCount" 
  ).setMaster( 
  "local[2]" 
  ) 
 
  val 
  sc = 
  new 
  SparkContext(conf) 
 
  val 
  ssc = 
  new 
  StreamingContext(sc, 
  Seconds 
  ( 
  5 
  )) 
 
  val 
  ds = ssc.socketTextStream( 
  "127.0.0.1" 
  , 
  8888 
  ) 
 
  //DStream是一个特殊的RDD 
 
  val 
  result = ds.flatMap(_.split( 
  " " 
  )).map((_, 
  1 
  )).reduceByKey(_+_) 
 
  //打印结果 
 
  result.print() 
 
      ssc.start() 
 
      ssc.awaitTermination() 
 
    } 
 
  } 
 
   nc -lk 8888

设置输出日志级别

 
  import 
  org.apache.spark.Logging 
 
  import 
  org.apache.log4j.{Logger, Level} 
 
  object 
  LoggerLevels 
  extends 
  Logging { 
 
  def 
  setStreamingLogLevels() { 
 
  val 
  log4jInitialized = Logger. 
  getRootLogger 
  .getAllAppenders.hasMoreElements 
 
  if 
  (!log4jInitialized) { 
 
        logInfo( 
  "Setting log level to [WARN] for streaming example." 
  + 
 
  " To override add a custom log4j.properties to the classpath." 
  ) 
 
        Logger. 
  getRootLogger 
  .setLevel(Level. 
  WARN 
  ) 
 
      } 
 
    } 
 
  }

累加的wordcount

 
  import 
  org.apache.spark.streaming.{Seconds, StreamingContext} 
 
  import 
  org.apache.spark.{HashPartitioner, SparkConf, SparkContext} 
 
  object 
  StatFullWordCount { 
 
  val 
  func 
  = (iter: 
  Iterator 
  [( 
  String 
  , 
  Seq 
  [Int],Option[Int])]) => { 
 
  //    iter.flatMap(it => Some(it._2.sum + it._3.getOrElse(0)).map(m=>(it._1,m))) 
 
  iter.flatMap{ 
  case 
  (x,y,z)=> 
  Some 
  (y.sum + z.getOrElse( 
  0 
  )).map(m => (x,m))} 
 
    } 
 
  def 
  main(args: Array[ 
  String 
  ]): Unit = { 
 
      LoggerLevels. 
  setStreamingLogLevels 
  () 
 
  val 
  conf = 
  new 
  SparkConf().setAppName( 
  "StatFullWordCount" 
  ).setMaster( 
  "local[2]" 
  ) 
 
  val 
  sc = 
  new 
  SparkContext(conf) 
 
  val 
  ssc = 
  new 
  StreamingContext(sc, 
  Seconds 
  ( 
  5 
  )) 
 
       // 必须设置checkpoint 
 
      sc.setCheckpointDir( 
  "/Users/chenxiaokang/Desktop/checkpoint" 
  ) 
 
  val 
  ds = ssc.socketTextStream( 
  "127.0.0.1" 
  , 
  8888 
  ) 
 
  val 
  result = ds.flatMap(_.split( 
  " " 
  )).map((_, 
  1 
  )).reduceByKey(_+_) 
 
        .updateStateByKey( 
  func 
  , 
  new 
  HashPartitioner(sc.defaultParallelism), 
  true 
  ) 
 
      result.print() 
 
      ssc.start() 
 
      ssc.awaitTermination() 
 
    } 
 
  }