sparkStreaming

1、sparkStreaming

  • 它是一个可扩展,高吞吐具有容错性的流式计算。

2、sparkStreaming特性

  • 1、易用性
    • 可以像编写离线批处理一样去编写流式程序
    • 可以使用java/python/R
  • 2、容错性
    • 保证数据恰好只被处理一次
  • 3、融合spark体系

3、sparkStreaming原理

  • Spark Streaming 是基于spark的流式批处理引擎,其基本原理是把输入数据以某一时间间隔批量的处理,当批处理间隔缩短到秒级时,便可以用于处理实时数据流。

4、Dstream

  • Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark算子操作后的结果数据流.

5、Dstream操作

  • transformation
    • 它是一个转换,与RDD中的transformation算子操作类似,它会生成一个新的Dstream
  • outputOperations
    • 类似于rdd中的action操作,触发任务的运行,得到结果数据。

6、DStream操作实战

需要引入对应的jar包

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>2.0.2</version>
        </dependency>
  • 1.SparkStreaming接受socket数据,实现单词计数WordCount
package cn.test.stream

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

//todo:利用sparkStreaming接受socket数据,实现单词计数
object SparkStreamingSocket {
  def main(args: Array[String]): Unit = {
     //1、创建sparkConf   设置master的地址local[N] ,n必须大于1,其中1个线程负责去接受数据,另一线程负责处理接受到的数据
      val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingSocket").setMaster("local[2]")
    // 2、创建sparkContext
      val sc = new SparkContext(sparkConf)
      sc.setLogLevel("WARN")
    //3、创建streamingContext,需要sparkContext和以多久时间间隔为一个批次
     val ssc = new StreamingContext(sc,Seconds(5))
    //4、通过streaming接受socket数据
      val stream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.200.100",9999)
    //5、切分每一行
      val words: DStream[String] = stream.flatMap(_.split(" "))
    //6、每个单词记为1
      val wordAndOne: DStream[(String, Int)] = words.map((_,1))
    //7、相同单词出现的次数累加
      val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)

    //8、打印
      result.print()

    //9、开启流式计算
      ssc.start()
      //一直会阻塞,等待退出
      ssc.awaitTermination()
  }
}
  • 2.SparkStreaming接受socket数据,实现单词计数累加
package cn.test.stream

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

//todo:利用sparkStreaming接受socket数据,实现所有批次单词计数结果累加
object SparkStreamingSocketTotal {

  //定义一个方法
  //currentValues:他表示在当前批次每个单词出现的所有的1   (hadoop,1) (hadoop,1)(hadoop,1)
  //historyValues:他表示在之前所有批次中每个单词出现的总次数   (hadoop,100)
  def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = {
          val newValue: Int = currentValues.sum+historyValues.getOrElse(0)
          Some(newValue)
  }

  def main(args: Array[String]): Unit = {
      //1、创建sparkConf
      val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingSocketTotal").setMaster("local[2]")
      //2、创建sparkContext
      val sc = new SparkContext(sparkConf)
      sc.setLogLevel("WARN")
     //3、创建streamingContext
      val ssc = new StreamingContext(sc,Seconds(5))
     //设置checkpoint目录
      ssc.checkpoint("./ck")
    //4、接受socket数据
      val stream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.200.100",9999)
    //5、切分每一行
      val words: DStream[String] = stream.flatMap(_.split(" "))
    //6、把每一个单词计为1
      val wordAndOne: DStream[(String, Int)] = words.map((_,1))
    //7、相同单词出现的次数累加
      val result: DStream[(String, Int)] = wordAndOne.updateStateByKey(updateFunc)

    //8、打印结果数据
      result.print()

    //9、开启流式计算
      ssc.start()
      ssc.awaitTermination()
    }
}
  • 3.SparkStreaming开窗函数reduceByKeyAndWindow,实现单词计数
package cn.test.stream

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}

//todo:利用sparkStreaming开窗函数 reducebyKeyAndWindow实现单词计数
object SparkStreamingSocketWindow {

  //定义一个方法
  //currentValues:他表示在当前批次每个单词出现的所有的1   (hadoop,1) (hadoop,1)(hadoop,1)
  //historyValues:他表示在之前所有批次中每个单词出现的总次数   (hadoop,100)
  def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = {
    val newValue: Int = currentValues.sum+historyValues.getOrElse(0)
    Some(newValue)
  }

  def main(args: Array[String]): Unit = {
    //1、创建sparkConf
    val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingSocketWindow").setMaster("local[2]")
    //2、创建sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")
    //3、创建streamingContext
    val ssc = new StreamingContext(sc,Seconds(5))
    //设置checkpoint目录
    ssc.checkpoint("./ck")
    //4、接受socket数据
    val stream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.200.100",9999)
    //5、切分每一行
    val words: DStream[String] = stream.flatMap(_.split(" "))
    //6、把每一个单词计为1
    val wordAndOne: DStream[(String, Int)] = words.map((_,1))
    //7、相同单词出现的次数累加
    //reduceByKeyAndWindow该方法需要三个参数
    //reduceFunc:需要一个函数
    //windowDuration:表示窗口的长度
    //slideDuration:表示窗口滑动时间间隔,即每隔多久计算一次
    val result: DStream[(String, Int)] = wordAndOne.reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Seconds(5),Seconds(10))

    //8、打印结果数据
    result.print()

    //9、开启流式计算
    ssc.start()
    ssc.awaitTermination()
  }
}
  • 4.SparkStreaming开窗函数统计一定时间内的热门词汇
    • 统计人单词时调用transform自写函数体
package cn.test.stream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
//todo:利用sparkStreaming开窗函数统计单位时间内热门词汇----出现频率比较高的词汇
object SparkStreamingSocketWindowHotWords {
  //定义一个方法
  //currentValues:他表示在当前批次每个单词出现的所有的1   (hadoop,1) (hadoop,1)(hadoop,1)
  //historyValues:他表示在之前所有批次中每个单词出现的总次数   (hadoop,100)
  def updateFunc(currentValues:Seq[Int], historyValues:Option[Int]):Option[Int] = {
    val newValue: Int = currentValues.sum+historyValues.getOrElse(0)
    Some(newValue)
  }
  def main(args: Array[String]): Unit = {
    //1、创建sparkConf
    val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingSocketWindowHotWords").setMaster("local[2]")
    //2、创建sparkContext
    val sc = new SparkContext(sparkConf)
    sc.setLogLevel("WARN")
    //3、创建streamingContext
    val ssc = new StreamingContext(sc,Seconds(5))
    //设置checkpoint目录
    ssc.checkpoint("./ck")
    //4、接受socket数据
    val stream: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.200.100",9999)
    //5、切分每一行
    val words: DStream[String] = stream.flatMap(_.split(" "))
    //6、把每一个单词计为1
    val wordAndOne: DStream[(String, Int)] = words.map((_,1))
    //7、相同单词出现的次数累加
    //reduceByKeyAndWindow该方法需要三个参数
    //reduceFunc:需要一个函数
    //windowDuration:表示窗口的长度
    //slideDuration:表示窗口滑动时间间隔,即每隔多久计算一次
    val result: DStream[(String, Int)] = wordAndOne.reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Seconds(10),Seconds(5))
    //8、按照单词出现的次数降序排列
val sortedDstream: DStream[(String, Int)] = result.transform(rdd => {
  //将rdd中数据按照单词出现的次数降序排列
  val sortedRDD: RDD[(String, Int)] = rdd.sortBy(_._2, false)
  //取出前3位
  val sortHotWords: Array[(String, Int)] = sortedRDD.take(3)
  //打印前3位结果数据
  sortHotWords.foreach(x => println(x))
  sortedRDD
})


//9、打印排序后的结果数据
sortedDstream.print()

//10、开启流式计算
ssc.start()
ssc.awaitTermination()
  }

}

7、SparkStreaming整合Flume

  • Poll方式

      package cn.test.dstream.flume

      import java.net.InetSocketAddress

      import org.apache.spark.{SparkConf, SparkContext}
      import org.apache.spark.storage.StorageLevel
      import org.apache.spark.streaming.{Seconds, StreamingContext}
      import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
      import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

      //todo:利用sparkStreaming对接flume数据,实现单词计算------Poll拉模式
      object SparkStreamingFlume_Poll {
        def main(args: Array[String]): Unit = {
           //1、创建sparkConf
            val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Poll").setMaster("local[2]")
          //2、创建sparkContext
            val sc = new SparkContext(sparkConf)
            sc.setLogLevel("WARN")
          //3、创建StreamingContext
            val ssc = new StreamingContext(sc,Seconds(5))
          //定义一个flume地址集合,可以同时接受多个flume的数据
          val address=Seq(new InetSocketAddress("192.168.200.100",9999),new InetSocketAddress("192.168.200.101",9999))

          //4、获取flume中数据
            val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createPollingStream(ssc,address,StorageLevel.MEMORY_AND_DISK_SER_2)
          //5、从Dstream中获取flume中的数据  {"header":xxxxx   "body":xxxxxx}
            val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
          //6、切分每一行,每个单词计为1
            val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
          //7、相同单词出现的次数累加
            val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
          //8、打印输出
          result.print()

          //开启计算
            ssc.start()
            ssc.awaitTermination()

        }
      }
  • push方式
package cn.test.dstream.flume

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.flume.{FlumeUtils, SparkFlumeEvent}

//todo:利用sparkStreaming对接flume数据,实现单词计数------Push推模式
object SparkStreamingFlume_Push {

  def main(args: Array[String]): Unit = {
    //1、创建sparkConf
      val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingFlume_Push").setMaster("local[2]")
    //2、创建sparkContext
      val sc = new SparkContext(sparkConf)
      sc.setLogLevel("WARN")
    //3、创建StreamingContext
      val ssc = new StreamingContext(sc,Seconds(5))
    //4、获取flume中的数据
    val stream: ReceiverInputDStream[SparkFlumeEvent] = FlumeUtils.createStream(ssc,"192.168.11.123",9999)
    //5、从Dstream中获取flume中的数据  {"header":xxxxx   "body":xxxxxx}
    val lineDstream: DStream[String] = stream.map(x => new String(x.event.getBody.array()))
    //6、切分每一行,每个单词计为1
    val wordAndOne: DStream[(String, Int)] = lineDstream.flatMap(_.split(" ")).map((_,1))
    //7、相同单词出现的次数累加
    val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
    //8、打印输出
    result.print()

    //开启计算
    ssc.start()
    ssc.awaitTermination()
  }
}

8、Spark Streaming整合kafka实战

  • KafkaUtils.createDstream方式(基于kafka高级Api—–偏移量由zk保存)
package cn.test.kafka

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils

import scala.collection.immutable

//todo:sparkStreaming整合kafka---基于receiver(高级api)
object SparkStreamingKafkaReceiver {
  def main(args: Array[String]): Unit = {
      //1、创建sparkConf
      val sparkConf: SparkConf = new SparkConf()
                                  .setAppName("SparkStreamingKafkaReceiver")
                                  .setMaster("local[4]") //线程数要大于receiver个数
                                    .set("spark.streaming.receiver.writeAheadLog.enable","true")
                                      //表示开启WAL预写日志,保证数据源的可靠性
      //2、创建sparkContext
      val sc = new SparkContext(sparkConf)
      sc.setLogLevel("WARN")
      //3、创建streamingContext
      val ssc = new StreamingContext(sc,Seconds(5))
        ssc.checkpoint("./spark_receiver")
      //4、准备zk地址
      val zkQuorum="node1:2181,node2:2181,node3:2181"
      //5、准备groupId
      val groupId="spark_receiver"
      //6、定义topic   当前这个value并不是topic对应的分区数,而是针对于每一个分区使用多少个线程去消费
     val topics=Map("spark_01" ->2)
      //7、KafkaUtils.createStream 去接受kafka中topic数据
      //(String, String) 前面一个string表示topic名称,后面一个string表示topic中的数据
    //使用多个reveiver接受器去接受kafka中topic数据
    val dstreamSeq: immutable.IndexedSeq[ReceiverInputDStream[(String, String)]] = (1 to 3).map(x => {
      val stream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc, zkQuorum, groupId, topics)
      stream
      }
    )

    //利用streamingcontext调用union,获取得到所有receiver接受器的数据
    val totalDstream: DStream[(String, String)] = ssc.union(dstreamSeq)

     //8、获取kafka中topic的数据
    val topicData: DStream[String] = totalDstream.map(_._2)
     //9、切分每一行
      val wordAndOne: DStream[(String, Int)] = topicData.flatMap(_.split(" ")).map((_,1))
     //10、相同单词出现的次数累加
      val result: DStream[(String, Int)] = wordAndOne.reduceByKey(_+_)
    //11、打印结果数据
      result.print()


    //12、开启流式计算
      ssc.start()
      ssc.awaitTermination()

  }
}

  • KafkaUtils.createDirectStream方式(基于kafka低级Api—–偏移量由客户端程序保存)
package cn.test.kafka

import kafka.serializer.StringDecoder
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils

//todo:sparkSteaming整合kafka----采用direct(低级Api)
object SparkStreamingKafkaDirect {
  def main(args: Array[String]): Unit = {
      //1、创建sparkConf
      val sparkConf: SparkConf = new SparkConf().setAppName("SparkStreamingKafkaDirect").setMaster("local[2]")
      //2、创建sparkContext
      val sc = new SparkContext(sparkConf)
      sc.setLogLevel("WARN")
      //3、创建streamingcontext
      val ssc = new StreamingContext(sc,Seconds(5))
      ssc.checkpoint("./spark_direct") //它会保存topic的偏移量
      //4、准备kafka参数
      val kafkaParams=Map("metadata.broker.list"->"node1:9092,node2:9092,node3:9092","group.id" ->"spark_direct")
      //5、准备topic
      val topics=Set("spark_01")
      //6、获取kafka中的数据
      val dstream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
      //7、获取topic中的数据
        val data: DStream[String] = dstream.map(_._2)
      //8、切分每一行,每个单词计为1,把相同单词出现的次数累加
        val result: DStream[(String, Int)] = data.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

      //9、打印结果数据
        result.print()

      //10、开启流式计算
        ssc.start()
        ssc.awaitTermination()

  }
}
阅读更多

没有更多推荐了,返回首页