Apache Spark Streaming-教案

最新推荐文章于 2024-01-10 01:46:28 发布

麦田里的守望者·

最新推荐文章于 2024-01-10 01:46:28 发布

阅读量390

点赞数

分类专栏： Spark内存计算

本文链接：https://blog.csdn.net/weixin_38231448/article/details/104529807

版权

Spark内存计算专栏收录该内容

19 篇文章 0 订阅

订阅专栏

Spark Streaming

流计算定义

一般流式计算会与批量计算相比较。在流式计算模型中，输入是持续的，可以认为在时间上是无界的，也就意味着，永远拿不到全量数据去做计算。同时，计算结果是持续输出的，也即计算结果在时间上也是无界的。流式计算一般对实时性要求较高，同时一般是先定义目标计算，然后数据到来之后将计算逻辑应用于数据。同时为了提高计算效率，往往尽可能采用增量计算代替全量计算。批量处理模型中，一般先有全量数据集，然后定义计算逻辑，并将计算应用于全量数据。特点是全量计算，并且计算结果一次性全量输出。
在这里插入图片描述

Spark Streaming是构建在Spark 批处理之上一款流处理框架。与批处理不同的是，流处理计算的数据是无界数据流，输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDD Batch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。
在这里插入图片描述
批处理 VS 流处理区别

数据形式数据量级计算延迟计算形式

批处理 - 输入数据一般是数据静态的，数据量级大-GB+、计算延迟高-分钟|小时、阶段性计算-终止
流处理 - 输入数据一定是动态的、数据量级-Byte级别、延迟较小-ms、持续计算-7*24小时

目前主流流处理框架：轻量级 Kafka Streaming、Storm（JStrom） | Storm2.0+ 一代、Spark Streaming 二代、Flink（BLink）三代

快速入门

pom.xml

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

书写Driver

import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object DStreamWordCount {
  def main(args: Array[String]): Unit = {
    //1.创建StreamContext
    val conf = new SparkConf().setAppName("wordcount").setMaster("local[6]")
    val ssc = new StreamingContext(conf,Seconds(1))

    //2.创建linesStream receiver 测试
    var linesStream:ReceiverInputDStream[String]=ssc.socketTextStream("CentOS",9999)

    //3.对linesStream做算子转换
    linesStream.flatMap(_.split(" "))
        .map((_,1))
        .reduceByKey(_+_)
        .print()//打印输出console

    //4.启动计算
    ssc.start()
    //5.等待关闭
    ssc.awaitTermination()
  }
}

需要安装nectcat插件。yum install -y nc

Discretized Streams (DStreams)

Discretized Stream或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流，可以是从源接收的输入数据流，也可以是通过转换输入流生成的已处理数据流。在内部，DStream由一系列连续的RDD表示，这是Spark对不可变分布式数据集的抽象。DStream中的每个RDD都包含来自特定时间间隔的数据，如下图所示。
在这里插入图片描述

应用于DStream的任何操作都转换为底层RDD上的操作。例如，在先前Quick Start示例中，flatMap操作应用于行DStream中的每个RDD以生成单词DStream的RDD。如下图所示。

在这里插入图片描述

注意：通过对DStream底层运行机制的探讨，在设计StreamingContext的时候要求设置的Seconds()间隔要略大于微批的计算时间。这样才可以有效的避免 back presure(背压)。https://blog.csdn.net/lingbo229/article/details/82380555

Spark Streaming结构

构建StreamingContext(sparkconf,Seconds(1)).
设置数据的Receiver(Basic|Advance)
使用DStream（Macro Batch RDD）转换算子
启动流计算ssc.start()
等待系统关闭流计算ssc.awaitTermination()

Input DStreams and Receivers

在SparkStreaming中每个InputStream（除去File Stream）底层都对应着一个Receiver的实现。每个Receiver负责接收来自外围系统的数据，并且把接收的数据存储到Spark的内存中，用于后续的处理。Spark提供了两种内建的Source用于读取外围系统的数据。

Basic sources: 直接借助StreamingContext API调用。例如：file systems和socket connections
Advanced sources: 例如：Kafka、Flume等可以借助一一些工具创建。一般需要导入第三方依赖。

例如Kafka：spark-streaming-kafka-0-10_2.11依赖。

注意：一个Receiver也会占用一个Cores，因此大家在跑Spark Streaming的程序的时候，一定要给程序分配 n个core(n > recevicers个数)，否则Spark只能接受，无法处理。

Basic Sources

File Streams

可以读取任意一种能和HDFS文件系统兼容文件。DStream 可以通过 StreamingContext.fileStream[KeyClass, ValueClass, InputFormatClass]方式创建。File streams不需要额外Receiver也就意味着不需要给FileStream分配计算资源cores.

·textFileStream

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))

val lines: DStream[String] = ssc.textFileStream("hdfs://CentOS:9000/files")

lines.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()
ssc.awaitTermination()

fileStream

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")

val lines: DStream[(LongWritable,Text)] = ssc.fileStream[LongWritable,Text,TextInputFormat]("hdfs://CentOS:9000/files")

lines.map(t=>t._2.toString)
.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()
ssc.awaitTermination()

注意在使用之前需要同步运行节点和HDFS节点的时钟。

[root@CentOS ~]# yum install -y ntp
[root@CentOS ~]# ntpdate time.ntp.org
16 Aug 15:10:55 ntpdate[1807]: step time server 139.199.215.251 offset -28784.287137 sec
[root@CentOS ~]# clock -w #保存此次时间

一旦文件被认定为处理过了（该文件处理的时间窗口消失）Spark将忽略这种文件更新。因此如果用户使用File Streams处理文本数据，一般要求文件上传到hdfs非采样目录，等待上传结束后，再移动到采样目录。

[root@CentOS ~]# hdfs dfs -put install.log.syslog  /
[root@CentOS ~]# hdfs dfs -mv /install.log.syslog  /files

Custom Receivers

用户需要继承org.apache.spark.streaming.receiver.Receiver制定存储级别，因为Recevicer本职工作是接收外围系统的数据，然后将外围系统的数据写入到Spark中。在Receiver提供了一个store方法可以将数据写入Spark内存。

class MySocketReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging{

  override def onStart(): Unit = { //在onStart方法中接收外围系统数据
    new Thread("Socket Receiver") {
      override def run() { receive() }
    }.start()
  }

  override def onStop(): Unit = {}
   //通过网络socket读取远端数据。
  private def receive() { 
    var socket: Socket = null
    var userInput: String = null
    try {
      // 连接netcat
      socket = new Socket(host, port)
      //读取远程数据
      val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
      userInput = reader.readLine()
      while(!isStopped && userInput != null) {
        store(userInput) //存储到spark
        userInput = reader.readLine()
      }
      reader.close()
      socket.close()
      // 重启服务
      restart("Trying to connect again")
    } catch {
      case e: java.net.ConnectException =>
        // restart if could not connect to server
        restart("Error connecting to " + host + ":" + port, e)
      case t: Throwable =>
        // restart if there is any other error
        restart("Error receiving data", t)
    }
  }
}
object MySocketReceiver{
  def apply(host: String, port: Int): MySocketReceiver = new MySocketReceiver(host, port)
}

Queue RDDs as a Stream（测试）

为了使用测试数据测试Spark Streaming应用程序，还可以使用streamingContext.queueStream（queueOfRDDs）基于RDD队列创建DStream。推入队列的每个RDD将被视为DStream中的一批数据，并像流一样处理。

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")
var queue=new mutable.Queue[RDD[String]]()

val lines: DStream[String] = ssc.queueStream(queue)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.reduceByKey(_+_)
.print()

ssc.start()

new Thread(){
    override def run(): Unit = {
        for(i<- 0 to 100){
            queue += ssc.sparkContext.makeRDD(List("hello spark"))
            Thread.sleep(100)
        }
    }
}.start()

ssc.awaitTermination()

kafka Source

pom.xml

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>2.4.5</version>
</dependency>

适配Kafka broker版本0.10+不兼容0.10以前版本。（面试考点kafka 0.8和0.10、0.11区别）

drver编写

import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object DirectKafkaWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.sparkContext.setLogLevel("FATAL")
    val kafkaParams = Map[String, Object](
      ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
      ConsumerConfig.GROUP_ID_CONFIG -> "g1",
      ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
      ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])

    //直接读取Kafka中的数据 将kafka独立维护Kafka topic的offset偏移量 checkpoint
    val messages = KafkaUtils.createDirectStream[String, String](ssc,
      LocationStrategies.PreferConsistent,//设置读取策略，如果你的spark计算节点和kafka broker节点不在一台物理主机
      ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

    messages.map(record=>record.value)
    .flatMap(line=>line.split(" "))
    .map(word => (word, 1L))
    .reduceByKey(_ + _)
    .print()
    
    ssc.start()
    ssc.awaitTermination()
  }
}

早期面试题考点：https://www.cnblogs.com/runnerjack/p/8597981.html

Transformations on DStreams

和RDD的转换类似，转换算子的作用是转换DStream.其中DStream常见的许多算子使用和Spark RDD保持一致。

转换	含义
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

map算子

//1,zhangsan,true
lines.map(line=> line.split(","))
    .map(words=>(words(0).toInt,words(1),words(2).toBoolean))
    .print()

flatMap

//hello spark
lines.flatMap(line=> line.split("\\s+"))
        .map((_,1)) //(hello,1)(spark,1)
        .print()

filter

//只会对含有hello的数据过滤
lines.filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

repartition(修改分区)

lines.repartition(10) //修改程序并行度 分区数
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

union(将两个流合并)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .filter(line => line.contains("hello"))
    .flatMap(line=> line.split("\\s+"))
    .map((_,1))
    .print()

count

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)
    .flatMap(line=> line.split("\\s+"))
    .count() //计算微批处中RDD元素的个数
    .print()

reduce(func)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10)  // aa bb
    .flatMap(line=> line.split("\\s+"))
    .reduce(_+"|"+_)
    .print() //aa|bb

countByValue(key计数)

val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)
stream1.union(stream2).repartition(10) // a a b c
    .flatMap(line=> line.split("\\s+"))
    .countByValue()  (a，2) （b,1） (c,1)
    .print()

reduceByKey(func, [numTasks])

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999) //this is spark this
    lines.repartition(10)
    .flatMap(line=> line.split("\\s+").map((_,1)))
    .reduceByKey(_+_)// （this,2）(is,1)(spark ,1)
    .print()

join(otherStream, [numTasks])

//1 zhangsan
val stream1: DStream[String] = ssc.socketTextStream("CentOS",9999)
//1 apple 1 4.5
val stream2: DStream[String] = ssc.socketTextStream("CentOS",8888)

val userPair:DStream[(String,String)]=stream1.map(line=>{
    var tokens= line.split(" ")
    (tokens(0),tokens(1))
})
val orderItemPair:DStream[(String,(String,Double))]=stream2.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
userPair.join(orderItemPair)
.map(t=>(t._1,t._2._1,t._2._2._1,t._2._2._2))//1 zhangsan apple 4.5
.print()

必须保证两个流同时发送数据，否则无法完成join,因此意义不大。

transform

可以使用stream和RDD做计算，因为transform可以拿到底层macro batch RDD，继而实现stream-batch join

//1 apple 2 4.5
val orderLog: DStream[String] = ssc.socketTextStream("CentOS",8888)
var userRDD=ssc.sparkContext.makeRDD(List(("1","zhangs"),("2","wangw")))
val orderItemPair:DStream[(String,(String,Double))]=orderLog.map(line=>{
    val tokens = line.split(" ")
    (tokens(0),(tokens(1),tokens(2).toInt * tokens(3).toDouble))
})
orderItemPair.transform(rdd=> rdd.join(userRDD))
.print()

updateStateByKey(有状态计算，全量输出)

ssc.checkpoint("file:///D:/checkpoints")//存储程序的状态信息，以及代码

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .updateStateByKey((newValues:Seq[Int],state:Option[Int])=>{//Complete输出
        val newValue = newValues.sum
        val historyValue=state.getOrElse(0)
        Some(newValue+historyValue)
    })
.print()

必须设定checkpointdir用于存储程序的状态信息。

mapWithState(有状态计算，增量输出)

ssc.checkpoint("file:///D:/checkpoints")//存储程序的状态信息，以及代码

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
    var total:Int=0
    if(stage.exists()){
        total=stage.getOption().getOrElse(0)
    }
    total += v.getOrElse(0)
    stage.update(total)//更新历史状态
    (k,total)
}))
.print() //Update 输出，只会输出更新的key-value

必须设定checkpointdir用于存储程序的状态信息。

Window Operations

Spark Streaming支持窗口的计算。在时间的维度上对数据进行归类。系统会将落入一个时间范围的数据，作为此次窗口计算的输入。
在这里插入图片描述

窗口的长度、滑动间隔必须是微批的整数倍。比如：微批处理2s，窗口宽度只能是2、4、6、8,...滑动间隔 2,4,6,...窗口长度

通常在流处理中窗口的时间参数有三种情况分别是：ingestion time、procession-time、event-time。
在这里插入图片描述

从DStream的定义来看，目前DStream仅仅做的是基于Procession-Time的计算。同时如果大家想使用基于EventTime计算Spark同样也支持，Spark目前在Structured Streaming板块中实现了基于EventTme语义处理。

一些常见的window operations如下.所有的操作都需要携带两个参数 - windowLength 和slideInterval.

Transformation	Meaning
window(windowLength, slideInterval)	Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)	Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)	Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])	A more efficient version of the above `reduceByKeyAndWindow()` where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength,slideInterval, [numTasks])	When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in `reduceByKeyAndWindow`, the number of reduce tasks is configurable through an optional argument.

window(windowLength, slideInterval)

每间隔5秒触发一个长度为10秒的时间窗口。

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .window(Seconds(10),Seconds(5))
    .reduceByKey(_+_)
    .print()

通过观察得出一个结论比如：countByWindow等价window+count、reduceByKeyAndWindow等价window+reduceByKey

由于count、reduce、·reduceByKey、countByValue都已经在上述章节当中讲解过了，这里就以reduceByKeyAndWindow为例展示一下使用：

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
    lines.flatMap(_.split("\\s+"))
    .map((_,1))
    //.window(Seconds(10),Seconds(5))
    //.reduceByKey(_+_)
    .reduceByKeyAndWindow((v1:Int,v2:Int)=>v1+v2,Seconds(10),Seconds(5))
    .print()

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

当window中重叠的元素过半时候，使用invFunc效率更高。

val conf = new SparkConf().setMaster("local[6]").setAppName("wordcount")
var ssc  =new StreamingContext(conf,Seconds(1))
ssc.sparkContext.setLogLevel("fatal")
ssc.checkpoint("file:///D:/checkpoints")
    var lines:DStream[String]=ssc.socketTextStream("CentOS",9999)
    lines.flatMap(_.split("\\s+"))
    .map((_,1)).reduceByKeyAndWindow(
        (v1:Int,v2:Int)=>v1+v2,
        (v1:Int,v2:Int)=>v1-v2,
        Seconds(6),Seconds(3),
        filterFunc = (t:(String,Int))=> t._2>0)
    .print()

ssc.start()
ssc.awaitTermination()

Output Operations

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Output Operation

Meaning

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

foreachRDD(func)

定制KafkaSink将流计算的结果写到Kafka中。

import java.util.{Properties, UUID}

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

class KafkaSink(topic:String,severs:String) extends Serializable {

  def createKafkaConnection(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,severs)
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")//开启幂等性
    props.put(ProducerConfig.RETRIES_CONFIG,"2")//设置重试
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")//设置缓冲区大小
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")//最多延迟1000毫秒
    new KafkaProducer[String,String](props)
  }

  lazy val kafkaProducer:KafkaProducer[String,String]= createKafkaConnection()
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.close()
    }
  })
  def save(vs: Iterator[(String, Int)]): Unit = {

    try{
      vs.foreach(tuple=>{
        val record = new ProducerRecord[String,String](topic,tuple._1,tuple._2.toString)
        kafkaProducer.send(record)
      })

    }catch {
      case e:Exception=> println("发邮件，出错啦~")
    }

  }
}

val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.setLogLevel("FATAL")
ssc.checkpoint("file:///D:/checkpoints")
val kafkaParams = Map[String, Object](
    ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
    ConsumerConfig.GROUP_ID_CONFIG -> "g1",
    ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
    ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])

val kafkaSinkBroadcast=ssc.sparkContext.broadcast(new KafkaSink("topic02","CentOS:9092"))

//直接读取Kafka中的数据 将kafka独立维护Kafka topic的offset偏移量 checkpoint
val messages = KafkaUtils.createDirectStream[String, String](ssc,
                                                             LocationStrategies.PreferConsistent,//设置读取策略，如果你的spark计算节点和kafka broker节点不在一台物理主机
                                                             ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

messages.map(record=>record.value)
.flatMap(line=>line.split(" "))
.map(word => (word, 1))
.mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
    var total:Int=0
    if(stage.exists()){
        total=stage.getOption().getOrElse(0)
    }
    total += v.getOrElse(0)
    stage.update(total)//更新历史状态
    (k,total)
}))
.foreachRDD(rdd=>{
    rdd.foreachPartition(vs=>{
        val kafkaSink = kafkaSinkBroadcast.value
        kafkaSink.save(vs)
    })
})

ssc.start()
ssc.awaitTermination()

DStream故障恢复

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.kafka.common.serialization.StringSerializer

object KafkaSink extends Serializable {

  def createKafkaConnection(): KafkaProducer[String, String] = {
    val props = new Properties()
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"CentOS:9092")
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,classOf[StringSerializer].getName)
    props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG,"true")//开启幂等性
    props.put(ProducerConfig.RETRIES_CONFIG,"2")//设置重试
    props.put(ProducerConfig.BATCH_SIZE_CONFIG,"100")//设置缓冲区大小
    props.put(ProducerConfig.LINGER_MS_CONFIG,"1000")//最多延迟1000毫秒
    new KafkaProducer[String,String](props)
  }

  lazy val kafkaProducer:KafkaProducer[String,String]= createKafkaConnection()
  Runtime.getRuntime.addShutdownHook(new Thread(){
    override def run(): Unit = {
      kafkaProducer.close()
    }
  })
  def save(vs: Iterator[(String, Int)]): Unit = {

    try{
      vs.foreach(tuple=>{
        val record = new ProducerRecord[String,String]("topic02",tuple._1,tuple._2.toString)
        kafkaProducer.send(record)
      })

    }catch {
      case e:Exception=> println("发邮件，出错啦~")
    }

  }
}

val checkpointDir="file:///D:/checkpointdir"
val ssc=StreamingContext.getOrCreate(checkpointDir,()=>{
    println("==========init ssc==========")
    val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[6]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    ssc.checkpoint(checkpointDir)
    val kafkaParams = Map[String, Object](
        ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "CentOS:9092",
        ConsumerConfig.GROUP_ID_CONFIG -> "g1",
        ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
        ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer])


    //直接读取Kafka中的数据 将kafka独立维护Kafka topic的offset偏移量 checkpoint
    val messages = KafkaUtils.createDirectStream[String, String](ssc,
                                                                 LocationStrategies.PreferConsistent,//设置读取策略，如果你的spark计算节点和kafka broker节点不在一台物理主机
                                                                 ConsumerStrategies.Subscribe[String, String](List("topic01"), kafkaParams))

    messages.map(record=>record.value)
    .flatMap(line=>line.split(" "))
    .map(word => (word, 1))
    .mapWithState(StateSpec.function((k:String,v:Option[Int],stage:State[Int])=>{
        var total:Int=0
        if(stage.exists()){
            total=stage.getOption().getOrElse(0)
        }
        total += v.getOrElse(0)
        stage.update(total)//更新历史状态
        (k,total)
    }))
    .foreachRDD(rdd=>{
        rdd.foreachPartition(vs=>{
            KafkaSink.save(vs)
        })
    })
    ssc
})

ssc.sparkContext.setLogLevel("FATAL")

ssc.start()
ssc.awaitTermination()
}

如果用户修改了流计算的代码块，必须清除checkpoint目录，否则更改不生效。

麦田里的守望者·

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Apache Spark Streaming-教案

Spark StreamingSpark Streaming是构建在Spark 批处理之上一款流处理框架。与批处理不同的是，流处理计算的数据是无界数据流，输出也是持续的。Spark Streaming底层将Spark RDD Batch 拆分成 Macro RDD Batch实现类似流处理的功能。因此spark Streaming在微观上依旧是批处理框架。批处理 VS 流处理区别 ...
复制链接

扫一扫