Spark Stream笔记

Spark Streaming笔记

⼊⼝类SparkStreaming

//构造⽅法
 def this(sparkContext: SparkContext, batchDuration: Duration) = {
 this(sparkContext, null, batchDuration)
 }
 
 def this(conf: SparkConf, batchDuration: Duration) = {
 this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
 }

//两种声明方式:
new StreamingContext(new SpackContext(new SparkConf()),Duration)
new StreamingContext(new SparkConf(),Duration)

Spark Streaming 案例1(模板,没有业务逻辑):

注意:一般流处理,是不会关闭的,自己手动关闭会出现异常 这种情况是正常的

package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount extends MyApp{
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))

  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()

  //业务逻辑
  
  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kfqPPkw1-1630938749986)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451078973.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kN3bcDVn-1630938749989)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624450901446.png)]

Spark Streaming 案例2(有业务逻辑):

package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount1 extends MyApp{
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))
  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()
  //业务逻辑
  lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mD52cMRt-1630938749992)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451582319.png)]

注意: 这是分两个批次进行的。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2j2gHZ5w-1630938749995)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451961825.png)]

DStream transformation算子

以下都是常见算子

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GurkRIvC-1630938749999)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624696338309.png)]

无状态操作()

参考RDD的转换操作,map\flatMap\reduceBykey…都是⽆状态操作

package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount2 extends MyApp{
  System.setProperty("HADOOP_USER_NAME", "root")
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))

  ssc.checkpoint("/checkpints/20210623")

  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()  //print 可以当作成打开一个新的窗口
  //业务逻辑
  //无状态操作
  //方式一:
//  lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
  //方式二:
//  lines.foreachRDD(rdd=>{
//    rdd.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).foreach(println(_))
//  })

  //方式三:根据函数结构 可知 transform类似于map
  lines.transform(rdd=>{
        rdd.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_)
      }).print()
  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

方式二的截图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NEHZfzlR-1630938750001)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624454308955.png)]

方式三的截图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TOa6ZetD-1630938750003)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624454518303.png)]

有状态操作

主要有: transform updateStateByKey window

注意只要是遇到状态的算子,都需要写上checkpoint目录,因为状态需要checkpoint存储到目录中,此时我用的是存储在hdfs上

1、updateStateByKey 说明:
def updateStateByKey[S: ClassTag](
 updateFunc: (Seq[V], Option[S]) => Option[S]
 ): DStream[(K, S)] = ssc.withScope {
 updateStateByKey(updateFunc, defaultPartitioner())
 }
updateFunc 输⼊(Seq[V]:按KEY划分的Values集合,Option[S]:之前的状态S可以任意结构) 输出新状态
Option[S]

Option[S]跟key绑定,⼀个key⼀个状态.
updateStateByKey输出 DStream[(K, S)] 
DStream[(String,Int)].updateStateByKey((values:Seq[Int],state:Option[Int])=>{
 Some(values.sum + state.getOrElse(0))
})
解释:Some(values.sum + state.getOrElse(0))当前的值所有的和+之前的状态的值的和相加
2、updateStateByKey示例
package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount1 extends MyApp{
  System.setProperty("HADOOP_USER_NAME", "root")
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))

  ssc.checkpoint("/checkpints/20210623")

  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()
  //业务逻辑
  //无状态操作
//  lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
  //有状态操作
  lines.flatMap(line=>line.split("\\s")).map((_,1)).updateStateByKey((values:Seq[Int],state:Option[Int])=>{
    Some(values.sum+state.getOrElse(0))
  }).print()
  
  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

注意:这个方式是把之前的状态也包含进来了

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3XSihpA3-1630938750004)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624453495697.png)]

补充:rdd的算⼦接收偏函数的⽤法

scala> val rdd3=sc.makeRDD(List(1,2,3,"xx","a"))

scala> rdd3.map({case x=>x+"1"}).collect
res5: Array[String] = Array(11, 21, 31, xx1, a1)

scala> rdd3.map({case x:String=>x+"1" ; case y:Int=>y+1}).collect
res6: Array[Any] = Array(2, 3, 4, xx1, a1)
3、window示例

注意:

窗⼝基于时间概念:startTime~endTime

窗⼝的⻓度应该是batch intervel 的整数倍,或者说滑动窗口的两个参数是batch批次的整数倍

滚动窗口:一个参数,滑动距离等于窗口程度时就是滚动窗口

滑动窗口:两个参数,滑动距离等于窗口程度和滚动窗口不同

3.1、滚动窗口

package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount3 extends MyApp{
  System.setProperty("HADOOP_USER_NAME", "root")
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))

  ssc.checkpoint("/checkpints/20210623")

  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()  //print 可以当作成打开一个新的窗口
  //业务逻辑
  //定义一个窗口,实现两个批次 计算一次
  //滚动窗口
  lines.flatMap(line=>line.split("\\s").map((_,1)))
    .reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Durations.seconds(10L))
    .print()
  
  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DW0cUkrr-1630938750005)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624456712372.png)]

3.2、滑动窗口

package day09

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object SteamingWordCount3 extends MyApp{
  System.setProperty("HADOOP_USER_NAME", "root")
  val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
  val ssc = new StreamingContext(conf, Durations.seconds(5))

  ssc.checkpoint("/checkpints/20210623")

  //接收nc socket数据流
  private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
  lines.print()  //print 可以当作成打开一个新的窗口  5秒输出一次
  //业务逻辑
  //定义一个窗口,实现两个批次 计算一次
  //滑动窗口
  //10s计算一次
  lines.flatMap(line=>line.split("\\s").map((_,1)))
    .reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Durations.seconds(10L),Durations.seconds(10L))
    .print()

  //启动ssc
  ssc.start()
  //防止ssc退出
  ssc.awaitTermination()
}

DStream output

foreachRDD
print
saveAsTextFile

SparkStreaming集成kafka

1、Spark消费Kafka数据
package day09

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object StreamingKafka extends MyApp{

  val conf=new SparkConf().setMaster("local[2]").setAppName("Streaming-Kafka")
  val ssc=new StreamingContext(conf,Durations.seconds(5))

  val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> "192.168.1.101:9092",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "group0623",
    "auto.offset.reset" -> "latest", //如果组id不存在Kafka中,则从新位置消费,否则不⽣效
    "enable.auto.commit" -> (false: java.lang.Boolean) //⼿动
  )

  val topics = Seq("spark")

  private val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe(topics, kafkaParams))
  dstream.map(_.value()).print()

  //启动ssc
  ssc.start()

  //防⽌ssc退出
  ssc.awaitTermination()

}

启动后 再启动生产者

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-r9qsgTcc-1630938750007)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624458219803.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yI6sx82P-1630938750008)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624458230679.png)]

2、简单指标统计
package day09

import com.alibaba.fastjson.JSON
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp

object StreamingKafka extends MyApp{

  val conf=new SparkConf().setMaster("local[2]").setAppName("Streaming-Kafka")
  val ssc=new StreamingContext(conf,Durations.seconds(5))

  val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> "192.168.1.101:9092",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "group0623",
    "auto.offset.reset" -> "latest", //如果组id不存在Kafka中,则从新位置消费,否则不⽣效
    "enable.auto.commit" -> (false: java.lang.Boolean) //⼿动
  )

  val topics = Seq("travel_ods_logs")

  //读取Kafka数据
  private val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe(topics, kafkaParams))
  dstream.map(_.value()).print()

  import scala.collection.JavaConverters._
  //Json->map
  dstream.map(record=>{
    val jsonstr = record.value()
    val log = JSON.parse(jsonstr).asInstanceOf[java.util.Map[String,String]].asScala
    (log.getOrElse("userID","-1"),log.getOrElse("ct","-1"),log.getOrElse("sid","-1"))

  }).print()

  //启动ssc
  ssc.start()

  //防⽌ssc退出
  ssc.awaitTermination()
}

实时处理 一定是流处理,但流处理不一定是实时处理

Spark Streaming属于一种准实时计算的框架

Spark Streaming是Spark生态中用于流计算的引擎。

Flink是纯实时的计算引擎,延迟特低,Spark Streaming没法比

Flink是纯实时的面向流处理的计算。

只要是遇到状态的算子,都需要写上checkpoint

window: 左闭右开的区间,关闭就要计算、然后再销毁

两个参数相等 就是 滚动窗口 不等就是滑动窗口

la
(log.getOrElse(“userID”,"-1"),log.getOrElse(“ct”,"-1"),log.getOrElse(“sid”,"-1"))

}).print()

//启动ssc
ssc.start()

//防⽌ssc退出
ssc.awaitTermination()
}






实时处理 一定是流处理,但流处理不一定是实时处理

Spark Streaming属于一种准实时计算的框架

Spark Streaming是Spark生态中用于流计算的引擎。



Flink是纯实时的计算引擎,延迟特低,Spark Streaming没法比

Flink是纯实时的面向流处理的计算。



只要是遇到状态的算子,都需要写上checkpoint



window: 左闭右开的区间,关闭就要计算、然后再销毁

两个参数相等 就是 滚动窗口 不等就是滑动窗口

第一个参数是窗口的大小 第二个参数是窗口滑动的大小
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值