Spark Streaming笔记
⼊⼝类SparkStreaming
//构造⽅法
def this(sparkContext: SparkContext, batchDuration: Duration) = {
this(sparkContext, null, batchDuration)
}
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}
//两种声明方式:
new StreamingContext(new SpackContext(new SparkConf()),Duration)
new StreamingContext(new SparkConf(),Duration)
Spark Streaming 案例1(模板,没有业务逻辑):
注意:一般流处理,是不会关闭的,自己手动关闭会出现异常 这种情况是正常的
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount extends MyApp{
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print()
//业务逻辑
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kfqPPkw1-1630938749986)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451078973.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kN3bcDVn-1630938749989)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624450901446.png)]
Spark Streaming 案例2(有业务逻辑):
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount1 extends MyApp{
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print()
//业务逻辑
lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mD52cMRt-1630938749992)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451582319.png)]
注意: 这是分两个批次进行的。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2j2gHZ5w-1630938749995)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624451961825.png)]
DStream transformation算子
以下都是常见算子
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GurkRIvC-1630938749999)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624696338309.png)]
无状态操作()
参考RDD的转换操作,map\flatMap\reduceBykey…都是⽆状态操作
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount2 extends MyApp{
System.setProperty("HADOOP_USER_NAME", "root")
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
ssc.checkpoint("/checkpints/20210623")
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print() //print 可以当作成打开一个新的窗口
//业务逻辑
//无状态操作
//方式一:
// lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
//方式二:
// lines.foreachRDD(rdd=>{
// rdd.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).foreach(println(_))
// })
//方式三:根据函数结构 可知 transform类似于map
lines.transform(rdd=>{
rdd.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_)
}).print()
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
方式二的截图
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NEHZfzlR-1630938750001)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624454308955.png)]
方式三的截图
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TOa6ZetD-1630938750003)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624454518303.png)]
有状态操作
主要有: transform updateStateByKey window
注意:只要是遇到状态的算子,都需要写上checkpoint目录,因为状态需要checkpoint存储到目录中,此时我用的是存储在hdfs上
1、updateStateByKey 说明:
def updateStateByKey[S: ClassTag](
updateFunc: (Seq[V], Option[S]) => Option[S]
): DStream[(K, S)] = ssc.withScope {
updateStateByKey(updateFunc, defaultPartitioner())
}
updateFunc 输⼊(Seq[V]:按KEY划分的Values集合,Option[S]:之前的状态S可以任意结构) 输出新状态
Option[S]
Option[S]跟key绑定,⼀个key⼀个状态.
updateStateByKey输出 DStream[(K, S)]
DStream[(String,Int)].updateStateByKey((values:Seq[Int],state:Option[Int])=>{
Some(values.sum + state.getOrElse(0))
})
解释:Some(values.sum + state.getOrElse(0))当前的值所有的和+之前的状态的值的和相加
2、updateStateByKey示例
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount1 extends MyApp{
System.setProperty("HADOOP_USER_NAME", "root")
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
ssc.checkpoint("/checkpints/20210623")
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print()
//业务逻辑
//无状态操作
// lines.flatMap(line=>line.split("\\s")).map((_,1)).reduceByKey(_+_).print()
//有状态操作
lines.flatMap(line=>line.split("\\s")).map((_,1)).updateStateByKey((values:Seq[Int],state:Option[Int])=>{
Some(values.sum+state.getOrElse(0))
}).print()
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
注意:这个方式是把之前的状态也包含进来了
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3XSihpA3-1630938750004)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624453495697.png)]
补充:rdd的算⼦接收偏函数的⽤法
scala> val rdd3=sc.makeRDD(List(1,2,3,"xx","a"))
scala> rdd3.map({case x=>x+"1"}).collect
res5: Array[String] = Array(11, 21, 31, xx1, a1)
scala> rdd3.map({case x:String=>x+"1" ; case y:Int=>y+1}).collect
res6: Array[Any] = Array(2, 3, 4, xx1, a1)
3、window示例
注意:
窗⼝基于时间概念:startTime~endTime
窗⼝的⻓度应该是batch intervel 的整数倍,或者说滑动窗口的两个参数是batch批次的整数倍
滚动窗口:一个参数,滑动距离等于窗口程度时就是滚动窗口
滑动窗口:两个参数,滑动距离等于窗口程度和滚动窗口不同
3.1、滚动窗口
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount3 extends MyApp{
System.setProperty("HADOOP_USER_NAME", "root")
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
ssc.checkpoint("/checkpints/20210623")
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print() //print 可以当作成打开一个新的窗口
//业务逻辑
//定义一个窗口,实现两个批次 计算一次
//滚动窗口
lines.flatMap(line=>line.split("\\s").map((_,1)))
.reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Durations.seconds(10L))
.print()
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DW0cUkrr-1630938750005)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624456712372.png)]
3.2、滑动窗口
package day09
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object SteamingWordCount3 extends MyApp{
System.setProperty("HADOOP_USER_NAME", "root")
val conf = new SparkConf().setMaster("local[*]").setAppName("SteamingWordCount")
val ssc = new StreamingContext(conf, Durations.seconds(5))
ssc.checkpoint("/checkpints/20210623")
//接收nc socket数据流
private val lines: ReceiverInputDStream[String] = ssc.socketTextStream("192.168.1.101",6666)
lines.print() //print 可以当作成打开一个新的窗口 5秒输出一次
//业务逻辑
//定义一个窗口,实现两个批次 计算一次
//滑动窗口
//10s计算一次
lines.flatMap(line=>line.split("\\s").map((_,1)))
.reduceByKeyAndWindow((x:Int,y:Int)=>x+y,Durations.seconds(10L),Durations.seconds(10L))
.print()
//启动ssc
ssc.start()
//防止ssc退出
ssc.awaitTermination()
}
DStream output
foreachRDD
print
saveAsTextFile
SparkStreaming集成kafka
1、Spark消费Kafka数据
package day09
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object StreamingKafka extends MyApp{
val conf=new SparkConf().setMaster("local[2]").setAppName("Streaming-Kafka")
val ssc=new StreamingContext(conf,Durations.seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.1.101:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group0623",
"auto.offset.reset" -> "latest", //如果组id不存在Kafka中,则从新位置消费,否则不⽣效
"enable.auto.commit" -> (false: java.lang.Boolean) //⼿动
)
val topics = Seq("spark")
private val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe(topics, kafkaParams))
dstream.map(_.value()).print()
//启动ssc
ssc.start()
//防⽌ssc退出
ssc.awaitTermination()
}
启动后 再启动生产者
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-r9qsgTcc-1630938750007)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624458219803.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yI6sx82P-1630938750008)(C:\Users\等待\AppData\Roaming\Typora\typora-user-images\1624458230679.png)]
2、简单指标统计
package day09
import com.alibaba.fastjson.JSON
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import utils.MyApp
object StreamingKafka extends MyApp{
val conf=new SparkConf().setMaster("local[2]").setAppName("Streaming-Kafka")
val ssc=new StreamingContext(conf,Durations.seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.1.101:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "group0623",
"auto.offset.reset" -> "latest", //如果组id不存在Kafka中,则从新位置消费,否则不⽣效
"enable.auto.commit" -> (false: java.lang.Boolean) //⼿动
)
val topics = Seq("travel_ods_logs")
//读取Kafka数据
private val dstream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe(topics, kafkaParams))
dstream.map(_.value()).print()
import scala.collection.JavaConverters._
//Json->map
dstream.map(record=>{
val jsonstr = record.value()
val log = JSON.parse(jsonstr).asInstanceOf[java.util.Map[String,String]].asScala
(log.getOrElse("userID","-1"),log.getOrElse("ct","-1"),log.getOrElse("sid","-1"))
}).print()
//启动ssc
ssc.start()
//防⽌ssc退出
ssc.awaitTermination()
}
实时处理 一定是流处理,但流处理不一定是实时处理
Spark Streaming属于一种准实时计算的框架
Spark Streaming是Spark生态中用于流计算的引擎。
Flink是纯实时的计算引擎,延迟特低,Spark Streaming没法比
Flink是纯实时的面向流处理的计算。
只要是遇到状态的算子,都需要写上checkpoint
window: 左闭右开的区间,关闭就要计算、然后再销毁
两个参数相等 就是 滚动窗口 不等就是滑动窗口
la
(log.getOrElse(“userID”,"-1"),log.getOrElse(“ct”,"-1"),log.getOrElse(“sid”,"-1"))
}).print()
//启动ssc
ssc.start()
//防⽌ssc退出
ssc.awaitTermination()
}
实时处理 一定是流处理,但流处理不一定是实时处理
Spark Streaming属于一种准实时计算的框架
Spark Streaming是Spark生态中用于流计算的引擎。
Flink是纯实时的计算引擎,延迟特低,Spark Streaming没法比
Flink是纯实时的面向流处理的计算。
只要是遇到状态的算子,都需要写上checkpoint
window: 左闭右开的区间,关闭就要计算、然后再销毁
两个参数相等 就是 滚动窗口 不等就是滑动窗口
第一个参数是窗口的大小 第二个参数是窗口滑动的大小
2239

被折叠的 条评论
为什么被折叠?



