DStream
DStream是Spark Streaming提供的基本抽象,表示持续性的数据流,其实质是一个个的RDD,方法dstream.foreachRDD()
可以充分的体现出这一点。所以对于DStream的操作实际是对DStream中的每一个RDD的操作。对于一个数据抽象,依旧从数据对象的获取,操作以及输出三部分研究。
DStream的获取
SparkStream原生就支持许多的数据源,其中最重要的是kafka数据源。RDD队列可以用于进行压力测试。
RDD队列
前面说过,DStream本身就是由一个个的RDD队列组成,所以,通过组成RDD队列创建DStream也是理所当然.
object queueStreaming {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("queueStreaming").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(3))
val sc: SparkContext = ssc.sparkContext
import scala.collection.mutable
// 构建一个RDDQueue
val RDDQ: mutable.Queue[RDD[String]] = mutable.Queue[RDD[String]]()
// ssc.queueStream读取这个queue
val RDDStream: InputDStream[String] = ssc.queueStream(RDDQ, false)
// 循环往Queue中写入数据
RDDStream.map(x=>(x,1)).reduceByKey(_+_).print(10)
// ssc启动
ssc.start()
// 注意启动后在往里面放数据
while (true){
RDDQ.enqueue(sc.parallelize(mutable.Seq("word")))
Thread.sleep(1000)
}
ssc.awaitTermination()
}
}
socket数据源
用于DStream的测试
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("queueStreaming").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(3))
val sc: SparkContext = ssc.sparkContext
import scala.collection.mutable
// ssc.socketTextStream监听一个端口获取数据
val RDDStream: InputDStream[String] = ssc.socketTextStream("hadoop101",9999)
RDDStream.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_+_).print(10)
ssc.start()
ssc.awaitTermination()
}
kafka数据源
DStream最重要的的源,The Kafka project introduced a new consumer API between versions 0.8 and 0.10, so there are 2 separate corresponding Spark Streaming packages available. 具有两套API
spark-streaming-kafka-0-8
这个api在Spark 2.3.0之后移除,它可以直连也可以通过一个分区接受后再分发,效率比较低,它提供了两套方法。目前SparkStreaming不能做到严格一次,只能做到至少一次。
Way1
使用kafkaUtil创建连接,比较容易
object KafKaStreaming2 {
def createSSC() = {
val conf: SparkConf = new SparkConf().setAppName("a").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(3))
// 使用缓存实现至少一次
ssc.checkpoint("./ck1")
val para: Map[String, String] = Map[String, String](
// 消费者组
ConsumerConfig.GROUP_ID_CONFIG->"0830",
// broker地址
"bootstrap.servers" ->"hadoop101:9092,hadoop102:9092,hadoop103:9092"
)
val sourceDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, para, topics=Set("SS"))
sourceDStream.print(10)
ssc
}
def main(args: Array[String]): Unit = {
// getActiveOrCreate创建Dstream的一个方式
val ssc = StreamingContext.getActiveOrCreate("./ck1",createSSC)
ssc.start()
ssc.awaitTermination()
}
}
Way2
// 使用低级api操作kafka
// 自己定义Offset的读和写,自定义可以实现真正的exactly-once
object KafKaStreaming3 {
val GROUP_ID = "0830"
val para: Map[String, String] = Map[String, String](
ConsumerConfig.GROUP_ID_CONFIG->GROUP_ID,
"bootstrap.servers" ->"hadoop101:9092,hadoop102:9092,hadoop103:9092"
)
val topic = Set("SS")
private val kafkaCluster = new KafkaCluster(para)
var resultMap = Map[TopicAndPartition,Long]()
def readOffset()={
/**
* 读取offset
*/
// 获取所有分区
val topicAndPartition: Either[Err, Set[TopicAndPartition]] = kafkaCluster.getPartitions(topic)
topicAndPartition match {
case Right(topicAndPartitionSet)=>
val topicAndPartitionAndOffset: Either[Err, Map[TopicAndPartition, Long]] = kafkaCluster.getConsumerOffsets(GROUP_ID, topicAndPartitionSet)
topicAndPartitionAndOffset match {
// 表示不是第一次消费
case Right(b)=> resultMap ++=b
// 表示是第一次消费
case Left(a)=>topicAndPartitionSet.foreach((topicAndPartition: TopicAndPartition) =>
resultMap += topicAndPartition->0L
)
}
case _ => // 如果是left,则不需要处理
}
}
/**
* 每消费一次都需要保存消费记录
*/
def saveOffset(sourceDStream:InputDStream)={
// 保存多次
var tempMap = Map[TopicAndPartition,Long]()
sourceDStream.foreachRDD(
// 每个时间间镉封装rdd
rdd=>{
val ranges: HasOffsetRanges = rdd.asInstanceOf[HasOffsetRanges]
// 每个分区的数组
val ranges1: Array[OffsetRange] = ranges.offsetRanges
ranges1.foreach(OffsetRange=>{
tempMap += OffsetRange.topicAndPartition()->OffsetRange.untilOffset
})
}
)
kafkaCluster.setConsumerOffsets(GROUP_ID,tempMap)
}
def createSSC() = {
val conf: SparkConf = new SparkConf().setAppName("a").setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(3))
// 使用缓存实现至少一次
ssc.checkpoint("./ck1")
// messageHandler.message()是kafka读到的value
val sourceDStream:InputDStream=
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder,String](ssc, para,
readOffset(), (messageHandler : MessageAndMetadata[String,String])=>messageHandler.message())
sourceDStream.print(10)
ssc
}
def main(args: Array[String]): Unit = {
val ssc = StreamingContext.getActiveOrCreate("./ck1",createSSC)
ssc.start()
ssc.awaitTermination()
}
}
spark-streaming-kafka-0-10
采用直连的方式,kafka的分区和spark的分区直连。相比旧版本,它支持SSL安全协议,也支持提交Offset还有动态主题绑定,很强。
DStream的转换
无状态的转换
每一次转换只对一个RDD的转换与之前的没有任何的关联。 无状态一招鲜
object sparkTransform {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("sparkStreamTransFrom")
val ssc = new StreamingContext(conf, Seconds(3))
val ssDream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101", 9999)
// 无状态转换就是一个一个的对里面的RDD做操作.和foreach的区别一个是转换一个是行动算子
val dstream: DStream[(String, Int)] = ssDream.transform(rdd =>
rdd.flatMap(_ => x_.split(" ")).map((_, 1)).reduceByKey(_ + _))
dstream.print(100)
ssc.start()
ssc.awaitTermination()
}
}
有状态的转换
每一次的输出都与上一次有关,所以一定要设置checkPoint()保存状态。缺点是checkPoint会产生大量的小文件。所以可以使用一些第三方工具做缓存。
无状态的转换
updateStateByKey()有许多重载方法,这里用最简单的
object sparkUpdateStateByKey {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKey")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("./ck1")
val receiverInputDStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101", 9999)
// 对每一个时间内的一个key做操作。n:Seq[Int]代表当前的key对应的value,l:Option[Int]代表之前的保存的value.
// 重点是对Option更新并且返回
receiverInputDStream.flatMap(x => x.split(" ")).map((_, 1)).updateStateByKey((n:Seq[Int],l:Option[Int])=>{
Some(n.sum+l.getOrElse(0))
}).print(10)
ssc.start()
ssc.awaitTermination()
}
}
带窗口的转换
窗口有滑动和大小的概念。他们都应该是批处理间镉的倍数 核心方法是reduceByKeyAndWindow
object sparkWindow {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKey")
val ssc = new StreamingContext(conf, Seconds(3))
val ls: LocalFileSystem =new LocalFileSystem
ls.delete(new Path("./ck1"),true)
ssc.checkpoint("./ck1")
val receiverInputDStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101", 9999)
// 最简单的用法Seconds(9)是窗口大小
receiverInputDStream.flatMap(x => x.split(" ")).map((_, 1))
.reduceByKeyAndWindow(_+_,Seconds(9)).print(100)
// 控制滑动间镉
receiverInputDStream.flatMap(x => x.split(" ")).map((_, 1))
.reduceByKeyAndWindow(_+_,Seconds(9),slideDuration = Seconds(6)).print(100)
// 一种优化,s2 = s1 + new - old,避免了中间部分的重新运算
receiverInputDStream.flatMap(x => x.split(" ")).map((_, 1))
.reduceByKeyAndWindow(_+_, (n,l)=>n-l,Seconds(9)).print(100)
receiverInputDStream.flatMap(x => x.split(" ")).map((_, 1))
.reduceByKeyAndWindow(_+_, (n,l)=>n-l,Seconds(9),filterFunc=(k,v)=>v>0).print(100)
ssc.start()
ssc.awaitTermination()
}
}
object sparkWindow {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("UpdateStateByKey")
val ssc = new StreamingContext(conf, Seconds(3))
val ls: LocalFileSystem =new LocalFileSystem
ls.delete(new Path("./ck1"),true)
ssc.checkpoint("./ck1")
// 给DStream加上枷锁
val receiverInputDStream: ReceiverInputDStream[String] = ssc.socketTextStream("hadoop101", 9999).window(Seconds(9))
ssc.start()
ssc.awaitTermination()
}
}
DStream的输出
DStream的输出方式和RDD基本一致,具有print(),saveAsTextFiles(),saveAsObjectFiles(),saveAsHadoopFiles().额外多出的是foreachRDD
val spark = SparkSession.builder.config(conf).getOrCreate()
// 利用foreachRDD写入,问题是会生成大量小文件
import spark.implicits._
count.foreachRDD(rdd =>{
val df: DataFrame = rdd.toDF("word", "count")
df.createOrReplaceTempView("words")
spark.sql("select * from words").show
})