Spark-Streaming
基础概念
SparkStreaming是Spark五大组件中的流式数据处理组件。
这里先了解两个概念:
- 流式计算
所谓流式数据,顾名思义,即数据像流水一样,会源源不断地产生。因此,流式处理程序是不会结束的,会一直处理数据或等待数据到来。 - 微批处理
SparkStreaming在处理数据时,并不是每收到一条数据就处理一条数据,而是在开始设定一个时间间隔,每个时间间隔内的数据会被打包为一个批次进行处理
实际上处理的过程可以简单理解为:将设定时间间隔内产生的数据作为一个RDD,执行一系列算子操作,并且每过一段时间都会执行一次(无论是否有数据产生)
SparkStreaming将数据流抽象为Discretized Stream(DStream),所有操作都需要DStream对象的函数
使用案例
导入核心依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${Spark-version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${Spark-version}</version>
</dependency>
因为SparkStreaming地操作需要转换为RDD地操作,所以需要导入Spark-Core组件
下面用WordCount案例,介绍SparkStreaming程序地基本结构
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamDemo1")
// 创建流处理功能的主入口,Seconds(3)即表示每3秒为一个批次拉取数据
val ssc = new StreamingContext(conf, Seconds(3))
// 从指定IP和端口获取数据
ssc.socketTextStream("192.168.226.10", 7777)
.flatMap(_.split("\\s+")) // 使用空白字符分隔
.map((_, 1))
.reduceByKey(_ + _)
.print() // 输出结果
// 编写到这里,上面的部分只是确定了收到数据后的处理逻辑,并没有实际动作产生
// 要开始执行上面的处理逻辑,需要调用下面的两行代码
ssc.start() // 启动程序
ssc.awaitTermination() // 等待程序结束
ssc.stop() // 结束程序(实际不会被运行到)
}
}
下面启动程序,另外启动一个终端,使用netcat发送数据,可以看到下面效果:
我们在左边netcat中发送的每条数据都会被作为右侧程序的输入,并输出计算的结果
当部分批次的运行时间超出设定的时间间隔时,可能需要将之前的数据缓存起来,可以调用ssc.remember(duration)
函数,传入需要保存的时长,Spark会将最近一批的数据保存指定的时长后,才会清理这些数据
常用函数
updateStateByKey
对每个key的状态进行更新
假设,在上面WordCount的案例基础上,要使以往的数据保留下来,每次再输入的数据累加到以往的结果上,就可以使用updateStateByKey
完成:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWordCount {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamDemo1")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.checkpoint("checkpoint")
ssc.socketTextStream("192.168.226.10", 7777)
.flatMap(_.split("\\s+"))
.map((_, 1))
.updateStateByKey((seq: Seq[Int], res: Option[Int]) => Option(res.getOrElse(0) + seq.size))
.print
ssc.start()
ssc.awaitTermination()
}
}
Kafka读取和写入
读取数据:
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object IngestAndPushKafka {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("IngestAndPush")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("checkpoint")
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.226.10:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "IngNo1",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkDemo"), kafkaParams)
)
kafkaStream.flatMap(_.value().toString.split("\\s+"))
.map((_, 1))
.reduceByKey(_ + _)
.print()
ssc.start()
ssc.awaitTermination()
}
}
写入数据:
import java.util
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object IngestAndPushKafka {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("IngestAndPush")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("checkpoint")
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.226.10:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "IngNo1",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkDemo"), kafkaParams)
)
kafkaStream.foreachRDD(rdd => {
rdd.foreach(record => {
val words = record .value().split("\\s+") // 获取数据
/*
因为这部分代码要在Executor上执行,所以直接在外面创建Producer然后在这里引用的方式并不可行
下面时创建Producer并发送消息的过程
*/
val prop = new util.HashMap[String, Object]()
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.226.10:9092")
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](prop)
for (word <- words) {
val record = new ProducerRecord[String, String]("SparkDemoOut", "", word + ",1")
producer.send(record)
}
})
/*
上半部分代码只作为演示使用,因为在实际运行过程中,每条消息都会创建一个Producer对象
实际创建对象和GC清理的过程及其影响性能
所以,可以使用foreachPartition函数改进以上代码
将每条记录创建一个Producer对象改为每个分区创建一个Producer对象
**官网推荐做法**
*/
rdd.foreachPartition(p => {
val prop = new util.HashMap[String, Object]()
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.226.10:9092")
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](prop)
p.foreach(record => {
val words: Array[String] = record.value().split("\\s+")
words.foreach(word => producer.send(new ProducerRecord[String, String]("SparkDemoOut", "", word + ",1")))
})
})
}
)
ssc.start()
ssc.awaitTermination()
}
}
除此之外,还可以使用广播变量的方式(可以正常运行,实际运行性能未测试),将Producer对象发送到各个Executor上:
import java.util
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object IngestAndPushKafka {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("IngestAndPush")
val ssc = new StreamingContext(conf, Seconds(5))
ssc.checkpoint("checkpoint")
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.226.10:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "IngNo2",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
)
val prop = new util.HashMap[String, Object]()
prop.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.226.10:9092")
prop.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
prop.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
// 注意这里需要混入Serializable,否则无法广播对象
val p = new KafkaProducer[String, String](prop) with Serializable
// 广播
val broadcast: Broadcast[KafkaProducer[String, String]] = ssc.sparkContext.broadcast(p)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkDemo"), kafkaParams)
)
kafkaStream.foreachRDD(rdd => {
rdd.foreach(y => {
val words: Array[String] = y.value().split("\\s+")
// 获取
val producer: KafkaProducer[String, String] = broadcast.value
for (word <- words) {
val record = new ProducerRecord[String, String]("SparkDemoOut", "", word + ",1")
producer.send(record)
}
})
}
)
ssc.start()
ssc.awaitTermination()
}
}
foreachRDD
上面Kafka的写出已经应用到了foreachRDD算子
对每段时间间隔内产生的数据RDD执行指定操作:
ssc.socketTextStream("192.168.226.10", 7777)
.foreachRDD((rdd, time) => {
rdd.flatMap(_.split("\\s+"))
.map((_, 1))
.reduceByKey(_ + _)
.collect()
.foreach(e => println(time.milliseconds + ":" + e))
})
每个时间间隔的RDD中可以新建Producer对象生产消息,那么也就可以导入其他模块,例如SparkSQL组件,使用SQL进行数据分析
先导入SparkSQL模块:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${Spark-version}</version>
</dependency>
下面在最开始的WordCount案例的基础上,使用SparkSQL完成分析:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext}
object StreamingWithSQL {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamingWithSQL")
val ssc = new StreamingContext(conf, Seconds(3))
ssc.socketTextStream("192.168.226.10", 7777)
.foreachRDD((rdd, time) => {
val rConf: SparkConf = rdd.sparkContext.getConf
val spark: SparkSession = SparkSession.builder().config(rConf).getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.types._
println(s"=========${time.milliseconds}===========")
rdd.toDF("word")
.select($"word", lit(1).as("number"))
.groupBy($"word")
.agg(sum($"number").as("count"))
.select($"word", $"count")
.show()
})
ssc.start()
ssc.awaitTermination()
}
}
测试效果:
transform
对DStream中的每个RDD执行传入的操作,返回新的DStream
ssc.socketTextStream("192.168.226.10", 7777)
.transform(rdd => {
rdd.flatMap(_.split("\\s+")).map((_, 1)).reduceByKey(_ + _)
})
.print()
实际运行效果和第一案例相同
transform和foreachRDD非常类似,只是foreachRDD不需要返回结果,transform必须要返回结果。类似foreachRDD,transform中也可以引入SparkSession,处理后一般将结果的DataFrame转为RDD返回,还可用于后续的处理流程
窗口
SparkStreaming的窗口分为滚动窗口和滑动窗口,两种窗口的含义和类型和Kafka Streaming中的类似,下面直接使用代码演示
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object TestWindow {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("TestWindow")
val ssc = new StreamingContext(conf, Seconds(2))
ssc.checkpoint("checkpoint")
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.226.10:9092",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.GROUP_ID_CONFIG -> "WindowDemo1",
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> "false",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "latest"
)
val kafkaStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe(Set("SparkDemo"), kafkaParams)
)
// 直接调用window划分窗口
/* val windowDStream: DStream[(String, Int)] = kafkaStream.flatMap(_.value().toString.split("\\s+"))
.map((_, 1))
.window(Seconds(8), Seconds(6))
windowDStream.print()*/
// 按照窗口计数
/* kafkaStream.flatMap(_.value().toString.split(","))
.countByValueAndWindow(Seconds(8), Seconds(4))
.print*/
// 按照窗口聚合相同的键
kafkaStream.flatMap(_.value().toString.split("\\s+"))
.reduceByWindow(_ + _, Seconds(8), Seconds(4))
.print()
// 按照窗口聚合全部数据
/* kafkaStream.flatMap(_.value.toString.split("\\s+"))
.map((_, 1))
.reduceByWindow((m, n) => ("word", m._2 + n._2), Seconds(8), Seconds(4))
.print*/
ssc.start()
ssc.awaitTermination()
}
}
Spark实际使用一组重载的方法表示滚动窗口和滑动窗口:
def window(windowDuration: Duration): DStream[T]
:仅传入窗口尺寸表示滚动窗口
def window(windowDuration: Duration, slideDuration: Duration): DStream[T]
:传入窗口尺寸和滑动步长表示滑动窗口
性能调优
SparkStreaming的性能调优可以大致分为两个部分:
- 减少每个批次数据运行的时间;
- 设置合理的批次间隔。
- 设置合理的CPU数量
- 并行度设置
- 使用高效的算子替代低效算子
-
CPU核数及并行度设置
SparkStreaming对于内存的需求相对较低,与之相比更关键的是CPU的核数,因为拉取和处理数据都需要占用CPU,根据实际任务配置合理的核数才能达到最优的性能 -
使用Kryo序列化
Kryo是一个轻量化的序列化模型,相对于java自带的序列化产生的对象更小,在IO和存储方面更有优势 -
DStream优化
DStream的操作最终都将通过RDD实现,因此RDD优化的策略同样适用于DStream,尽量避免创建重复的DStream并复用一个DStream。
对于多次使用的DStream持久化处理(一般采用MEMORY_AND_DISK_SER
,数据量小可以考虑使用MEMORY_ONLY
以获取最佳性能(慎用)) -
设置合理的批次间隔:
数据批次的间隔最好与数据的处理时间相符,过长的时间间隔会造成Executor资源闲置,而时间间隔过短会导致任务堆积
可以考虑使用自带的背压机制,动态设定数据拉取的时间间隔:
conf.set("spark.streaming.backpressure.enabled","true")
除批次间隔设置外,DStream的优化与RDD优化差别并不大,首先在硬件(合理的CPU核数、并行度和内存配置)的基础上,优化运算逻辑(持久化、DStream复用、序列化、数据结构优化等内容),就可以获得比较好的性能体验