文章目录
Spark中的Streaming记录
1 窗口操作
图中sparkstreaming中batch的间隔时间是5s,而窗口的大小是15s,窗口的滑动间隔是10s;
注意
:1、batch间隔的5s是将数据流封装成一个个rdd DStream,窗口滑动间隔是10,则每10s后会将3个rdd DStream汇聚成一个大的rdd DStream;
2、窗口的大小以及窗口的滑动间隔必须是
batch interval的整数倍
(图中也就是5s的整数倍)
2 窗口优化
可以看到图中计算的窗口大小第一次是t-1到t+3,第二次是t到t+4,我们知道这中间肯定有状态的存储,默认的存储状态是MEMORY_ONLY,基于内存不稳定,可以使用checkpoint;对于这个例子,我们可以发现t到t+3再第二次窗口被重复计算了,那就可以直接将第一次窗口的计算t+3的结果减去t-1等于t到t+3的结果,再加上t+4就是第二次窗口计算的结果;
3 SparkStreaming demo
//1、创建spark的环境
val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("wc")
val sc = new SparkContext(conf)
//2、创建SparkStreaming的上下文对象:需要spark环境以及batch的间隔时间,时间越长延迟越长,时间越短吞吐量越低
val ssc: StreamingContext = new StreamingContext(sc, Durations.seconds(5))
//3、读取数据:构建DStream 被动的接收数据
val ds: ReceiverInputDStream[String] = ssc.socketTextStream("172.16.6.14", 9999)
//4、有状态处理数据
val kvDS: DStream[(String, Int)] = ds
.flatMap(_.split(","))
.map((_, 1))
/**
* 统计最近15秒单词的数量,每隔5s统计一次
*
* 滑动窗口:时间和窗口大小必须是batch整数倍
* 滚动窗口
*
*/
/* val countDS: DStream[(String, Int)] = kvDS
.reduceByKeyAndWindow(
(x: Int, y: Int) => x + y, //聚合函数
Durations.seconds(15), //窗口大小
Durations.seconds(5) //滑动时间
)*/
//优化窗口计算
val countDS: DStream[(String, Int)] = kvDS
.reduceByKeyAndWindow(
(x: Int, y: Int) => x + y, //聚合函数
(x: Int, y: Int) => x - y, //减去上一个
Durations.seconds(15), //窗口大小
Durations.seconds(5) //滑动时间
)
countDS.print()
//5、启动
ssc.start() //启动任务
ssc.awaitTermination() //等待关闭
4 StructuredStreaming
spark中流的处理有两种StructuredStreaming、SparkStreaming
StructuredStreaming是事件驱动的,SparkStreaming是时间驱动的,换句话说,StructuredStreaming有数据才会计算,没有数据就不算,SparkStreaming一直在算,运行后看spark web页面可直观看到job数量变化,附上StructuredStreaming demo
//1、创建StructuredStreaming环境
val spark: SparkSession = SparkSession
.builder()
.master("local[*]")
.appName("StructuredStreaming")
.config("spark.sql.shuffle.partitions", 1) //默认200个
.getOrCreate()
//2、读取socket
val DF: DataFrame = spark
.readStream
.format("socket")
.option("host", "172.16.6.14")
.option("port", 9999)
.load()
import spark.implicits._
import org.apache.spark.sql.functions._
//3、处理流
DF.groupBy($"value")
.agg(count("value") as "c")
.writeStream //直接show报错,流是无界的
.outputMode(OutputMode.Complete())
.format("console") //输出控制台
.start() //启动
.awaitTermination() //等待关闭
5 Structured Streaming读写Kafka demo
读kafka
val spark: SparkSession = SparkSession.builder()
.master("local[2]")
.appName("stru")
.config("spark.sql.shuffle.partitions", 2)
.getOrCreate()
import org.apache.spark.sql.functions._
import spark.implicits._
/**
* 连接kafka,创建df
*
*/
val kafkaDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "master:9092") //kafka 集群地址
.option("subscribe", "test_topic1")
.option("startingOffsets", "earliest") // latest : 只读最新的数据, earliest : 读所有数据
.load()
kafkaDF.printSchema()
/**
* root
* |-- key: binary (nullable = true)
* |-- value: binary (nullable = true)
* |-- topic: string (nullable = true)
* |-- partition: integer (nullable = true)
* |-- offset: long (nullable = true)
* |-- timestamp: timestamp (nullable = true)
* |-- timestampType: integer (nullable = true)
*/
val wordDF: DataFrame = kafkaDF
.selectExpr("cast(value as string) as value")
.select(explode(split($"value", ",")) as "word") // 将一行数据展开
wordDF.createOrReplaceTempView("words")
val countDF: DataFrame = spark.sql(
"""
|
|select word,count(1) c from words group by word
|
|
""".stripMargin)
countDF.writeStream
.format("console")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
写kafka
val spark: SparkSession = SparkSession.builder()
.master("local[2]")
.appName("stru")
.config("spark.sql.shuffle.partitions", 2)
.config("spark.sql.streaming.checkpointLocation", "spark/data/checkpoint")
.getOrCreate()
import spark.implicits._
/**
* 连接kafka,创建df
*
*/
val kafkaDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "master:9092") //kafka 集群地址
.option("subscribe", "test_topic1")
.option("startingOffsets", "latest") // latest : 只读最新的数据, earliest : 读所有数据
.load()
/**
* root
* |-- key: binary (nullable = true)
* |-- value: binary (nullable = true)
* |-- topic: string (nullable = true)
* |-- partition: integer (nullable = true)
* |-- offset: long (nullable = true)
* |-- timestamp: timestamp (nullable = true)
* |-- timestampType: integer (nullable = true)
*/
val df: DataFrame = kafkaDF
.selectExpr("cast(value as string) as value")
.filter($"value" === "java")
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "master:9092")
.option("topic", "test1") //如果topic不存在, 默认会自动创建一个分区为1 副本为1的topic
.start()
.awaitTermination()