spark Steaming 学习笔记

最新推荐文章于 2022-08-12 18:08:36 发布

kehan_c

最新推荐文章于 2022-08-12 18:08:36 发布

阅读量526

点赞数

分类专栏：基础知识文章标签： spark streaming

本文链接：https://blog.csdn.net/kehan_c/article/details/94461973

版权

基础知识专栏收录该内容

13 篇文章 0 订阅

订阅专栏

官网文档：http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

万物之源 word count：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
.builder
.appName(“StructuredNetworkWordCount”)
.getOrCreate()

import spark.implicits._

// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
.format(“socket”)
.option(“host”, “localhost”)
.option(“port”, 9999)
.load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy(“value”).count()

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode(“complete”)
.format(“console”)
.start()

query.awaitTermination()

使用start()开始流式计算。以上指定了输出模式为complete，输出模式有3中，下面介绍。

程序模型（执行模式，Programming Model）：

Complete Mode：整个更新后的结果表将被写入外部存储。如何处理整个表的写入由存储连接器决定。
Append Mode：只有添加到结果表中的新行(因为最后一个触发器)才会被写入外部存储。这只适用于预期结果表中现有行不会更改的查询。
Update Mode：只有自最后一个触发器以来结果表中更新的行才会被写入外部存储(从Spark 2.1.1开始可用)。与Complete Mode不同，此模式只输出更改的行。

每个模式对应着不同类型的查询。以下是一个word count示例。
在这里插入图片描述
结构化流不能实现整个表，它从流数据源读取最新可用数据，增量地处理它以更新结果，然后丢弃源数据。它只保留更新结果所需的最小中间状态数据(例如，前面示例中的中间计数)。

event-time ：来自设备的每个事件都是表中的一行，而事件时间（event-time）是行中的列值。
end-to-end exactly-once：使用checkpointing 和 write-ahead logs 来确保容错达到精确一次。

创建流的dataFrame：

通过SparkSession.readStream()来创建。具体参数设置详见官方文档。
例子：
val spark: SparkSession = …

// Read text from socket
val socketDF = spark
.readStream
.format(“socket”)
.option(“host”, “localhost”)
.option(“port”, 9999)
.load()

socketDF.isStreaming // Returns True for DataFrames that have streaming sources

socketDF.printSchema

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add(“name”, “string”).add(“age”, “integer”)
val csvDF = spark
.readStream
.option(“sep”, “;”)
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format(“csv”).load("/path/to/directory")

常见操作：

spark流支持大部分的DataFrame/Dataset操作。
例子：
case class DeviceData(device: String, deviceType: String, signal: Double, time: DateTime)

val df: DataFrame = … // streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: string }
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data

// Select the devices which have signal more than 10
df.select(“device”).where(“signal > 10”) // using untyped APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs

// Running count of the number of updates for each device type
df.groupBy(“deviceType”).count() // using untyped API

// Running average signal for each device type
import org.apache.spark.sql.expressions.scalalang.typed
ds.groupByKey(_.deviceType).agg(typed.avg(_.signal)) // using typed API

可以注册成临时视图，然后使用spark-sql。
df.createOrReplaceTempView(“updates”)
spark.sql(“select count(*) from updates”) // returns another streaming DF

基于event-time的窗口操作：

在这里插入图片描述
可以使用groupBy() 和 window()操作来实现窗口聚合。
例子：
import spark.implicits._

val words = … // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
window($“timestamp”, “10 minutes”, “5 minutes”),
$“word”
).count()

水印技术（Watermarking）：

在这里插入图片描述
简单来理解，水印技术就是为了让流引擎自动追踪event-time并自动清理旧数据。类似于规定了实时数据的最大延迟时间，超过了时间，就抛弃掉。
例子：
import spark.implicits._

val words = … // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark(“timestamp”, “10 minutes”)
.groupBy(
window($“timestamp”, “10 minutes”, “5 minutes”),
$“word”)
.count()

在这里插入图片描述

使用watermarking后，每个窗口的中间结果仍然会保存，但是并不会在窗口时间结束后立刻写到水槽中（sink），而是等水印时间（watermarking）过去后，才会写到结果表（或者sink）。有点像等待延迟时间，等你几分钟，如果还不来就再也不管你了。

水印功能只能在Append或者Update模式中使用，因为Complete模式要求保存所有聚合的数据，因此无法清除中间结果。
水印功能的 withWatermark 必须使用同一个时间字段，如df.withWatermark(“time”, “1 min”).groupBy(“time2”).count()这种是不允许的。（time字段前后要一致）

Join 操作：

流–静态表join：
inner join 无限制；
outer join 仅支持流left join 静态表，静态表 right join 流。
例子：
val staticDf = spark.read. …
val streamingDf = spark.readStream. …

streamingDf.join(staticDf, “type”) // inner equi-join with a static DF
streamingDf.join(staticDf, “type”, “right_join”) // right outer join with a static DF

流–流join：

inner join：
1、必须在两个流上都定义水印（watermark）延迟时间。
2、定义event-time的限制以确定旧有的行什么时候被抛弃，不再进行匹配，有如下两种方式，范围型和等于型。
e.g. …JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR
e.g. …JOIN ON leftTimeWindow = rightTimeWindow
例子：
import org.apache.spark.sql.functions.expr

val impressions = spark.readStream. …
val clicks = spark.readStream. …

// Apply watermarks on event-time columns
val impressionsWithWatermark = impressions.withWatermark(“impressionTime”, “2 hours”)
val clicksWithWatermark = clicks.withWatermark(“clickTime”, “3 hours”)

// Join with event-time constraints
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
“”")
)

outer join：
外连接依然要有 watermark + event-time 的限定，但比内连接多了个参数 joinType。
例子：
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= impressionTime AND
clickTime <= impressionTime + interval 1 hour
“”"),
joinType = “leftOuter” // can be “inner”, “leftOuter”, “rightOuter”
)

流式计算不支持full join。

流计算去重复：

例子：
val streamingDf = spark.readStream. … // columns: guid, eventTime, …

// Without watermark using guid column
streamingDf.dropDuplicates(“guid”)

// With watermark using guid and eventTime columns
streamingDf
.withWatermark(“eventTime”, “10 seconds”)
.dropDuplicates(“guid”, “eventTime”)

输出模式和查询操作

不同的输出模式支持的查询操作也不同。
Append, Update, Complete三种模式都支持使用水印对事件时间进行聚合（Aggregation on event-time with watermark）。

输出水槽（Output Sinks）：

文件：
writeStream
.format(“parquet”) // can be “orc”, “json”, “csv”, etc.
.option(“path”, “path/to/destination/dir”)
.start()

kafka：
writeStream
.format(“kafka”)
.option(“kafka.bootstrap.servers”, “host1:port1,host2:port2”)
.option(“topic”, “updates”)
.start()

使用start()操作来执行查询语句：
// ========== DF with no aggregations ==========
val noAggDF = deviceDataDf.select(“device”).where(“signal > 10”)

// Print new data to console
noAggDF
.writeStream
.format(“console”)
.start()

// Write new data to Parquet files
noAggDF
.writeStream
.format(“parquet”)
.option(“checkpointLocation”, “path/to/checkpoint/dir”)
.option(“path”, “path/to/destination/dir”)
.start()

// ========== DF with aggregation ==========
val aggDF = df.groupBy(“device”).count()

// Print updated aggregations to console
aggDF
.writeStream
.outputMode(“complete”)
.format(“console”)
.start()

// Have all the aggregates in an in-memory table
aggDF
.writeStream
.queryName(“aggregates”) // this query name will be the table name
.outputMode(“complete”)
.format(“memory”)
.start()

spark.sql(“select * from aggregates”).show() // interactively query in-memory table

触发器：

默认：微批次处理；
Fixed interval micro-batches：有时间间隔的微批次处理。
One-time micro-batch：一次性微批次处理，适用于周期性启动集群，一次性之前累积的数据，然后关闭集群。（这样的话直接定时调度spark作业不就好了，用个毛线的流啊）
Continuous Processing：连续处理，还在实验的功能。
例子：
import org.apache.spark.sql.streaming.Trigger

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format(“console”)
.start()

// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
.format(“console”)
.trigger(Trigger.ProcessingTime(“2 seconds”))
.start()

// One-time trigger
df.writeStream
.format(“console”)
.trigger(Trigger.Once())
.start()

// Continuous trigger with one-second checkpointing interval
df.writeStream
.format(“console”)
.trigger(Trigger.Continuous(“1 second”))
.start()

微批处理：
精确一次容错保证（exactly-once）

连续处理：
至少一次容错保证（at-least-once），必须指定一个连续触发器（continuous trigger），并将所需的检查点间隔作为参数。

流查询管理：
可以在一个sparkSession中执行多个查询语句，他们分享集群资源，同时立即执行。可通过sparkSession.streams() 获得 StreamingQueryManager 来管理正在执行的查询。

通过检查点重新启动作业：

设置检查点后，检查点会把运行数据保存到设置的路径上，可通过检查点重启作业。（using checkpointing and write-ahead logs）。可根据检查点重启为任何类型的作业，该任务类型可以是微批处理或者连续处理。
例子：
aggDF
.writeStream
.outputMode(“complete”)
.option(“checkpointLocation”, “path/to/HDFS/dir”)
.format(“memory”)
.start()

kehan_c

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark Steaming 学习笔记

微批处理：精确一次容错保证（exactly-once）连续处理：至少一次容错保证（at-least-once），必须指定一个连续触发器（continuous trigger），并将所需的检查点间隔作为参数。流查询管理：可以在一个sparkSession中执行多个查询语句，他们分享集群资源，同时立即执行。可通过sparkSession.streams() 获得 StreamingQuery...
复制链接

扫一扫