初识Spark Structured Streaming结构化流

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html官网

Overview概述

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

结构化流是一个基于Spark SQL可扩展以及容错的流式处理引擎

You can express your streaming computation the same way you would express a batch computation on static data.

你可以像理解静态数据批处理一样去理解流式处理引擎

The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.

Spark SQL主要负责连续地、持续的去更新最终的一个结果集

You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.

由于是基于SparkSQL的,所以你可以使用Dataset/DataFrame API,而本次我们是用scala来学习结构化流的;当然你也可以使用这些API去完成流聚合、事件窗口以及流到批的join操作

The computation is executed on the same optimized Spark SQL engine.

整个计算过程是被SparkSQL计算引擎优化过的

Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs.

最后,结构化流保证了端对端以及通过checkpoint、预写日志可以保证一次性容错

In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

简单来说,就是结构化流提供了快速、容错的、端对端、一次性消费的流式处理引擎

Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.

默认情况下,结构化流的查询是使用微批的处理引擎来处理的,微批处理引擎是指一系列小的批次作业来处理,从而达到100ms的端对端的延迟而且是保证一次性容错的

However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees.

但是自从2.3版本引进来之后,它提供了一个叫Continuous Processing模式,这个模式可以实现至少1ms的低延迟,而且保证一次性容错

Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

所以你可以根据业务要求来选择使用哪种模式

Quick Example

获取SparkSession

import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder
  .appName("StructuredNetworkWordCount")
  .getOrCreate()
  
import spark.implicits._
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream    //readStream结构化流的启动入口
  .format("socket")   //format是指定你们数据源是什么(socket、kafka.....)
  .option("host", "localhost")
  .option("port", 9999)//通过key-value的形式传一些参数
  .load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()   //这里的value是socket的sechma

This lines DataFrame represents an unbounded table containing the streaming text data.

这个lines DataFrame表示着这是一个无界的数据表

This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table.

他包含的schema信息就叫做“value”,然后我们在客户端输入的每一行数据,他都会变成这张表的行

Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using .as[String], so that we can apply the flatMap operation to split each line into multiple words. The resultant words Dataset contains all the words.

上面的String就是要做他做成一个DF的String的数据集,然后通过flatMap操作来分隔成多个单词

Finally, we have defined the wordCounts DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.

最终我们就可以通过wordCount来做一个统计了

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")   //有三种模式
  .format("console")
  .start()    //启动流计算

query.awaitTermination()   //通过这个方法来控制引擎一直处于查询(active )状态

结构化流是通过writeStream来输出的

Programming Model程序模型

The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended.

结构化流核心就是把源源不断进来的数据看成无界的表

This leads to a new stream processing model that is very similar to a batch processing model.

You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table.

然后我们可以对这个无界的表通过一些静态数据的查询操作,也就是说可以将一些离线计算的API运用在流式计算中

Let’s understand this model in more detail.

在这里插入图片描述

A query on the input will generate the “Result Table”.

每一行查询操作都会产生一个结果集,每一次触发的查询操作都会追加到最终的结果集

在这里插入图片描述

Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.

每一个触发的查询都会把结果集追加到无界表

The “Output” is defined as what gets written out to the external storage. The output can be defined in a different mode:
三种输出模式:

  • Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.Complete模式,把每一个间隔产生的结果都会输出到外部存储
  • Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.Append模式,他只会把新出现的记录才会输出到外部存储,而对于DF来说里面的数据还是全的
  • Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.更新模式,就是经过计算后再输出,eg:在第一个统计小明出现过2次,大飞1次;而第二个进来一个小明以及狗蛋一次,这时候只会数据小明三次,狗蛋一次;对于总的数据集还是把上述的全部累计起来

在这里插入图片描述

This model is significantly different from many other stream processing engines.

这个模型是和其他流式处理引擎不一样的

Many streaming systems require the user to maintain running aggregations themselves, thus having to reason about fault-tolerance, and data consistency (at-least-once, or at-most-once, or exactly-once).

需要用户自己去维护运行的聚合

In this model, Spark is responsible for updating the Result Table when there is new data, thus relieving the users from reasoning about it. As an example, let’s see how this model handles event-time based processing and late arriving data.

也就是维护event-time,来看数据聚合的情况

Handling Event-time and Late Data

Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time.

Event-time是指每一行数据他自己是属于哪个时间,对于流式处理,Event-time才是属于最有价值的字段,因为他可以判断是属于哪个时间的,eg:7点产生了一条数据,8点才到程序中,8点就是他的pocessing time,而7点就是这条数据的产生时间,也就是Event-time;这两个时间不能混淆

For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them.

This event-time is very naturally expressed in this model – each event from the devices is a row in the table, and event-time is a column value in the row.

这个数据每一行应该都有一个字段来判断这个event-time

This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column – each time window is a group and each row can belong to multiple windows/groups.
Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.

然后根据这个event-time,我们就可以做一些聚合操作,像时间窗口的一些操作

Furthermore, this model naturally handles data that has arrived later than expected based on its event-time.

这个模型是天然地支持这个延迟,基于event-time,就是我可以通过判断这个event-time,来接收我多久之前的数据,简单的来说就是当前时间是8点,我允许6点~7点的数据再进来处理,但是5点的数据就不处理了,这个机制控制就是要基于event-time,这样程序就能控制哪些才去做聚合操作

Since Spark is updating the Result Table, it has full control over updating old aggregates when there is late data, as well as cleaning up old aggregates to limit the size of intermediate state data. Since Spark 2.1, we have support for watermarking which allows the user to specify the threshold of late data, and allows the engine to accordingly clean up old state. These are explained later in more detail in the Window Operations section.

Streaming Deduplication流式去重

在生产中我们是经常要用到去重的,比如:我们需要统计每五分钟一个用户出现的次数,这个用户可能出现多次,这时候我们就需要用到去重

You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as deduplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use deduplication with or without watermarking.

  • With watermark - If there is a upper bound on how late a duplicate record may arrive, then you can define a watermark on a event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.

通过watermark把过去的记录去除掉,eg:1012点,出现过的用户,我把这个状态记录,然后再过510min,这个用户又一次出现,已经过了这个watermark的话,前面这两条记录会被去除掉

  • Without watermark - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state.

而没有watermark的话,所有的记录都会被保存,这样统计就会不准

val streamingDf = spark.readStream. ...  // columns: guid, eventTime, ...

//通过DF去过滤你要去重的字段

// Without watermark using guid column
streamingDf.dropDuplicates("guid")

// With watermark using guid and eventTime columns
streamingDf
  .withWatermark("eventTime", "10 seconds")
  .dropDuplicates("guid", "eventTime")
  

Window Operations on Event Time

在这里插入图片描述

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }



// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", "10 minutes", "5 minutes"),  //通过timestamp,每10分钟做一个窗口,每5分钟做一个滑动窗口
  $"word"
).count()

Handling Late Data and Watermarking处理延时数据

Now consider what happens if one of the events arrives late to the application. For example, say, a word generated at 12:04 (i.e. event time) could be received by the application at 12:11. The application should use the time 12:04 instead of 12:11 to update the older counts for the window 12:00 - 12:10. This occurs naturally in our window-based grouping – Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly, as illustrated below.

因为真实生产项目中,由于网络波动,商家给你数据的迟延,数据延时是不可避免的

在这里插入图片描述

不过由于是基于Event Time的,所以还是会落到他这个批次中

However, to run this query for days, it’s necessary for the system to bound the amount of intermediate in-memory state it accumulates.

但是,整个查询进行了数天,内存里已经积累了一定量有状态的数据

This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more.

这就意味着系统需要将这些旧的状态的聚合数据从内存中去掉,例如:我整个数据已经运行了5天,但是在第三天的时候上一天或者第一天的数据已经没有必要保留了,这时候如果不将其从内存中去掉,就会对内存造成不必要的浪费,而且这样持续地增长,还会造成OOM(代码控制的不好的话会出现这种情况)

To enable this, in Spark 2.1, we have introduced watermarking, which lets the engine automatically track the current event time in the data and attempt to clean up old state accordingly.

从2.1开始就引进了watermark水印的概念

You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected to be in terms of event time.

用户可以通过指定这个水印,去根据event-time去设置一个阙值查询

For a specific window ending at time T, the engine will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will start getting dropped (see later in the section for the exact guarantees). Let’s understand this with an example. We can easily define watermarking on the previous example using withWatermark() as shown below.

import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
   .withWatermark("timestamp", "10 minutes")   //设置了一个水印,10分钟的,也就是当前时间10分钟之前的数据就不会再来接收了,就会去除掉
   .groupBy(
       window($"timestamp", "10 minutes", "5 minutes"),
       $"word")
   .count()
   

在这里插入图片描述

Conditions for watermarking to clean aggregation state
It is important to note that the following conditions must be satisfied for the watermarking to clean the state in aggregation queries (as of Spark 2.1.1, subject to change in the future).

  • Output mode must be Append or Update. Complete mode requires all aggregate data to be preserved, and hence cannot use watermarking to drop intermediate state. See the Output Modes section for detailed explanation of the semantics of each output mode.

如果使用了watermark的话,Output这个模式,必须是Append或者Update,Complete是输出全量的,所以不支持watermark,不支持删除状态,那就就会在内存中无限增长,在生产中要慎用

  • The aggregation must have either the event-time column, or a window on the event-time column.

所有的聚合操作必须要有event-time或者event-time形成的窗口来进行聚合操作

  • withWatermark must be called on the same column as the timestamp column used in the aggregate. For example, df.withWatermark(“time”, “1 min”).groupBy(“time2”).count() is invalid in Append output mode, as watermark is defined on a different column from the aggregation column.

  • withWatermark must be called before the aggregation for the watermark details to be used. For example, df.groupBy(“time”).count().withWatermark(“time”, “1 min”) is invalid in Append output mode.

Join Operations

Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame as well as another streaming Dataset/DataFrame.

结构化流是支持好几类(Stream-static Joins、Stream-stream Joins…)

流和静态数据join的动作,eg:用IP和静态数据join去拿到域名

The result of the streaming join is generated incrementally, similar to the results of streaming aggregations in the previous section. In this section we will explore what type of joins (i.e. inner, outer, etc.) are supported in the above cases. Note that in all the supported join types, the result of the join with a streaming Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame containing the same data in the stream.

Stream-stream Joins

In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming Datasets/DataFrames. The challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs. Any row received from one input stream can match with any future, yet-to-be-received row from the other input stream. Hence, for both the input streams, we buffer past input as streaming state, so that we can match every future input with past input and accordingly generate joined results. Furthermore, similar to streaming aggregations, we automatically handle late, out-of-order data and can limit the state using watermarks. Let’s discuss the different types of supported stream-stream joins and how to use them.

在Spark2.3就提供了流和流的join

Stream-stream Joins代码案例

广告投放的流通用户点击的流的join,就形成了一个类似用户画像的用户点击喜好

import org.apache.spark.sql.functions.expr

val impressions = spark.readStream. ...
val clicks = spark.readStream. ...

// Apply watermarks on event-time columns
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

//这里值只有设置的很大,才能匹配到另外一个流,找到对应的数据,也就是要注意Watermark的设置

// Join with event-time constraints
impressionsWithWatermark.join(
 clicksWithWatermark,
 expr("""
   clickAdId = impressionAdId AND
   clickTime >= impressionTime AND
   clickTime <= impressionTime + interval 1 hour
   """)
)

Output Sinks

http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

  • File sink - Stores the output to a directory.
  • Kafka sink - Stores the output to one or more topics in Kafka.
  • Foreach sink - Runs arbitrary computation on the records in the output. See later in the section for more details.
  • Console sink (for debugging) - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory after every trigger.
  • Memory sink (for debugging) - The output is stored in memory as an in-memory table. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver’s memory. Hence, use it with caution.

Using Foreach and ForeachBatch

The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch. Let’s understand their usages in more detail.

ForeachBatch

foreachBatch(…) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

ForeachBatch是允许把当前这个微批的数据落到内存中,也就是可以缓存起来

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  // Transform and write batchDF 
}.start()

缓存之后就可以保存到MySQL等,也就是中间落地了,可以写转换也可以写到其他地方,而在2.4之前,对于Spark是不允许落地的,因为它是端对端的,不允许输出到多个地方的

Note:

  • By default, foreachBatch provides only at-least-once write guarantees. However, you can use the batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee.
  • foreachBatch does not work with the continuous processing mode as it fundamentally relies on the micro-batch execution of a streaming query. If you write data in the continuous mode, use foreach instead.

不过仅仅是针对每一个批次,eg:假如说我要做一个聚合的操作,每5min的统计,但是我只能是以当前数据的批次来使用ForeachBatch,不能和之前的批次混合在一起,这个利弊后续会详细分析

Foreach

streamingDatasetOfString.writeStream.foreach(
  new ForeachWriter[String] {

    def open(partitionId: Long, version: Long): Boolean = {
      // Open connection
    }

    def process(record: String): Unit = {
      // Write string to connection
    }

    def close(errorOrNull: Throwable): Unit = {
      // Close the connection
    }
  }
).start()

Triggers

间隔说明

其他的可以具体查看官网http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#overview

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值