Structured Streaming编程 Programming Guide

Structured Streaming编程 Programming Guide
• Overview
• Quick Example
• Programming Model
o Basic Concepts
o Handling Event-time and Late Data
o Fault Tolerance Semantics
• API using Datasets and DataFrames
o Creating streaming DataFrames and streaming Datasets
 Input Sources
 Schema inference and partition of streaming DataFrames/Datasets
o Operations on streaming DataFrames/Datasets
 Basic Operations - Selection, Projection, Aggregation
 Window Operations on Event Time
 Handling Late Data and Watermarking
 Join Operations
 Stream-static Joins
 Stream-stream Joins
 Inner Joins with optional Watermarking
 Outer Joins with Watermarking
 Semi Joins with Watermarking
 Support matrix for joins in streaming queries
 Streaming Deduplication
 Policy for handling multiple watermarks
 Arbitrary Stateful Operations
 Unsupported Operations
 Limitation of global watermark
o Starting Streaming Queries
 Output Modes
 Output Sinks
 Using Foreach and ForeachBatch
 ForeachBatch
 Foreach
 Streaming Table APIs
 Triggers
o Managing Streaming Queries
o Monitoring Streaming Queries
 Reading Metrics Interactively
 Reporting Metrics programmatically using Asynchronous APIs
 Reporting Metrics using Dropwizard
o Recovering from Failures with Checkpointing
o Recovery Semantics after Changes in a Streaming Query
• Continuous Processing
• Additional Information
Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.
Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.
In this guide, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then later discuss Continuous Processing model. First, let’s start with a simple example of a Structured Streaming query - a streaming word count.
结构化流是基于Spark SQL引擎构建的可伸缩且容错的流处理引擎。可以像对静态数据进行批处理计算一样,来表示流计算。当流数据继续到达时,Spark SQL引擎负责递增地,连续地运行,并更新最终结果。可以在Scala,Java,Python或R中使用Dataset / DataFrame API来表示流聚合,事件时间窗口,流到批处理联接等。计算是在同一优化的Spark SQL引擎上执行的。最后,系统通过检查点和预写日志来确保端到端的一次容错。简而言之,结构化流提供了快速,可扩展,容错,端到端的一次精确流处理,而用户无需推理流。
在内部,默认情况下,结构化流查询是使用微批量处理引擎处理的,该引擎将数据流作为一系列小批量作业处理,从而实现了低至100毫秒的端到端延迟以及一次精确的容错保证。但是,从Spark 2.3开始,引入了一种称为“连续处理”的新低延迟处理模式,该模式可以实现一次最少保证的低至1毫秒的端到端延迟。在不更改查询中的Dataset / DataFrame操作的情况下,将能够根据应用程序需求选择模式。
本文将逐步了解编程模型和API。主要使用默认的微批处理模型来解释这些概念,然后再讨论连续处理模型。首先,让从结构化流查询的简单示例开始-流字数。
Quick Example
Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Let’s see how you can express this using Structured Streaming. You can see the full code in Scala/Java/Python/R. And if you download Spark, you can directly run the example. In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.
假设要维护从侦听TCP套接字的数据服务器接收到的,文本数据的运行字数。看看如何使用结构化流来表达这一点。可以在Scala / Java / Python / R中看到完整的代码 。如果下载了Spark,则可以直接运行该示例。无论如何,逐步介绍该示例并了解其工作原理。首先,必须导入必要的类,创建一个本地SparkSession,这是与Spark相关的所有功能的起点。
• Scala
• Java
• Python
• R
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
.builder
.appName(“StructuredNetworkWordCount”)
.getOrCreate()

import spark.implicits._
Next, let’s create a streaming DataFrame that represents text data received from a server listening on localhost:9999, and transform the DataFrame to calculate word counts. 接下来,创建一个流数据框架,该数据框架表示从在localhost:9999上侦听的服务器接收的文本数据,并对数据框架进行转换以计算字数。
• Scala
• Java
• Python
• R
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
.format(“socket”)
.option(“host”, “localhost”)
.option(“port”, 9999)
.load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy(“value”).count()
This lines DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named “value”, and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a Dataset of String using .as[String], so that we can apply the flatMap operation to split each line into multiple words. The resultant words Dataset contains all the words. Finally, we have defined the wordCounts DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.
We have now set up the query on the streaming data. All that is left is to actually start receiving data and computing the counts. To do this, we set it up to print the complete set of counts (specified by outputMode(“complete”)) to the console every time they are updated. And then start the streaming computation using start().
linesDataFrame表示一个包含流文本数据的无边界表。该表包含一列名为“值”的字符串,流文本数据中的每一行都成为表中的一行。由于正在设置转换,并且尚未开始转换,目前未接收任何数据。接下来,使用DataFrame转换为String的Dataset .as[String],以便可以应用该flatMap算子,将每一行拆分为多个单词。结果words数据集包含所有单词。最后,wordCounts通过对数据集中的唯一值进行分组,对其进行计数来定义DataFrame。这是一个流数据帧,表示流的运行字数。
现在,对流数据进行了查询。剩下的就是实际开始接收数据并计算计数了。为此,将其设置outputMode(“complete”)为在每次更新计数时,将完整的计数集(由指定)打印到控制台。然后使用start()开始流计算。
• Scala
• Java
• Python
• R
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode(“complete”)
.format(“console”)
.start()

query.awaitTermination()
After this code is executed, the streaming computation will have started in the background. The query object is a handle to that active streaming query, and we have decided to wait for the termination of the query using awaitTermination() to prevent the process from exiting while the query is active.
To actually execute this example code, you can either compile the code in your own Spark application, or simply run the example once you have downloaded Spark. We are showing the latter. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using
执行此代码后,流计算将在后台开始。该query对象是该活动流查询的句柄,已决定等待查询终止,awaitTermination()以防止该查询处于活动状态时退出该过程。
要实际执行此示例代码,可以在Spark应用程序中编译代码,也可以在 下载Spark之后直接 运行示例。正在展示后者。首先需要通过使用以下命令将Netcat(在大多数类Unix系统中找到的一个小实用程序)作为数据服务器运行。
$ nc -lk 9999
Then, in a different terminal, you can start the example by using
• Scala
• Java
• Python
• R
$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999
Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second. It will look something like the following.
然后,每秒钟在运行netcat服务器的终端中键入的任何行,都将被计数并打印在屏幕上。类似于以下内容。

TERMINAL 1:

Running Netcat

$ nc -lk 9999
apache spark
apache hadoop

… • Scala
• Java
• Python
• R

TERMINAL 2: RUNNING StructuredNetworkWordCount

$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999


Batch: 0

±-----±----+
| value|count|
±-----±----+
|apache| 1|
| spark| 1|
±-----±----+


Batch: 1

±-----±----+
| value|count|
±-----±----+
|apache| 2|
| spark| 1|
|hadoop| 1|
±-----±----+

Programming Model
The key idea in Structured Streaming is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model. You will express your streaming computation as standard batch-like query as on a static table, and Spark runs it as an incremental query on the unbounded input table. Let’s understand this model in more detail.
结构化流处理中的关键思想,将实时数据流视为被连续追加的表。这导致了一个新的流处理模型,该模型与批处理模型非常相似。将像在静态表上一样,将流计算表示为类似于批处理的标准查询,Spark在无界输入表上,将其作为增量查询运行。更详细地了解此模型。
Basic Concepts
Consider the input data stream as the “Input Table”. Every data item that is arriving on the stream is like a new row being appended to the Input Table.
将输入数据流视为“输入表”。流上到达的每个数据项都像是将新行附加到输入表中。
在这里插入图片描述

A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.
查询输入,生成“结果表”。在每个触发间隔(例如,每1秒钟),新行将附加到输入表中,并最终更新结果表。无论何时更新结果表,都希望将更改后的结果行写入外部接收器。
在这里插入图片描述

The “Output” is defined as what gets written out to the external storage. The output can be defined in a different mode:
• Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
• Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
• Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
Note that each mode is applicable on certain types of queries. This is discussed in detail later.
To illustrate the use of this model, let’s understand the model in context of the Quick Example above. The first lines DataFrame is the input table, and the final wordCounts DataFrame is the result table. Note that the query on streaming lines DataFrame to generate wordCounts is exactly the same as it would be a static DataFrame. However, when this query is started, Spark will continuously check for new data from the socket connection. If there is new data, Spark will run an “incremental” query that combines the previous running counts with the new data to compute updated counts, as shown below.
“输出”定义为写到外部存储器的内容。可以在不同的模式下定义输出:
• 完整模式-整个更新后的结果表,将被写入外部存储器。由存储连接器决定如何处理整个表的写入。
• 追加模式-仅将自上次触发以来追加在结果表中的新行,写入外部存储器。仅适用于预期结果表中现有行不会更改的查询。
• 更新模式-仅自上次触发以来在结果表中已更新的行,将被写入外部存储(自Spark 2.1.1起可用)。与完成模式的不同之处在于,此模式仅输出自上次触发以来已更改的行。如果查询不包含聚合,则等效于追加模式。
每种模式都适用于某些类型的查询。稍后将对此进行详细讨论。
为了说明此模型的用法,在上面的“快速示例”的上下文中了解该模型。第一个linesDataFrame是输入表,最后一个wordCountsDataFrame是结果表。在流媒体的查询lines数据帧生成wordCounts是完全一样的,因为是一个静态的数据帧。但是,启动此查询时,Spark将不断检查套接字连接中是否有新数据。如果有新数据,Spark将运行一个“增量”查询,该查询将先前的运行计数与新数据结合起来以计算更新的计数,如下所示。
在这里插入图片描述

Note that Structured Streaming does not materialize the entire table. It reads the latest available data from the streaming data source, processes it incrementally to update the result, and then discards the source data. It only keeps around the minimal intermediate state data as required to update the result (e.g. intermediate counts in the earlier example).
This model is significantly different from many other stream processing engines. Many streaming systems require the user to maintain running aggregations themselves, thus having to reason about fault-tolerance, and data consistency (at-least-once, or at-most-once, or exactly-once). In this model, Spark is responsible for updating the Result Table when there is new data, thus relieving the users from reasoning about it. As an example, let’s see how this model handles event-time based processing and late arriving data.
结构化流技术不会实现整个表。从流数据源读取最新的可用数据,对其进行增量处理以更新结果,然后丢弃该源数据。仅保留更新结果所需的最小中间状态数据(例如,前面示例中的中间计数)。
此模型与许多其它流处理引擎明显不同。许多流系统要求用户自己维护运行中的聚合,因此必须考虑容错和数据一致性(至少一次,最多一次或恰好一次)。在此模型中,Spark负责在有新数据时更新结果表,从而使用户免于推理。作为示例,看看该模型如何处理基于事件时间的处理和延迟到达的数据。
Handling Event-time and Late Data
Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model – each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column – each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
Furthermore, this model naturally handles data that has arrived later than expected based on its event-time. Since Spark is updating the Result Table, it has full control over updating old aggregates when there is late data, as well as cleaning up old aggregates to limit the size of intermediate state data. Since Spark 2.1, we have support for watermarking which allows the user to specify the threshold of late data, and allows the engine to accordingly clean up old state. These are explained later in more detail in the Window Operations section.
事件时间是嵌入数据本身的时间。对于许多应用程序,可能希望在此事件时间进行操作。例如,如果要获取每分钟由IoT设备生成的事件数,则可能要使用生成数据的时间(即数据中的事件时间),而不是Spark收到的时间。此事件时间在此模型中非常自然地表达-设备中的每个事件,都是表中的一行,而事件时间是该行中的列值。允许基于窗口的聚合(例如,每分钟的事件数),只是事件时间列上的一种特殊类型的分组和聚合-每个时间窗口都是一个组,每行可以属于多个窗口/组。
此外,此模型自然会根据事件时间处理比预期时间晚到达的数据。由于Spark正在更新结果表,具有完全控制权,可以在有较晚数据时更新旧聚合,并可以清除旧聚合以限制中间状态数据的大小。从Spark 2.1开始,支持水印功能,该功能允许用户指定最新数据的阈值,并允许引擎相应地清除旧状态。这些将在后面的“窗口操作”部分中详细介绍。
Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure end-to-end exactly-once semantics under any failure.
提供端到端的一次语义是结构化流设计背后的主要目标之一。为此,设计了结构化流源,接收器和执行引擎,可靠地跟踪处理的确切进度,可以通过重新启动和/或重新处理来处理任何类型的故障。假定每个流源都有偏移量(类似于Kafka偏移量或Kinesis序列号),跟踪流中的读取位置。引擎使用检查点和预写日志,记录每个触发器中正在处理的数据的偏移范围。流接收器被设计为是幂等的idempotent,用于处理后处理。结合使用可重播的源和幂等的idempotent接收器,结构化流可以确保端到端的一次精确语义 在任何故障下。
API using Datasets and DataFrames
Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point SparkSession (Scala/Java/Python/R docs) to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the DataFrame/Dataset Programming Guide.
从Spark 2.0开始,DataFrame和Dataset可以表示静态的有界数据,以及流式无界数据。与静态数据集/数据框类似,可以使用公共入口点SparkSession (Scala / Java / Python / R docs),从流源创建流式数据框/数据集,应用与静态数据框/数据集相同的操作。如果不熟悉Datasets / DataFrames,强烈建议使用DataFrame / Dataset编程指南来熟悉 。
Creating streaming DataFrames and streaming Datasets
Streaming DataFrames can be created through the DataStreamReader interface (Scala/Java/Python docs) returned by SparkSession.readStream(). In R, with the read.stream() method. Similar to the read

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值