Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark译文

下面翻译一下,Spark 2.4 版本后舍弃 Spark Streaming 转向 Structured Streaming的趋势了
Structured Streaming: A Declarative API for Real-Time
Applications in Apache Spark
Abstract
With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators.
Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL’s code generation engine and can outperform Apache Flink by up to 2× and Apache Kafka Streams by 90×. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system’s design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.
翻译:
随着实时数据的普遍性,企业也更加需要“流式计算系统”具有更好的可扩展性、容易使用,并且容易整合进业务系统中去。基于我们在Spark Streaming的经验,Structured Streaming在Apache Spark中是一种更高级的流式API。Structured Streaming与近来流行其他流式API,像谷歌的Dataflow,主要有两点不同。
首先,它是一个纯粹的声明式API,基于自动递增静态关系查询(使用SQL或DataFrames表示),而不是要求用户构建物理操作符的DAG的API。
第二,Structured Streaming旨在支持端到端处理真实实时应用,该应用将流与批处理和交互分析集成在一起。在实践时才发现,批流的集成才是真正关键的挑战。Structured Streaming通过Spark SQL代码生成器能够取得很好的表现。我们测试时得到的效果是:它的性能是Flink的两倍,是Apache Kafka Stream的90倍(主要应该是吞吐量上,可参考第9节的具体讲解)。它也提供了丰富的操作特性,比如:回滚,代码更新,混合批处理和流处理。我们从Databricks上的几百个生产部署中描述了系统的设计和用例,其中最大的部署每月处理超过1pb的数据。
1、Introduction
Many high-volume data sources operate in real time, including sensors, logs from mobile applications, and the Internet of Things. As organizations have gotten better at capturing this data, they also want to process it in real time, whether to give human analysts the freshest possible data or drive automated decisions. Enabling broad access to streaming computation requires systems that are scalable,easy to use and easy to integrate into business applications.
大量数据的数据源包括传感器信息、移动应用产生的日志和物联网都是实时的数据源。随着企业越来越容易获取这些数据,他们也希望实时处理这些数据,无论是为分析师提供最新的数据,还是推动自动化决策。要实现对流计算的广泛访问,需要具有可扩展性、易于使用和易于集成到业务应用程序中的系统。
While there has been tremendous progress in distributed stream processing systems in the past few years,these systems still remain fairly challenging to use in practice. In this paper, we begin by describing these challenges, based on our experience with Spark Streaming, one of the earliest stream processing systems to provide a high-level, functional API. We found that two challenges frequently came up with users. First, streaming systems often ask users to think in terms of complex physical execution concepts,such as at-least-once delivery,state storage and triggering modes, that are unique to streaming. Second, many systems focus only on streaming computation, but in real use cases, streaming is often part of a larger business application that also includes batch analytics, joins with static data, and interactive queries. Integrating streaming systems with these other workloads (e.g., maintainingtransactionality) requires significant engineering effort.
虽然分布式流处理系统在过去的几年中取得了巨大的进步,但这些系统在实际应用中仍然具有相当大的挑战性。在本文中,基于对Spark Streaming(最早提供高级功能API的流处理系统之一)的经验,我们首先描述这些挑战。我们发现用户经常面临两个调整。首先,流式系统通常要求用户考虑复杂的物理执行概念,例如至少一次传递、状态存储和触发模式,这些都是流所特有的。第二,许多系统只关注流计算,但在实际用例中,流通常是大型业务应用程序的一部分,该应用程序还包括批量分析、静态数据连接和交互式查询。将流计算系统与这些其他工作负载集成(例如,维护事务性)还需要大量的工程工作。
Motivated by these challenges, we describe Structured Streaming, a new high-level API for stream processing that was developed in Apache Spark starting in 2016. Structured Streaming builds on many ideas in recent stream processing systems, such as separating processing time from event time and triggers in Google Dataflow, using a relational execution engine for performance, and offering a language-integrated API, but aims to make them simpler to use and integrated with the rest of Apache Spark. Specifically, Structured Streaming differs from other widely used open source streaming APIs in two ways:
基于这些挑战,从2016年在Apache Spark开始发展这种叫做Structured Streaming的用于流处理的新的高级API。Structured Streaming建立在许多最近已经出现的处理系统的思想之上,比如 Google Dataflow系统中分离处理时间与事件时间和触发器的思想,,使用关系执行引擎来提高性能,并提供语言集成API,这些思想的目的是使Structured Streaming更易于使用并且更易于和Apache Spark的其余部分集成。具体而言,Structured Streaming 与其他广泛使用的开源流式API有两种不同:

Incremental query model: Structured Streaming automatically incrementalizes queries on static datasets expressed through Spark’s SQL and DataFrame APIs , meaning that users typically only need to understand Spark’s batch APIs to write a streaming query. Event time concepts are especially easy to express and understand in this model.Although incremental query execution and view maintenance are well studied, we believe Structured Streaming is the first effort to adopt them in a widely used open source system. We found that this incremental API generally worked well for both novice and advanced users. For example, advanced users can use a set of stateful processing operators that give fine-grained control to implement custom logic while fitting into the incremental model.
增量查询模型:Structured Streaming 通过SparkSQL 和 DataFrame API自动增加静态数据集的查询,意味着用户仅仅需要理解sparkSQL的批处理API就可以编写流处理了。在这个模型中,事件时间的概念特别容易表达和理解。虽然增量查询执行和视图维护已经得到了很好的研究,我们坚信Structured Streaming的需要做的第一位仍然是开源系统下的广泛使用。我们发现这种增量API通常对新手和高级用户的吸引都很有效。例如,高级用户可以使用一组有状态处理操作符,这些操作符提供细粒度控制,以实现定制逻辑,同时适应增量模型。

Support for end-to-end applications: Structured Streaming’s API and built-in connectors make it easy to write code that is “correct by default" when interacting with external systems and can be integrated into larger applications using Spark and other software. Data sources and sinks follow a simple transactional model that enables “exactly-once" computation by default. The incrementalization based API naturally makes it easy to run a streaming query as a batch job or develop hybrid applications that join streams with static data computed through Spark’s batch APIs. In addition, users can manage multiple streaming queries dynamically and run interactive queries on consistent snapshots of stream output, making it possible to write applications that go beyond computing a fixed result to let users refine and drill into streaming data.
支持端到端的应用:Structured Streaming的API和内置连接器使编写代码变得容易,在与外部系统交互时“默认正确”,并且可以使用Spark和其他软件集成到更大的应用程序中。数据源和输出使用简单的事务模型,默认下实现exactly-once。基于增量化的API自然可以轻松地将流式查询作为批处理作业运行,或者开发混合应用时,将流与通过SparkAPI批处理得到的静态数据连接起来。此外,用户可以动态管理多个流式查询,并在保证快照一致性的情况下,在流输出上运行交互式查询,这样就可以不用编写复杂计算去得到结果的应用程序,而让用户细化和精研流式数据,体现流式数据的价值。
Beyond these design decisions, we made several other design choices in Structured Streaming that simplify operation and increase performance. First, Structured Streaming reuses the Spark SQL execution engine , including its optimizer and runtime code generator. This leads to high throughput compared to other streaming systems (e.g., 2 × the throughput of Apache Flink and 90 × that of Apache Kafka Streams in the Yahoo! Streaming Benchmark),as in Trill, and also lets Structured Streaming automatically leverage new SQL functionality added to Spark. The engine runs in a microbatch execution mode by default but it can also use a low-latency continuous operators for some queries because the API is agnostic to execution strategy .
除了以上的设计,我们在Structured Streaming还做了其他设计选择以简化操作和提高性能。首先,Structured Streaming复用SparkSQL的执行引擎,包括它的优化器和代码生成器。这使得它相比于其他流式系统具有更高的吞吐量(Apache Flink的两倍,Apache Kafka Streams的90倍)与Trill一样,还允许自动利用Spark中添加的新SQL功能。默认情况下,引擎以微批处理执行模式运行,但也可以对某些查询使用低延迟连续运算符,因为API对执行策略不可知。
Second, we found that operating a streaming application can be challenging, so we designed the engine to support failures, code updates and recomputation of already outputted data. For example,one common issue is that new data in a stream causes an application to crash, or worse, to output an incorrect result that users do not notice until much later (e.g., due to mis-parsing an input field). In Structured Streaming, each application maintains a write-ahead event log in human-readable JSON format that administrators can use to restart it from an arbitrary point. If the application crashes due to an error in a user-defined function, administrators can update the UDF and restart from where it left off, which happens automatically when the restarted application reads the log. If the application was outputting incorrect data instead, administrators can manually roll it back to a point before the problem started and recompute its results starting from there.
其次,我们发现操作一个流应用程序很有挑战性,所以我们设计了一个引擎来支持失败、代码更新和重新计算已经输出的数据。例如,一个常见的问题是流中的新数据会导致应用程序崩溃,或者更糟的是,输出一个错误的结果,用户直到很久以后才注意到(例如,由于对输入字段的错误解析)。在Structured Streaming中,每个应用程序都维护一个可读JSON格式的写前事件日志,管理员可以使用它从任意点重新启动。如果应用程序由于用户定义函数中的错误而崩溃,管理员可以更新UDF并从它停止的位置重新启动,这在重新启动的应用程序读取日志时自动发生。如果应用程序输出的数据不正确,则管理员可以手动将其回滚到问题开始之前的某个点,并从该点开始重新计算结果。
Our team has been running Structured Streaming applications for customers of Databricks’ cloud service since 2016, as well as using the system internally, so we end the paper with some example use cases. Production applications range from interactive network security analysis and automated alerts to incremental Extract, Transform and Load (ETL). Users often leverage the design of the engine in interesting ways, e.g., by running a streaming query “discontinuously" as a series of single-microbatch jobs to leverage Structured Streaming’s transactional input and output without having to pay for cloud servers running 24/7. The largest customer applications we discuss process over 1 PB of data per month on hundreds of machines. We also show that Structured Streaming outperforms Apache Flink and Kafka Streams by 2 × and 90 × respectively in the widely used Yahoo! Streaming Benchmark .
自2016年以来,我们的团队一直在为Databricks云服务的客户运行 Structured Streaming应用程序,并在内部使用该系统,因此我们以一些示例用例结束本文。生产应用范围从交互式网络安全分析和自动警报到增量提取、转换和加载(ETL)。用户经常以有趣的方式利用引擎的设计。例如通过将流式查询“不连续”地作为一系列单个微批量作业运行,以利用结构化流式处理的事务性输入和输出,而无需为全天候运行的云服务器付费。我们讨论的最大客户应用程序每月在数百台机器上处理超过1 PB的数据。我们还表明,结构化流在广泛使用的Yahoo!中的性能分别比Apache-Flink和Kafka流高出2倍和90倍! Structured Streaming基准测试。
The rest of this paper is organized as follows. We start by discussing the stream processing challenges reported by users in Section 2. Next, we give an overview of Structured Streaming (Section 3), then describe its API (Section 4), query planning (Section 5),execution (Section 6) and operational features (Section 7). In Section 8, we describe several large use cases at Databricks and its customers. We then measure the system’s performance in Section 9, discuss related work in Section 10 and conclude in Section 11.
本文的其余部分组织如下:我们首先在第2节讨论用户报告的流处理挑战。接下来,我们概述Structured Streaming (第3节),然后描述其API(第4节)、查询规划(第5节)、执行(第6节)和操作特性(第7节)。在第8节中,我们描述了Databricks及其客户的几个大型用例。然后,我们在第9节中测量系统的性能,在第10节中讨论相关工作,并在第11节中得出结论。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值