StructuredStreaming: A Declarative API for Real-Time

weixin_38440581

于 2018-05-10 11:31:47 发布

阅读量1.5k

点赞数

本文链接：https://blog.csdn.net/weixin_38440581/article/details/80265168

版权

Abstract

With the ubiquity of real-time data,organizations need streaming systems that are scalable, easy to use, and easyto integrate into business applications. Structured Streaming is a newhigh-level streaming API in Apache Spark based on our experience with SparkStreaming. Structured Streaming differs from other recent streaming APIs, suchas Google Dataflow, in two main ways. First, it is a purely declarative API based on automaticallyincrementalizing a

static relational query (expressed using SQLor DataFrames), in contrast to APIs that ask the user to build a DAG ofphysical operators. Second, Structured Streaming aims to support end-to-end real-time applications thatintegrate streaming with batch and interactive analysis. We found that thisintegration was often a key challenge in practice. Structured Streamingachieves high performance via Spark SQL’s code generation engine and canoutperform Apache Flink by up to 2× and Apache Kafka Streams by 90×. It alsooffers rich operational features such as rollbacks, code updates, and mixedstreaming/batch execution. We describe the system’s design and use cases fromseveral hundred production deployments on Databricks, the largest of whichprocess over 1 PB of data per month.

ACM Reference Format:

M. Armbrust et al.. 2018. Structured Streaming: ADeclarative API for RealTime Applications in Apache Spark. In SIGMOD’18: 2018 International Conference onManagement of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY,USA, 13 pages. https://doi.org/10.1145/3183713.3190664

1 Introduction

Many high-volume data sources operate inreal time, including sensors, logs from mobile applications, and the Internetof Things. As organizations have gotten better at capturing this data, theyalso want to process it in real time, whether to give human analysts thefreshest possible data or drive automated decisions. Enabling broad access tostreaming computation requires systems that are scalable, easy to use and easyto integrate into business applications.

While there has beentremendous progress in distributed stream processing systems in the past fewyears [2, 15, 17, 27, 32], these systems still remain fairly challenging to usein practice. In this paper, we begin by describing these challenges, based onour experience

Permissionto make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this workowned by others than the author(s) must be honored. Abstracting with credit ispermitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from permissions@acm.org.

SIGMOD’18,June 10–15, 2018, Houston, TX, USA

with Spark Streaming [37], one of theearliest stream processing systems to provide a high-level, functional API. Wefound that two challenges frequently came up with users. First, streamingsystems often ask users to think in terms of complex physical executionconcepts, such as at-least-once delivery, state storage and triggering modes,that are unique to streaming. Second, many systems focus only on streaming computation, but in real use cases, streaming isoften part of a larger business application that also includes batch analytics,joins with static data, and interactive queries. Integrating streaming systemswith these other workloads (e.g., maintaining transactionality) requiressignificant engineering effort.

Motivated by these challenges,we describe Structured Streaming, a new high-level API for stream processingthat was developed in Apache Spark starting in 2016. Structured Streamingbuilds on many ideas in recent stream processing systems, such as separatingprocessing time from event time and triggers in Google Dataflow [2], using arelational execution engine for performance [12], and offering alanguage-integrated API [17, 37], but aims to make them simpler to use andintegrated with the rest of Apache Spark. Specifically, Structured Streamingdiffers from other widely used open source streaming APIs in two ways:

• Incremental query model: StructuredStreaming automatically incrementalizes queries on static datasets expressedthrough Spark’s SQL and DataFrame APIs [8], meaning that users typically onlyneed to understand Spark’s batch APIs to write a streaming query. Event timeconcepts are especially easy to express and understand in this model. Althoughincremental query execution and view maintenance are well studied [11, 24, 29,38], we believe Structured Streaming is the first effort to adopt them in awidely used open source system. We found that this incremental API generallyworked well for both novice and advanced users. For example, advanced users canuse a set of stateful processing operators that give fine-grained control toimplement custom logic while fitting into the incremental model.

• Supportforend-to-endapplications:Structured Streaming’s

API and built-in connectors make it easyto write code that is

“correct by default" when interactingwith external systems and can be integrated into larger applications usingSpark and other software. Data sources and sinks follow a simple transactionalmodel that enables “exactly-once" computation by default. Theincrementalization based API naturally makes it easy to run a streaming queryas a batch job or develop hybrid applications that join streams with staticdata computed through Spark’s batch APIs. In addition, users can managemultiple streaming queries dynamically and run interactive queries onconsistent snapshots of stream output, making it possible to write applicationsthat go beyond computing a fixed result to let users refine and drill intostreaming data.

Beyond these designdecisions, we made several other design choices in Structured Streaming thatsimplify operation and increase performance. First, Structured Streaming reusesthe Spark SQL execution engine [8], including its optimizer and runtime codegenerator. This leads to high throughput compared to other streaming systems(e.g., 2× the throughput of Apache Flink and 90× that of Apache Kafka Streamsin the Yahoo! Streaming Benchmark [14]), as in Trill [12], and also letsStructured Streaming automatically leverage new SQL functionality added toSpark. The engine runs in a microbatch execution mode by default [37] but itcan also use a low-latency continuous operators for some queries because theAPI is agnostic to execution strategy [6].

Second, we found thatoperating a streaming application can be challenging, so we designed the engineto support failures, code updates and recomputation of already outputted data.For example, one common issue is that new data in a stream causes anapplication to crash, or worse, to output an incorrect result that users do notnotice until much later (e.g., due to mis-parsing an input field). InStructured Streaming, each application maintains a write-ahead event log inhuman-readable JSON format that administrators can use to restart it from anarbitrary point. If the application crashes due to an error in a user-definedfunction, administrators can update the UDF and restart from where it left off,which happens automatically when the restarted application reads the log. Ifthe application was outputting incorrect data instead, administrators canmanually roll it back to a point before the problem started and recompute itsresults starting from there.

Our team has beenrunning Structured Streaming applications for customers of Databricks’ cloudservice since 2016, as well as using the system internally, so we end the paperwith some example use cases. Production applications range from interactivenetwork security analysis and automated alerts to incremental Extract,Transform and Load (ETL). Users often leverage the design of the engine ininteresting ways, e.g., by running a streaming query

“discontinuously" as a series ofsingle-microbatch jobs to leverage Structured Streaming’s transactional inputand output without having to pay for cloud servers running 24/7. The largestcustomer applications we discuss process over 1 PB of data per month onhundreds of machines. We also show that Structured Streaming outperforms ApacheFlink and Kafka Streams by 2× and 90× respectively in the widely used Yahoo!Streaming Benchmark [14].

The rest of this paperis organized as follows. We start by discussing the stream processingchallenges reported by users in Section 2. Next, we give an overview ofStructured Streaming (Section 3), then describe its API (Section 4), queryplanning (Section 5), execution (Section 6) and operational features (Section7). In Section 8, we describe several large use cases at Databricks and itscustomers. We then measure the system’s performance in Section 9, discussrelated work in Section 10 and conclude in Section 11.

2 Stream Processing Challenges

Despite extensive progress in the past fewyears, distributed streaming applications are still generally considereddifficult to develop and operate. Before designing Structured Streaming, wespent time discussing these challenges with users and designers of otherstreaming systems, including Spark Streaming, Truviso, Storm, Dataflow andFlink. This section details the challenges we saw.

2.1 Complex and Low-Level APIs

Streaming systems were invariably consideredmore difficult to use than batch ones due to complex API semantics. Somecomplexity is to be expected due to new concerns that arise only in streaming:for example, the user needs to think about what type of intermediate resultsthe system should output before it has received all the data relevant to aparticular entity, e.g., to a customer’s browsing session on a website.However, other complexity arises due to the lowlevelnature of many streaming APIs: these APIs often ask users to specifyapplications at the level of physicaloperators with complex semantics instead of a more declarative level.

As a concrete example,the Google Dataflow model [2] has a powerful API with a rich set of options forhandling event time aggregation, windowing and out-of-order data. However, inthis model, users need to specify a windowing mode, triggering mode and triggerrefinement mode (essentially, whether the operator outputs deltas oraccumulated results) for each aggregation operator. Adding an operator thatexpects deltas after an aggregation that outputs accumulated results will leadto unexpected results. In essence, the raw API [10] asks the user to write aphysical operator graph, not a logical query, so every user of the system needsto understand the intricacies of incremental processing.

Other APIs, such as Spark Streaming [37] andFlink’s DataStream

API [18], are also based on writing DAGs ofphysical operators and offer a complex array of options for managing state [20].In addition, reasoning about applications becomes even more complex in systemsthat relax exactly-once semantics [32], effectively requiring the user todesign and implement a consistency model.

To address thisissue, we designed Structured Streaming to make simple applications simple toexpress using its incremental query model. In addition, we found that addingcustomizable stateful processing operatorsto this model still enabled advanced users to build their own processing logic,such as custom session-based windows, while staying within the incrementalmodel (e.g., these same operators also work in batch jobs). Other open sourcesystems have also recently added incremental SQL queries [15, 19], and ofcourse databases have long supported them [11, 24, 29, 38].

2.2 Integration in End-to-End Applications

The second challenge we found was thatnearly every streaming workload must run in the context of a largerapplication, and this integration often requires significant engineeringeffort. Many streaming APIs focus primarily on reading streaming input from asource and writing streaming output to a sink, but end-to-end businessapplications need to perform other tasks. Examples include:

(1) The business purpose of the application may be to enable interactivequeries on fresh data. In this case, a streaming job is used to update summarytables in a structured storage system such as an RDBMS or Apache Hive [33]. Itis important that when the streaming job updates its result, it does soatomically, so users do not see partial results. This can be difficult withfile-based big data systems like Hive, where tables are partitioned acrossfiles, or even with parallel loads into a data warehouse.

(2) An Extract, Transform and Load (ETL) job might need to join a streamwith static data loaded from another storage system or transformed using abatch computation. In this case, it is important to be able to reason aboutconsistency across the two systems (e.g., what happens when the static data isupdated?), and it is useful to write the whole computation in a single API.

(3) A team may occasionally need to run its streaming business logic asa batch application, e.g., to backfill a result on old data or test alternateversions of the code. Rewriting the code in a separate system would betime-consuming and error-prone.

We address thischallenge by integrating Structured Streaming closely with Spark’s batch andinteractive APIs.

2.3 Operational Challenges

One of the largest challenges to deployingstreaming applications in practice is management and operation. Some key issuesinclude:

• Failures: This is the mostheavily studied issue in the research literature. In addition to single nodefailures, systems also need to support graceful shutdown and restart of thewhole application, e.g., to let operators migrate it to a new cluster.

• Code Updates: Applications arerarely perfect, so developers may need to update their code. After an update,they may want the application to restart where it left off, or possibly to recompute past results that wereerroneous due to a bug. Both cases need to be supported in the streamingsystem’s state management and fault recovery mechanisms. Systems should alsosupport updating the runtime itself (e.g., patching Spark).

• Rescaling: Applications seevarying load over time, and generally increasingload in the long term, so operators may want to scale them up and downdynamically, especially in the cloud. Systems based on a static communicationtopology, while conceptually simple, are difficult to scale dynamically.

• Stragglers: Instead of outrightfailing, nodes in the streaming system can slow down due to hardware orsoftware issues and degrade the throughput of the whole application. Systemsshould automatically handle this situation.

• Monitoring: Streaming systemsneed to give operators clear visibility into system load, backlogs, state sizeand other metrics.

2.4 Cost and Performance Challenges

Beyond operational and engineering issues,the cost-performance of streaming applications can be an obstacle because theseapplications run 24/7. For example, without dynamic rescaling, an applicationwill waste resources outside peak hours; and even with rescaling, it may bemore expensive to compute a result continuously than to run a periodic batchjob. We thus designed Structured Streaming to leverage all the executionoptimizations in Spark SQL [8].

So far, we chose tooptimize throughput as our main performancemetric because we found that it was often the most important metric inlarge-scale streaming applications. Applications that require a distributed streaming system usuallywork with large data volumes coming from external sources (e.g., mobile devices,sensors or IoT), where data may already incur a delay just getting to thesystem. This is one reason why event time processing is an important feature inthese systems [2]. In contrast, latency-sensitive applications

Figure 1: The components of StructuredStreaming.

such as high-frequency trading or physicalsystem control loops often run on a single scale-up processor, or even customhardware like ASICs and FPGAs [3]. However, we also designed StructuredStreaming to support executing over latency-optimized engines, and implementeda continuous processing mode for this task, which we describe in Section 6.3.This is a change over Spark Streaming, where microbatching was “baked into"the API.

3 Structured Streaming Overview

Structured Streaming aims to tackle thestream processing challenges we identified through a combination of API andexecution engine design. In this section, we give a brief overview of theoverall system. Figure 1 shows Structured Streaming’s main components.

Input and Output. StructuredStreaming connects to a variety of inputsources and output sinks for I/O.To provide “exactly-once" output and fault tolerance, it places tworestrictions on sources and sinks, which are similar to other exactly-oncesystems [17, 37]:

(1) Input sources must be replayable,allowing the system to re-read recent input data if a node crashes. Inpractice, organizations use a reliable message bus such as Amazon Kinesis orApache Kafka [5, 23] for this purpose, or simply a durable file system.

(2) Output sinks must support idempotentwrites, to ensure reliable recovery if a node fails while writing.Structured Streaming can also provide atomicoutput for certain sinks that support it, where the entire update to thejob’s output appears atomically even if it was written by multiple nodesworking in parallel.

In addition to external systems, StructuredStreaming also supports input and output from tables in Spark SQL. F

最低0.47元/天解锁文章

weixin_38440581

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
StructuredStreaming: A Declarative API for Real-Time

AbstractWith the ubiquity of real-time data,organizations need streaming systems that are scalable, easy to use, and easyto integrate into business applications. Structured Streaming is a newhigh-leve...
复制链接

扫一扫