spark权威指南 pdf_翻译：《Spark权威指南》第20章：流处理基础

最新推荐文章于 2024-05-12 09:31:24 发布

weixin_39786617

最新推荐文章于 2024-05-12 09:31:24 发布

阅读量1.5k

点赞数

本文介绍了Apache Spark的流处理基础，重点是结构化流API，它直接与DataFrame和Dataset API集成，适用于新流应用程序。文章对比了流处理与批处理，探讨了流处理的使用场景，如通知和警报，实时报告等，并讨论了Spark的DStream和Structured Streaming API，强调了Structured Streaming的优化和易用性，以及其在构建端到端连续应用中的优势。

摘要由CSDN通过智能技术生成

此系列翻译为个人原创的对照翻译，如有不当或错误，欢迎指正，知乎对markdown支持不全有碍于阅读体验，欢迎访问我的个人博客：SnailDove's blog。

Chapter 20 Stream Processing Fundamentals 流处理基础

Stream processing is a key requirement in many big data applications. As soon as an application computes something of value—say, a report about customer activity, or a new machine learning model —an organization will want to compute this result continuously in a production setting. As a result, organizations of all sizes are starting to incorporate stream processing, often even in the first version of a new application.

流处理是许多大数据应用程序的关键要求。一旦一个应用程序计算出一些有价值的东西，比如关于客户活动的报告，或者一个新的机器学习模型，一个组织就会想要在生产环境中连续计算这个结果。因此，各种规模的组织都开始合并流处理，甚至在新应用程序的第一个版本中也是如此。

Luckily, Apache Spark has a long history of high-level support for streaming. In 2012, the project incorporated Spark Streaming and its DStreams API, one of the first APIs to enable stream processing using high-level functional operators like map and reduce. Hundreds of organizations now use DStreams in production for large real-time applications, often processing terabytes of data per hour. Much like the Resilient Distributed Dataset (RDD) API, however, the DStreams API is based on relatively low-level operations on Java/Python objects that limit opportunities for higher-level optimization. Thus, in 2016, the Spark project added Structured Streaming, a new streaming API built directly on DataFrames that supports both rich optimizations and significantly simpler integration with other DataFrame and Dataset code. The Structured Streaming API was marked as stable in Apache Spark 2.2, and has also seen swift adoption throughout the Spark community.

幸运的是，Apache Spark 有很长的高级流支持历史。2012年，该项目整合了 Spark Streaming 及其 DStreams API，这是第一批使用诸如 Map 和 Reduce 之类的高级功能操作符实现流处理的 API 之一。数百个组织现在在生产中使用数据流来处理大型实时应用程序，通常每小时处理数兆字节的数据。与弹性分布式数据集（RDD）API非常相似，但是，DStreams API 是基于 Java/Python 对象上相对较低级别的操作，这些对象限制了更高级别优化的机会。因此，在2016年，Spark项目添加了结构化流式处理（Structured Streaming），这是一种直接在 DataFrames 上构建的新流式API，它既支持丰富的优化，也支持与其他 DataFrame 和 Dataset 代码的显著简化集成。结构化流式API在ApacheSpark2.2中被标记为稳定的，并且在整个Spark社区中也得到了迅速的采用。

In this book, we will focus only on the Structured Streaming API, which integrates directly with the DataFrame and Dataset APIs we discussed earlier in the book and is the framework of choice for writing new streaming applications. If you are interested in DStreams, many other books cover that API, including several dedicated books on Spark Streaming only, such as Learning Spark Streaming by Francois Garillot and Gerard Maas (O’Reilly, 2017). Much as with RDDs versus DataFrames, however, Structured Streaming offers a superset of the majority of the functionality of DStreams, and will often perform better due to code generation and the Catalyst optimizer. Before we discuss the streaming APIs in Spark, let’s more formally define streaming and batch processing. This chapter will discuss some of the core concepts in this area that we will need throughout this part of the book. It won’t be a dissertation on this topic, but will cover enough of the concepts to let you make sense of systems in this space.

在本书中，我们将只关注结构化流式API（Structured Streaming API），它直接与本书前面讨论的 DataFrame 和 Dataset API 集成，是编写新流式应用程序的首选框架。如果您对 DStream 感兴趣，那么许多其他书籍都涉及该 API，其中包括一些专门的关于 Spark Streaming 的书籍，例如 Francois Garillot和Gerard Maas的 Learning Spark Streaming（O'Reilly，2017）。然而，与RDD与 DataFrame 相比，结构化流提供了数据流大部分功能的超集，并且由于代码生成和Catalyst优化器，通常性能会更好。 在讨论Spark中的流式API之前，让我们更正式地定义流式处理和批处理。本章将讨论本书这一部分中我们需要的这一领域的一些核心概念。这不是一篇关于这个主题的论文，但将涵盖足够多的概念，使您能够理解这个空间中的系统。

What Is Stream Processing 什么是流处理?

Stream processing is the act of continuously incorporating new data to compute a result. In stream processing, the input data is unbounded and has no predetermined beginning or end. It simply forms a series of events that arrive at the stream processing system (e.g., credit card transactions, clicks on a website, or sensor readings from Internet of Things [IoT] devices). User applications can then compute various queries over this stream of events (e.g., tracking a running count of each type of event or aggregating them into hourly windows). The application will output multiple versions of the result as it runs, or perhaps keep it up to date in an external “sink” system such as a key-value store.

流处理是不断合并新数据以计算结果的行为。在流处理中，输入数据是无边界的，没有预先确定的开始或结束。它只是形成一系列到达流处理系统的事件（例如，信用卡交易、网站点击或物联网设备的传感器读数）。然后，用户应用程序可以计算对该事件流的各种查询（例如，跟踪每种类型事件的运行计数，或将其聚合到每小时一次的窗口中）。应用程序将在运行时输出结果的多个版本，或者在外部的“接收器”系统（如键值存储）中使其事件保持最新。

Naturally, we can compare streaming to batch processing, in which the computation runs on a fixedinput dataset. Oftentimes, this might be a large-scale dataset in a data warehouse that contains all the historical events from an application (e.g., all website visits or sensor readings for the past month). Batch processing also takes a query to compute, similar to stream processing, but only computes the result once.

当然，我们可以将流式处理与批处理进行比较，在批处理中，计算运行在固定的输入数据集上。通常，这可能是数据仓库中的大型数据集，其中包含应用程序的所有历史事件（例如，过去一个月的所有网站访问或传感器读数）。批处理也需要一个查询来计算，类似于流处理，但只计算一次结果。

Although streaming and batch processing sound different, in practice, they often need to work together. For example, streaming applications often need to join input data against a dataset written periodically by a batch job, and the output of streaming jobs is often files or tables that are queried in batch jobs. Moreover, any business logic in your applications needs to work consistently across streaming and batch execution: for example, if you have a custom code to compute a user’s billing amount, it would be harmful to get a different result when running it in a streaming versus batch fashion! To handle these needs, Structured Streaming was designed from the beginning to interoperate easily with the rest of Spark, including batch applications. Indeed, the Structured Streaming developers coined the term continuous applications to capture end-to-end applications that consist of streaming, batch, and interactive jobs all working on the same data to deliver an end product. Structured Streaming is focused on making it simple to build such applications in an end-to-end fashion instead of only handling stream-level per-record processing.

虽然流式处理和批处理听起来不同，但在实践中，它们通常需要一起工作。例如，流式处理应用程序通常需要将输入数据与批处理作业定期写入的数据集连接起来，而流式处理作业的输出通常是批处理作业中查询的文件或表。此外，应用程序中的任何业务逻辑都需要在流式处理和批处理执行之间始终如一地工作：例如，如果您有一个自定义代码来计算用户的账单金额，那么以流式处理与批处理方式运行时获得不同的结果将是有害的！为了处理这些需求，从一开始就设计了结构化流（Structured Streaming），以便与Spark的其余部分（包括批处理应用程序）轻松地进行互操作。实际上，结构化流式开发人员创造了术语“连续应用程序”，以捕获由流式、批处理和交互式作业组成的端到端应用程序，这些作业都处理相同的数据以交付最终产品。结构化流的重点是使端到端的方式构建此类应用程序变得简单，而不是只应对流级别的每个记录处理。

Stream Processing Use Cases 流处理使用案例

We defined stream processing as the incremental processing of unbounded datasets, but that’s a strange way to motivate a use case. Before we get

最低0.47元/天解锁文章

weixin_39786617

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark权威指南 pdf_翻译：《Spark权威指南》第20章：流处理基础

此系列翻译为个人原创的对照翻译，如有不当或错误，欢迎指正，知乎对markdown支持不全有碍于阅读体验，欢迎访问我的个人博客：SnailDove's blog。Chapter 20 Stream Processing Fundamentals 流处理基础Stream processing is a key requirement in many big data applications. As ...
复制链接

扫一扫