pandas 0.23.3_在PySpark 3.0中使用PyArrow支持的新Pandas UDF进行分布式处理-CSDN博客

本文介绍了在pandas 0.23.3版本中，如何利用PyArrow在PySpark 3.0上实现新Pandas UDF的分布式处理，从而提高数据处理效率。

摘要由CSDN通过智能技术生成

pandas 0.23.3

Data processing time is so valuable as each minute-spent costs back to users in financial terms. This article is mainly for data scientists and data engineers looking to use the newest enhancements of Apache Spark since, in a noticeably short amount of time, Apache Spark has emerged as the next generation big data processing engine, and is highly being practiced throughout the industry faster than ever.

数据处理时间非常宝贵，因为每分钟花费的费用都会从财务角度回馈给用户。本文主要针对希望使用Apache Spark的最新增强功能的数据科学家和数据工程师，因为在很短的时间内，Apache Spark已经成为下一代大数据处理引擎，并且在整个行业中得到了广泛的实践。比以往更快。

Spark’s consolidated structure supports both compatible and constructible APIs that are formed to empower high performance by optimizing across the various libraries and functions built together in programs enabling users to build applications beyond existing libraries. It gives the opportunity for users to write their own analytical libraries on top as well.

Spark的统一结构支持兼容的API和可构造的API ，它们通过优化跨各种库和程序中共同构建的功能来支持高性能，从而使用户能够在现有库之外构建应用程序。它还为用户提供了在其顶部编写自己的分析库的机会。

Data is costly to migrate so Spark concentrates on performing computations over the data, regardless of where it locates. In user-interacting APIs, Spark strives to manage these storage systems seem broadly related in case applications do not require to concern where their data is.

数据迁移成本很高，因此，Spark专注于对数据进行计算，无论其位于何处。在用户交互API中，Spark努力管理似乎与这些存储系统密切相关的存储，以防应用程序不需要担心其数据在哪里。

When the data is too big to fit on a single machine with a long time to execute that computation on one machine drives it to place the data on more than one server or computer. This logic requires to process the data in a distributed manner. Spark DataFrame is the ultimate Structured API that serves a table of data with rows and columns. With its column-and-column-type schema, it can span large numbers of data sources.

如果数据太大而无法长时间容纳在一台计算机上，则无法在一台计算机上执行该计算，则驱动该数据将数据放置在多台服务器或计算机上。此逻辑要求以分布式方式处理数据。 Spark DataFrame是最终的结构化API，可提供具有行和列的数据表。凭借其列和列类型的架构，它可以跨越大量数据源。

The purpose of this article is to introduce the benefits of one of the currently released features of Spark 3.0 that is related to Pandas with Apache Arrow usage with PySpark in order to be able to execute a pandas-like UDFs in a parallel manner. In the following headings, PyArrow’s crucial usage with PySpark session configurations, PySpark enabled Pandas UDFs will be explained in a detailed way by providing code snippets for corresponding topics. At the end of the article, references and additional resources are added for further research.

本文的目的是介绍Spark 3.0当前发布的功能的好处，该功能与将PySpark与Apache Arrow配合使用的熊猫相关，以便能够以并行方式执行类似熊猫的UDF。在以下标题中，将通过提供相应主题的代码段详细说明PyArrow在PySpark会话配置中的关键用法(启用PySpark的Pandas UDF)。在本文的结尾，添加了参考资料和其他资源以进行进一步的研究。

1. PyArrow与PySpark (1. PyArrow with PySpark)

In the previous versions of Spark, there were inefficient steps for converting DataFrame to Pandas in PySpark as collecting all rows to the Spark driver, serializing each row into Python’s pickle format (row by row) and sending them to a Python worker process. At the end of this converting procedure, it unpickles each row into a massive list of tuples. In order to be able to overcome these ineffective operations, Apache Arrow that is integrated with Apache Spark can be used to empower faster columnar data transfer and conversion.

在早期版本的Spark中，在将所有数据行收集到Spark驱动程序，将每一行序列化为Python的pickle格式(逐行)并将其发送到Python工作进程时，没有效率低下的步骤将DataFrame转换为PySpark中的Pandas。在此转换过程结束时，它将每一行解开为大量的元组列表。为了能够克服这些无效的操作，可以使用与Apache Spark集成的Apache Arrow来实现更快的列式数据传输和转换。

1.1。为什么将PyArrow与PySpark一起使用 (1.1. Why Use PyArrow with PySpark)

Apache Arrow helps to accelerate converting to pandas objects from traditional columnar memory providing the high-performance in-memory columnar data structures.

Apache Arrow帮助加速从传统列式内存转换为pandas对象的过程，从而提供了高性能的内存中列式数据结构。

Previously, Spark reveals a row-based interface for interpreting and running user-defined functions (UDFs). This introduces high overhead in serialization and deserialization and makes it difficult to work with Python libraries such as NumPy,