Spark 3.0.0 New features

最新推荐文章于 2024-02-17 19:39:00 发布

lucklilili

最新推荐文章于 2024-02-17 19:39:00 发布

阅读量266

点赞数 1

分类专栏： Apache Spark 文章标签： spark

本文链接：https://blog.csdn.net/lucklilili/article/details/119083968

版权

Apache Spark 专栏收录该内容

31 篇文章

订阅专栏

Apache Spark 3.0.0 is the first release of the 3.x line. The vote passed on the 10th of June, 2020. This release is based on git tag v3.0.0 which includes all commits up to June 10. Apache Spark 3.0 builds on many of the innovations from Spark 2.x, bringing new ideas as well as continuing long-term projects that have been in development. With the help of tremendous contributions from the open-source community, this release resolved more than 3400 tickets as the result of contributions from over 440 contributors.

ApacheSpark3.0.0是3.x系列的第一个版本。投票于2020年6月10日获得通过。该版本基于git tag v3.0.0，包括截至6月10日的所有承诺。Apache Spark 3.0基于Spark 2.x的许多创新，带来了新的想法，并继续开发长期项目。在开源社区的巨大贡献的帮助下，这个版本解决了3400多个问题，这是440多个贡献者贡献的结果。

Here are the feature highlights in Spark 3.0: adaptive query execution; dynamic partition pruning; ANSI SQL compliance; significant improvements in pandas APIs; new UI for structured streaming; up to 40x speedups for calling R user-defined functions; accelerator-aware scheduler; and SQL reference documentation.

以下是Spark 3.0中的亮点：自适应查询执行；动态分区剪枝；符合ANSI SQL标准；大熊猫原料药的显著改进；结构化流媒体的新用户界面；调用R个用户定义函数的速度可提高40倍；加速器感知调度器；和SQL参考文档。

新特性主要与Spark SQL和Python相关。这也恰恰说明了大数据方向的两大核心：BI与AI。下面是本次发布的主要特性，包括性能、API、生态升级、数据源、SQL兼容、监控和调试等方面的升级。

http://spark.apache.org/releases/spark-release-3-0-0.html

Dynamic partition pruning
https://issues.apache.org/jira/browse/SPARK-11150

Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach:

As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the ReuseExchange rule; or
As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise
As a bypassed condition (true).
Below shows a basic example of DPP.

SPIP: Accelerator-aware task scheduling for Spark

https://issues.apache.org/jira/browse/SPARK-24615

GPUs and other accelerators have been widely used for accelerating special workloads, e.g., deep learning and signal processing. While users from the AI community use GPUs heavily, they often need Apache Spark to load and process large datasets and to handle complex data scenarios like streaming. YARN and Kubernetes already support GPUs in their recent releases. Although Spark supports those two cluster managers, Spark itself is not aware of GPUs exposed by them and hence Spark cannot properly request GPUs and schedule them for users. This leaves a critical gap to unify big data and AI workloads and make life simpler for end users.

To make Spark be aware of GPUs, we shall make two major changes at high level:

At cluster manager level, we update or upgrade cluster managers to include GPU support. Then we expose user interfaces for Spark to request GPUs from them.
Within Spark, we update its scheduler to understand available GPUs allocated to executors, user task requests, and assign GPUs to tasks properly.

Based on the work done in YARN and Kubernetes to support GPUs and some offline prototypes, we could have necessary features implemented in the next major release of Spark. You can find a detailed scoping doc here, where we listed user stories and their priorities.

New Adaptive Query Execution in Spark SQL

https://issues.apache.org/jira/browse/SPARK-31412

proposed the basic idea of adaptive execution in Spark. In DAGScheduler, a new API is added to support submitting a single map stage. The current implementation of adaptive execution in Spark SQL supports changing the reducer number at runtime. An Exchange coordinator is used to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages. The current implementation adds ExchangeCoordinator while we are adding Exchanges. However there are some limitations. First, it may cause additional shuffles that may decrease the performance. We can see this from EnsureRequirements rule when it adds ExchangeCoordinator. Secondly, it is not a good idea to add ExchangeCoordinators while we are adding Exchanges because we don’t have a global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 3 tables’ join in a single stage, the same ExchangeCoordinator should be used in three Exchanges but currently two separated ExchangeCoordinator will be added. Thirdly, with the current framework it is not easy to implement other features in adaptive execution flexibly like changing the execution plan and handling skewed join at runtime.

Revisiting Python / pandas UDF

https://issues.apache.org/jira/browse/SPARK-28264

In the past two years, the pandas UDFs are perhaps the most important changes to Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. This document revisits UDF definition and naming, as a result of discussions among Xiangrui, Li Jin, Hyukjin, and Reynold.

Support Structured Streaming UI

https://issues.apache.org/jira/browse/SPARK-29543

Open this jira to support structured streaming UI

Catalog plugin API

https://issues.apache.org/jira/browse/SPARK-31121

Details please see the SPIP doc: https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d

This will bring multi-catalog support to Spark and allow external catalog implementations.

Build and Run Spark on JDK11
https://issues.apache.org/jira/browse/SPARK-24417

This is an umbrella JIRA for Apache Spark to support JDK11

As JDK8 is reaching EOL, and JDK9 and 10 are already end of life, per community discussion, we will skip JDK9 and 10 to support JDK 11 directly.

Spark run on Hadoop 3.0.0

https://issues.apache.org/jira/browse/SPARK-23534

Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark run on Hadoop 3.0.

The work includes: