spark 自定义分区_自定义对Spark数据集进行分区

最新推荐文章于 2024-07-25 10:38:08 发布

weixin_26746861

最新推荐文章于 2024-07-25 10:38:08 发布

阅读量1.2k

点赞数

文章标签： spark 大数据 hive

原文链接：https://medium.com/@naveennani1998/custom-partitioning-spark-datasets-25cbd4e2d818

版权

spark 自定义分区

Spark, as we all know, is generally used to process large data sets in a distributed manner. However, the performance of spark jobs really depends on whether the data that's being processed is properly (evenly) distributed across its executors or not.

众所周知，Spark通常用于以分布式方式处理大型数据集。但是，spark作业的性能实际上取决于正在处理的数据是否正确地(均匀地)分布在其执行程序上。

The problem with uneven data distribution is that the execution of partitions with fewer data will be completed first and the partitions with a huge amount of data will take a long time. Because of this, the overall performance of the Spark Job goes down.

数据分布不均匀的问题在于，将首先完成数据量少的分区的执行，而数据量大的分区将花费很长时间。因此，Spark Job的整体性能下降。

The distribution of data into RDD Partitions always depends on the default partitioner that’s available in Spark. It’s the spark’s hash partitioner (default partitioner) which play’s the role of data partitioning behind the scenes. For this reason, irrespective of how many times the “repartition()” method is applied on top of that dataset, the data might still not be evenly distributed amongst all the partitions. Hence, the even distribution of data in the Spark RDD is not always guaranteed with the default partitioner.

将数据分配到RDD分区中始终取决于Spark中可用的默认分区程序。这是Spark的哈希分区程序(默认分区程序)，它在后台扮演着数据分区的角色。因此，无论“ repartition()”方法在该数据集上应用了多少次，数据仍可能无法在所有分区之间平均分配。因此，默认分区程序无法始终保证Spark RDD中数据的均匀分布。

To be able to avoid these kinds of situations, the ability to apply a custom partitioner on the RDDs will be critical. And that’s where Spark provides an option to create a “custom partitioner” where one can apply the logic of data partitioning on RDDs based on the custom conditions. Below, we will see in detail an example on “how the default partitioner works” for a given data set, “how the application of repartition method” will not be sufficient to evenly distribute the data set, and “how the Customer Partitioner can help resolve the problem of uneven distribution” for the same data set.

为了避免这种情况，在RDD上应用自定义分区的能力至关重要。在那里，Spark提供了一个选项来创建“自定义分区程序”，从而可以根据自定义条件在RDD上应用数据分区逻辑。下面，我们将详细介绍一个示例，该示例针对给定数据集的“默认分区程序如何工作”，“重新分区方法的应用方式”不足以均匀地分发数据集，以及“客户分区程序如何提供帮助”解决同一数据集的“分布不均的问题”。

Spark Default Partitioner:

Spark默认分区程序：

Spark splits data into different partitions and executes computations on top of it in a parallel fashion. It uses hash partitioner, by default, to partition the data across

最低0.47元/天解锁文章

weixin_26746861

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
spark 自定义分区_自定义对Spark数据集进行分区

spark 自定义分区Spark, as we all know, is generally used to process large data sets in a distributed manner. However, the performance of spark jobs really depends on whether the data that's being processed...
复制链接

扫一扫