spark 自定义分区_自定义对Spark数据集进行分区

spark 自定义分区

Spark, as we all know, is generally used to process large data sets in a distributed manner. However, the performance of spark jobs really depends on whether the data that's being processed is properly (evenly) distributed across its executors or not.

众所周知,Spark通常用于以分布式方式处理大型数据集。 但是,spark作业的性能实际上取决于正在处理的数据是否正确地(均匀地)分布在其执行程序上。

The problem with uneven data distribution is that the execution of partitions with fewer data will be completed first and the partitions with a huge amount of data will take a long time. Because of this, the overall performance of the Spark Job goes down.

数据分布不均匀的问题在于,将首先完成数据量少的分区的执行,而数据量大的分区将花费很长时间。 因此,Spark Job的整体性能下降。

The distribution of data into RDD Partitions always depends on the default partitioner that’s available in Spark. It’s the spark’s hash partitioner (default partitioner) which play’s the role of data partitioning behind the scenes. For this reason, irrespective of how many times the “repartition()” method is applied on top of that dataset, the data might still not be evenly distributed amongst all the partitions. Hence, the even distribution of data in the Spark RDD is not always guaranteed with the default partitioner.

将数据分配到RDD分区中始终取决于Spark中可用的默认分区程序。 这是Spark的哈希分区程序(默认分区程序),它在后台扮演着数据分区的角色。 因此,无论“ repartition()”方法在该数据集上应用了多少次,数据仍可能无法在所有分区之间平均分配。 因此,默认分区程序无法始终保证Spark RDD中数据的均匀分布。

To be able to avoid these kinds of situations, the ability to apply a custom partitioner on the RDDs will be critical. And that’s where Spark provides an option to create a “custom partitioner” where one can apply the logic of data partitioning on RDDs based on the custom conditions. Below, we will see in detail an example on “how the default partitioner works” for a given data set, “how the application of repartition method” will not be sufficient to evenly distribute the data set, and “how the Customer Partitioner can help resolve the problem of uneven distribution” for the same data set.

为了避免这种情况,在RDD上应用自定义分区的能力至关重要。 在那里,Spark提供了一个选项来创建“自定义分区程序”,从而可以根据自定义条件在RDD上应用数据分区逻辑。 下面,我们将详细介绍一个示例,该示例针对给定数据集的“默认分区程序如何工作”,“重新分区方法的应用方式”不足以均匀地分发数据集,以及“客户分区程序如何提供帮助”解决同一数据集的“分布不均的问题”。

Spark Default Partitioner:

Spark默认分区程序:

Spark splits data into different partitions and executes computations on top of it in a parallel fashion. It uses hash partitioner, by default, to partition the data across

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值