RDD：分区器

最新推荐文章于 2024-05-07 22:04:17 发布

花和尚也有春天

最新推荐文章于 2024-05-07 22:04:17 发布

阅读量829

点赞数

分类专栏： rdd 文章标签： rdd 分区器

rdd 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

RDD 分区器

分区器（Partitioner）在前面章节中或多或少有所提及。我总结了 RDD 分区器的三个作用，而这三个影响在本质上其实是相互关联的。

决定 Shuffle 过程中 Reducer 的个数（实际上是子 RDD 的分区个数）以及 Map 端的一条数据记录应该分配给哪一个 Reducer。这个应该是最主要的作用。
决定 RDD 的分区数量。例如执行操作 groupByKey(new HashPartitioner(2)) 所生成的ShuffledRDD 中，分区的数目等于 2。
决定 CoGroupedRDD 与父 RDD 之间的依赖关系。这个在依赖小节说过。

由于分区器能够间接决定 RDD 中分区的数量和分区内部数据记录的个数，因此选择合适的分区器能够有效提高并行计算的性能（回忆下分区小节我们提及过的 spark.default.parallelism 配置参数）。Apache Spark 内置了两种分区器，分别是哈希分区器（Hash Partitioner）和范围分区器（Range Partitioner）。

开发者还可以根据实际需求，编写自己的分区器。分区器对应的源码实现是 Partitioner 抽象类，Partitioner 的子类（包括自定义分区器）需要实现自己的 getPartition 函数，用于确定对于某一特定键值的键值对记录，会被分配到子RDD中的哪一个分区。

/**
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 */
abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

哈希分区器

哈希分区器的实现在 HashPartitioner 中，其 getPartition 方法的实现很简单，取键值的 hashCode，除以子 RDD 的分区个数取余即可。

/**
 * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using
 * Java's `Object.hashCode`.
 *
 * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,
 * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will
 * produce an unexpected or incorrect result.
 */
class HashPartitioner(partitions: Int) extends Partitioner {
  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

使用哈希分区器进行分区的一个示例如下图所示。此例中整数的 hashCode 即其本身。

哈希分区器

范围分区器

哈希分析器的实现简单，运行速度快，但其本身有一明显的缺点：由于不关心键值的分布情况，其散列到不同分区的概率会因数据而异，个别情况下会导致一部分分区分配得到的数据多，一部分则比较少。范围分区器则在一定程度上避免这个问题，范围分区器争取将所有的分区尽可能分配得到相同多的数据，并且所有分区内数据的上界是有序的。使用范围分区器进行分区的一个示例如下图所示。

如果你自己去测试下面这个例子的话，会发现键值 4 被分配到子 RDD 中的第一个分区，与下图并不一致，这是因为 Apache Spark 1.1 及之后的版本，划分分区边界时候用的是 > 而不是 >=，后文会细述相应的代码实现。我已经向 Apache Spark 提交了 JIRA 和 PR。

哈希分区器

范围分区器需要做的事情有两个：根据父 RDD 的数据特征，确定子 RDD 分区的边界，以及给定一个键值对数据，能够快速根据键值定位其所应该被分配的分区编号。

如果之前有接触过 Apache Hadoop 的 TeraSort 排序算法的话，应该会觉得范围分区器解决的事情与 TeraSort 算法在 Map 端所需要完成的其实是一回事。两者解决问题的思路也是十分类似：对父 RDD 的数据进行采样（Sampling），将采样得到的数据排序，并分成 M 个数据块，分割点的键值作为后面快速定位的依据。尽管思路基本一致，但由于 RDD 的一些独有特性，在具体的实现细节上，范围分区器又与 TeraSort 算法有许多不同之处。

原文参考：https://ihainan.gitbooks.io/spark-source-code/content/section1/partitioner.html

自定义分区(Partitioner)

我们都知道Spark内部提供了HashPartitioner和RangePartitioner两种分区策略(这两种分区的代码解析可以参见：《Spark分区器HashPartitioner和RangePartitioner代码详解》)，这两种分区策略在很多情况下都适合我们的场景。但是有些情况下，Spark内部不能符合咱们的需求，这时候我们就可以自定义分区策略。为此，Spark提供了相应的接口，我们只需要扩展Partitioner抽象类，然后实现里面的三个方法：

package org.apache.spark

/**
 * An object that defines how the elements in a key-value pair RDD are partitioned by key.
 * Maps each key to a partition ID, from 0 to `numPartitions - 1`.
 */
abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
}

　　假如我们想把来自同一个域名的URL放到一台节点上，比如:https://www.iteblog.com和https://www.iteblog.com/archives/1368，如果你使用HashPartitioner，这两个URL的Hash值可能不一样，这就使得这两个URL被放到不同的节点上。所以这种情况下我们就需要自定义我们的分区策略，可以如下实现：　　def numPartitions: Int：这个方法需要返回你想要创建分区的个数；
　　def getPartition(key: Any): Int：这个函数需要对输入的key做计算，然后返回该key的分区ID，范围一定是0到numPartitions-1；
　　equals()：这个是Java标准的判断相等的函数，之所以要求用户实现这个函数是因为Spark内部会比较两个RDD的分区是否一样。

package com.iteblog.utils

import org.apache.spark.Partitioner

/**
 * User: 过往记忆
 * Date: 2015-05-21
 * Time: 下午23:34
 * bolg: https://www.iteblog.com
 * 本文地址：https://www.iteblog.com/archives/1368
 * 过往记忆博客，专注于hadoop、hive、spark、shark、flume的技术博客，大量的干货
 * 过往记忆博客微信公共帐号：iteblog_hadoop
 */

class IteblogPartitioner(numParts: Int) extends Partitioner {
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    val domain = new java.net.URL(key.toString).getHost()
    val code = (domain.hashCode % numPartitions)
    if (code < 0) {
      code + numPartitions
    } else {
      code
    }
  }

  override def equals(other: Any): Boolean = other match {
    case iteblog: IteblogPartitioner =>
      iteblog.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}

因为hashCode值可能为负数，所以我们需要对他进行处理。然后我们就可以在partitionBy()方法里面使用我们的分区：

iteblog.partitionBy(new IteblogPartitioner(20))

　　类似的，在Java中定义自己的分区策略和Scala类似，只需要继承org.apache.spark.Partitioner，并实现其中的方法即可。

　　在Python中，你不需要扩展Partitioner类，我们只需要对iteblog.partitionBy()加上一个额外的hash函数，如下：

import urlparse

def iteblog_domain(url):
  return hash(urlparse.urlparse(url).netloc)

iteblog.partitionBy(20, iteblog_domain)

参考：https://www.iteblog.com/archives/1368.html

花和尚也有春天

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
RDD：分区器

目录RDD 分区器哈希分区器范围分区器自定义分区(Partitioner)RDD 分区器分区器（Partitioner）在前面章节中或多或少有所提及。我总结了 RDD 分区器的三个作用，而这三个影响在本质上其实是相互关联的。决定 Shuffle 过程中 Reducer 的个数（实际上是子 RDD 的分区个数）以及 Map 端的一条数据记录应该分配给哪一个 Reducer...
复制链接

扫一扫