spark内核源码学习-RDD基础篇

最新推荐文章于 2023-05-19 12:02:47 发布

linhao19891124

最新推荐文章于 2023-05-19 12:02:47 发布

阅读量559

点赞数

分类专栏： spark 文章标签： spark rdd

spark 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

1. RDD基本概念

RDD，英文全称：resilient distributed dataset，中文名：弹性分布式数据集。它是可以并行处理的，错误容忍性强的数据集合。RDD是只读的，不能修改里面的数据，当对RDD使用map等转换操作后，会生成新的RDD。

在Spark中，我们可以通过SparkContext的parallelize方法，把一个普通集合创建为一个RDD,也可以通过引用外部存储系统如共享文件系统，HDFS，Hbase等创建出一个RDD。

RDD在spark源码中是一个抽象类，有两个非常重要的抽象函数，子类必须实现。compute函数，入参为分区，返回值为该分区的迭代器，该函数的作用就是通过计算得到某个分区的数据，以迭代器的方法返回。getPartitions函数，得到该RDD下所有的分区信息。还有一个partitioner（分区器）字段，子类可以根据其分区情况有选择的重载这个字段。

abstract class RDD[T: ClassTag](
    @transient private var _sc: SparkContext,
    @transient private var deps: Seq[Dependency[_]]
  ) extends Serializable with Logging {
...
    def compute(split: Partition, context: TaskContext): Iterator[T]

    protected def getPartitions: Array[Partition]

    val partitioner: Option[Partitioner] = None
...
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
 
 1
2
3
4
5
6
7
8
9
10
11
12

RDD有三个非常核心的信息：Partition(分区)，Partitioner(分区器)，Dependency(依赖)。

1.1 RDD的分区

分区的概念就是RDD把数据按照一定的规则划分为多块，每一块就是一个分区。一个RDD有多少个分区，就有多少个并行度。换一句话说，不同分区的计算是互相不依赖的，是可以并行处理的。
从代码层面上看，Partition是一个接口，主要数据是分区索引号。由于有很多RDD的子类实现，每个RDD子类实现基本上都会有自己的Partition子类，不同的Partition子类实现方法差异很大，包含的信息也各不相同。

trait Partition extends Serializable {
  /**
   * Get the partition's index within its parent RDD
   */
  def index: Int

  // A better default implementation of HashCode
  override def hashCode(): Int = index

  override def equals(other: Any): Boolean = super.equals(other)
}
 
 1
2
3
4
5
6
7
8
9
10
11
 
 1
2
3
4
5
6
7
8
9
10
11

1.2 RDD的分区器

在spark内核中，只有两种分区器，一种是HashPartitioner，另外一种是RangePartitioner。它们的基类是Partitioner，它主要提供一个numPartitions字段，表示多少个分区，还有一个getPartition函数，该函数传入一个元素的key，返回该元素所在的分区号。

abstract class Partitioner extends Serializable {
  def numPartitions: Int
  def getPartition(key: Any): Int
}
 
 1
2
3
4
 
 1
2
3
4

HashPartitioner，最常见的一种分区器，在使用reduceByKey，groupByKey，combineByKey等操作后，生成的rdd就会有HashPartitioner，当两个rdd进行join时，如果两个rdd具有相同的HashPartitioner，那么它们就不需要shuffle过程，因为两个RDD相同的key必然会位于同一个分区，计算的时候不需要向其它节点拉取数据。

HashPartitioner的实现代码也很简单，其本质就是使用key的hashCode取模，模就是总的分区数，当然如果用户对key的分布比较熟悉，那么用户就可以自己写一个更加符合现实需求的哈希分区器。

class HashPartitioner(partitions: Int) extends Partitioner {
  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")

  def numPartitions: Int = partitions

  def getPartition(key: Any): Int = key match {
    case null => 0
    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
  }

  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }

  override def hashCode: Int = numPartitions
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

RangePartitioner，主要用作数据排序，比如说sortByKey产生的rdd就会有RangePartitioner。RangePartitioner实现源码有点多，逻辑较为复杂。实现的基本思想为：对父RDD的各个分区数据进行采样，如果各个分区数据有倾斜，对数据量大的分区还会进行重复采样。采样出来的key进行排序，并根据总的分区数，确定每个分区的key的边界。

举个例子，加入采样出来的key经过排序后，有1,3,10,11,14,16,18,19，总的分区数为4,那么分区的边界就为3,11,16。那么小于等于3的key就属于分区0,大于3小于等于11之间的key属于分区1，大于11小于等于16的key属于分区2，大于16的key属于分区3。

当然真实算法比这个还要复杂一些，还要根据采用出来的每个key的采样概率的倒数作为权重，权重大说明这个key出现的几率比较大，所以这个key所在的分区范围尽量窄一些。

这里还需要注意的一点是，由于sortByKey需要采样，采样是通过触发job来完成的。尽管sortByKey是转换操作，但是它会触发job的执行，所以速度比较慢。当然只有等真正遇到action操作，才会触发真正的排序，排序是一个shuffle过程，真正的排序过程耗时更长。

  def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length)
      : RDD[(K, V)] = self.withScope
  {
    val part = new RangePartitioner(numPartitions, self, ascending)
    new ShuffledRDD[K, V, V](self, part)
      .setKeyOrdering(if (ascending) ordering else ordering.reverse)
  }
 
 1
2
3
4
5
6
7
 
 1
2
3
4
5
6
7

1.3 RDD的依赖

RDD依赖分为窄依赖和宽依赖。
窄依赖是指每个父RDD的Partition最多被子RDD的一个Partition所使用，例如map、filter
宽依赖是指一个父RDD的Partition会被多个子RDD的Partition所使用，例如groupByKey、reduceByKey等操作

abstract class Dependency[T] extends Serializable {
  def rdd: RDD[T]
}
 
 1
2
3
 
 1
2
3

窄依赖代码如下，NarrowDependency是一个抽象类，主要让子类实现getParents函数，getParents函数作用就是根据子rdd的分区号来获取所有父RDD的分区号，rdd字段为子rdd依赖的父rdd对象。

abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {
  /**
   * Get the parent partitions for a child partition.
   * @param partitionId a partition of the child RDD
   * @return the partitions of the parent RDD that the child partition depends upon
   */
  def getParents(partitionId: Int): Seq[Int]

  override def rdd: RDD[T] = _rdd
}
 
 1
2
3
4
5
6
7
8
9
10
 
 1
2
3
4
5
6
7
8
9
10

窄依赖的子类有三个，也就是有三种场景的窄依赖，分别为OneToOneDependency，RangeDependency，PruneDependency。
OneToOneDependency，一对一依赖，一般map，filter，mapValues操作产生的都是一对一依赖,这种情况下父RDD的分区号和子RDD的分区号是一致的。所以getParents函数直接的partitionId。

class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int): List[Int] = List(partitionId)
}
 
 1
2
3
 
 1
2
3

RangeDependency，union算子会产生RangeDependency。一个子rdd会持有多个RangeDependency对象，子rdd的分区划分为不同的范围，每个分区范围对应一个RangeDependency，inStart表示父RDD的分区范围的开始位置，outStart表示子RDD的分区范围的开始位置，length表示分区范围长度。getParents就是把子RDD的分区号映射为父RDD的分区号。

class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {

  override def getParents(partitionId: Int): List[Int] = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
 
 1
2
3
4
5
6
7
8
9
10
11

下面的代码是UnionRDD获取依赖的代码，这里的实现原理其实就是子RDD依赖于多个父RDD，子RDD的分区数是这些父RDD分区数的和，每个父RDD都对应子RDD分区的一个范围。

  override def getDependencies: Seq[Dependency[_]] = {
    val deps = new ArrayBuffer[Dependency[_]]
    var pos = 0
    for (rdd <- rdds) {
      deps += new RangeDependency(rdd, 0, pos, rdd.partitions.length)
      pos += rdd.partitions.length
    }
    deps
  }
 
 1
2
3
4
5
6
7
8
9
 
 1
2
3
4
5
6
7
8
9

PruneDependency，不太常用，这里不做分析。

宽依赖，对应得就是ShuffleDependency，一般sortByKey，reduceByKey等算子会产生ShuffleDependency，对应就会有shuffle过程。

2. ParallelCollectionRDD的实现原理

一般我们在spark-shell练习RDD的一些算子时，都喜欢用sc.parallelize()生成一个RDD。通过这种方式生成的RDD就是ParallelCollectionRDD。
下面是SparkContext的parallelize函数实现，最核心的代码就是创建了一ParallelCollectionRDD对象。

  def parallelize[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    assertNotStopped()
    new ParallelCollectionRDD[T](this, seq, numSlices, Map[Int, Seq[String]]())
  }
 
 1
2
3
4
5
6
 
 1
2
3
4
5
6

下面是ParallelCollectionRDD类的实现代码，构造函数有4个参数，分别是SparkContext,集合数据，分区数以及优选位置信息，它继承了RDD抽象类，调用RDD构造函数时，第二个参数填了Nil,表示该RDD是没有依赖的父RDD的，它就是RDD生成的一个源头。如果通过map等一系列转换操作后，生成的子RDD最终指向的RDD依赖就是它了。

private[spark] class ParallelCollectionRDD[T: ClassTag](
    sc: SparkContext,
    @transient private val data: Seq[T],
    numSlices: Int,
    locationPrefs: Map[Int, Seq[String]])
    extends RDD[T](sc, Nil) {

  override def getPartitions: Array[Partition] = {
    val slices = ParallelCollectionRDD.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray
  }

  override def compute(s: Partition, context: TaskContext): Iterator[T] = {
    new InterruptibleIterator(context, s.asInstanceOf[ParallelCollectionPartition[T]].iterator)
  }

  override def getPreferredLocations(s: Partition): Seq[String] = {
    locationPrefs.getOrElse(s.index, Nil)
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

它实现了三个方法，分别是getPartitions，compute，getPreferredLocations。其实它只需要实现前面两个方法就可以，后面一个实现方法是多余的。PreferredLocation的作用是在需要计算某个分区的数据时，如果知道这个数据在什么位置，那么就在该位置上提交任务进行计算，这样可以减少IO开销。当然我们这个ParallelCollectionRDD是没有优先位置的，在parallelize函数中，这个信息就填了一个空的map。

getPartitions方法，获取该RDD的所有分区信息。该函数首先把数据集合均匀的切分为numSlices份，然后每一份数据生成一个ParallelCollectionPartition分区对象，然后返回所有的ParallelCollectionPartition分区。

ParallelCollectionPartition分区，主要有三个数据，rddId,slice(切片号，其实就是分区号)，values(分区的数据)。它首先定义了一个iterator，指向values.iterator，紧接着重载了hashCode()方法，然后再重载了equals方法，需要类型相同，rddId以及slice相同才认为是同一个分区。后面把index字段重载为slice，最后writeObject，readObject函数是序列化，反序列化使用的，这里不深入研究。

private[spark] class ParallelCollectionPartition[T: ClassTag](
    var rddId: Long,
    var slice: Int,
    var values: Seq[T]
  ) extends Partition with Serializable {

  def iterator: Iterator[T] = values.iterator

  override def hashCode(): Int = (41 * (41 + rddId) + slice).toInt

  override def equals(other: Any): Boolean = other match {
    case that: ParallelCollectionPartition[_] =>
      this.rddId == that.rddId && this.slice == that.slice
    case _ => false
  }

  override def index: Int = slice

  @throws(classOf[IOException])
  private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {
    ...
  }

  @throws(classOf[IOException])
  private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {
    ...
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

有了分区后，就可以计算分区的数据了，ParallelCollectionRDD的compute函数，首先把传入的Partition对象动态转换为ParallelCollectionPartition对象，然后取得ParallelCollectionPartition对象的iterator，最后用InterruptibleIterator函数把这个iterator重新包装了一下，并返回该迭代器。返回的迭代器其本质就是分区中数据的迭代器，有了这个迭代器，就可以获取这个分区的数据了。

  override def compute(s: Partition, context: TaskContext): Iterator[T] = {
    new InterruptibleIterator(context, s.asInstanceOf[ParallelCollectionPartition[T]].iterator)
  }
 
 1
2
3
 
 1
2
3

从上面的分析我们可以看出，ParallelCollectionRDD只有分区，没有分区器，也不需要依赖任何其它的RDD。

3 MapPartitionsRDD实现原理

用户通过调用RDD的map函数，或者mapValues函数，就能生成一个新的类型为MapPartitionsRDD的RDD。如下面代码所示：myrdd2就是MapPartitionsRDD类型的rdd。

scala> val myrdd = sc.parallelize(Array('a','b','c','d','e','f','a','b','c'),4)
myrdd: org.apache.spark.rdd.RDD[Char] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> val myrdd2 = myrdd.map(x=>(x,1))
myrdd2: org.apache.spark.rdd.RDD[(Char, Int)] = MapPartitionsRDD[8] at map at <console>:26

 
 1
2
3
4
5
6
 
 1
2
3
4
5
6

MapPartitionsRDD实现源码如下：

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

MapPartitionsRDD构造函数有三个参数。prev表示前一个rdd，也就是父RDD。f是一个函数，函数入参是一个TaskContext，分区索引号，父RDD分区的迭代器，返回的是该RDD该分区的迭代器。preservesPartitioning表示是否保持父RDD的分区器，默认是不保持，不保持的话，MapPartitionsRDD的分区器就变成None了。

当对RDD进行map操作时，由于元素的key有可能会发生变化，如果还是用原来的分区器，会导致元素在父RDD和子RDD的分区号不一致，这样就无法进行流水线计算。而如果子RDD分区器为空，那么默认情况下子RDD的分区就和父RDD的分区完全一致。

scala> myrdd3.partitioner
res9: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@4)
scala> val myrdd4=myrdd3.map(x=>x)
myrdd4: org.apache.spark.rdd.RDD[(Char, Int)] = MapPartitionsRDD[10] at map at <console>:30
scala> myrdd4.partitioner
res12: Option[org.apache.spark.Partitioner] = None
 
 1
2
3
4
5
6
 
 1
2
3
4
5
6

从上面的测试结果来看，myrdd3是有一个HashPartitioner的分区器的，当对其进行map操作生成myrdd4后，myrdd4的分区器就变成了None。

当对RDD进行mapValues操作时，key是不会发生变化的，所以可以保持父RDD的分区器。这是因为由于key不变，分区器也不变，所以元素在父RDD和子RDD的分区号肯定是一致的。

scala> val myrdd5=myrdd3.mapValues(x=>x)
myrdd5: org.apache.spark.rdd.RDD[(Char, Int)] = MapPartitionsRDD[11] at mapValues at <console>:30

scala> myrdd5.partitioner
res13: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@4)

scala> myrdd3.partitioner == myrdd5.partitioner
res14: Boolean = true

 
 1
2
3
4
5
6
7
8
9
 
 1
2
3
4
5
6
7
8
9

从上面的测试结果来看,myrdd3经过mapValues后生成了myrdd5,myrdd5也有一个HashPartitioner分区器，而且这个分区器就是myrdd3的分区器。

MapPartitionsRDD的getPartitions函数，直接委托给父RDD，获取父RDD的partitions。它的compute函数，用f函数作用于父RDD的迭代器，相当于父RDD的每一个元素都是经过用户自定义函数进行处理，生成新的元素，这也就实现了map的功能。

4. 自己实现一个RDD

SparkContext的textFile可以读取本地文件以及HDFS文件，生成HadoopRDD。本次自己实现的RDD子类名称为LocalFileRdd，实现读取本地文件转换为RDD。本代码只是为了加深对RDD的理解而写的，没有考虑很多异常以及性能，不具备商用价值。
首先看看如何使用LocalFileRdd。直接用new LocalFileRdd创建LocalFileRdd对象，构造函数第一个参数是SparkContext,第二个参数是文件路径名。从返回数据来看，该RDD的元素是一个Tuple2类型，Tuple2的第一个元素是行号，第二个元素是行内容。

scala> val myLocalFileRdd = new LocalFileRdd(sc,"/home/zte/soft/spark-2.1.0-bin-hadoop2.7/README.md")
myLocalFileRdd: LocalFileRdd = LocalFileRdd[8] at RDD at <console>:42

scala> myLocalFileRdd.take(10).foreach(println)
(0,# Apache Spark)
(1,)
(2,Spark is a fast and general cluster computing system for Big Data. It provides)
(3,high-level APIs in Scala, Java, Python, and R, and an optimized engine that)
(4,supports general computation graphs for data analysis. It also supports a)
(5,rich set of higher-level tools including Spark SQL for SQL and DataFrames,)
(6,MLlib for machine learning, GraphX for graph processing,)
(7,and Spark Streaming for stream processing.)
(8,)
(9,<http://spark.apache.org/>)
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14

我们看看如何实现LocalFileRdd的，代码如下：

import java.io.{IOException, ObjectInputStream, ObjectOutputStream}
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.serializer.JavaSerializer
import org.apache.spark.util.Utils
import scala.io.Source

class LocalFileRddPartition(val split: Int, val startLine: Int, val endLine: Int) extends Partition{
  override def index: Int = split

  @throws(classOf[IOException])
  private def writeObject(out: ObjectOutputStream): Unit =  {
      out.defaultWriteObject()
  }

  @throws(classOf[IOException])
  private def readObject(in: ObjectInputStream): Unit =  {
    in.defaultReadObject()
  }
}

class LocalFileRdd(sc: SparkContext,fileFullName: String,partitionsNum: Int = 4) extends RDD[(Int,String)](sc,Nil)
{
  val lines = Source.fromFile(fileFullName)
              .getLines().toSeq
              .zipWithIndex.map(x=>(x._2,x._1))

  override def getPartitions: Array[Partition] = {
    val linesPerPartition = (lines.length+partitionsNum-1)/partitionsNum
    return  (0 to partitionsNum).map {
      split =>
        new LocalFileRddPartition(split, split * linesPerPartition, (split + 1) * linesPerPartition)
    }.toArray

  }

  override def compute(s: Partition, context: TaskContext): Iterator[(Int,String)] = {
    val partition = s.asInstanceOf[LocalFileRddPartition]
    lines.iterator.filter(x=>x._1 >= partition.startLine && x._1 < partition.endLine)
  }
}
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 
 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41

首先我们定义了两个类，一个是LocalFileRddPartition，一个是LocalFileRdd。

LocalFileRddPartition是LocalFileRdd的分区信息，它主要包含分区号，该分区的开始行号，结束行号，当然还需要实现序列化接口writeObject和readObject。

一个是LocalFileRdd，首先读取文件的内容，并按照行号，行内容的格式把数据存储在lines字段中。getPartitions函数，根据文件的总行数以及分区总数，来确定每个分区的行号范围，最后返回所有的分区。compute函数就是根据LocalFileRddPartition里面的行号范围，过滤出本分区的数据，然后返回新的迭代器。

linhao19891124

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark内核源码学习-RDD基础篇

1. RDD基本概念RDD，英文全称：resilient distributed dataset，中文名：弹性分布式数据集。它是可以并行处理的，错误容忍性强的数据集合。RDD是只读的，不能修改里面的数据，当对RDD使用map等转换操作后，会生成新的RDD。在Spark中，我们可以通过SparkContext的parallelize方法，把一个普通集合创建为一个RDD,也可以通过引用
复制链接

扫一扫