spark RDD源码

最新推荐文章于 2024-01-28 18:05:27 发布

hadoop程序猿

最新推荐文章于 2024-01-28 18:05:27 发布

阅读量760

点赞数

分类专栏： spark 文章标签： spark rdd

本文链接：https://blog.csdn.net/zhaolq1024/article/details/83899488

版权

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

spark发展至今，核心设计没什么大变化，如果想快速了解底层实现，可以去看早期的源码，

Branch-0.5分支的，https://github.com/apache/spark/tree/branch-0.5，github直接可以找到，相比spark2.x源码的庞大，

动辄几十个包，早期的除了注释少点之外，读起来没大的挫败感。

#源码-RDD的5个特征

#源码-RDD的特征一Split(分区)

#源码-RDD的特征二Dependency(依赖)

#源码-RDD的特征三Partitioner可选的分区器

<一>HashPartitioner

<二>RangePartitioner

#源码-RDD的特征四compute()计算方法

#源码-RDD的特性五preferredLocations分区的优先位置

4.ParallelCollectionSplit

A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel.

* Each RDD is characterized by five main properties:
* - A list of splits (partitions)
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)

这是源码上的注释，翻译过来就是：弹性分布式数据集RDD，是spark中的基础抽象，表示一个可变的，可分区的，可并行

操作的元素集合。

RDD的内部主要有5个特征:

#一个分区的列表

#计算每个分区的方法

#依赖的RDD列表

#可选，分区器针对key-value数据(例如，RDD是哈希分区)

#可选，一个优先计算的分区位置(分布式计算，本节点有数据，直接计算，无需从别的节点copy)

#源码-RDD的5个特征

abstract class RDD[T: ClassManifest](@transient sc: SparkContext) extends Serializable {
  //splits()方法，返回Split类型的数组，也就是后来的分区Partition
  def splits: Array[Split] 

  //compute()方法，参数为分区，返回一个迭代器类型，RDD的子类都实现了自己的compute()方法
  def compute(split: Split): Iterator[T]

  //依赖，类型为Dependency的List
  @transient val dependencies: List[Dependency[_]]
  
    
  val partitioner: Option[Partitioner] = None//分区器
  def preferredLocations(split: Split): Seq[String] = Nil//优先的分区位置
  
  def context = sc//很重要，初始化很多组件
  val id = sc.newRddId()//生成一个唯一的ID
  private var shouldCache = false//是否缓存相关的变量

  //下面是RDD的算子，后面说
  ......
}

#源码-RDD的特征一Split(分区)

分区仅仅是为了提高分布式计算的计算能力，自认为跟kafka的partition一样，提高吞吐量。

trait Split extends Serializable {
  val index: Int//RDD的分区下标，从0开始
  override def hashCode(): Int = index
}

#源码-RDD的特征二Dependency(依赖)

所谓依赖，就是确定RDD某个分区是依赖父RDD分区的全部数据，还是部分数据，如果是全部数据的话，不会发生shuffle，将分区数据fetch然后计算就可以，如果依赖父RDD部分数据，就需要进行shuffle

左边就需要进行shuffle，需要决定分区内哪些数据进去下一个RDD的的哪个分区，而右边则不需要，fetch全部分区数据，compute()即可。

//抽象类
abstract class Dependency[T](val rdd: RDD[T], val isShuffle: Boolean) extends Serializable

//shuffle依赖，没实现，后面遇到说
class ShuffleDependency[K, V, C](
    val shuffleId: Int,
    rdd: RDD[(K, V)],
    val aggregator: Aggregator[K, V, C],
    val partitioner: Partitioner)
  extends Dependency(rdd, true)

//窄依赖
abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd, false) {
  //获取rdd依赖的父RDD分区id
  def getParents(outputPartition: Int): Seq[Int]
}

//一对一依赖继承窄依赖
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int) = List(partitionId)
}

//RangeDependency继承窄依赖
//参数一次为，子rdd,子rdd开始位置，父rdd开始位置，范围长度
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
  extends NarrowDependency[T](rdd) {
  override def getParents(partitionId: Int) = {
    if (partitionId >= outStart && partitionId < outStart + length) {
      List(partitionId - outStart + inStart)
    } else {
      Nil
    }
  }
}

左边就是一对一依赖，右边是range依赖

#源码-RDD的特征三Partitioner可选的分区器

主要用于key-value数据，非key-value会先调map生成key-value的Mapperd，key为null

abstract class Partitioner extends Serializable {
  def numPartitions: Int //分区数
  def getPartition(key: Any): Int//根据Key得到该去往的分区
}

<一>HashPartitioner

根据key的哈希值模与partiton的数量，决定数据该去往哪个分区

class HashPartitioner(partitions: Int) extends Partitioner {
  def numPartitions = partitions
 
  def getPartition(key: Any): Int = {
    if (key == null) {//如果key为空，直接返回0
      return 0
    } else {
      val mod = key.hashCode % partitions//取key的哈希值模与分区数量
      if (mod < 0) {
        mod + partitions//防止出现负的哈希值
      } else {
        mod 
      }
    }
  }
  
  override def equals(other: Any): Boolean = other match {
    case h: HashPartitioner =>
      h.numPartitions == numPartitions
    case _ =>
      false
  }
}

<二>RangePartitioner

适用于key是可排序的

class RangePartitioner[K <% Ordered[K]: ClassManifest, V](
    partitions: Int,
    @transient rdd: RDD[(K,V)],
    private val ascending: Boolean = true) 
  extends Partitioner {
 
//第一步，主要获取区间边界值
  private val rangeBounds: Array[K] = {
    if (partitions == 1) {//如果分区数为1，直接返回空数组
      Array()
    } else {
      val rddSize = rdd.count()//计算元素总数量
      val maxSampleSize = partitions * 20.0//抽样的数量
 
      val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
      //抽样的比例=抽样数量/元素总数量
 
      val rddSample = rdd.sample(true, frac, 1).map(_._1).collect().sortWith(_ < _)
      //按比例抽样，取key由小到达排序
    
      if (rddSample.length == 0) {//如果未抽到数据，直接返回空数组
        Array()
      } else {
        val bounds = new Array[K](partitions - 1)//创建一个长度为分区数-1的数组
        for (i <- 0 until partitions - 1) {
          //循环，将区间边界值赋给数组
          val index = (rddSample.length - 1) * (i + 1) / partitions 
          bounds(i) = rddSample(index)
        }
        bounds
      }
    }
  }
 
  def numPartitions = partitions
 
  def getPartition(key: Any): Int = {
    val k = key.asInstanceOf[K]//强转
    var partition = 0
    while (partition < rangeBounds.length && k > rangeBounds(partition)) {
      //循环直到给定key值大于边界值
      partition += 1
    }
    //数据排序是升序还是降序，默认升序
    if (ascending) {
      partition
    } else {
      rangeBounds.length - partition
    }
  }
 
  override def equals(other: Any): Boolean = other match {
    case r: RangePartitioner[_,_] =>
      r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
    case _ =>
      false
  }
}

比如:300个元素，为1-300，分6个区

#rangeBounds方法:会返回一个长度为5的数组，依次为:50，100，150，200，250

#getPartition方法:会将传入的key与数组进行比较，确定去往哪个分区

#源码-RDD的特征四compute()计算方法

RDD的子类都会实现自己的compute()方法，后面说子类再说。

#源码-RDD的特性五preferredLocations分区的优先位置

spark的pipeline思想，分区数据需要流到计算的位置，就涉及到从别的节点http fetch数据，分区的优先位置就是为了减少网络传输。

#源码 -生成RDD

生成RDD有三种方法，textFile()，makeRDD()，parallelize()。

makeRDD()也是调用parallelize()，也可以说是两种，一种是读Seq，另一种是读文件。

#源码-textFile()

1.textFile()方法


//需要两个参数，文件路径，分片数
def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String] = {
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minSplits)
      .map(pair => pair._2.toString)
  }

当用户未指定分片数时，默认为defaultMinSplits，该值为defaultParallelism与2的小值

def defaultMinSplits: Int = math.min(defaultParallelism, 2)

defaultParallelism该值为调度器的默认并行度

def defaultParallelism: Int = scheduler.defaultParallelism

由于这次的源码是早期的spark代码，具体可看Spark源码《一》RDD，没有yarn调度，只有local和mesos调度

先看local调度,该值为线程数，线程数为local[n]指定，不指定时为默认1,local[*]为core数

 override def defaultParallelism() = threads

再看mesos 调度，该值默认为8

override def defaultParallelism() =
    System.getProperty("spark.default.parallelism", "8").toInt

textFile()方法直接调用了hadoopFile()方法。

2.hadoopFile()

def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minSplits: Int = defaultMinSplits
      ) : RDD[(K, V)] = {
    val conf = new JobConf()//新建mapreduce JobConf
    FileInputFormat.setInputPaths(conf, path)//设置输入路径
    val bufferSize = System.getProperty("spark.buffer.size", "65536")//spark缓冲区
    conf.set("io.file.buffer.size", bufferSize)//将mapreduce缓存区设为spark缓冲区大小
    new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minSplits)
  }

该方法返回值为键值对的RDD，创建了一个HadoopRDD对象，并将value（行内容）取出。

3.HadoopRDD类

class HadoopRDD[K, V](
    sc: SparkContext,
    @transient conf: JobConf,
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minSplits: Int)
  extends RDD[(K, V)](sc) {
  
  val serializableConf = new SerializableWritable(conf)//序列化jobConf

   
  //返回一个长度为分区数的数组，数组存储数据的分区
  @transient
  val splits_ : Array[Split] = {
    val inputFormat = createInputFormat(conf)
    val inputSplits = inputFormat.getSplits(conf, minSplits)
    val array = new Array[Split](inputSplits.size)
    for (i <- 0 until inputSplits.size) {
      array(i) = new HadoopSplit(id, i, inputSplits(i))
    }
    array
  }
   
  //创建一个InputForamt对象
  def createInputFormat(conf: JobConf): InputFormat[K, V] = {
    ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
      .asInstanceOf[InputFormat[K, V]]
  }
  
  
  override def splits = splits_
  
  //compute()方法，返回存储键值对数据的迭代器
  override def compute(theSplit: Split) = new Iterator[(K, V)] {
    val split = theSplit.asInstanceOf[HadoopSplit]
    var reader: RecordReader[K, V] = null


    val conf = serializableConf.value
    val fmt = createInputFormat(conf)
    reader = fmt.getRecordReader(split.inputSplit.value, conf, Reporter.NULL)

    val key: K = reader.createKey()
    val value: V = reader.createValue()
    var gotNext = false
    var finished = false
    
    //判断分区是否还有数据
    override def hasNext: Boolean = {
      if (!gotNext) {
        try {
          finished = !reader.next(key, value)
        } catch {
          case eof: EOFException =>
            finished = true
        }
        gotNext = true
      }
      if (finished) {
        reader.close()
      }
      !finished
    }
    
    //读取数据
    override def next: (K, V) = {
      if (!gotNext) {
        finished = !reader.next(key, value)
      }
      if (finished) {
        throw new NoSuchElementException("End of stream")
      }
      gotNext = false
      (key, value)
    }
  }

  override def preferredLocations(split: Split) = {
    // TODO: Filtering out "localhost" in case of file:// URLs
    val hadoopSplit = split.asInstanceOf[HadoopSplit]
    hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
  }
  
   //无依赖
  override val dependencies: List[Dependency[_]] = Nil
}

该方法涉及较多的mapreduce源码，这里暂不赘述，HadoopRDD的compute()方法，系统默认的RecordReader是LineRecordReader，如TextInputFormat，是将行偏移量作为key，行内容作为value，生成迭代器返回。

#源码-parallelize()

1.parallelize

def parallelize[T: ClassManifest](seq: Seq[T], numSlices: Int = defaultParallelism ): RDD[T] = {
    new ParallelCollection[T](this, seq, numSlices)
  }
    
def makeRDD[T: ClassManifest](seq: Seq[T], numSlices: Int = defaultParallelism ): RDD[T] = {
    parallelize(seq, numSlices)
  }

可以看到makeRDD()也是调用parallelize()方法,该方法生成了一个ParallelCollection对象，类参数为sc,Seq,分区数

2.ParallelCollection

class ParallelCollection[T: ClassManifest](
    sc: SparkContext, 
    @transient data: Seq[T],
    numSlices: Int)
  extends RDD[T](sc) {

  @transient
  val splits_ = {
    //调用了slice()方法
    val slices = ParallelCollection.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionSplit(id, i, slices(i))).toArray
  }

  override def splits = splits_.asInstanceOf[Array[Split]]

  override def compute(s: Split) = s.asInstanceOf[ParallelCollectionSplit[T]].iterator

 //首选位置为空
  override def preferredLocations(s: Split): Seq[String] = Nil
  
 //依赖为空
  override val dependencies: List[Dependency[_]] = Nil
}

splits_调用了slice()方法，创建了 ParallelCollectionSplit对象，先看slice()方法

3.slice()方法

def slice[T: ClassManifest](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
    
    //判断分区数，小于1的话，抛异常
    if (numSlices < 1) {
      throw new IllegalArgumentException("Positive number of slices required")
    }
    
    seq match {
      case r: Range.Inclusive => {
        val sign = if (r.step < 0) {
          -1 
        } else {
          1
        }
        slice(new Range(
            r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
      }


      case r: Range => {
        (0 until numSlices).map(i => {
     
          val start = ((i * r.length.toLong) / numSlices).toInt
          val end = (((i+1) * r.length.toLong) / numSlices).toInt
          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
        }).asInstanceOf[Seq[Seq[T]]]
      }
      case nr: NumericRange[_] => {
     
        val slices = new ArrayBuffer[Seq[T]](numSlices)
        val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
        var r = nr
        for (i <- 0 until numSlices) {
          //a,b
          slices += r.take(sliceSize).asInstanceOf[Seq[T]]
          r = r.drop(sliceSize)
        }
        slices
      }
      case _ => {
        val array = seq.toArray  // To prevent O(n^2) operations for List etc
        (0 until numSlices).map(i => {
          //0 1
          // i=0 start=0  end=2
          val start = ((i * array.length.toLong) / numSlices).toInt
          val end = (((i+1) * array.length.toLong) / numSlices).toInt
          array.slice(start, end).toSeq
        })
      }
    }
  }

该方法匹配Seq的类型，分别为：Range.Inclusive，Range，NumericRange，其它。

to方法，生成的Range为Range.Inclusive，包含10

until方法，new Range()生成的Range为Range，不包含10

除int型外，别的类型，float,char,long等为NumericRange类型。

举几个例子，这样更方便理解代码

<一>(0 to 6 by 1) 分2个分区 1,2,3,4,5,6

case r: Range.Inclusive => {
        val sign = if (r.step < 0) {
          -1 
        } else {
          1
        }
        slice(new Range(
            r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
      }

sign=1,继续调用slice()方法，参数为new Range(1,7,1),2

r.start,r.end为Range第一个数与最后一个数

接下来匹配到:

 case r: Range => {
        (0 until numSlices).map(i => {
   
          val start = ((i * r.length.toLong) / numSlices).toInt
          val end = (((i+1) * r.length.toLong) / numSlices).toInt
          new Range(r.start + start * r.step, r.start + end * r.step, r.step)
        }).asInstanceOf[Seq[Seq[T]]]
      }

循环2次，

当i=0时，start=0，end=3，生成Range(1,4,1)，则为1,2,3

当i=1时，start=3，end=6，生成Range(4,7,1)，则为4,5,6

到这就将一个Range分为两个，如果为until,new Range()生成的直接匹配第二种就ok

<二>匹配到NumericRange类型

case nr: NumericRange[_] => {

     val slices = new ArrayBuffer[Seq[T]](numSlices)
     val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
     var r = nr
     for (i <- 0 until numSlices) {
        slices += r.take(sliceSize).asInstanceOf[Seq[T]]
        r = r.drop(sliceSize)
      }
      slices
  }

举例，'a' to 'd' 2个分区 a,b,c,d

slices为长度为2的变长数组

sliceSize=(4+2-1)/2=2

r=a,b,c,d

循环:当i=0时，(Range.take(n)为取前n个元素，drop(n)为移除前n个元素)，

slices将前2元素，也就是a,b加入，并将a,b移除

当i=1时，slices将c,d加入

到此，返回一个变长数组，类型为Seq,一个a，b ，一个c，d

<三>匹配到其它的

case _ => {
        val array = seq.toArray  // To prevent O(n^2) operations for List etc
        (0 until numSlices).map(i => {
          val start = ((i * array.length.toLong) / numSlices).toInt
          val end = (((i+1) * array.length.toLong) / numSlices).toInt
          array.slice(start, end).toSeq
        })
      }

举例,List(5,7,3,9,11,4)，2个分区

先转为数组，

循环2次，

i=0时，start=0*6/2=0，end=(0+1)*6/2=3,调数组的slice(n,m)方法，从下标n取到m-1的元素,故取下标为0,1,2的元素

i=1时，start=1*6/2=3，end=(1+1)*6/2=6，取下标为3,4,5的元素

4.ParallelCollectionSplit

 val splits_ = {
    val slices = ParallelCollection.slice(data, numSlices).toArray
    slices.indices.map(i => new ParallelCollectionSplit(id, i, slices(i))).toArray
    }

slice()方法返回的集合转为数组后，indices为返回数组所有下标，生成数量为数组长度的ParallelCollectionSplit对象，

如果指定2个分区，则生成2个ParallelCollectionSplit对象，每个对象依次存储下标为0到长度-1的数据

class ParallelCollectionSplit[T: ClassManifest](
    val rddId: Long,
    val slice: Int,
    values: Seq[T])
  extends Split with Serializable {
  
  def iterator(): Iterator[T] = values.iterator//依次取下标为i的数据，生成迭代器

  override def hashCode(): Int = (41 * (41 + rddId) + slice).toInt

  override def equals(other: Any): Boolean = other match {
    case that: ParallelCollectionSplit[_] => (this.rddId == that.rddId && this.slice == that.slice)
    case _ => false
  }

  override val index = slice//将数组下标赋给rdd的分区下标
}