spark发展至今,核心设计没什么大变化,如果想快速了解底层实现,可以去看早期的源码,
Branch-0.5分支的,https://github.com/apache/spark/tree/branch-0.5,github直接可以找到,相比spark2.x源码的庞大,
动辄几十个包,早期的除了注释少点之外,读起来没大的挫败感。
目录
#源码-RDD的特性五preferredLocations分区的优先位置
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable,
partitioned collection of elements that can be operated on in parallel.
* Each RDD is characterized by five main properties:
* - A list of splits (partitions)
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for HDFS)
这是源码上的注释,翻译过来就是:弹性分布式数据集RDD,是spark中的基础抽象,表示一个可变的,可分区的,可并行
操作的元素集合。
RDD的内部主要有5个特征:
#一个分区的列表
#计算每个分区的方法
#依赖的RDD列表
#可选,分区器针对key-value数据(例如,RDD是哈希分区)
#可选,一个优先计算的分区位置(分布式计算,本节点有数据,直接计算,无需从别的节点copy)
#源码-RDD的5个特征
abstract class RDD[T: ClassManifest](@transient sc: SparkContext) extends Serializable {
//splits()方法,返回Split类型的数组,也就是后来的分区Partition
def splits: Array[Split]
//compute()方法,参数为分区,返回一个迭代器类型,RDD的子类都实现了自己的compute()方法
def compute(split: Split): Iterator[T]
//依赖,类型为Dependency的List
@transient val dependencies: List[Dependency[_]]
val partitioner: Option[Partitioner] = None//分区器
def preferredLocations(split: Split): Seq[String] = Nil//优先的分区位置
def context = sc//很重要,初始化很多组件
val id = sc.newRddId()//生成一个唯一的ID
private var shouldCache = false//是否缓存相关的变量
//下面是RDD的算子,后面说
......
}
#源码-RDD的特征一Split(分区)
分区仅仅是为了提高分布式计算的计算能力,自认为跟kafka的partition一样,提高吞吐量。
trait Split extends Serializable {
val index: Int//RDD的分区下标,从0开始
override def hashCode(): Int = index
}
#源码-RDD的特征二Dependency(依赖)
所谓依赖,就是确定RDD某个分区是依赖父RDD分区的全部数据,还是部分数据,如果是全部数据的话,不会发生shuffle,将分区数据fetch然后计算就可以,如果依赖父RDD部分数据,就需要进行shuffle
左边就需要进行shuffle,需要决定分区内哪些数据进去下一个RDD的的哪个分区,而右边则不需要,fetch全部分区数据,compute()即可。
//抽象类
abstract class Dependency[T](val rdd: RDD[T], val isShuffle: Boolean) extends Serializable
//shuffle依赖,没实现,后面遇到说
class ShuffleDependency[K, V, C](
val shuffleId: Int,
rdd: RDD[(K, V)],
val aggregator: Aggregator[K, V, C],
val partitioner: Partitioner)
extends Dependency(rdd, true)
//窄依赖
abstract class NarrowDependency[T](rdd: RDD[T]) extends Dependency(rdd, false) {
//获取rdd依赖的父RDD分区id
def getParents(outputPartition: Int): Seq[Int]
}
//一对一依赖继承窄依赖
class OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int) = List(partitionId)
}
//RangeDependency继承窄依赖
//参数一次为,子rdd,子rdd开始位置,父rdd开始位置,范围长度
class RangeDependency[T](rdd: RDD[T], inStart: Int, outStart: Int, length: Int)
extends NarrowDependency[T](rdd) {
override def getParents(partitionId: Int) = {
if (partitionId >= outStart && partitionId < outStart + length) {
List(partitionId - outStart + inStart)
} else {
Nil
}
}
}
左边就是一对一依赖,右边是range依赖
#源码-RDD的特征三Partitioner可选的分区器
主要用于key-value数据,非key-value会先调map生成key-value的Mapperd,key为null
abstract class Partitioner extends Serializable {
def numPartitions: Int //分区数
def getPartition(key: Any): Int//根据Key得到该去往的分区
}
<一>HashPartitioner
根据key的哈希值模与partiton的数量,决定数据该去往哪个分区
class HashPartitioner(partitions: Int) extends Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
if (key == null) {//如果key为空,直接返回0
return 0
} else {
val mod = key.hashCode % partitions//取key的哈希值模与分区数量
if (mod < 0) {
mod + partitions//防止出现负的哈希值
} else {
mod
}
}
}
override def equals(other: Any): Boolean = other match {
case h: HashPartitioner =>
h.numPartitions == numPartitions
case _ =>
false
}
}
<二>RangePartitioner
适用于key是可排序的
class RangePartitioner[K <% Ordered[K]: ClassManifest, V](
partitions: Int,
@transient rdd: RDD[(K,V)],
private val ascending: Boolean = true)
extends Partitioner {
//第一步,主要获取区间边界值
private val rangeBounds: Array[K] = {
if (partitions == 1) {//如果分区数为1,直接返回空数组
Array()
} else {
val rddSize = rdd.count()//计算元素总数量
val maxSampleSize = partitions * 20.0//抽样的数量
val frac = math.min(maxSampleSize / math.max(rddSize, 1), 1.0)
//抽样的比例=抽样数量/元素总数量
val rddSample = rdd.sample(true, frac, 1).map(_._1).collect().sortWith(_ < _)
//按比例抽样,取key由小到达排序
if (rddSample.length == 0) {//如果未抽到数据,直接返回空数组
Array()
} else {
val bounds = new Array[K](partitions - 1)//创建一个长度为分区数-1的数组
for (i <- 0 until partitions - 1) {
//循环,将区间边界值赋给数组
val index = (rddSample.length - 1) * (i + 1) / partitions
bounds(i) = rddSample(index)
}
bounds
}
}
}
def numPartitions = partitions
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]//强转
var partition = 0
while (partition < rangeBounds.length && k > rangeBounds(partition)) {
//循环直到给定key值大于边界值
partition += 1
}
//数据排序是升序还是降序,默认升序
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
override def equals(other: Any): Boolean = other match {
case r: RangePartitioner[_,_] =>
r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending
case _ =>
false
}
}
比如:300个元素,为1-300,分6个区
#rangeBounds方法:会返回一个长度为5的数组,依次为:50,100,150,200,250
#getPartition方法:会将传入的key与数组进行比较,确定去往哪个分区
#源码-RDD的特征四compute()计算方法
RDD的子类都会实现自己的compute()方法,后面说子类再说。
#源码-RDD的特性五preferredLocations分区的优先位置
spark的pipeline思想,分区数据需要流到计算的位置,就涉及到从别的节点http fetch数据,分区的优先位置就是为了减少网络传输。
#源码 -生成RDD
生成RDD有三种方法,textFile(),makeRDD(),parallelize()。
makeRDD()也是调用parallelize(),也可以说是两种,一种是读Seq,另一种是读文件。
#源码-textFile()
1.textFile()方法
//需要两个参数,文件路径,分片数
def textFile(path: String, minSplits: Int = defaultMinSplits): RDD[String] = {
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minSplits)
.map(pair => pair._2.toString)
}
当用户未指定分片数时,默认为defaultMinSplits,该值为defaultParallelism与2的小值
def defaultMinSplits: Int = math.min(defaultParallelism, 2)
defaultParallelism该值为调度器的默认并行度
def defaultParallelism: Int = scheduler.defaultParallelism
由于这次的源码是早期的spark代码,具体可看Spark源码《一》RDD,没有yarn调度,只有local和mesos调度
先看local调度,该值为线程数,线程数为local[n]指定,不指定时为默认1,local[*]为core数
override def defaultParallelism() = threads
再看mesos 调度,该值默认为8
override def defaultParallelism() =
System.getProperty("spark.default.parallelism", "8").toInt
textFile()方法直接调用了hadoopFile()方法。
2.hadoopFile()
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minSplits: Int = defaultMinSplits
) : RDD[(K, V)] = {
val conf = new JobConf()//新建mapreduce JobConf
FileInputFormat.setInputPaths(conf, path)//设置输入路径
val bufferSize = System.getProperty("spark.buffer.size", "65536")//spark缓冲区
conf.set("io.file.buffer.size", bufferSize)//将mapreduce缓存区设为spark缓冲区大小
new HadoopRDD(this, conf, inputFormatClass, keyClass, valueClass, minSplits)
}
该方法返回值为键值对的RDD,创建了一个HadoopRDD对象,并将value(行内容)取出。
3.HadoopRDD类
class HadoopRDD[K, V](
sc: SparkContext,
@transient conf: JobConf,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minSplits: Int)
extends RDD[(K, V)](sc) {
val serializableConf = new SerializableWritable(conf)//序列化jobConf
//返回一个长度为分区数的数组,数组存储数据的分区
@transient
val splits_ : Array[Split] = {
val inputFormat = createInputFormat(conf)
val inputSplits = inputFormat.getSplits(conf, minSplits)
val array = new Array[Split](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopSplit(id, i, inputSplits(i))
}
array
}
//创建一个InputForamt对象
def createInputFormat(conf: JobConf): InputFormat[K, V] = {
ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
.asInstanceOf[InputFormat[K, V]]
}
override def splits = splits_
//compute()方法,返回存储键值对数据的迭代器
override def compute(theSplit: Split) = new Iterator[(K, V)] {
val split = theSplit.asInstanceOf[HadoopSplit]
var reader: RecordReader[K, V] = null
val conf = serializableConf.value
val fmt = createInputFormat(conf)
reader = fmt.getRecordReader(split.inputSplit.value, conf, Reporter.NULL)
val key: K = reader.createKey()
val value: V = reader.createValue()
var gotNext = false
var finished = false
//判断分区是否还有数据
override def hasNext: Boolean = {
if (!gotNext) {
try {
finished = !reader.next(key, value)
} catch {
case eof: EOFException =>
finished = true
}
gotNext = true
}
if (finished) {
reader.close()
}
!finished
}
//读取数据
override def next: (K, V) = {
if (!gotNext) {
finished = !reader.next(key, value)
}
if (finished) {
throw new NoSuchElementException("End of stream")
}
gotNext = false
(key, value)
}
}
override def preferredLocations(split: Split) = {
// TODO: Filtering out "localhost" in case of file:// URLs
val hadoopSplit = split.asInstanceOf[HadoopSplit]
hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
}
//无依赖
override val dependencies: List[Dependency[_]] = Nil
}
该方法涉及较多的mapreduce源码,这里暂不赘述,HadoopRDD的compute()方法,系统默认的RecordReader是LineRecordReader,如TextInputFormat,是将行偏移量作为key,行内容作为value,生成迭代器返回。
#源码-parallelize()
1.parallelize
def parallelize[T: ClassManifest](seq: Seq[T], numSlices: Int = defaultParallelism ): RDD[T] = {
new ParallelCollection[T](this, seq, numSlices)
}
def makeRDD[T: ClassManifest](seq: Seq[T], numSlices: Int = defaultParallelism ): RDD[T] = {
parallelize(seq, numSlices)
}
可以看到makeRDD()也是调用parallelize()方法,该方法生成了一个ParallelCollection对象,类参数为sc,Seq,分区数
2.ParallelCollection
class ParallelCollection[T: ClassManifest](
sc: SparkContext,
@transient data: Seq[T],
numSlices: Int)
extends RDD[T](sc) {
@transient
val splits_ = {
//调用了slice()方法
val slices = ParallelCollection.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionSplit(id, i, slices(i))).toArray
}
override def splits = splits_.asInstanceOf[Array[Split]]
override def compute(s: Split) = s.asInstanceOf[ParallelCollectionSplit[T]].iterator
//首选位置为空
override def preferredLocations(s: Split): Seq[String] = Nil
//依赖为空
override val dependencies: List[Dependency[_]] = Nil
}
splits_调用了slice()方法,创建了 ParallelCollectionSplit对象,先看slice()方法
3.slice()方法
def slice[T: ClassManifest](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
//判断分区数,小于1的话,抛异常
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
seq match {
case r: Range.Inclusive => {
val sign = if (r.step < 0) {
-1
} else {
1
}
slice(new Range(
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
case r: Range => {
(0 until numSlices).map(i => {
val start = ((i * r.length.toLong) / numSlices).toInt
val end = (((i+1) * r.length.toLong) / numSlices).toInt
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
case nr: NumericRange[_] => {
val slices = new ArrayBuffer[Seq[T]](numSlices)
val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
for (i <- 0 until numSlices) {
//a,b
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
}
case _ => {
val array = seq.toArray // To prevent O(n^2) operations for List etc
(0 until numSlices).map(i => {
//0 1
// i=0 start=0 end=2
val start = ((i * array.length.toLong) / numSlices).toInt
val end = (((i+1) * array.length.toLong) / numSlices).toInt
array.slice(start, end).toSeq
})
}
}
}
该方法匹配Seq的类型,分别为:Range.Inclusive,Range,NumericRange,其它。
to方法,生成的Range为Range.Inclusive,包含10
until方法,new Range()生成的Range为Range,不包含10
除int型外,别的类型,float,char,long等为NumericRange类型。
举几个例子,这样更方便理解代码
<一>(0 to 6 by 1) 分2个分区 1,2,3,4,5,6
case r: Range.Inclusive => {
val sign = if (r.step < 0) {
-1
} else {
1
}
slice(new Range(
r.start, r.end + sign, r.step).asInstanceOf[Seq[T]], numSlices)
}
sign=1,继续调用slice()方法,参数为new Range(1,7,1),2
r.start,r.end为Range第一个数与最后一个数
接下来匹配到:
case r: Range => {
(0 until numSlices).map(i => {
val start = ((i * r.length.toLong) / numSlices).toInt
val end = (((i+1) * r.length.toLong) / numSlices).toInt
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}).asInstanceOf[Seq[Seq[T]]]
}
循环2次,
当i=0时,start=0,end=3,生成Range(1,4,1),则为1,2,3
当i=1时,start=3,end=6,生成Range(4,7,1),则为4,5,6
到这就将一个Range分为两个,如果为until,new Range()生成的直接匹配第二种就ok
<二>匹配到NumericRange类型
case nr: NumericRange[_] => {
val slices = new ArrayBuffer[Seq[T]](numSlices)
val sliceSize = (nr.size + numSlices - 1) / numSlices // Round up to catch everything
var r = nr
for (i <- 0 until numSlices) {
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
}
举例,'a' to 'd' 2个分区 a,b,c,d
slices为长度为2的变长数组
sliceSize=(4+2-1)/2=2
r=a,b,c,d
循环:当i=0时,(Range.take(n)为取前n个元素,drop(n)为移除前n个元素),
slices将前2元素,也就是a,b加入,并将a,b移除
当i=1时,slices将c,d加入
到此,返回一个变长数组,类型为Seq,一个a,b ,一个c,d
<三>匹配到其它的
case _ => {
val array = seq.toArray // To prevent O(n^2) operations for List etc
(0 until numSlices).map(i => {
val start = ((i * array.length.toLong) / numSlices).toInt
val end = (((i+1) * array.length.toLong) / numSlices).toInt
array.slice(start, end).toSeq
})
}
举例,List(5,7,3,9,11,4),2个分区
先转为数组,
循环2次,
i=0时,start=0*6/2=0,end=(0+1)*6/2=3,调数组的slice(n,m)方法,从下标n取到m-1的元素,故取下标为0,1,2的元素
i=1时,start=1*6/2=3,end=(1+1)*6/2=6,取下标为3,4,5的元素
4.ParallelCollectionSplit
val splits_ = {
val slices = ParallelCollection.slice(data, numSlices).toArray
slices.indices.map(i => new ParallelCollectionSplit(id, i, slices(i))).toArray
}
slice()方法返回的集合转为数组后,indices为返回数组所有下标,生成数量为数组长度的ParallelCollectionSplit对象,
如果指定2个分区,则生成2个ParallelCollectionSplit对象,每个对象依次存储下标为0到长度-1的数据
class ParallelCollectionSplit[T: ClassManifest](
val rddId: Long,
val slice: Int,
values: Seq[T])
extends Split with Serializable {
def iterator(): Iterator[T] = values.iterator//依次取下标为i的数据,生成迭代器
override def hashCode(): Int = (41 * (41 + rddId) + slice).toInt
override def equals(other: Any): Boolean = other match {
case that: ParallelCollectionSplit[_] => (this.rddId == that.rddId && this.slice == that.slice)
case _ => false
}
override val index = slice//将数组下标赋给rdd的分区下标
}