摘要
RDD,弹性分布式数据集,是spark的底层数据结构。RDD是一个容错的,可以被并行操作的数据集合。RDD的特点之一是分布式存储,它的好处就是数据存储在不同的节点上,当需要数据进行计算的时候可以在这些节点上并行操作。弹性表现在节点在存储RDD数据的时候,既可以存储在内存中,也可以存储在磁盘上,也可以两者结合使用。RDD还有个特点就是延迟计算,当是transformation算子的时候,并不执行操作,直到遇到action算子的时候才开始执行计算。根据RDD源码里面的注释,我们来了解一下RDD的五大特性:
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* - A function for computing each split
* - A list of dependencies on other RDDs
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
* Internally, each RDD is characterized by five main properties:
*
* - A list of partitions
* RDD是由多个partition构成的
* - A function for computing each split
* RDD的每个分区上都有一个函数去作用
* - A list of dependencies on other RDDs
* RDD有依赖性,通常情况下一个RDD是来源于另一个RDD,这个叫做lineage,即血缘关系。RDD会记录下这些依赖,方便容错。
* - Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
* 可选项,如果RDD里面存的数据是key-value形式,则可以传递一个自定义的Partitioner进行重新分区,例如这里自定义的Partitioner是基于key进行分区,那则会将不同RDD里面的相同key的数据放到同一个partition里面
* - Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
* 最优的位置去计算,也就是数据的本地性
1.RDD的五大属性
1.1partitions(分区)
partitions :(分区属性): 每个RDD包括多个分区,这既是RDD的数据单位,也是计算粒度, 每个分区是由一个Task线程处理。在RDD创建的时候可以指定分区的个数,如果没有指定,那么默认分区个数由参数spark.default.parallelism指定(如果未设置这个参数 ,则在yarn或者standalone模式下有如下推导:spark.default.parallelism = max(所有executor使用的core总数,2))。每一分区对应一个内存block,,由BlockManager分配。
// Our dependencies and partitions will be gotten by calling subclass's methods below, and will
// be overwritten when we're checkpointed
private var dependencies_ : Seq[Dependency[_]] = null
@transient private var partitions_ : Array[Partition] = null
子类可以通过调用下面的方法来获取分区列表,当处于检查点时,分区信息会被重写
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
partitions_.zipWithIndex.foreach { case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
partitions_
}
}
Partition实现:
/**
* An identifier for a partition in an RDD.
*/
trait Partition extends Serializable {
/**
* Get the partition's index within its parent RDD
*/
def index: Int
// A better default implementation of HashCode
override def hashCode(): Int = index
override def equals(other: Any): Boolean = super.equals(other)
}
partition 与 iterator 方法
RDD 的 iterator(split: Partition, context: TaskContext): Iterator[T] 方法用来获取 split 指定的 Partition 对应的数据的迭代器,有了这个迭代器就能一条一条取出数据来按 compute chain 来执行一个个transform 操作。iterator 的实现如下:
/**
* Internal method to this RDD; will read from cache if applicable, or otherwise compute it.
* This should ''not'' be called by users directly, but is available for implementors of custom
* subclasses of RDD.
*/
final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
if (storageLevel != StorageLevel.NONE) {
getOrCompute(split, context)
} else {
computeOrReadCheckpoint(split, context)
}
}
其先判断 RDD 的 storageLevel 是否为 NONE,若不是,则尝试从缓存中读取,读取不到则通过计算来获取该Partition对应的数据的迭代器;若是,尝试从 checkpoint 中获取 Partition 对应数据的迭代器,若 checkpoint 不存在则通过计算(compute属性)
1.2partitioner(分区方法)
RDD的分区方式,这个属性指的是RDD的partitioner函数(分片函数),分区函数就是将数据分配到指定的分区,这个目前实现了HashPartitioner和RangePartitioner,只有key-value的RDD才会有分片函数,否则为none。分片函数不仅决定了当前分片的个数,同时决定parent shuffle RDD的输出的分区个数。
/**
* An object that defines how the elements in a key-value pair RDD are partitioned by key.
* Maps each key to a partition ID, from 0 to `numPartitions - 1`.
*/
abstract class Partitioner extends Serializable {
def numPartitions: Int
def getPartition(key: Any): Int
}
HashPartitioner如何决定一个键对应的分区的:
def getPartition(key: Any): Int = key match {
case null => 0
case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)
}
其中nonNegativeMod方法考虑到了key的符号,如果key是负数,就返回key%numPartitions +numPartitions(补数);HashPartitioner是基于Object的hashcode来分区的,所以不应该对集合类型进行哈希分区。
RangePartitioner如何决定一个键对应的分区的:
def getPartition(key: Any): Int = {
val k = key.asInstanceOf[K]
var partition = 0
if (rangeBounds.length <= 128) {
// If we have less than 128 partitions naive search
while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {
partition += 1
}
} else {
// Determine which binary search method to use only once.
partition = binarySearch(rangeBounds, k)
// binarySearch either returns the match location or -[insertion point]-1
if (partition < 0) {
partition = -partition-1
}
if (partition > rangeBounds.length) {
partition = rangeBounds.length
}
}
if (ascending) {
partition
} else {
rangeBounds.length - partition
}
}
其中rangeBounds是各个分区的上边界的Array。而rangeBounds的具体计算是通过抽样进行估计的,具体代码可以参照RangePartitioner 实现简记。RangePartitioner是根据key值大小进行分区的,所以支持RDD的排序类算子。
1.3dependencies(依赖关系)
Spark的运行过程就是RDD之间的转换,因此,必须记录RDD之间的生成关系(新RDD是由哪个或哪几个父RDD生成),这就是所谓的依赖关系,这样既有助于阶段和任务的划分,也有助于在某个分区出错的时候,只需要重新计算与当前出错的分区有关的分区,而不需要计算所有的分区。dependencies_是一个记录Dependency关系的序列(Seq):
// Our dependencies and partitions will be gotten by calling subclass's methods below, and will
// be overwritten when we're checkpointed
private var dependencies_ : Seq[Dependency[_]] = null
@transient private var partitions_ : Array[Partition] = null
RDD是如何记录依赖关系的:
/**
* Get the array of partitions of this RDD, taking into account whether the
* RDD is checkpointed or not.
*/
final def partitions: Array[Partition] = {
checkpointRDD.map(_.partitions).getOrElse {
if (partitions_ == null) {
partitions_ = getPartitions
partitions_.zipWithIndex.foreach { case (partition, index) =>
require(partition.index == index,
s"partitions($index).partition == ${partition.index}, but it should equal $index")
}
}
partitions_
}
}
设计接口的一个关键问题就是,如何表示RDD之间的依赖。我们发现RDD之间的依赖关系可以分为两类,即:
(1)窄依赖(narrow dependencies):子RDD的每个分区依赖于常数个父分区(即与数据规模无关);
(2)宽依赖(wide dependencies):子RDD的每个分区依赖于所有父RDD分区。例如,map产生窄依赖,而join则是宽依赖(除非父RDD被哈希分区)。
如图1所示:
图1 窄依赖和宽依赖的例子。(方框表示RDD,实心矩形表示分区)
窄依赖是一种常数依赖的对应关系,所以可以直接从父分区中获取;宽依赖则不行。以下是宽依赖(实现是ShuffleDependency)的几个重要属性。
/**
* :: DeveloperApi ::
* Represents a dependency on the output of a shuffle stage. Note that in the case of shuffle,
* the RDD is transient since we don't need it on the executor side.
*
* @param _rdd the parent RDD
* @param partitioner partitioner used to partition the shuffle output
* @param serializer [[org.apache.spark.serializer.Serializer Serializer]] to use. If not set
* explicitly then the default serializer, as specified by `spark.serializer`
* config option, will be used.
* @param keyOrdering key ordering for RDD's shuffles
* @param aggregator map/reduce-side aggregator for RDD's shuffle
* @param mapSideCombine whether to perform partial aggregation (also known as map-side combine)
*/
@DeveloperApi
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](
@transient private val _rdd: RDD[_ <: Product2[K, V]],
val partitioner: Partitioner,
val serializer: Serializer = SparkEnv.get.serializer,
val keyOrdering: Option[Ordering[K]] = None,
val aggregator: Option[Aggregator[K, V, C]] = None,
val mapSideCombine: Boolean = false)
extends Dependency[Product2[K, V]] {
override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]
private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName
private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName
// Note: It's possible that the combiner class tag is null, if the combineByKey
// methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.
private[spark] val combinerClassName: Option[String] =
Option(reflect.classTag[C]).map(_.runtimeClass.getName)
val shuffleId: Int = _rdd.context.newShuffleId()
val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(
shuffleId, _rdd.partitions.length, this)
_rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))
}
1.4compute(获取分区迭代列表)
计算属性: 当调用 RDD#iterator 方法无法从缓存或 checkpoint 中获取指定 partition 的迭代器时,就需要调用 compute 方法来获取。RDD不仅包含有数据,还有在数据上的计算,每个RDD以分区为计算粒度,每个RDD会实现compute函数,compute函数会和迭代器(RDD之间转换的迭代器)进行复合,这样就不需要保存每次compute运行的结果。
代码:
/**
* :: DeveloperApi ::
* Implemented by subclasses to compute a given partition.
*/
@DeveloperApi
def compute(split: Partition, context: TaskContext): Iterator[T]
下面举几个算子操作:
1、map
/**
* An RDD that applies the provided function to every partition of the parent RDD.
*/
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
var prev: RDD[T],
f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator)
preservesPartitioning: Boolean = false)
extends RDD[U](prev) {
override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
override def getPartitions: Array[Partition] = firstParent[T].partitions
override def compute(split: Partition, context: TaskContext): Iterator[U] =
f(context, split.index, firstParent[T].iterator(split, context))
override def clearDependencies() {
super.clearDependencies()
prev = null
}
}
/** Returns the first parent RDD */
protected[spark] def firstParent[U: ClassTag]: RDD[U] = {
dependencies.head.rdd.asInstanceOf[RDD[U]]
}
上面代码中的 firstParent 是指本 RDD 的依赖 dependencies: Seq[Dependency[_]] 中的第一个,MapPartitionsRDD 的依赖中只有一个父 RDD。而 MapPartitionsRDD 的 partition 与其唯一的父 RDD partition 是一一对应的,所以其 compute 方法可以描述为:对父 RDD partition 中的每一个元素执行传入 map (代码中的f(context,split.index,iterator)函数)的方法得到自身的 partition 及迭代器。
2、groupByKey
与 map、union 不同,groupByKey 是一个会产生宽依赖(ShuffleDependency)的 transform,其最终生成的 RDD 是 ShuffledRDD,来看看其 compute 实现:
override def compute(split: Partition, context: TaskContext): Iterator[(K, C)] = {
val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]
SparkEnv.get.shuffleManager.getReader(dep.shuffleHandle, split.index, split.index + 1, context)
.read()
.asInstanceOf[Iterator[(K, C)]]
}
对应groupByKey的源码,官方给出三种实现方式。但是,三个方法只是传递的参数不同,整体需要实现的功能是相同的,需要对结果的分区进行控制的话可以使用带有分区器参数的方法,需要重新设置分区数量的话可以使用带有分区数参数的方法,使用官方默认设置的话则是用无参数的方法。
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with the existing partitioner/parallelism level. The ordering of elements
* within each group is not guaranteed, and may even differ each time the resulting RDD is
* evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
* 默认设置的方法
*/
def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(defaultPartitioner(self))
}
/**
* Group the values for each key in the RDD into a single sequence. Allows controlling the
* partitioning of the resulting key-value pair RDD by passing a Partitioner.
* The ordering of elements within each group is not guaranteed, and may even differ
* each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 带有分区器参数的方法
*/
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn't use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}
/**
* Group the values for each key in the RDD into a single sequence. Hash-partitions the
* resulting RDD with into `numPartitions` partitions. The ordering of elements within
* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.
*
* @note This operation may be very expensive. If you are grouping in order to perform an
* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`
* or `PairRDDFunctions.reduceByKey` will provide much better performance.
*
* @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
* 带有分区数量参数的方法
*/
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {
groupByKey(new HashPartitioner(numPartitions))
}
groupByKey方法主要作用是将键相同的所有的键值对分组到一个集合序列当中,如(a,1),(a,3),(b,1),(c,1),(c,3),分组后结果是((a,1),(a,3)),(b,1),((c,1),(c,3)),分组后的集合中的元素顺序是不确定的,比如键a的值集合也可能是((a,3),(a,1))。
相对而言,groupByKey方法是比较昂贵的操作,意思就是说比较消耗资源。所以如果你的目的是分组后对每一个键所对应的所有值进行求和或者取平均的话,那么请使用PairRDD中的reduceByKey方法或者aggregateByKey方法,这两种方法可以提供更好的性能 。
groupBykey是把所有的键值对集合都加载到内存中存储计算,所以如果一个键对应的值太多的话,就会导致内存溢出的错误,这是需要重点关注的地方。
3、reduceByKey
总体来说下面三种形式的方法备注大意为:
根据用户传入的函数来对(K,V)中每个K对应的所有values做merge操作(具体的操作类型根据用户定义的函数),在将结果发送给reducer节点前该merge操作首先会在本地Mapper端进行。但是具体到每个方法,根据传入的参数其含义又有所延伸,下面会具体解释:
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce.
* 传入分区器,根据分区器重新分区
*/
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.
* 重新设置分区数
*/
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {
reduceByKey(new HashPartitioner(numPartitions), func)
}
/**
* Merge the values for each key using an associative and commutative reduce function. This will
* also perform the merging locally on each mapper before sending results to a reducer, similarly
* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
* parallelism level.
* 使用默认分区器
*/
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
reduceByKey(defaultPartitioner(self), func)
}
接着往下面来看,reduceByKey方法主要执行逻辑在combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)这个方法中,源码如下:
def combineByKeyWithClassTag[C](
createCombiner: V => C, //把V装进C中
mergeValue: (C, V) => C, //把V整合进入C中
mergeCombiners: (C, C) => C, //整合两个C成为一个
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
//这里可以看到,pairRDD的key类型不能为数组,否则会报错
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
//hash分区器不能作用于数组键
if (partitioner.isInstanceOf[HashPartitioner]) {
throw new SparkException("HashPartitioner cannot partition array keys.")
}
}
val aggregator = new Aggregator[K, V, C](
self.context.clean(createCombiner),
self.context.clean(mergeValue),
self.context.clean(mergeCombiners))
//判断传入分区器是否相同
if (self.partitioner == Some(partitioner)) {
self.mapPartitions(iter => {
val context = TaskContext.get()
new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
}, preservesPartitioning = true)
} else {
//不相同的话重新返回shufferRDD
new ShuffledRDD[K, V, C](self, partitioner)
.setSerializer(serializer)
.setAggregator(aggregator)
.setMapSideCombine(mapSideCombine)
}
}
reduceByKey与groupByKey进行对比
1、返回值类型不同:reduceByKey返回的是RDD[(K, V)],而groupByKey返回的是RDD[(K, Iterable[V])],举例来说这两者的区别。比如含有一下数据的rdd应用上面两个方法做求和:(a,1),(a,2),(a,3),(b,1),(b,2),(c,1);reduceByKey产生的中间结果(a,6),(b,3),(c,1);而groupByKey产生的中间结果结果为((a,1)(a,2)(a,3)),((b,1)(b,2)),(c,1),(以上结果为一个分区中的中间结果)可见groupByKey的结果更加消耗资源。
2、作用不同,reduceByKey作用是聚合,异或等,groupByKey作用主要是分组,也可以做聚合(分组之后)。
3、map端中间结果对键对应的值得聚合方式不同。
单词计数说明两种方式的区别
val words = Array("a", "a", "a", "b", "b", "b")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _) //reduceByKey
val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum)) //groupByKey
4、combineByKey
作用: 将RDD[(K,V)] => RDD[(K,C)] 表示V的类型可以转成C两者可以不同类型。
def combineByKey[C](createCombiner:V =>C ,mergeValue:(C,V) =>C, mergeCombiners:(C,C) =>C):RDD[(K,C)]
def combineByKey[C](createCombiner:V =>C ,mergeValue:(C,V) =>C,
mergeCombiners:(C,C) =>C,numPartitions:Int ):RDD[(K,C)]
def combineByKey[C](createCombiner:V =>C ,mergeValue:(C,V) =>C,
mergeCombiners:(C,C)=>C,partitioner:Partitioner,mapSideCombine:Boolean=true,
serializer:Serializer= null):RDD[(K,C)]
第一个函数和第二个函数默认使用的是HashPartitioner、serialize为null。
这个算子还是比较复杂,解释下:
1)createCombiner:在遍历RDD的数据集合过程中,对于遍历到的(k,v),如果combineByKey第一次遇到值为k的Key(类型K),那么将对这个(k,v)调用createCombiner函数,它的作用是将v转换为c(类型是C,聚合对象的类型,c作为局和对象的初始值)
2)mergeValue:在遍历RDD的数据集合过程中,对于遍历到的(k,v),如果combineByKey不是第一次(或者第二次,第三次…)遇到值为k的Key(类型K),那么将对这个(k,v)调用mergeValue函数,它的作用是将v累加到聚合对象(类型C)中,mergeValue的类型是(C,V)=>C,参数中的C遍历到此处的聚合对象,然后对v 进行聚合得到新的聚合对象值。
3)mergeCombiners:因为combineByKey是在分布式环境下执行,RDD的每个分区单独进行combineByKey操作,最后需要对各个分区的结果进行最后的聚合,它的函数类型是(C,C)=>C,每个参数是分区聚合得到的聚合对象
例子:
scala> val data = sc.parallelize(List(("1","3"),("1","2"),("1","5"),("2","3")))
scala> val natPairRdd = data.combineByKey(List(_), (c: List[String], v: String) => v::c, (c1: List[String], c2: List[String]) => c1 ::: c2)
scala> natPairRdd.collect
res0: Array[(String, List[String])] = Array((1,List(3, 2, 5)), (2,List(3)))
1.5preferedLocations(优先分配节点列表)
对于分区而言返回数据本地化计算的节点列表。也就是说,每个RDD会报出一个列表(Seq),而这个列表保存着分片优先分配给哪个Worker节点计算,spark坚持移动计算而非移动数据的原则。也就是尽量在存储数据的节点上进行计算。
要注意的是,并不是每个 RDD 都有 preferedLocation,比如从 Scala 集合中创建的 RDD 就没有,而从 HDFS 读取的 RDD 就有。
/**
* Get the preferred locations of a partition, taking into account whether the
* RDD is checkpointed.
*/
final def preferredLocations(split: Partition): Seq[String] = {
checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {
getPreferredLocations(split)
}
}
spark 本地化级别:PROCESS_LOCAL => NODE_LOCAL => NO_PREF => RACK_LOCAL => ANY
PROCESS_LOCAL 进程本地化:task要计算的数据在同一个Executor中。
NODE_LOCAL 节点本地化:速度比 PROCESS_LOCAL 稍慢,因为数据需要在不同进程之间传递或从文件中读取。
NODE_PREF 没有最佳位置这一说,数据从哪里访问都一样快,不需要位置优先。比如说SparkSQL读取MySql中的数据。
RACK_LOCAL 机架本地化,数据在同一机架的不同节点上。需要通过网络传输数据及文件 IO,比 NODE_LOCAL 慢。
ANY 跨机架,数据在非同一机架的网络上,速度最慢。