njzhujinhua@2017/12/15
1. RDD的实现
1.1 作业调度
当对RDD执行转换操作时,调度器根据RDD的血统Lineage构建若干调度阶段Stage组成的有向无环图DAG,每个Stage包含尽可能多的连续窄依赖转换操作。调度器按照有向无环图顺序依次进行计算并得到最终结果RDD。
调度器向各执行节点分配task采用lazy策略,根据数据真正存储位置来确定其计算位置,尽量少移动数据,改为移动计算任务的机制。
真正触发RDD计算的是Action操作,其执行时会以宽依赖来构建各个执行阶段Stage,各个Stage内部为窄依赖,其在该stage内前后链接构成流水线。
1.2 内存管理
在RDD的作业调度过程中,如果分区数据丢失,则只需要依次计算其依赖分区的数据即可恢复,对于宽依赖则需要从其所有父节点进行计算。RDD中每一个action操作都将触发其依赖的stage的执行,这将对性能十分不利。
Spark支持三种持久化RDD的存储策略:内存中未序列化对象,内存中序列化数据,磁盘存储。第一种显然是性能最优的,第三种则用于RDD太大的情况,多次使用且每次重新计算该RDD消耗资源太多的情形。
血统可用于错误恢复后RDD的恢复,若血统很长,则其恢复时间也很可观,需要通过checkpoint操作保存到外部存储,通常情况对于包含宽依赖的很长的Lineage的RDD设置检查点是十分必要的,但对于只有窄依赖的RDD做checkpoint操作则基本没必要了。
1.3 cache() persist() 与checkpoint()的关系
cache与persist
cache适用于频繁使用但不太大的RDD,cache只能写内存,persist则支持多种存储级别,看其源码实现,cache也是通过persist的参数的特殊情形实现的
/**
* Mark this RDD for persisting using the specified level.
*
* @param newLevel the target storage level
* @param allowOverride whether to override any existing level with the new one
*/
//persist实现,支持自定义存储级别及是否覆盖已存在的数据
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
//目前实现:新旧存储级别不一致且不支持覆盖的话报错,
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
RDD第一次persist需要SparkContext相应操作
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
persist的无参方法,默认也是只存到内存中
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
cache与不带参数的persis是一致的
def cache(): this.type = persist()
cache/persist与checkpoint
cache或persist时是将要cache的对象直接存到内存或磁盘中,而checkpoint则是等到job结束另起一个job进行checkpoint操作,也就是说这个RDD需要计算两遍,为了减少不必要的操作,可以在checkpoint前线执行rdd.cache()缓存下来,以便checkpoint时可以直接使用。
cache将RDD保存到内存或磁盘中,此时其血统仍保存在RDD的依赖之中,因而若缓存的RDD的某部分丢失,其仍可计算得到。大checkpoint将RDD保存到HDFS上后就将血统完全清除了,这个可用于很长血统的截断,并将数据彻底保存与HDFS中。
persist和checkpoint都将数据保存到磁盘,其区别除了上面的之外,另外一个是其生命周期。persist将RDD持久化到磁盘,氮气partition由BlockManager管理,在driver进程执行结束后,executor进程也将结束,blockManager进而停止,此时persist到磁盘的RDD被清空。但checkpoint将RDD持久化到HDFS或本地文件夹中后除非明确手动删除,其不会消失。可被其他应用读取使用,但是persist的数据则无法被其他应用使用。
checkpoint的实现
/**
* Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint
* directory set with `SparkContext#setCheckpointDir` and all references to its parent
* RDDs will be removed. This function must be called before any job has been
* executed on this RDD. It is strongly recommended that this RDD is persisted in
* memory, otherwise saving it on a file will require recomputation.
*/
checkpoint操作只是标记此RDD在checkpoint。将数据保存到通过SparkContext#setCheckpointDir设置的路径中区,所有血统被移除
此函数必须在此RDD执行任何其他job前被调用。 且强烈建议RDD先persist到内存中,否则该RDD将被重新计算...
def checkpoint(): Unit = RDDCheckpointData.synchronized {
// NOTE: we use a global lock here due to complexities downstream with ensuring
// children RDD partitions point to the correct parent partitions. In the future
// we should revisit this consideration.
if (context.checkpointDir.isEmpty) {
throw new SparkException("Checkpoint directory has not been set in the SparkContext")
} else if (checkpointData.isEmpty) {
checkpointData = Some(new ReliableRDDCheckpointData(this))
}
}