如何进行缓存
计算RDD之前,先去判断Partition是否需要被缓存,如果需要被缓存,进行Partition计算,然后缓存到内存,可以缓存到memory,如果写到hdfs(外部存储系统),就需要检查checkpoint。调用RDD.cache()后,RDD变成persistRDD,存储级别为MEMORY_ONLY(内存级别),persistRDD告知Driver自己需要被persist,此时调用RDD.iterator,这是要计算该RDD中某个Partition,先去cacheManager获取blockId,用这个blockId去BlockManager匹配这个Partition是否被checkpoint,如果是,直接从checkpoint读取该partition的所有记录到ArrayBuffer(cache)中。如果没有被checkpoint,计算这个Partition,然后将其所有记录放入cache中。
cache、persist和chechpoint三者区别
cache和persist属于同一种,都是缓存
cache和persist都属于缓存,cache调用无参的persist方法,无参的persist指定缓存级别为memory,调用有参persist,下面为RDD.scala中的源码,可见persist可以自定义缓存级别,而cache不能指定
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
checkpoint和persist之间的区别
checkpoint(检查点)创建检查点需要指定目录,将RDD存在这个目录下。checkpoint会将所有父依赖删除,是lineage的终点。进行checkpoint前要先对RDD进行cache(checkpoint会等到job结束后启动专门的job去ckeckpiont),需要checkpoint的job会执行两次。
checkpoint是永久持久化到磁盘,虽然persist也能持久化到磁盘,但partition由BlockManager管理,一旦Driver program结束,Executor所在进程CoarseGrainedExecutorBackend 会停止,BlockManager也会,缓存的RDD也会被清空,而checkpoint即使Driver program停止,也不会被清空,除非手动清除,可以被下一个Driver program使用