持久化算子可以将计算结果保存起来, 不必重新计算。
相关源码
持久化算子cache
, persist
, unpersist
源码如下:
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Set this RDD's storage level to persist its values across operations after the first time
* it is computed. This can only be used to assign a new storage level if the RDD does not
* have a storage level set yet. Local checkpointing is an exception.
*/
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Mark this RDD for persisting using the specified level.
*
* @param newLevel the target storage level
* @param allowOverride whether to override any existing level with the new one
*/
private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
// TODO: Handle changes of StorageLevel
if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
throw new UnsupportedOperationException(
"Cannot change storage level of an RDD after it was already assigned a level")
}
// If this is the first time this RDD is marked for persisting, register it
// with the SparkContext for cleanups and accounting. Do this only once.
if (storageLevel == StorageLevel.NONE) {
sc.cleaner.foreach(_.registerRDDForCleanup(this))
sc.persistRDD(this)
}
storageLevel = newLevel
this
}
/**
* Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.
*
* @param blocking Whether to block until all blocks are deleted.
* @return This RDD.
*/
def unpersist(blocking: Boolean = true): this.type = {
logInfo("Removing RDD " + id + " from persistence list")
sc.unpersistRDD(id, blocking)
storageLevel = StorageLevel.NONE
this
}
由源码可知, cache
是persist
无参的别名。
使用建议
持久化某个RDD后, 要在不需要的时候使用unpersist
算子删除。
在选择存储级别时, 不要使用磁盘存储, 因为这个操作太昂贵且低效, 读取它甚至不如重新计算。
建议使用的存储级别为:
- MEMORY_ONLY (默认)
- MEMORY_ONLY_SER (序列化的,节省内存空间但是更耗费cpu资源)
DEMO
val value: RDD[Double] = sc.parallelize(Array(1.1,-2.2,3.3)).persist(StorageLevels.MEMORY_AND_DISK_SER)
value.foreach(println)
Thread.sleep(30 * 1000)
value.unpersist()
Thread.sleep(30 * 1000)
持久化后, 在UI中显示:
反持久化后: