基于spark-2.4.0的源码对两个算子进行分析它们之间的区别
首先,看下cache()算子的源码
def cache(self):
"""
Persist this RDD with the default storage level (`MEMORY_ONLY`).
"""
self.is_cached = True
self.persist(StorageLevel.MEMORY_ONLY)
return self
该算子调用了persist()算子,但是看不出什么,需要看下persist()算子的源码
def persist(self, storageLevel=StorageLevel.MEMORY_ONLY):
"""
Set this RDD's storage level to persist its values across operations
after the first time it is computed. This can only be used to assign
a new storage level if the RDD does not have a storage level set yet.
If no storage level is specified defaults to (`MEMORY_ONLY`).
>>> rdd = sc.parallelize(["b", "a", "c"])
>>> rdd.persist().is_cached
True
"""
self.is_cached = True
javaStorageLevel = self.ctx._getJavaStorageLevel(storageLevel)
self._jrdd.persist(javaStorageLevel)
return self
可以看出persist()内部调用了persist(StorageLevel.MEMORY_ONLY),有StorageLevel 参数,该参数为缓存级别。
而cache()算子仅有默认的缓存级别MEMORY_ONLY ,而persist()算子可以设置缓存级别。
可以再深入了解下rdd有哪些缓存级别,查看StorageLevel源码
StorageLevel.DISK_ONLY = StorageLevel(True, False, False, False)
StorageLevel.DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
StorageLevel.MEMORY_ONLY = StorageLevel(False, True, False, False)
StorageLevel.MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
StorageLevel.MEMORY_AND_DISK = StorageLevel(True, True, False, False)
StorageLevel.MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
StorageLevel.OFF_HEAP = StorageLevel(True, True, True, False, 1)