文章目录
官网:RDD Persistence
http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
1、RDD Persistence介绍
-
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
Spark中最重要的功能之一是跨操作在内存中持久化(或缓存)数据集。当您持久化一个RDD时,每个节点将它计算的任何分区存储在内存中,并在action数据集时重新使用它们。这使得将来的操作要快得多(通常超过10倍)。缓存是迭代算法和快速交互使用的关键工具。 -
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
您可以在RDD上使用persist()或cache()方法来标记要持久化的RDD。第一次在操作中计算它时,它将保存在节点的内存中。Spark的缓存是容错的——如果RDD的任何分区丢失,它将使用最初创建它的转换自动重新计算。
scala> val lines = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
scala> lines.cache
此时打开UI,是没有任何Job产生的,在Storage中也看不到持久化的数据
scala> lines.collect
res1: Array[String] = Array(hello spark, hello mr, hello yarn, hello hive, hello spark)
当遇到action时才会产生Job,然后查看Storage发现持久化了刚才的文件数据,并且100%持久化在内存中,这说明cache()是lazy的,当然他不属于transformer算子
2、persist()和cache()算子
2.1 cache底层源码
查看cache()源码,cache底层调用的是persist(),默认storage level (MEMORY_ONLY
)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()