Spark中CheckPoint、Cache、Persist
大家好,我是一拳就能打爆A柱的猛男
这几天看到一套视频《尚硅谷2021迎新版大数据Spark从入门到精通》,其中有关于检查点(CheckPoint)的内容,所以就给大家以文字的形式复盘一下。接下来的顺序是:Spark关于持久化的描述、Cache用法、Persist用法、CheckPoint用法。中间会讲解三者之间的关系。
1、Spark关于持久化的描述
在Spark官网,我可以找到关于RDD持久化的全部内容就是如下的内容:
RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the
persist()
orcache()
methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a
StorageLevel
object (Scala, Java, Python) topersist()
. Thecache()
method is a shorthand for using the default storage level, which isStorageLevel.MEMORY_ONLY
(store deserialized objects in memory). The full set of storage levels is:
Storage Level Meaning MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level. MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed. MEMORY_ONLY_SER (Java and Scala) Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. MEMORY_AND_DISK_SER (Java and Scala) Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed. DISK_ONLY Store the RDD partitions only on disk. MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes. OFF_HEAP (experimental) Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled. Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include
MEMORY_ONLY
,MEMORY_ONLY_2
,MEMORY_AND_DISK
,MEMORY_AND_DISK_2
,DISK_ONLY
, andDISK_ONLY_2
.Spark also automatically persists some intermediate data in shuffle operations (e.g.
reduceByKey
), even without users callingpersist
. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users callpersist
on the resulting RDD if they plan to reuse it.
这段话大概意思可以总结为以下几点:
- RDD可以将其正在计算的所有分区的数据都保存起来用于下一次计算。
- 可以使用cache()和persist()来做保存。
- 保存可以存储在节点的内存中、节点的磁盘中、跨节点备份(磁盘和内存都可以),总之Spark为开发者提供了7中持久化级别。
注:RDD是可容错的数据集,也体现在持久化部分,一旦持久化过程中出现数据丢失、错误,可以沿着RDD血缘关系重新计算一遍再次持久化。
不知道是我找不到还是Spark将这部分内容写在了RDD的API中,以上是我翻找持久化的成果。从这段话可以看出RDD的持久化在持久化级别、数据纠错、数据丢失等方面做了完善的工作,所以接下来我们主要关注效率问题。
2、Cache的用法
cache的英文是高速缓冲存储器,也就是内存的意思。显然该方法作用是将数据缓存到内存中(注意:此处没有shuffle,各节点将各节点中各分区的数据缓存到各自的内存中)。下面是wordCount案例中使用Cache:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("cache_test")
val sc = new SparkContext(conf)
// 读数据
val data: RDD[String] = sc.makeRDD(Array("hello spark", "fuck word"))
val flatMap: RDD[String] = data.flatMap(_.split(" "))
val maped: RDD[(String, Int)] = flatMap.map(line => {
println("-------")
(line, 1)
})
// maped.cache()
val reduced: RDD[(String, Int)] = maped.reduceByKey(_ + _)
reduced.collect().foreach(println)
val reduced2: RDD[(String, Int)] = maped.reduceByKey(_ - _)
reduced2.collect().foreach(println)
}
这个wordcount案例我在map阶段做了打印,也就是说分词分出来多少个就会有多少行-----
。在没有cache()的时候打印如下:
-------
-------
-------
-------
(hello,1)
(word,1)
(spark,1)
(fuck,1)
-------
-------
-------
-------
(hello,1)
(word,1)
(spark,1)
(fuck,1)
// 两次结果一样是因为没有重复的单词
可以看到打印了两块-----
,按道理来说应该只需要一块就行了,因为RDD重用了,前面的流程不必要再走一次的嘛。可是各位要**注意RDD中是不存储数据的,RDD定义的是对数据的处理规则。**所以在reduced2这一行这个action的调用使得之前的一系列transformation操作都要重新执行。这显然会降低效率,再次打开cache语句查看打印结果:
-------
-------
-------
-------
(hello,1)
(word,1)
(spark,1)
(fuck,1)
(hello,1)
(word,1)
(spark,1)
(fuck,1)
只有一次打印,也就是说RDD被成功缓存到了内存中,故不需要重复执行之前一系列读出数据、数据变换等操作,进而提高程序的效率。
提问:是不是只有重用RDD时才需要Cache?
答:其实不是的,数据在计算过程中可能会出现错误、丢失等情况,所以在计算链(或血缘关系)太长的情况下,使用Cache可以减少重走计算链的时间,也可以在一定情况提升性能。
3、 Persist的用法
看完了Cache,接下来就看Persist。点击Cache方法的源码可以看到:
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()
其实cache就是persist的默认调用,默认存储级别为MEMORY_ONLY
。所以Persist的用法与Cache用法很相似,只是Persist支持的持久化级别多很多:
maped.persist(StorageLevel.MEMORY_ONLY)
只需要给定持久化级别就可以了,当然写入内存和写入磁盘的速度上会有差异,这要看具体的需求。
所以这部分主要聚焦在存储级别的差异上:
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
StorageLevel的五个参数分别代表了是否存磁盘、是否存内存、是否使用堆外内存、是否做反序列化、备份数目。
具体的存储级别的对比表格(部分)如下:
我在一些博客上看到有人提问:若RDD的数据大于内存如何存储?
其实这个问题是不正确的,Spark设计的存储级别足够丰富,对于大体量的数据用户可以选择MEMORY_AND_DISK类型来解决内存不够大的问题。换句话说,对于太大的RDD不建议使用cache。
4、CheckPoint的用法
既然有了cache或者说persist,为什么要用checkpoint呢?我们先解决这个问题再看checkpoint的用法,通过百度我找到很多答案都引用了 Tathagata Das 的回答:
There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk(TD看来cache和persist没什么区别,其实cache本质也是persist). But the lineage(也就是 computing chain) of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated. However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).
我顺手查了一下这个大佬是谁,我在这里找到了他的介绍:
我又在Stack Overflow上找到了其本人的profile:
这真的是够权威了,Spark Streaming的负责人,其实有时候看看这些东西挺好玩的。
所以,根据Tathagata Das的意思,最重要的一点是persist(或cache)保存了RDD间的血脉关系,而checkpoint切断了血脉联系。使用了persist来保存RDD,那么在接下来的操作中如果出现failure,RDD依然可以通过容错机制恢复,而checkpoint则不行。
Persist/Cache在持久化的过程中确实可能会以文件的形式保存数据,但是一旦application结束文件会自动删除(即使写入磁盘),而CheckPoint的持久化会将文件保存下来,若用户不主动删除是不会消失的。
了解了这三者之间的关系,看CheckPoint的用法就很轻松了:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("cache_test")
val sc = new SparkContext(conf)
// 设置checkPoint路径
sc.setCheckpointDir("ckp")
// 读数据
val data: RDD[String] = sc.makeRDD(Array("hello spark", "fuck word"))
val flatMap: RDD[String] = data.flatMap(_.split(" "))
val maped: RDD[(String, Int)] = flatMap.map(line => {
println("-------")
(line, 1)
})
// RDD落盘
maped.checkpoint()
val reduced: RDD[(String, Int)] = maped.reduceByKey(_ + _)
reduced.collect().foreach(println)
val reduced2: RDD[(String, Int)] = maped.reduceByKey(_ - _)
reduced2.collect().foreach(println)
}
打印结果与cache处相同,所以就不贴出来了。
使用CheckPoint会在项目根目录自动生成ckp路径,各分区的数据都会写入磁盘,同时RDD的血缘关系被切断:
总结
Spark中的cache、persist、checkPoint三个持久化方法的用法、区别、作用都讲完了,总的来说Cache就是Persist,而Persist有多种存储级别支持内存、磁盘的存储,也支持备份。Persist没有切断RDD的血缘关系,在数据出错时可以通过RDD血缘重新计算出来,而且落盘的文件在程序执行结束会自动删除。而CheckPoint会切断血缘,落盘的数据以文件的形式永久保存。
同时对于两个问题也做了回答:
-
是不是只有重用RDD时才需要Cache?
-
若RDD的数据大于内存如何存储?