SparkCore：RDD Persistence持久化策略， persist和cache算子

最新推荐文章于 2024-07-18 06:27:13 发布

11号车厢

最新推荐文章于 2024-07-18 06:27:13 发布

阅读量490

点赞数

分类专栏： Spark2 文章标签： Spark2

本文链接：https://blog.csdn.net/greenplum_xiaofan/article/details/98224094

版权

本文介绍了Spark中RDD的持久化机制，包括persist()和cache()算子的使用。RDD持久化可以提升后续操作的性能，cache()是lazy的，会在遇到action时进行计算并缓存。持久化级别StorageLevel提供了内存使用和CPU效率之间的权衡，如MEMORY_ONLY和MEMORY_ONLY_SER。此外，还讨论了如何选择合适的存储级别和如何使用RDD.unpersist()移除缓存数据。

摘要由CSDN通过智能技术生成

文章目录

官网：RDD Persistence
http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

1、RDD Persistence介绍

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
Spark中最重要的功能之一是跨操作在内存中持久化(或缓存)数据集。当您持久化一个RDD时，每个节点将它计算的任何分区存储在内存中，并在action数据集时重新使用它们。这使得将来的操作要快得多(通常超过10倍)。缓存是迭代算法和快速交互使用的关键工具。
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
您可以在RDD上使用persist()或cache()方法来标记要持久化的RDD。第一次在操作中计算它时，它将保存在节点的内存中。Spark的缓存是容错的——如果RDD的任何分区丢失，它将使用最初创建它的转换自动重新计算。

scala> val lines = sc.textFile("hdfs://192.168.137.130:9000/test.txt")
scala> lines.cache

此时打开UI，是没有任何Job产生的，在Storage中也看不到持久化的数据

scala> lines.collect
res1: Array[String] = Array(hello spark, hello mr, hello yarn, hello hive, hello spark)

当遇到action时才会产生Job，然后查看Storage发现持久化了刚才的文件数据，并且100%持久化在内存中，这说明cache()是lazy的，当然他不属于transformer算子
在这里插入图片描述

2、persist()和cache()算子

2.1 cache底层源码

查看cache()源码，cache底层调用的是persist()，默认storage level (MEMORY_ONLY)

 /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def cache(): this.type = persist()

最低0.47元/天解锁文章

11号车厢

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录