SparkCore-RDD持久化操作

最新推荐文章于 2024-05-23 10:59:51 发布

五块兰州拉面

最新推荐文章于 2024-05-23 10:59:51 发布

阅读量138

点赞数

分类专栏： # spark 文章标签： spark

本文链接：https://blog.csdn.net/qq_39331255/article/details/111413231

版权

spark 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

什么是持久化

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
Spark最重要的功能之一是跨操作在内存中持久化（或缓存）数据集。当您持久化RDD时，每个节点都将它计算的所有分区存储在内存中，并在该数据集（或从该数据集派生的数据集）上的其他操作中重用这些分区。这使得未来的行动更快（通常是10倍以上）。缓存是迭代算法和快速交互使用的关键工具。

如何进行持久化

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
可以使用persist（）或cache（）方法将RDD标记为持久化。第一次在操作中计算它时，它将保存在节点上的内存中。Spark的缓存是容错的——如果RDD的任何分区丢失，它将自动使用最初创建它的转换重新计算。

持久化策略

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).
此外，每个持久化的RDD都可以使用不同的存储级别进行存储，例如，允许您在磁盘上持久化数据集，将其持久化到内存中，但作为序列化的Java对象（以节省空间），跨节点复制它。这些级别是通过向persist（）传递一个StorageLevel对象（Scala、Java、Python）来设置的。cache（）方法是使用默认存储级别的简写，它是仅限StorageLevel.MEMORY_ONLY （将反序列化的对象存储在内存中）。

可以通过persist(StoreageLevle的对象)来指定持久化策略,eg:StorageLevel.MEMORY_ONLY。
在这里插入图片描述

如何选择一款合适的持久化策略

第一就选择默认MEMORY_ONLY，因为性能最高嘛，但是对空间要求最高；如果空间满足不了，退而求其次，选择MEMORY_ONLY_SER,此时性能还是蛮高的，相比较于MEMORY_ONLY的主要性能开销就是序列化和反序列化；如果内存满足不了,直接跨越MEMORY_AND_DISK，选择MEMEORY_AND_DISK_SER，因为到这一步，说明数据蛮大的，要想提高性能，关键就是基于内存的计算，所以应该尽可能的在内存中存储对象；DISK_ONLY不用，xx_2的使用如果说要求数据具备高可用，同时容错的时间花费比从新计算花费时间少，此时便可以使用，否则一般不用。

持久化和非持久化性能比较

object Demo4 {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("Demo2")
      .setMaster("local[*]")
    val sc = new SparkContext(conf)
    //读取外部数据
    var start = System.currentTimeMillis()
    val lines = sc.textFile("C:\\Users\\70201\\Desktop\\wordcount.txt")
    var count = lines.count()
    println("没有持久化：#######lines' count: " + count + ", cost time: " + (System.currentTimeMillis() - start) + "ms")
    lines.persist(StorageLevel.DISK_ONLY) //lines.cache()
    start = System.currentTimeMillis()
    count = lines.count()
    println("持久化之后：#######lines' count: " + count + ", cost time: " + (System.currentTimeMillis() - start) + "ms")
    lines.unpersist()//卸载持久化数据
    sc.stop()
  }
}