Spark 之 persist

90 篇文章 1 订阅
18 篇文章 1 订阅
persisit
/**
   * Set this RDD's storage level to persist its values across operations after the first time
   * it is computed. This can only be used to assign a new storage level if the RDD does not
   * have a storage level set yet. Local checkpointing is an exception.
   */
  def persist(newLevel: StorageLevel): this.type = {
    if (isLocallyCheckpointed) {
      // This means the user previously called localCheckpoint(), which should have already
      // marked this RDD for persisting. Here we should override the old storage level with
      // one that is explicitly requested by the user (after adapting it to use disk).
      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
    } else {
      persist(newLevel, allowOverride = false)
    }
  }

  /**
   * Persist this RDD with the default storage level (`MEMORY_ONLY`).
   */
  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
  • RDDs are re-computable on each action by default due to its behavior. This phenomenon can be overcome by persisting the RDDs. So, that whenever we call an action on RDD, no re-computation takes place. When we call persist ( ) method, each computation stores the result in their partitions.
  • Cache mechanism is one used to speed up the applications that access the same RDDs several times.
  • cache is a synonym of word persist or persist(MEMORY_ONLY), that signifies cache is nothing but persist with the default storage level MEMORY_ONLY.
when to use persist

总而言之,有需要重复使用的RDD,均可以考虑使用 persist
There are following situations in which we can use cache mechanism.

  • When we re-use RDD while working in iterative machine learning applications
  • While we re-use RDD in standalone spark applications
  • When RDD computations are expensive, we use caching mechanism. It helps in reducing the cost of recovery if, in case one executor fails.
Need of Persistence Mechanism

It allows us to use same RDD multiple times in apache spark. As we know as many times we use RDD or we repeat RDD evaluation, we need to call action to execute. This process consumes much time as well as memory, while we perform iterative algorithm we require looking at data many times that time, that consumes ample of memory and time. To overcome this issue of repeated computation, these techniques of persistence introduced.

How to un-persist RDD in Spark

Cached data overreach the volume of memory, spark automatically expel the old data. This is actually a process named LRU, LRU refers to Last Recently Used. This algorithm categorizes the data as less used or frequently used. Either, it happens automatically or we can do it on our own by using the method calls un-persist, this is RDD.unpersist( ) method.

Conclusion

Hence, Spark RDD persistence and caching mechanism are various optimization techniques, that help in storing the results of RDD evaluation techniques. These mechanisms help saving results for upcoming stages so that we can reuse it. After that, these results as RDD can be stored in memory and disk as well.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Spark 中,内存调优是非常重要的一部分,它可以显著提高 Spark 应用程序的性能。以下是一些内存调优的技巧。 1. 调整堆内存大小 Spark 的默认堆内存大小为 1GB,但是这可能不适用于所有情况。如果您的应用程序需要更多的内存,可以通过设置 `--driver-memory` 和 `--executor-memory` 参数来增加堆内存大小。 2. 调整内存分配比例 Spark 内存分为堆内存和非堆内存,其中非堆内存主要用于缓存和其他临时数据。您可以通过调整 `spark.memory.fraction` 参数来设置内存分配比例。通常情况下,将非堆内存的比例设置为 0.6 左右可以获得最佳性能。 3. 启用内存压缩 Spark 可以使用内存压缩技术来减少内存使用量。您可以通过设置 `spark.io.compression.codec` 参数来启用内存压缩。通常情况下,使用 Snappy 压缩算法可以获得最佳性能。 4. 使用持久化 Spark 可以将 RDD 缓存在内存中,以便快速访问。这可以通过使用 `cache()` 或 `persist()` 方法来实现。如果您的应用程序需要频繁访问同一数据集,那么持久化 RDD 可以显著提高性能。 5. 增加并行度 通过增加并行度,可以将任务分配给更多的执行器,从而减少每个执行器的负载,提高整个应用程序的性能。您可以通过设置 `spark.default.parallelism` 参数来增加并行度。 这些都是 Spark 内存调优的一些技巧,您可以根据您的具体情况来选择适合您的方法。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值