看spark 源码学java/scala 之一java reference ----contextcleaner 源码解读

最新推荐文章于 2023-07-22 20:45:14 发布

zhenhailiu

最新推荐文章于 2023-07-22 20:45:14 发布

阅读量318

点赞数

文章标签：大数据 spark java 引用

本文链接：https://blog.csdn.net/zhenhailiu/article/details/90368291

版权

spark contextcleaner 源码解读

本文通过学习spark 的源码加深对Java reference 的理解。

java 引用

java 有四类引用，分别是强引用、软引用、弱引用、虚应用。

强引用

如下代码片段所示，一个java对象句柄就是对其指向的对象的强引用。一个对象如果有强引用，该变量不会被垃圾回收。

String handle=new String("a string")

软引用

软引用对象引用另一个对象。只有软引用对象应用而不存在强应用的对象，在内存不足，进程抛出out of memory 异常之前会回收这些对象。如果软引用引用的对象没有被回收，get方法将返回被引用的对象，否则null。

需要注意的是，软应用对象本身也是一个对象，软引用类可以被继承。软应用对象是另一个对象的软应用，软应用对象的句柄是对软引用对象本身的强应用。

String handle=new String("a string");
SoftReference<String> sfh=new SoftReference<String>(handle);
assert sfh.get()!=null

弱引用

弱引用对象引用另一个对象。只有弱引用对象应用而不存在强应用或软引用的对象，这些对象在下次gc 的时候被回收。如果弱引用引用的对象没有被回收，get方法将返回被引用的对象，否则null。

需要注意的是，弱应用对象本身也是一个对象，弱引用类可以被继承，读者可以参考Threadlocal 对弱引用的使用。弱引用对象是另一个对象的弱应用，弱应用对象的句柄是对弱引用对象本身的强应用。

String handle=new String("a string");
WeakReference<String> wfh=new WeakReference<String>(handle);
assert wfh.get()!=null

虚引用

弱引用对象引用另一个对象。虚引用对引用对象的生命周期没影响。无论所引用的对象有没有被回收，虚引用的get 方法都返回null。虚引用的用法是配合ReferenceQueue一起使用，起到通知进程某对象已被回收的作用。进程可以根据这些信息做一些善后的工作。

String handle=new String("a string");

引用和referencequeue

软引用、弱引用、虚应用都可以关联一个referencequeue。当这些引用对象所应用的对象被回收时，这些引用会被放到关联的referencequeue。

需要注意的时，因为软引用、弱引用、虚应用本身也是对象，有自己的生命周期。如果在引用对象所引用的对象被回收之前，引用对象就被回收了，那么在所引用的对象被回收时，引用对象将不会被放到关联的referencequeue，因为引用对象已经不在了。

以下是一段测试代码，测试引用对象先于被引用对象被回收时的情况。

import java.lang.ref.ReferenceQueue;
import java.lang.ref.WeakReference;
import java.util.LinkedList;

public class TestRef {
	static Weak<TestRef> testWeak2(TestRef tc,ReferenceQueue<TestRef>rq,LinkedList<Weak<TestRef>>wq) {
		 TestRef tc2=new TestRef();
		 //refed lost before ref
		 Weak<TestRef> w1=new Weak<TestRef>(tc2,"test2",rq);
		 //ref lost before refed
		 Weak<TestRef> w2=new Weak<TestRef>(tc,"test3",rq);
		 return w1;
		 //wq.add(w1);
	 }
	 
	 static void testWeak1(ReferenceQueue<TestRef>rq,LinkedList<Weak<TestRef>>wq) throws InterruptedException {
		 TestRef tc=new TestRef();
		 // change the next line to Weak<TestRef> w=testweak2(tc,rq,wq) 
		 //to see diffirent result;
		 {testWeak2(tc,rq,wq);}
		// this call just suggest the vm to gc,so we wait for a while,pray for it
		 System.gc();
		 Thread.sleep(1000);
		 Weak<TestRef> w1=new Weak<TestRef>(tc,"weak1",rq);
		 Weak<TestRef> w2=new Weak<TestRef>(tc,"weak2",rq);
		 Weak<TestRef> w3=new Weak<TestRef>(tc,"weak3",rq);
		 Weak<TestRef> w4=new Weak<TestRef>(tc,"weak4",rq);
		 // when this scope is exited,all weak refs but w2 and w3 are lost.
		 wq.add(w2);
		 wq.add(w3);
		 
		 
		 
	 }
	 static class Weak<T> extends WeakReference<T>{
       String name;
		public Weak(T referent,String name,ReferenceQueue<T>rq) {
			super(referent,rq);
			this.name=name;
			// TODO Auto-generated constructor stub
		}}
	public static void main(String[] args) throws InterruptedException 
	{
		
		ReferenceQueue<TestRef> rq=new ReferenceQueue<TestRef>();
		LinkedList<Weak<TestRef>>wq=new LinkedList<Weak<TestRef>>();
		testWeak1(rq,wq);
		// this call just suggest the vm to gc,so we wait for a while,pray for it
		System.gc();
		Thread.sleep(1000);
	    Weak<TestRef> wr=(Weak<TestRef>) rq.poll();
	    System.out.println("print weakref");
	    while (wr!=null)
	    {
	    	System.out.println(wr.name);
	    	wr=(Weak<TestRef>) rq.poll();
	    	
	    }
		
	}


}

这个程序的运行输出结果是

print weakref
weak2
weak3

如果把代码中的第20行改成Weak w=testweak2(tc,rq,wq)，输出结果将变成

print weakref
weak2
weak3
test2

各引用的使用场景

软应用引用的对象，在进程内存不足的时候会被回收。利用这点，可以用软引用缓存对象，内存不足时，对象自动被回收。

弱引用可以用来实现 canonical map,当一个key 在map之外已经不存在了，应该删除map 里这个key对应的对象。ThreadLocal 就是类似这样用的，具体可以看源码。

弱引用或虚引用可以配合ReferenceQueue使用，用来通知程序莫对象已被回收，可以做一些善后工作，比如资源回收，spark contextcleaner 就是这样做的。

spark contextcleaner

在spark 中，driver 端的对象可能对应集群中的资源。比如，RDD只是一个符号，RDD 对应的数据（如果已经计算好并cached or checkpointed）分布在集群中，占据了集群的磁盘和内存资源。当RDD回收时，我们希望RDD占用的集群的资源也要释放。

如果是在C++ 中，这好办，按照资源获取即初始化（RAII, Resource Acquisition Is Initialization）的逻辑，我们可以在对象的析构函数中释放对象获得的资源。可是Scala/java并没有析构函数这一说法，spark contextcleaner 中通过弱引用和ReferenceQueue来实现对象回收时资源的回收。

下面我们看源码

package org.apache.spark

import java.lang.ref.{ReferenceQueue, WeakReference}
import java.util.Collections
import java.util.concurrent.{ConcurrentHashMap, ConcurrentLinkedQueue, ScheduledExecutorService, TimeUnit}

import scala.collection.JavaConverters._

import org.apache.spark.broadcast.Broadcast
import org.apache.spark.internal.Logging
import org.apache.spark.rdd.{RDD, ReliableRDDCheckpointData}
import org.apache.spark.util.{AccumulatorContext, AccumulatorV2, ThreadUtils, Utils}

/**
 * Classes that represent cleaning tasks.
 */
private sealed trait CleanupTask
private case class CleanRDD(rddId: Int) extends CleanupTask
private case class CleanShuffle(shuffleId: Int) extends CleanupTask
private case class CleanBroadcast(broadcastId: Long) extends CleanupTask
private case class CleanAccum(accId: Long) extends CleanupTask
private case class CleanCheckpoint(rddId: Int) extends CleanupTask

/**
 * A WeakReference associated with a CleanupTask.
 *
 * When the referent object becomes only weakly reachable, the corresponding
 * CleanupTaskWeakReference is automatically added to the given reference queue.
 */
private class CleanupTaskWeakReference(
    val task: CleanupTask,
    referent: AnyRef,
    referenceQueue: ReferenceQueue[AnyRef])
  extends WeakReference(referent, referenceQueue)

/**
 * An asynchronous cleaner for RDD, shuffle, and broadcast state.
 *
 * This maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest,
 * to be processed when the associated object goes out of scope of the application. Actual
 * cleanup is performed in a separate daemon thread.
 */
private[spark] class ContextCleaner(sc: SparkContext) extends Logging {

  /**
   * A buffer to ensure that `CleanupTaskWeakReference`s are not garbage collected as long as they
   * have not been handled by the reference queue.
   */
  private val referenceBuffer =
    Collections.newSetFromMap[CleanupTaskWeakReference](new ConcurrentHashMap)

  private val referenceQueue = new ReferenceQueue[AnyRef]

  private val listeners = new ConcurrentLinkedQueue[CleanerListener]()

  private val cleaningThread = new Thread() { override def run() { keepCleaning() }}

  private val periodicGCService: ScheduledExecutorService =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("context-cleaner-periodic-gc")

  /**
   * How often to trigger a garbage collection in this JVM.
   *
   * This context cleaner triggers cleanups only when weak references are garbage collected.
   * In long-running applications with large driver JVMs, where there is little memory pressure
   * on the driver, this may happen very occasionally or not at all. Not cleaning at all may
   * lead to executors running out of disk space after a while.
   */
  private val periodicGCInterval =
    sc.conf.getTimeAsSeconds("spark.cleaner.periodicGC.interval", "30min")

  /**
   * Whether the cleaning thread will block on cleanup tasks (other than shuffle, which
   * is controlled by the `spark.cleaner.referenceTracking.blocking.shuffle` parameter).
   *
   * Due to SPARK-3015, this is set to true by default. This is intended to be only a temporary
   * workaround for the issue, which is ultimately caused by the way the BlockManager endpoints
   * issue inter-dependent blocking RPC messages to each other at high frequencies. This happens,
   * for instance, when the driver performs a GC and cleans up all broadcast blocks that are no
   * longer in scope.
   */
  private val blockOnCleanupTasks = sc.conf.getBoolean(
    "spark.cleaner.referenceTracking.blocking", true)

  /**
   * Whether the cleaning thread will block on shuffle cleanup tasks.
   *
   * When context cleaner is configured to block on every delete request, it can throw timeout
   * exceptions on cleanup of shuffle blocks, as reported in SPARK-3139. To avoid that, this
   * parameter by default disables blocking on shuffle cleanups. Note that this does not affect
   * the cleanup of RDDs and broadcasts. This is intended to be a temporary workaround,
   * until the real RPC issue (referred to in the comment above `blockOnCleanupTasks`) is
   * resolved.
   */
  private val blockOnShuffleCleanupTasks = sc.conf.getBoolean(
    "spark.cleaner.referenceTracking.blocking.shuffle", false)

  @volatile private var stopped = false

  /** Attach a listener object to get information of when objects are cleaned. */
  def attachListener(listener: CleanerListener): Unit = {
    listeners.add(listener)
  }

  /** Start the cleaner. */
  def start(): Unit = {
    cleaningThread.setDaemon(true)
    cleaningThread.setName("Spark Context Cleaner")
    cleaningThread.start()
    periodicGCService.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = System.gc()
    }, periodicGCInterval, periodicGCInterval, TimeUnit.SECONDS)
  }

  /**
   * Stop the cleaning thread and wait until the thread has finished running its current task.
   */
  def stop(): Unit = {
    stopped = true
    // Interrupt the cleaning thread, but wait until the current task has finished before
    // doing so. This guards against the race condition where a cleaning thread may
    // potentially clean similarly named variables created by a different SparkContext,
    // resulting in otherwise inexplicable block-not-found exceptions (SPARK-6132).
    synchronized {
      cleaningThread.interrupt()
    }
    cleaningThread.join()
    periodicGCService.shutdown()
  }

  /** Register an RDD for cleanup when it is garbage collected. */
  def registerRDDForCleanup(rdd: RDD[_]): Unit = {
    registerForCleanup(rdd, CleanRDD(rdd.id))
  }

  def registerAccumulatorForCleanup(a: AccumulatorV2[_, _]): Unit = {
    registerForCleanup(a, CleanAccum(a.id))
  }

  /** Register a ShuffleDependency for cleanup when it is garbage collected. */
  def registerShuffleForCleanup(shuffleDependency: ShuffleDependency[_, _, _]): Unit = {
    registerForCleanup(shuffleDependency, CleanShuffle(shuffleDependency.shuffleId))
  }

  /** Register a Broadcast for cleanup when it is garbage collected. */
  def registerBroadcastForCleanup[T](broadcast: Broadcast[T]): Unit = {
    registerForCleanup(broadcast, CleanBroadcast(broadcast.id))
  }

  /** Register a RDDCheckpointData for cleanup when it is garbage collected. */
  def registerRDDCheckpointDataForCleanup[T](rdd: RDD[_], parentId: Int): Unit = {
    registerForCleanup(rdd, CleanCheckpoint(parentId))
  }

  /** Register an object for cleanup. */
  private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = {
    referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue))
  }

  /** Keep cleaning RDD, shuffle, and broadcast state. */
  private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
    while (!stopped) {
      try {
        val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
          .map(_.asInstanceOf[CleanupTaskWeakReference])
        // Synchronize here to avoid being interrupted on stop()
        synchronized {
          reference.foreach { ref =>
            logDebug("Got cleaning task " + ref.task)
            referenceBuffer.remove(ref)
            ref.task match {
              case CleanRDD(rddId) =>
                doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
              case CleanShuffle(shuffleId) =>
                doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
              case CleanBroadcast(broadcastId) =>
                doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
              case CleanAccum(accId) =>
                doCleanupAccum(accId, blocking = blockOnCleanupTasks)
              case CleanCheckpoint(rddId) =>
                doCleanCheckpoint(rddId)
            }
          }
        }
      } catch {
        case ie: InterruptedException if stopped => // ignore
        case e: Exception => logError("Error in cleaning thread", e)
      }
    }
  }

  /** Perform RDD cleanup. */
  def doCleanupRDD(rddId: Int, blocking: Boolean): Unit = {
    try {
      logDebug("Cleaning RDD " + rddId)
      sc.unpersistRDD(rddId, blocking)
      listeners.asScala.foreach(_.rddCleaned(rddId))
      logInfo("Cleaned RDD " + rddId)
    } catch {
      case e: Exception => logError("Error cleaning RDD " + rddId, e)
    }
  }

  /** Perform shuffle cleanup. */
  def doCleanupShuffle(shuffleId: Int, blocking: Boolean): Unit = {
    try {
      logDebug("Cleaning shuffle " + shuffleId)
      mapOutputTrackerMaster.unregisterShuffle(shuffleId)
      blockManagerMaster.removeShuffle(shuffleId, blocking)
      listeners.asScala.foreach(_.shuffleCleaned(shuffleId))
      logInfo("Cleaned shuffle " + shuffleId)
    } catch {
      case e: Exception => logError("Error cleaning shuffle " + shuffleId, e)
    }
  }

  /** Perform broadcast cleanup. */
  def doCleanupBroadcast(broadcastId: Long, blocking: Boolean): Unit = {
    try {
      logDebug(s"Cleaning broadcast $broadcastId")
      broadcastManager.unbroadcast(broadcastId, true, blocking)
      listeners.asScala.foreach(_.broadcastCleaned(broadcastId))
      logDebug(s"Cleaned broadcast $broadcastId")
    } catch {
      case e: Exception => logError("Error cleaning broadcast " + broadcastId, e)
    }
  }

  /** Perform accumulator cleanup. */
  def doCleanupAccum(accId: Long, blocking: Boolean): Unit = {
    try {
      logDebug("Cleaning accumulator " + accId)
      AccumulatorContext.remove(accId)
      listeners.asScala.foreach(_.accumCleaned(accId))
      logInfo("Cleaned accumulator " + accId)
    } catch {
      case e: Exception => logError("Error cleaning accumulator " + accId, e)
    }
  }

  /**
   * Clean up checkpoint files written to a reliable storage.
   * Locally checkpointed files are cleaned up separately through RDD cleanups.
   */
  def doCleanCheckpoint(rddId: Int): Unit = {
    try {
      logDebug("Cleaning rdd checkpoint data " + rddId)
      ReliableRDDCheckpointData.cleanCheckpoint(sc, rddId)
      listeners.asScala.foreach(_.checkpointCleaned(rddId))
      logInfo("Cleaned rdd checkpoint data " + rddId)
    }
    catch {
      case e: Exception => logError("Error cleaning rdd checkpoint data " + rddId, e)
    }
  }

  private def blockManagerMaster = sc.env.blockManager.master
  private def broadcastManager = sc.env.broadcastManager
  private def mapOutputTrackerMaster = sc.env.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]
}

private object ContextCleaner {
  private val REF_QUEUE_POLL_TIMEOUT = 100
}

/**
 * Listener class used for testing when any item has been cleaned by the Cleaner class.
 */
private[spark] trait CleanerListener {
  def rddCleaned(rddId: Int): Unit
  def shuffleCleaned(shuffleId: Int): Unit
  def broadcastCleaned(broadcastId: Long): Unit
  def accumCleaned(accId: Long): Unit
  def checkpointCleaned(rddId: Long): Unit
}

contextcleaner 的核心是referenceBuffer、referenceQueue、cleaningThread、periodicGCService四个对象。

referenceBuffer

referenceBuffer保存了对象的弱引用，referenceBuffer里的弱引用对象是弱引用类某一派生类的对象，保存了必要的信息去清理所引用的对象的资源。一个对象获得集群资源时，就会向referenceBuffer注册该对象的弱引用。

比如下面这段代码，在对RDD persist 后，向referenceBuffer注册该RDD的弱引用。把弱引用放进referenceBuffer是为了防止引用对象先于被引用对象被回收。

private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
    // TODO: Handle changes of StorageLevel
    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
      throw new UnsupportedOperationException(
        "Cannot change storage level of an RDD after it was already assigned a level")
    }
    // If this is the first time this RDD is marked for persisting, register it
    // with the SparkContext for cleanups and accounting. Do this only once.
    if (storageLevel == StorageLevel.NONE) {
      sc.cleaner.foreach(_.registerRDDForCleanup(this))
      sc.persistRDD(this)
    }
    storageLevel = newLevel
    this
  }