Spark源码——Spark Task执行内存获取（Execution Memory）

最新推荐文章于 2023-05-31 08:27:02 发布

置顶 Southwest-

最新推荐文章于 2023-05-31 08:27:02 发布

阅读量828

点赞数

分类专栏： Spark源码文章标签： Spark 内存

本文链接：https://blog.csdn.net/lovetechlovelife/article/details/111026182

版权

Spark源码专栏收录该内容

5 篇文章 4 订阅

订阅专栏

文章目录

TaskMemoryManager类
MemoryManager.acquireExecutionMemory()
UnifiedMemoryManager.acquireExecutionMemory()
ExecutionMemoryPool.acquireMemory()
- ExecutionMemoryPool功能
小结
参考

Task获取执行内存的实现都位于core模块的memory包（org/apache/spark/memory）路径下面。

Spark应用程序中，每个task都有一个 TaskMemoryManager，它为每个task申请所需要的内存、释放所占用的内存。

TaskMemoryManager类

TaskMemoryManager管理着执行单个Task所分配的内存，它是在 Executor 类中（launchTask()方法中）实例化TaskRunner的时候被创建的。Executor启动Task的流程分析，可参考这里。

TaskMemoryManager类最复杂的部分就在于将非堆（off-heap）地址处理为64位长整型。

在off-heap非堆模式下，可以直接使用64位长整型来进行内存寻址
在on-heap堆内模式下，则需要组合基于对象引用和该对象的64位偏移（offset）来进行内存寻址

TaskMemoryManager类中最复杂的逻辑位于acquireExecutionMemory()方法中，也就是如何为task申请所需的内存。

1. 执行内存的申请流程图

其实，task申请执行内存的逻辑，最终是由执行内存池 ExecutionMemoryPool 的 acquireMemory() 函数来实现的：
在这里插入图片描述

2. TaskMemoryManager.acquireExecutionMemory()

每个TaskMemoryManager为Task获取执行内存的逻辑都在 acquireExecutionMemory() 方法中。此方法为一个内存消费者（MemoryConsumer）获取N字节的内存，如果没有足够的内存，它将会调用内存消费者的 spill() 方法来释放需要的内存，也就是将consumer占用的内存溢出到磁盘。

MemoryConsumer

一个内存消费者（MemoryConsumer）对应Task中的一个操作和一个数据结构。

TaskMemoryManager接收来自MemoryConsumer支持将所占用内存溢出（spill()方法）到磁盘的功能。spill()方法目前只支持释放 Tungsten-managed pages。

acquireExecutionMemory()

acquireExecutionMemory()方法为当前consumer获取执行内存的过程：

首先，向MemoryManager请求所需内存，
如果请求得到的执行内存比要请求的少，就会强制同一节点上的其他consumers释放内存，
如果从其他consumers中获取的内存加上之前得到的内存还是少于请求的执行内存，那就会强制当前consumer释放所占用的内存。

具体代码细节如下：

//跟踪哪些溢出内存的consumers
private final HashSet<MemoryConsumer> consumers;

public long acquireExecutionMemory(long required, MemoryConsumer consumer) {
  assert(required >= 0); //所请求内存必须大于等于0
  assert(consumer != null); //请求获取执行内存的consumer
  MemoryMode mode = consumer.getMode();
  //如果我们正在从堆外分配Tungsten pages，并在这里收到了分配堆内内存的请求，那么将请求失败，
  //因为这里只能释放堆外内存。
  synchronized (this) {
  	//先通过MemoryManager为当前task获取指定大小的执行内存，并返回获得的内存（可能会小于required），如果没有获取到内存，返回0。
  	//调用此方法，可能会阻塞直到获取了足够的空闲内存，这是为了确保每个task都有机会在它被强制溢写到磁盘之前，获得总内存池1/2N大小的内存。这可能会发生在task数量增加，但旧的tasks已占用大量内存的情况。
    long got = memoryManager.acquireExecutionMemory(required, taskAttemptId, mode);

    //如果获取的内存数比请求的少，首先会尝试从其他的consumers释放内存，这样可以减少溢写磁盘的频率，从而避免产生太多小的溢出文件
    if (got < required) {
      //在其他consumers上调用spill()方法来释放内存。
      
	  //首先会通过TreeMap对其他consumers所占用的内存数进行排序。
	  //这样，我们就可以避免对同一个consumer进行多次磁盘溢写，否则会产生太多的溢写小文件。
	  //TreeMap中key为consumer所使用的内存大小，value为对应的consumer列表
      TreeMap<Long, List<MemoryConsumer>> sortedConsumers = new TreeMap<>();
      
      //遍历所有consumers，将每个内存大小及其对应的consumer添加到sortedConsumers中
      for (MemoryConsumer c: consumers) {
        //遍历的consumer不能是当前task对应的consumer，
        //遍历的consumer使用的内存大于0，
        //遍历的consumer所使用的内存模型，是堆内存（ON_HEAP）还是非堆内存（OFF_HEAP）
        if (c != consumer && c.getUsed() > 0 && c.getMode() == mode) {
          //获取遍历consumer所使用的内存大小
          long key = c.getUsed();
          
          //computeIfAbsent()方法：如果指定的key还没有关联一个value（或者key对应的value为null），就会通过传入的函数计算key对应的value，并将计算的value添加到map中，并返回这个value
          //将遍历的consumer所占用的内存和对应的consumer列表添加到sortedConsumers中，
          //如果sortedConsumers中没有指定的key，就会实例化一个ArrayList；如果有指定的key，就会向之前实例化的ArrayList中添加一个consumer
          List<MemoryConsumer> list =
              sortedConsumers.computeIfAbsent(key, k -> new ArrayList<>(1));
          list.add(c);
        }
      }
      
	  //sortedConsumers不为空
      while (!sortedConsumers.isEmpty()) {
        //在所有排序的consumer（sortedConsumers）中，获取那个比（required - got）大且最接近（required - got）的consumer
        Map.Entry<Long, List<MemoryConsumer>> currentEntry =
          sortedConsumers.ceilingEntry(required - got);
          
        //如果没有consumer使用的内存大于等于(required - got)，就获取那个占用最大内存的consumer
        if (currentEntry == null) {
          currentEntry = sortedConsumers.lastEntry();
        }

		//获取consumer列表
        List<MemoryConsumer> cList = currentEntry.getValue();
        //获取consumers列表中最后一个consumer
        MemoryConsumer c = cList.get(cList.size() - 1);
        try {
          //溢出数据到磁盘来释放内存。
          //注意：为了避免死锁，不要在spill()函数中调用acquireMemory()
          //注意：目前spill()只支持释放Tungsten管理的pages
          long released = c.spill(required - got, consumer);
          if (released > 0) {
            logger.debug("Task {} released {} from {} for {}", taskAttemptId, Utils.bytesToString(released), c, consumer);
            //将之前获得的内存加上通过spill()释放的内存
            //这里的acquireExecutionMemory()函数的实现逻辑主要是在 UnifiedMemoryManager类的acquireExecutionMemory()中
            got += memoryManager.acquireExecutionMemory(required - got, taskAttemptId, mode);
            
            //如果got已经大于请求的内存，就返回
            if (got >= required) {
              break;
            }
          } else { //通过spill()没有释放任何内存
            cList.remove(cList.size() - 1);
            if (cList.isEmpty()) {
              //移除currentEntry对应的key-value
              sortedConsumers.remove(currentEntry.getKey());
            }
          }
        } catch (ClosedByInterruptException e) {
          ...
        } catch (IOException e) {
          ...
        }
      }
    }

    //如果通过其他consumer释放的内存还是不能满足要请求的内存，那就释放当前consumer所占用的内存来满足要求
    if (got < required) {
      try {
        //在当前consumer上释放剩余所需内存
        long released = consumer.spill(required - got, consumer);
        if (released > 0) {
          logger.debug("Task {} released {} from itself ({})", taskAttemptId, Utils.bytesToString(released), consumer);
          //got加上从当前consumer上释放的内存
          got += memoryManager.acquireExecutionMemory(required - got, taskAttemptId, mode);
        }
      } catch (ClosedByInterruptException e) {
        ...
      } catch (IOException e) {
        ...
      }
    }

	//记录释放过内存的consumer
    consumers.add(consumer);
    logger.debug("Task {} acquired {} for {}", taskAttemptId, Utils.bytesToString(got), consumer);
    return got;
  }
}

MemoryManager.acquireExecutionMemory()

在上面代码中，执行内存的获取是通过抽象类 MemoryManager 的 acquireExecutionMemory()函数来实现的，此方法是一个抽象方法，没有具体实现，其具体逻辑最终是通过MemoryManager的子类来实现的。

Spark 1.6之前使用静态内存管理，对应的实现类为 StaticMemoryManager。
Spark 1.6之后及之后默认使用统一内存管理，对应实现类为 UnifiedMemoryManager 。

UnifiedMemoryManager.acquireExecutionMemory()

在统一内存管理模型中，每次获取执行内存之前都会先回收被存储内存（Storage Memory）借去的执行内存空间。

private[spark] abstract class MemoryManager(
    conf: SparkConf,
    numCores: Int,
    onHeapStorageMemory: Long,
    onHeapExecutionMemory: Long) extends Logging {
	//统一内存管理模型中记录管理Storage Memory堆内存的使用情况
	protected val onHeapStorageMemoryPool = new StorageMemoryPool(this, MemoryMode.ON_HEAP)
	//统一内存管理模型中记录管理Storage Memory非堆内存的使用情况
	protected val offHeapStorageMemoryPool = new StorageMemoryPool(this, MemoryMode.OFF_HEAP)
	//统一内存管理模型中记录管理Execution Memory堆内存的使用情况
	protected val onHeapExecutionMemoryPool = new ExecutionMemoryPool(this, MemoryMode.ON_HEAP)
	//统一内存管理模型中记录管理Execution Memory非堆内存的使用情况
	protected val offHeapExecutionMemoryPool = new ExecutionMemoryPool(this, MemoryMode.OFF_HEAP)
}


//UnifiedMemoryManager为MemoryManager的子类
private[spark] class UnifiedMemoryManager private[memory] (
    conf: SparkConf,
    val maxHeapMemory: Long,
    onHeapStorageRegionSize: Long,
    numCores: Int)
  extends MemoryManager(
    conf,
    numCores,
    onHeapStorageRegionSize,
    maxHeapMemory - onHeapStorageRegionSize) {
    ...
    
	override private[memory] def acquireExecutionMemory(
	    numBytes: Long,
	    taskAttemptId: Long,
	    memoryMode: MemoryMode): Long = synchronized {
	  ...
	  //通过传入的内存模型参数（memoryMode）来判断请求的是堆内存（ON_HEAP）还是堆外内存（OFF_HEAP）。
	  //onHeapStorageMemoryPool、offHeapStorageMemoryPool、onHeapExecutionMemoryPool和offHeapExecutionMemoryPool都是继承父类MemoryManager的变量。
	  val (executionPool, storagePool, storageRegionSize, maxMemory) = memoryMode match {
	    case MemoryMode.ON_HEAP => (
	      onHeapExecutionMemoryPool,
	      onHeapStorageMemoryPool,
	      onHeapStorageRegionSize,
	      maxHeapMemory)
	    case MemoryMode.OFF_HEAP => (
	      offHeapExecutionMemoryPool,
	      offHeapStorageMemoryPool,
	      offHeapStorageMemory,
	      maxOffHeapMemory)
	  }
	
	  /**
	   * 统一内存管理模型中，Storage Memory和Execution Memory是共享同一块内存区域，可以相互借用。
	   * 此方法在Execution内存池的空闲内存不足以满足请求所需内存时时，通过驱逐Storage pool中缓存的blocks收缩其内存空间，来扩大Execution内存储。
	   * 
	   * 当为一个task获取内存时，Execution内存池可能需要做多次的尝试，每一次尝试都必须要驱逐Storage pool中缓存的blocks，
	   * 以防在两次尝试之间有其他task在Storage内存缓存了新的block。
	   */
	  def maybeGrowExecutionPool(extraMemoryNeeded: Long): Unit = {
	    if (extraMemoryNeeded > 0) {
	      //当Execution内存池中没有足够空闲内存的时候，会尝试从Storage内存池回收内存。
	      //我们可以回收Storage内存池中的任何空闲内存。
	      //如果Storage内存池超过了其初始大小（Storage内存区域和Execution内存区域共享同一块内存，默认情况下，两者各占一半，执行过程中可相互借用对方内存）的时候，
	      //我们可以驱逐Storage内存池中从Execution内存池借来的那部分内存。
	      val memoryReclaimableFromStorage = math.max(
	        storagePool.memoryFree,
	        storagePool.poolSize - storageRegionSize)
	      if (memoryReclaimableFromStorage > 0) {
	        //仅仅回收必要的可用的内存空间
	        val spaceToReclaim = storagePool.freeSpaceToShrinkPool(
	          math.min(extraMemoryNeeded, memoryReclaimableFromStorage))
	        //Storage内存池减少被回收的内存
	        storagePool.decrementPoolSize(spaceToReclaim)
	        //Execution内存池增加被回收的内存
	        executionPool.incrementPoolSize(spaceToReclaim)
	      }
	    }
	  }
	
	  /**
	   * 此方法计算：在驱逐Storage内存中的缓存blocks之后，Execution内存池的内存大小。
	   *
	   * Execution内存池将最大执行内（包括从Storage释放的那部分内存）存平均地分配给活跃的tasks，以限制每个task的执行内存分配。
	   * 所以，保持最大执行内存大于当前Execution内存池大小是非常重要的，因为Execution内存池并没有把从Storage内存池驱逐释放的那部分潜在内存考虑在内。
	   * 否则，如果按当前Execution内存池大小来给task分配内存的话，就会超出限制。
	   * 例如，假设总内存大小为100GB，缓存的blocks占用90GB，Storage内存占总内存的0.5（spark.memory.storageFraction=0.5），当前有2个活跃的task。
	   * 在这个例子中，每个task的执行内存上限为 (100-90) / 2 = 5GB。如果此时，其中一个task要请求获取20GB的执行内存，那么它会驱逐Storage内存池中20GB内存，但是因为内存上限的原因只能获取5GB的执行内存，这就会造成OOM。
	   * 
	   * 此外，这个最大执行内存不能超过maxMemory（最大堆内存或最大非堆内存）。
	   */
	  def computeMaxExecutionPoolSize(): Long = {
	    maxMemory - math.min(storagePool.memoryUsed, storageRegionSize)
	  }
	
	  //向Execution内存池请求内存分配
	  executionPool.acquireMemory(
	    numBytes, taskAttemptId, maybeGrowExecutionPool, () => computeMaxExecutionPoolSize)
	}
}

ExecutionMemoryPool.acquireMemory()

正如文章前面流程图所示，task最终申请执行内存的逻辑是由执行内存池 ExecutionMemoryPool 类的 acquireMemory() 函数来实现的。

ExecutionMemoryPool功能

ExecutionMemoryPool 类实现策略以便在Task之间共享大小可调节的内存池。它尽量确保每个task都获得合理的内存，而不是某些task优先占用了大量内存，从而导致其他task重复地溢出到磁盘。

假如，有N个task，ExecutionMemoryPool 类会确保每个task在不得不溢出到磁盘之前，可以获取至少 [1/2N, 1/N] 的内存大小，也就是最少1/2N、最多1/N的内存大小。因为N是动态变化的，我们会跟踪活跃的哪些task，并在活跃task数发生变化的时候，重新计算1/2N和1/N的大小。

private[memory] def acquireMemory(
    numBytes: Long, //请求的内存数
    taskAttemptId: Long, //task尝试获取内存的id
    maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit, //用来扩大Execution内存池的回调函数。对应上一章节中的maybeGrowExecutionPool()函数
    computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {
    ...
  // memoryForTask记录了task的内存占用，key为taskAttemptId，value为task占用的内存字节数。
  if (!memoryForTask.contains(taskAttemptId)) {
    memoryForTask(taskAttemptId) = 0L
    // This will later cause waiting tasks to wake up and check numTasks again
    lock.notifyAll()
  }

  //循环直到我们确定不能满足的这个请求（是因为这个task请求超过了 1/numActiveTasks的内存大小上限），
  //或者我们能给这个task提供足够的空闲内存（我们能给每个task至少 1/(2*numActiveTasks)的内存）。
  while (true) {
    //活跃的task数
    val numActiveTasks = memoryForTask.keys.size
    val curMem = memoryForTask(taskAttemptId)

    //每次循环，都要先尝试回收任何被Storage内存借走的Execution内存
    maybeGrowPool(numBytes - memoryFree)

    //Execution内存池在扩大之后拥有的最大内存大小。
    //这个最大内存被用来计算每个task可以占用的内存上限值。这必须把潜在的要从Storage内存区域驱逐的内存和当前Execution内存池占用的内存都考虑在内。
    val maxPoolSize = computeMaxPoolSize()
    //每个task所能请求的最大内存
    val maxMemoryPerTask = maxPoolSize / numActiveTasks
    //满足每个task请求的最小内存
    val minMemoryPerTask = poolSize / (2 * numActiveTasks)

    //确保给予这个task的内存大小，其比例范围在[0, 1 / numActiveTasks]之间
    val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
    //只给这个task空闲的内存
    val toGrant = math.min(maxToGrant, memoryFree)

    //我们会让每个task在阻塞之前获得至少 1 / (2 * numActiveTasks) 的内存；
    //如果不能为每个task分配这么多的内存，就会等待其他task释放内存。
    if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
      logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")
      lock.wait()
    } else {
      //task当前占用内存加上获取的内存
      memoryForTask(taskAttemptId) += toGrant
      //返回获取的满足条件的内存
      return toGrant
    }
  }
  0L  // Never reached
}