Spark源码阅读笔记之MemoryStore

最新推荐文章于 2024-04-04 08:35:21 发布

huangyuu5

最新推荐文章于 2024-04-04 08:35:21 发布

阅读量2.1k

点赞数

分类专栏： spark 文章标签： spark 源码

本文链接：https://blog.csdn.net/huangyuu5/article/details/47081219

版权

Spark的MemoryStore负责Block的内存存储，采用LinkedHashMap数据结构，按存储顺序访问。MemoryStore支持反序列化对象数组和序列化ByteBuffer存储。在内存不足时，会根据存储顺序将可持久化的Block移到磁盘，或者如果新Block可持久化，将其存储到磁盘。MemoryStore关键属性包括blockManager、maxMemory、entries等，并通过tryToPut和ensureFreeSpace等方法管理内存。

摘要由CSDN通过智能技术生成

Spark源码阅读笔记之MemoryStore

BlockManager底层通过BlockStore来对数据进行实际的存储。BlockStore是一个抽象类，有三种实现：DiskStore（磁盘级别的持久化）、MemoryStore（内存级别的持久化）和TachyonStore（Tachyon内存分布式文件系统级别的持久化）。

MemoryStore以反序列化后的数组或者序列化后的字节缓存（ByteBuffer）形式将Block，存储到内存中。

Stores blocks in memory, either as Arrays of deserialized Java objects or as serialized ByteBuffers.

MemoryStore用LinkedHashMap数据结构来存储数据，从而可以以存储的先后顺序来访问数据。并且会记录当前存储的所有Block的大小，若没有足够的内存来存储新Block，会以存储的先后顺序来遍历所有的Block，将可以存储到磁盘的Block存储到磁盘中，以释放出足够的内存来存储新Block，若还是没有足够的内存来存储，则如果新Block可以被存储到磁盘，将新Block存储到磁盘，否则不做任何操作。在存储Iterator的数据时，由于Iterator可能是从磁盘或者其他非内存的来源读取，因此展开时（转化为Array[Any]）需要考虑是否有足够的内存。

MemoryStore属性

blockManager：BlockManager
maxMemory：Long
最大可用的内存大小
entries：LinkedHashMap[BlockId, MemoryEntry]
存储Block数据的Map，以BlockId为key，MemoryEntry为值，并能根据存储的先后顺序访问
accountingLock: Object
同步锁，保证只有一个线程在写和删除Block

Ensure only one thread is putting, and if necessary, dropping blocks at any given time
currentMemory：Long
当前内存使用情况
unrollMemoryMap：Map[Long, Long]
记录各个线程展开Iterator时的内存使用情况，key为线程的Id，value为内存使用情况。

A mapping from thread ID to amount of memory used for unrolling a block (in bytes). All accesses of this map are assumed to have manually synchronized on accountingLock
maxUnrollMemory：Long
展开Iterator时需要保证的内存大小，值为maxMemory*conf.getDouble(“spark.storage.unrollFraction”, 0.2)，若展开时没有足够的内存，并且展开Iterator使用的内存没有达到maxUnrollMemory，需要将存储在内存中的可以存储到磁盘中的Block存储到磁盘，以释放内存。

The amount of space ensured for unrolling values in memory, shared across all cores. This space is not reserved in advance, but allocated dynamically by dropping existing blocks.
unrollMemoryThreshold: Long
展开Iterator时每个线程初始分配的内存大小，当内存不够时会以1.5倍的大小申请内存，若没有足够的内存并且没有达到maxUnrollMemory，将存储在内存中的可以存储到磁盘中的Block存储到磁盘，以释放内存。值为conf.getLong(“spark.storage.unrollMemoryThreshold”, 1024 * 1024)。

Initial memory to request before unrolling any block

MemoryEntry的代码：

case class MemoryEntry(value: Any, size: Long, deserialized: Boolean)

MemoryStore方法

写数据方法：

putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): PutResult

将字节缓存形式的Block存储到内存中，若level.deserialized为真，则需要将字节缓存反序列化，以数组的形式（Array[Any]）存储；若为假则以字节缓存形式（ByteBuffer）存储。

override def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): PutResult = {
    // Work on a duplicate - since the original input might be used elsewhere.
    val bytes = _bytes.duplicate()
    bytes.rewind()
    if (level.deserialized) {
      val values = blockManager.dataDeserialize(blockId, bytes)
      putIterator(blockId, values, level, returnValues = true)
    } else {
      val putAttempt = tryToPut(blockId, bytes, bytes.limit, deserialized = false)
      PutResult(bytes.limit(), Right(bytes.duplicate()), putAttempt.droppedBlocks)
    }
  }

putArray(blockId: BlockId,values: Array[Any],level: StorageLevel,returnValues: Boolean): PutResult
将数组形式的Block存储到内存，若level.deserialized为假，则需要将数组序列化，以字节缓存的形式（ByteBuffer）存储；若为真则数组的形式（Array[Any]）存储。

override def putArray(
      blockId: BlockId,
      values: Array[Any],
      level: StorageLevel,
      returnValues: Boolean): PutResult = {
    if (level.deserialized) {
      val sizeEstimate = SizeEstimator.estimate(values.asInstanceOf[AnyRef])
      val putAttempt = tryToPut(blockId, values, sizeEstimate, deserialized = true)
      PutResult(sizeEstimate, Left(values.iterator), putAttempt.droppedBlocks)
    } else {
      val bytes = blockManager.dataSerialize(blockId, values.iterator)
      val putAttempt = tryToPut(blockId, bytes, bytes.limit, deserialized = false)
      PutResult(bytes.limit(), Right(bytes.duplicate()), putAttempt.droppedBlocks)
    }
  }

putIterator(blockId: BlockId,values: Iterator[Any],level: StorageLevel,returnValues: Boolean):PutResult
将Iterator[Any]展开为Array[Any]存储到内存中，展开时需靠考虑是否有足够的内存。若展开时没有足够的内存，并且展开Iterator使用的内存没有达到maxUnrollMemory，需要将存储在内存中的可以存储到磁盘中的Block存储到磁盘，以释放内存。若还是没有足够的内存，如果Block允许存储在磁盘则将该Block存储到磁盘，否则返回失败的结果（PutResult(0, Left(iteratorValues), droppedBlocks)）。