Spark源码阅读笔记之BlockStore

最新推荐文章于 2021-09-17 09:40:00 发布

huangyuu5

最新推荐文章于 2021-09-17 09:40:00 发布

阅读量575

点赞数

分类专栏： spark 文章标签： spark 源码 BlockStore

本文链接：https://blog.csdn.net/huangyuu5/article/details/47038809

版权

spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Spark源码阅读笔记之BlockStore

BlockManager底层通过BlockStore来对数据进行实际的存储。BlockStore是一个抽象类，有三种实现：DiskStore（磁盘级别的持久化）、MemoryStore（内存级别的持久化）和TachyonStore（Tachyon内存分布式文件系统级别的持久化）。

BlockStore代码：


/**
 * Abstract class to store blocks.
 */
private[spark] abstract class BlockStore(val blockManager: BlockManager) extends Logging {

  def putBytes(blockId: BlockId, bytes: ByteBuffer, level: StorageLevel): PutResult

  /**
   * Put in a block and, possibly, also return its content as either bytes or another Iterator.
   * This is used to efficiently write the values to multiple locations (e.g. for replication).
   *
   * @return a PutResult that contains the size of the data, as well as the values put if
   *         returnValues is true (if not, the result's data field can be null)
   */
  def putIterator(
    blockId: BlockId,
    values: Iterator[Any],
    level: StorageLevel,
    returnValues: Boolean): PutResult

  def putArray(
    blockId: BlockId,
    values: Array[Any],
    level: StorageLevel,
    returnValues: Boolean): PutResult

  /**
   * Return the size of a block in bytes.
   */
  def getSize(blockId: BlockId): Long

  def getBytes(blockId: BlockId): Option[ByteBuffer]

  def getValues(blockId: BlockId): Option[Iterator[Any]]

  /**
   * Remove a block, if it exists.
   * @param blockId the block to remove.
   * @return True if the block was found and removed, False otherwise.
   */
  def remove(blockId: BlockId): Boolean

  def contains(blockId: BlockId): Boolean

  def clear() { }
}

BlockStore有三个方法来存储数据：

def putBytes(blockId: BlockId, bytes: ByteBuffer, level: StorageLevel): PutResult
将字节缓存（ByteBuffer）存储到内存或磁盘
def putArray(blockId: BlockId,values: Array[Any],level: StorageLevel,returnValues: Boolean): PutResult
将数组（Array[Any]）存储到内存或磁盘
def putIterator(blockId: BlockId,values: Iterator[Any],level: StorageLevel,returnValues: Boolean): PutResult
将Iterator存储到内存或磁盘，由于Iterator可能是从磁盘或者其他不是内存的来源读取，因此需要考虑展开时的内存占用情况

调用存储数据的方法，返回的结果为PutResult：


/**
 * Result of adding a block into a BlockStore. This case class contains a few things:
 *   (1) The estimated size of the put,
 *   (2) The values put if the caller asked for them to be returned (e.g. for chaining
 *       replication), and
 *   (3) A list of blocks dropped as a result of this put. This is always empty for DiskStore.
 */
private[spark] case class PutResult(
    size: Long,
    data: Either[Iterator[_], ByteBuffer],
    droppedBlocks: Seq[(BlockId, BlockStatus)] = Seq.empty)

BlockStore有两个方法来读取数据：

def getBytes(blockId: BlockId): Option[ByteBuffer]
取得存储的数据，如果需要的话（数据以Iterator形式存储）将其转化为字节缓存
def getValues(blockId: BlockId): Option[Iterator[Any]]
取得存储的数据，如果需要的话（数据以字节缓存的形式存储）将其转换为Iterator[Any]

BlockStore其他方法：