Spark源码阅读笔记之BlockStore
BlockManager底层通过BlockStore来对数据进行实际的存储。BlockStore是一个抽象类,有三种实现:DiskStore(磁盘级别的持久化)、MemoryStore(内存级别的持久化)和TachyonStore(Tachyon内存分布式文件系统级别的持久化)。
BlockStore代码:
/**
* Abstract class to store blocks.
*/
private[spark] abstract class BlockStore(val blockManager: BlockManager) extends Logging {
def putBytes(blockId: BlockId, bytes: ByteBuffer, level: StorageLevel): PutResult
/**
* Put in a block and, possibly, also return its content as either bytes or another Iterator.
* This is used to efficiently write the values to multiple locations (e.g. for replication).
*
* @return a PutResult that contains the size of the data, as well as the values put if
* returnValues is true (if not, the result's data field can be null)
*/
def putIterator(
blockId: BlockId,
values: Iterator[Any],
level: StorageLevel,
returnValues: Boolean): PutResult
def putArray(
blockId: BlockId,
values: Array[Any],
level: StorageLevel,
returnValues: Boolean): PutResult
/**
* Return the size of a block in bytes.
*/
def getSize(blockId: BlockId): Long
def getBytes(blockId: BlockId): Option[ByteBuffer]
def getValues(blockId: BlockId): Option[Iterator[Any]]
/**
* Remove a block, if it exists.
* @param blockId the block to remove.
* @return True if the block was found and removed, False otherwise.
*/
def remove(blockId: BlockId): Boolean
def contains(blockId: BlockId): Boolean
def clear() { }
}
BlockStore有三个方法来存储数据:
def putBytes(blockId: BlockId, bytes: ByteBuffer, level: StorageLevel): PutResult
将字节缓存(ByteBuffer)存储到内存或磁盘def putArray(blockId: BlockId,values: Array[Any],level: StorageLevel,returnValues: Boolean): PutResult
将数组(Array[Any])存储到内存或磁盘def putIterator(blockId: BlockId,values: Iterator[Any],level: StorageLevel,returnValues: Boolean): PutResult
将Iterator存储到内存或磁盘,由于Iterator可能是从磁盘或者其他不是内存的来源读取,因此需要考虑展开时的内存占用情况
调用存储数据的方法,返回的结果为PutResult:
/**
* Result of adding a block into a BlockStore. This case class contains a few things:
* (1) The estimated size of the put,
* (2) The values put if the caller asked for them to be returned (e.g. for chaining
* replication), and
* (3) A list of blocks dropped as a result of this put. This is always empty for DiskStore.
*/
private[spark] case class PutResult(
size: Long,
data: Either[Iterator[_], ByteBuffer],
droppedBlocks: Seq[(BlockId, BlockStatus)] = Seq.empty)
BlockStore有两个方法来读取数据:
def getBytes(blockId: BlockId): Option[ByteBuffer]
取得存储的数据,如果需要的话(数据以Iterator形式存储)将其转化为字节缓存def getValues(blockId: BlockId): Option[Iterator[Any]]
取得存储的数据,如果需要的话(数据以字节缓存的形式存储)将其转换为Iterator[Any]
BlockStore其他方法:
def getSize(blockId: BlockId): Long
根据BlockId得到Block的大小def remove(blockId: BlockId): Boolean
根据BlockId删除Blockdef contains(blockId: BlockId): Boolean
根据BlockId判断是的存在相应的Blockdef clear()
清空所有存储的Block