Spark源码阅读笔记之BlockObjectWriter
Spark中Hash Shuffle阶段能将多个map的结果合并到一个文件,以减少文件的数量,主要依赖于BlockObjectWriter 。BlockObjectWriter是一个接口,用来直接操作Block对应的存储容器(目前只支持磁盘存储,即只能将数据添加到Block对应的磁盘文件中),可以直接向存储容器中添加数据,从而实现向相应的Block中添加数据的操作。一个存储容器(如一个文件)可以对应多个Block,这样就将多个Block合并到一个文件中,从而减少了文件的数量,而每个Block则对应该存储容器中一段连续的数据段。BlockObjectWriter支持回滚,因此在添加数据出错时,可以将数据回滚,以保证原子性,但该接口不支持并发操作。BlockObjectWriter只有一个实现:DiskBlockObjectWriter。
An interface for writing JVM objects to some underlying storage. This interface allows appending data to an existing block, and can guarantee atomicity in the case of faults as it allows the caller to revert partial writes.
This interface does not support concurrent writes. Also, once the writer has been opened, it cannot be reopened again.
BlockObjectWriter的方法
open(): BlockObjectWriter
打开输入流close()
关闭输入流isOpen: Boolean
判断是否打开commitAndClose(): Unit
提交缓冲中的内容,并把写入的内容对应到相应的Block上Flush the partial writes and commit them as a single atomic block.
revertPartialWritesAndClose()
撤销所有的写入操作,将文件中的内容恢复到写入数据之前。Reverts writes that haven’t been flushed yet. Callers should invoke this function when there are runtime exceptions. This method will not throw, though it may be unsuccessful in truncating written data.
write(value: Any)
写入一条数据。fileSegment(): FileSegment
返回FileSegment,该方法只有在commitAndClose方法调用之后才有效。Returns the file segment of committed data that this Writer has written.This is only valid after commitAndClose() has been called.
FileSegment表示文件的一个连续的片段:
References a particular segment of a file (potentially the entire file), based off an offset and a length.