PackedRecordPointer
概述
PackedRecordPointer对象用一个64bit的long型变量来记录record信息:
[24 bit partition number][13 bit memory page number][27 bit offset in page]。这些record信息用来给ShuffleInMemorySorter排序。
该存储格式意味着在offset in page不需要8字节对齐的情况下,可寻址的最大page的大小为2^27 bit。因为在org.apache.spark.memory.TaskMemoryManager中,设置pageNumber的bit数为13位,所以一个task的可用内存 = page的数量 * page的最大大小 = 2^13 * 2^27 bit = 1 TB字节。
成员变量
要完成以上64bit的long型变量的组装和拆卸,离不开相关掩码的配合:
低40位掩码:用于生成高24位掩码;
高24位掩码:与高24位掩码相与,可以从64bit的long型变量拆卸出partitionId;
低51位掩码:用于生成高13位掩码;
高13位掩码:经TaskMemoryManager编码后得到的record编码地址,与高13位掩码相与,从而获得pageNumber,并组装到64bit的long型变量;64bit的long型变量(先前移24位),与高13位掩码相与,从而拆卸出pageNumber;
低27位掩码:经TaskMemoryManager编码后得到的record编码地址,与低27位掩码相与,从而获得压缩后的offset in page,并组装到64bit的long型变量;64bit的long型变量,与低27位掩码相与,从而拆卸出offset in page。
static final int MAXIMUM_PAGE_SIZE_BYTES = 1 << 27; // 128 megabytes
/**
* The maximum partition identifier that can be encoded. Note that partition ids start from 0.
*/
static final int MAXIMUM_PARTITION_ID = (1 << 24) - 1; // 16777215
/** Bit mask for the lower 40 bits of a long. */
private static final long MASK_LONG_LOWER_40_BITS = (1L << 40) - 1;
/** Bit mask for the upper 24 bits of a long */
private static final long MASK_LONG_UPPER_24_BITS = ~MASK_LONG_LOWER_40_BITS;
/** Bit mask for the lower 27 bits of a long. */
private static final long MASK_LONG_LOWER_27_BITS = (1L << 27) - 1;
/** Bit mask for the lower 51 bits of a long. */
//2^51 -1 = (2^50 *2 -1)/(2-1) = 2^0+2^1+2^2+2^3+.....+2^50,转为64位二进制后,前13位都为0,后51位都为1
private static final long MASK_LONG_LOWER_51_BITS = (1L << 51) - 1;
/** Bit mask for the upper 13 bits of a long */
//前13位都为1,后51位都为0
private static final long MASK_LONG_UPPER_13_BITS = ~MASK_LONG_LOWER_51_BITS;
此外,它还有一个成员变量,用以保存该64bit的long型变量。
private long packedRecordPointer;
public void set(long packedRecordPointer) {
this.packedRecordPointer = packedRecordPointer;
}
packPointer方法
将record的编码地址和partitionId组装成一个long型变量。最终,该64bit的long型变量的格式为:
[24 bit partition number][13 bit memory page number][27 bit offset in page]
/**
* Pack a record address and partition id into a single word.
*
* @param recordPointer a record pointer encoded by TaskMemoryManager.
* @param partitionId a shuffle partition id (maximum value of 2^24).
* @return a packed pointer that can be decoded using the {@link PackedRecordPointer} class.
*/
//@param recordPointer record的编码地址。高13位为pageNumber,低51位为offset in page
//@param partitionId 最大值不能超过2^24
public static long packPointer(long recordPointer, int partitionId) {
assert (partitionId <= MAXIMUM_PARTITION_ID);
// Note that without word alignment we can address 2^27 bytes = 128 megabytes per page.
// Also note that this relies on some internals of how TaskMemoryManager encodes its addresses.
//recordPointer与高13位掩码相与,得到高13位的内容,即为record编码地址中的pageNumber,然后将这高13位的内容右移24位。
final long pageNumber = (recordPointer & MASK_LONG_UPPER_13_BITS) >>> 24;
//recordPointer与低27位掩码相与,从而获得压缩后的offset in page
//执行|操作获得压缩后的record的编码地址
final long compressedAddress = pageNumber | (recordPointer & MASK_LONG_LOWER_27_BITS);
//因为packedRecordPointer的高24位保存PartitionId,所以在这里PartitionId要左移40位才能放到高24位上
return (((long) partitionId) << 40) | compressedAddress;
}
原record编码地址中,offset in page为51位,在与低27位掩码相与后,压缩的offset in page只剩下27位。如果offset in page大于2^27,就会出现高位数据丢失的问题。所以规定,可寻址的最大page的大小为2^27 bit。
同样,partitionId也是如此,最大partitionId的大小不能超过2^24。
在org.apache.spark.memory.TaskMemoryManager中,设置pageNumber的bit数为13位,在这里保持不变。
getRecordPointer方法和getPartitionId方法
以下方法用于从该64bit的long型变量拆卸出partitionId和record的编码地址。
public int getPartitionId() {
return (int) ((packedRecordPointer & MASK_LONG_UPPER_24_BITS) >>> 40);
}
//返回record的编码地址
public long getRecordPointer() {
final long pageNumber = (packedRecordPointer << 24) & MASK_LONG_UPPER_13_BITS;
final long offsetInPage = packedRecordPointer & MASK_LONG_LOWER_27_BITS;
return pageNumber | offsetInPage;
}
LongArray
ShuffleInMemorySorter中创建LongArray作为成员变量。
ShuffleInMemorySorter(MemoryConsumer consumer, int initialSize, boolean useRadixSort) {
this.consumer = consumer;
assert (initialSize > 0);
this.initialSize = initialSize;
this.useRadixSort = useRadixSort;
this.array = consumer.allocateArray(initialSize); //创建LongArray
this.usableCapacity = getUsableCapacity();
}
而ShuffleInMemorySorter是在ShuffleEexternalSorter中创建的,传入的MemoryConsumer类型参数其实是ShuffleEexternalSorter,它是MemoryConsumer的子类。(ShuffleInMemorySorter没有继承谁,不是MemoryConsumer的子类,所以LongArray内存块的分配和释放都依赖于ShuffleEexternalSorter这个MemoryConsumer的子类)
ShuffleExternalSorter(){
//省略部分代码
this.inMemSorter = new ShuffleInMemorySorter(
this, initialSize, conf.getBoolean("spark.shuffle.sort.useRadixSort", true));
}
最终是调用MemoryConsumer#allocateArray()方法实例化一个LongArray。
/**
* Allocates a LongArray of `size`. Note that this method may throw `OutOfMemoryError` if Spark
* doesn't have enough memory for this allocation, or throw `TooLargePageException` if this
* `LongArray` is too large to fit in a single page. The caller side should take care of these
* two exceptions, or make sure the `size` is small enough that won't trigger exceptions.
*
* @throws SparkOutOfMemoryError
* @throws TooLargePageException
*/
public LongArray allocateArray(long size) {
long required = size * 8L;
MemoryBlock page = taskMemoryManager.allocatePage(required, this);
if (page == null || page.size() < required) {
throwOom(page, required);
}
used += required;
return new LongArray(page);
}
LongArray本质上是对MemoryBlock的封装:
1、MemoryBlock的大小以字节为单位,LongArray的长度以一个long型变量的字节数(8字节)为单位;
2、MemoryBlock没有提供获取和设置在内存块指定索引处的value值的方法,而LongArray提供了相应的set/get方法。
/**
* An array of long values. Compared with native JVM arrays, this:
* <ul>
* <li>supports using both in-heap and off-heap memory</li>
* <li>has no bound checking, and thus can crash the JVM process when assert is turned off</li>
* </ul>
*/
public final class LongArray {
// This is a long so that we perform long multiplications when computing offsets.
private static final long WIDTH = 8;
private final MemoryBlock memory;
private final Object baseObj;
private final long baseOffset;
private final long length;
public LongArray(MemoryBlock memory) {
assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size >= Integer.MAX_VALUE elements";
this.memory = memory;
this.baseObj = memory.getBaseObject();
this.baseOffset = memory.getBaseOffset();
this.length = memory.size() / WIDTH;
}
public MemoryBlock memoryBlock() {
return memory;
}
public Object getBaseObject() {
return baseObj;
}
public long getBaseOffset() {
return baseOffset;
}
/**
* Returns the number of elements this array can hold
返回底层内存块可容纳的long类型字节数
*/
public long size() {
return length;
}
/**
* Fill this all with 0L.
将底层内存块以8字节对齐的方式用0填充
*/
public void zeroOut() {
for (long off = baseOffset; off < baseOffset + length * WIDTH; off += WIDTH) {
Platform.putLong(baseObj, off, 0);
}
}
/**
* Sets the value at position {@code index}.
在底层内存块的指定索引处,设置long类型数据
*/
public void set(int index, long value) {
assert index >= 0 : "index (" + index + ") should >= 0";
assert index < length : "index (" + index + ") should < length (" + length + ")";
Platform.putLong(baseObj, baseOffset + index * WIDTH, value);
}
/**
* Returns the value at position {@code index}.
返回在底层内存块的指定索引处的long类型数据
*/
public long get(int index) {
assert index >= 0 : "index (" + index + ") should >= 0";
assert index < length : "index (" + index + ") should < length (" + length + ")";
return Platform.getLong(baseObj, baseOffset + index * WIDTH);
}
}
后记:在spark 2.4中,对MemoryBlock和LongArray类都做了调整。MemoryBlock变成抽象类,它的子类有:OnHeapMemoryBlock、OffHeapMemoryBlock、BtyeArrayMemoryBlock。在抽象类MemoryBlock中也定义了getLong、putLong等抽象方法,由子类实现,用以获取和设置在内存块指定索引处的某种类型的value值。在LongArray的set/get方法中,直接调用MemoryBlock子类的putLong/getLong方法就行。
ShuffleInMemorySorter
SortComparator
ShuffleInMemorySorter根据record的partitionId大小进行排序,它定义的用以比较partitionId大小的Comparator如下:
private static final class SortComparator implements Comparator<PackedRecordPointer> {
@Override //比较PackedRecordPointer的partition ID的大小
public int compare(PackedRecordPointer left, PackedRecordPointer right) {
int leftId = left.getPartitionId();
int rightId = right.getPartitionId();
return leftId < rightId ? -1 : (leftId > rightId ? 1 : 0);
}
}
成员变量
根据我们前面的分析,ShuffleInMemorySorter有以下成员变量:
- SortComparator SORT_COMPARATOR:比较器,用以比较record的partitionId大小。
- MemoryConsumer consumer:保存ShuffleExternalSorter的引用,用以实现LongArray内存块的分配和释放。
- LongArray array:存储PackedRecordPointer,排序操作的就是这个LongArray,还有一部分作为排序的临时缓冲区。
- int pos:在LongArray内存块中下一个可用(可插入数据)的位置的索引。
private static final SortComparator SORT_COMPARATOR = new SortComparator();
private final MemoryConsumer consumer;
/**
* An array of record pointers and partition ids that have been encoded by
* {@link PackedRecordPointer}. The sort operates on this array instead of directly manipulating
* records.
record pointer和partitionId的数组。其中,record pointer和partitionId都已经经过了PackedRecordPointer
编码。排序操作的是这个数组,而不是直接操作record。
*
* Only part of the array will be used to store the pointers, the rest part is preserved as
* temporary buffer for sorting.
该数组只有一部分是用来存储record pointer,其余部分都用来作为排序的临时缓冲区。
*/
private LongArray array;
/**
* Whether to use radix sort for sorting in-memory partition ids. Radix sort is much faster
* but requires additional memory to be reserved memory as pointers are added.
是否使用基数排序来排序内存的partition ID。基数排序是更快的,但是在添加指针时需要额外的内存
*/
private final boolean useRadixSort; //是否使用基数排序
/**
* The position in the pointer array where new records can be inserted.
新纪录插入数组时指针的位置
*/
private int pos = 0;
/**
* How many records could be inserted, because part of the array should be left for sorting.
可以插入多少条记录,因为array数组的一部分应该留作排序用。
*/
private int usableCapacity = 0;
private int initialSize;
insertRecord方法
- 将record的编码地址和partitionId压缩为[24 bit partitionId][13 bit memory page number][27 bit offset in page]格式的PackedRecordPointer。
- 将PackedRecordPointer插入到LongArray。
/**
* Inserts a record to be sorted. 插入待排序的record。
*
* @param recordPointer a pointer to the record, encoded by the task memory manager. Due to
* certain pointer compression techniques used by the sorter, the sort can
* only operate on pointers that point to locations in the first
* {@link PackedRecordPointer#MAXIMUM_PAGE_SIZE_BYTES} bytes of a data page.
* @param partitionId the partition id, which must be less than or equal to
* {@link PackedRecordPointer#MAXIMUM_PARTITION_ID}.
*/
//参数recordPointer是record的编码地址,经过task memory manager编码。sorter使用的压缩技术,
//排序只能操作一个dataPage中的PackedRecordPointer#MAXIMUM_PAGE_SIZE_BYTES个字节数据
//参数partitionId必须小于或等于PackedRecordPointer#MAXIMUM_PARTITION_ID
public void insertRecord(long recordPointer, int partitionId) {
if (!hasSpaceForAnotherRecord()) {
throw new IllegalStateException("There is no space for new record");
}
array.set(pos, PackedRecordPointer.packPointer(recordPointer, partitionId));
pos++;
}
getSortedIterator方法
根据前面定义的SORT_COMPARATOR比较器,对LongArray中的partitionId进行排序。
/**
* Return an iterator over record pointers in sorted order.
*/
public ShuffleSorterIterator getSortedIterator() {
int offset = 0;
if (useRadixSort) { //如果使用基数排序
offset = RadixSort.sort(
array, pos,
PackedRecordPointer.PARTITION_ID_START_BYTE_INDEX,
PackedRecordPointer.PARTITION_ID_END_BYTE_INDEX, false, false);
} else {
MemoryBlock unused = new MemoryBlock(
array.getBaseObject(),
array.getBaseOffset() + pos * 8L,
(array.size() - pos) * 8L);
LongArray buffer = new LongArray(unused);
Sorter<PackedRecordPointer, LongArray> sorter =
new Sorter<>(new ShuffleSortDataFormat(buffer));
sorter.sort(array, 0, pos, SORT_COMPARATOR);
}
//返回ShuffleSorterIterator的实例
return new ShuffleSorterIterator(pos, array, offset);
}