- 无锁数据结构被吹捧为当今多核CPU的理想选择。 但是,由于几个原因,它们很难实现[10]。 首先,编写有效而健壮的free-free1代码需要开发人员弄清楚所有可能的竞争条件,它们之间的相互作用可能很复杂。 此外,并发线程彼此同步的观点通常在算法的串行版本中没有明确说明。程序员经常错误地实现无锁算法,并最终导致繁忙的循环。
- 另一个挑战是,无锁数据结构需要安全的内存回收,该回收要延迟到所有读取器都处理完数据为止。
- 最后,如果原子原语使用不当,它们本身可能成为性能瓶颈。
- 原始的Bw-Tree论文[29]声称此较低同步和缓存一致性开销提供了更好的可伸缩性要比基于锁的索引好。
- 尽管先前声称无锁索引优于多核CPU上的基于锁的索引,但Bw-Tree的间接层和增量记录的开销导致它的性能比基于锁的索引低1.5-4.5倍。
- 关注delta data中的增量信息
- 映射表还具有在部署SSD时支持日志结构更新的目的。否则,对树节点的更新将传播到所有级别,而无需映射表提供额外的间接访问。
缺失的组件
本节介绍Bw-Tree论文中缺少或缺少细节的四个组件的设计和实现。 由于我们假设Bw-Tree将在DBMS中使用,因此在第3.1节中介绍如何支持非唯一键,我们还将在第3.2节中介绍迭代器。 最后,将在3.3节中讨论如何启用动态映射表扩展。
3.1非唯一键支持
在遍历期间,Bw-Tree在与搜索关键字匹配的第一个叶子增量记录处停止,而无需继续沿链向下移动。但是,此行为与非唯一键支持不兼容。
我们通过让线程使用两个不相交的搜索键值集来动态计算增量记录的可见性,从而处理OpenBw-Tree中的非唯一键
3.2迭代
为了支持对索引进行范围扫描,DBMS的执行引擎使用了迭代器。在BwTree中实现此接口很复杂,因为它必须支持无锁的并发操作。
当其他线程同时插入和删除项目时,很难在迭代器中跟踪线程的当前位置。此外,当迭代器从一个逻辑叶节点移动到其相邻节点时,并发SMO(第2.4节)使其更具挑战性。
为了克服这些问题,OpenBw-Tree的迭代器不能直接在树节点上运行。而是,每个迭代器维护逻辑节点的私有只读副本,以实现一致且快速的随机访问。迭代器还维护此节点中当前项目的偏移量以及用于确定下一个项目的信息。当迭代器向前或向后移动时,如果当前私有副本已用尽,则工作线程从根开始使用当前的低键或高键到达上一个或下一个同级节点,开始新的遍历。
3.3映射表扩展
由于每个线程在遍历过程中都会多次访问Bw-Tree的映射表,因此,这不是瓶颈很重要。
将映射表存储为由节点ID索引的物理指针数组是最快的数据结构。但是,使用固定大小的数组会使随着树中项目数的增加和缩小而动态调整映射表的大小变得困难。最后一点是我们在这里解决的问题。
OpenBw-Tree为映射表预分配了较大的虚拟地址空间,而无需请求备份物理页。这使它可以利用操作系统来懒散地分配物理内存,而无需使用锁。这项技术以前曾在KISS-Tree中使用[18]。随着索引的增长,线程可能会尝试访问尚未映射到物理内存的映射表的页面之一,从而导致页面错误。
然后,操作系统为虚拟页面分配新的空白物理页面。实际上,我们保留的虚拟地址空间量是使用物理内存总量和虚拟节点大小的下限估算的。
尽管此方法使随着索引的增加而容易增加映射表中的条目数,但是它不能解决缩小映射表大小的问题。据我们所知,没有无锁的方式可以做到这一点。缩小映射表的唯一方法是阻塞所有工作线程并重建索引。
4.1 组件优化
4.1 Delta记录预分配
如第2.1节所述,Bw-Tree中的增量链是在堆上分配的增量记录的链表。 遍历此链接列表的速度很慢,因为线程可能会因每个指针取消引用而导致高速缓存未命中。 此外,小对象的过多分配会在分配器中引起争用,随着内核数量的增加,这成为可伸缩性瓶颈。
为了避免这些问题,OpenBw-Tree在每个基本节点内部预分配了增量记录。如图4所示,它将基节点存储在预分配块的高地址端,并存储从高地址到低地址(图中从右到左)的增量记录。每个链还维护一个分配标记,该标记指向最后一个增量记录或基节点。
当工作线程声明一个插槽时,它将使用原子减法将此标记减少新增量记录的字节数。如果预分配的区域已满,则这将触发节点合并。
此反向增长设计针对有效的Delta链遍历进行了优化。以从新到旧的顺序读取增量记录可能(但并非总是)从低地址到高地址线性访问内存,这对于具有硬件内存预取功能的现代CPU是理想的选择。但是线程必须通过遵循每个增量记录的指针以找到下一个条目来遍历节点的增量链,而不仅仅是从低地址扫描到高地址。
这是因为增量记录的逻辑顺序可能与它们在内存中的物理位置不匹配。插槽分配和Delta链附加不是原子的,允许多个线程交织它们。
例如,图4示出了增量记录Δ3在逻辑上被添加到增量记录Δ2之前的节点,但是Δ3在物理上在存储器中在Δ2之后出现。
4.2 GC
epoch-based GC
When the thread completes its operation, it removes itself from the epoch it has entered. Any objects that are marked for deletion by a thread are added into the garbage list of the current epoch. Once all threads exit an epoch, the index’s GC component can then reclaim the objects in that epoch that are marked for deletion.
解决方案:
原生BwTree采用集中式GC,每个线程会先在epoch上进行注册,然后将其删除的数据挂载到对应epoch的gc链表中,会话结束后,会把线程从epoch上卸载。当对应epoch上没有活跃的线程,可以清理其GC链表上的数据。
OpenBwTree方案是分散式GC:考虑到在多核场景下Cache coherence的问题(应该是避免把GC集中起来,这样可以避免对全局内存的操作),然后每个线程有维护自己的local-gc即llocal,自己的epoch即elocal。通过更新全局的epoch到自己的elocal来进行本地的GC。--- 这个要确认究竟有多大的代价。
4.3 delta link的高效合并
传统bwtree的合并方案:
On consolidation, a thread has to first replay the Delta Chain to collect all (key, value) or (key, node ID) items in the logical node and then sort them. We present a faster consolidation algorithm that reduces the sorting overhead.
OpenBwtree的合并方案:
没时间看,后续再补充。
1438/5000
4.4节点搜索快捷方式
如果基本节点跨越多条缓存行,则二进制搜索的前几个探测可能会导致缓存未命中。
我们可以通过使用offset属性来缩小线程必须在基本节点中搜索的范围,从而优化最后一步,如图7所示。我们使用的技术称为微索引[27]。
当工作线程遍历Delta链时,它将初始化其访问的搜索关键字K的二进制搜索范围[min,max]为[0,+ inf)。在遍历期间,每当线程看到带有键K'和偏移量的Δ插入或Δ删除记录时,它将K与K'进行比较。
如果K = K',则范围立即收敛到[offset,offset],避免了二分查找。如果offset> min并且K> K',则将min设置为offset。否则,如果offset <max并且K <K',则将max设置为offset。但是,对于非唯一键,尚不清楚如何解释offset属性,因为它可能指向包含键K的项组成的范围的中间。在这种情况下,索引将忽略offset属性。
B+Tree: Although originally designed for disk-oriented DBMSs [6],
B+Trees are widely used in main-memory database systems [35]. Instead of using traditional latching [3], our B+Tree implementation uses the optimistic lock coupling (OLC) [22] method. In OLC, each
node has a lock, but instead of acquiring locks eagerly, read operations validate version counters (and restart if the counter changes).
Read validations across multiple nodes can be interleaved, which allows implementing the traditional lock coupling technique for synchronizing tree-based indexes. Our B+Tree has a similar node organization as the OpenBw-Tree (sorted keys). We configure the B+Tree to use 4KB node size.
B +树:虽然最初设计用于面向磁盘的DBMS[6],
B +树广泛用于主内存数据库系统[35]。 代替使用传统的闩锁[3],我们的B + Tree实现使用乐观锁耦合(OLC)[22]方法。 在OLC中,每个
节点具有锁,但是读取操作不会急于获取锁,而是会验证版本计数器(并在计数器更改时重新启动)。
可以交错跨多个节点的读取验证,这允许实现传统的锁耦合技术来同步基于树的索引。 我们的B + Tree具有与OpenBw-Tree(排序键)相似的节点组织。 我们将B + Tree配置为使用4KB的节点大小。
We compare the indexes using the YCSB-based workload in Section
5.1. We first run all four workloads with the three key configurations
using a single worker thread. We then execute the trials again using 20 threads that are all pinned to a single CPU socket.
The peak amount of memory consumed by the index during operations for the Read/Update workload are also measured. Finally, we measured the performance counters for the 20-thread Read/Update workload using perf and Intel’s Performance Counter Monitor.
The results for the single-threaded and multi-threaded experiments are shown in Fig. 13 and Fig. 14 respectively. Memory numbers are in Fig. 15. Performance counter readings are in Table 3.
Although our optimizations made the OpenBw-Tree faster than the default Bw-Tree, it is still slower than its competitors except the SkipList. For example, the ART is more than 4× faster than the OpenBw-Tree for point lookups (though the ART is slower on the Scan/Insert workload).
The OpenBw-Tree is also slower than the Masstree and the B+Tree, often by a factor of ∼2×. Microbenchmark numbers show that the OpenBw-Tree in general has a higher instruction count and cache misses per operation (and hence lower IPC). Higher instruction count is a consequence of having complicated delta chain traversal routines. Higher cache misses are caused by features such as the Mapping Table.
我们在本节中使用基于YCSB的工作负载来比较索引
5.1。我们首先使用三个关键配置运行所有四个工作负载
使用单个工作线程。然后,我们使用固定在单个CPU插槽上的20个线程再次执行试验。
还测量了索引在读取/更新工作负载的操作过程中所消耗的最大内存量。最后,我们使用perf和Intel的Performance Counter Monitor测量了20线程读取/更新工作负载的性能计数器。
单线程和多线程实验的结果分别显示在图13和图14中。存储器编号在图15中。性能计数器读数在表3中。
尽管我们的优化使OpenBw-Tree比默认的Bw-Tree更快,但除SkipList之外,它仍然比其竞争对手慢。例如,对于点查找,ART比OpenBw-Tree快4倍以上(尽管ART在“扫描/插入”工作负载下速度较慢)。
OpenBw-Tree的速度也比Masstree和B + Tree慢,通常约为2倍。微基准数字表明,OpenBw-Tree通常具有较高的指令数和每个操作的高速缓存未命中(因此具有较低的IPC)。较高的指令数是复杂的增量链遍历例程的结果。高速缓存未命中率较高是由诸如映射表之类的功能引起的。
The SkipList shows high variation and low performance for most multi-threaded experiments. This is because its threads do not create towers as they insert elements. Instead, the SkipList uses a background thread that periodically scans the entire list and adjusts the height of towers. As a consequence, the background thread may not process recent inserts fast enough, and worker threads iterate through the SkipList’s lowest level to locate a key, causing high cache misses and cycle counts.
The Masstree has high single-threaded Mono-Int Insert-only throughput, but scales only by 3× using 20 threads. This is because Masstree avoids splitting an overflowed leaf node when items are inserted sequentially: it creates a new empty leaf node instead of copying half of the items from the previous leaf. This optimization, however, is less effective in the multi-threaded experiments where the threads’ insert operations are interleaved. In general, the Masstree is comparable to the B+Tree for integer workloads (except Insert-only). And for Email, its performance is even comparable to the trie-based ART index, as its high-level structure is also a trie.
跳过列表显示了大多数多线程实验的高变化和低性能。这是因为其线程在插入元素时不会创建塔。相反,SkipList使用后台线程定期扫描整个列表并调整塔的高度。因此,后台线程可能无法足够快地处理最近的插入操作,并且工作线程会在SkipList的最低级别进行迭代以找到密钥,从而导致高速缓存未命中率和周期数增加。
Masstree具有较高的单线程Mono-Int插入唯一吞吐量,但使用20个线程仅可扩展3倍。这是因为Masstree避免了在顺序插入项目时分裂溢出的叶节点:它创建了一个新的空叶节点,而不是复制前一个叶中的一半项目。但是,这种优化在交错插入线程的多线程实验中效果不佳。通常,对于整数工作负载(仅“插入”除外),Masstree与B + Tree相当。对于电子邮件,它的性能甚至可以与基于trie的ART索引相媲美,因为它的高级结构也是trie。
For integer keys, the B+Tree’s Read-only and Read/Update performance is comparable to the Masstree, and much faster than the OpenBw-Tree. For the Mono-Int Insert-only workload, the B+Tree without any optimizations even outperforms the Masstree and ART, and is 3.7× faster than the OpenBw-Tree. The B+Tree also achieves high throughput for Scan/Insert workloads, and is usually 3–5× faster than all other indexes. But it has relatively poor performance for Email workloads. The microbenchmark indicates high cache misses and low IPC during Rand-Int and Email (not shown) insertion, which explains why the B+Tree is slower in these workloads.
ART outperforms the other indexes for all workloads and key types except Scan/Insert, where its iteration requires more memory access than the OpenBw-Tree.
对于整数键,B + Tree的只读和读取/更新性能与Masstree相当,并且比OpenBw-Tree快得多。 对于仅用于Mono-Int插入的工作负载,未经任何优化的B + Tree甚至胜过Masstree和ART,并且比OpenBw-Tree快3.7倍。 B + Tree还为扫描/插入工作负载实现了高吞吐量,通常比所有其他索引快3–5倍。 但是对于电子邮件工作负载,其性能相对较差。 微基准指示在Rand-Int和Email(未显示)插入期间高缓存未命中和低IPC,这解释了为什么B + Tree在这些工作负载中速度较慢。
对于所有工作负载和键类型,ART的性能均优于其他索引,但“扫描/插入”除外,后者的迭代需要比OpenBw-Tree更多的内存访问权限。
As shown in Fig. 15a, both the Bw-Tree and the OpenBw-Tree use moderate amount of memory. The OpenBw-Tree consumes more memory than the Bw-Tree (10–31%) in all experiments due to pre-allocation and metadata. For the Mono-Int workload, as the pre-allocated space utilization is lower compared with the Rand-Int workload (Table 2). Correspondingly, the OpenBw-Tree uses more memory in the Rand-Int workload. For multi-threaded experiments, since worker threads keep garbage nodes in their threadlocal chains, peak memory usage also increases slightly (8–17%).
Among all compared indexes, the ART has the lowest usage for Mono-Int and Email keys, while the B+Tree has the lowest for the Rand-Int keys due to its compact internal structure and large node size (4 KB). The SkipList consumes more memory than the B+Tree/ART due to its customized memory allocator and preallocation; it has a memory usage comparable to the OpenBw-
Tree.
The Masstree always has highest memory usage, especially for the Email workload (2.0–5.7× higher). For integer workloads, although the Masstree still uses the most memory, the gap is smaller compared on the Email workload (only 1.3–2.5× higher, except for the ART).
The high throughput and low memory usage of the ART index under both single-threaded and multi-threaded environments should be attributed to its flexible way of structuring trie nodes of different sizes. Furthermore, only a single byte is compared on each level. Table 3 shows that both properties minimize CPU cycles and reduce cache misses, resulting in high IPC.
Instruction per cycle;
Number of clock cycles
如图15a所示,Bw-Tree和OpenBw-Tree都使用适度的内存。由于预分配和元数据,在所有实验中,OpenBw-Tree消耗的内存都比Bw-Tree更多(10-31%)。对于Mono-Int工作负载,由于与Rand-Int工作负载相比,预分配的空间利用率较低(表2)。相应地,OpenBw-Tree在Rand-Int工作负载中使用更多的内存。对于多线程实验,由于工作线程将垃圾节点保留在其线程本地链中,因此峰值内存使用量也略有增加(8-17%)。
在所有比较的索引中,由于其紧凑的内部结构和较大的节点大小(4 KB),ART对Mono-Int和Email键的使用率最低,而B + Tree对Rand-Int键的使用率最低。 SkipList由于其自定义的内存分配器和预分配,因此比B + Tree / ART消耗更多的内存。它的内存使用量可与OpenBw-
树。
Masstree始终具有最高的内存使用率,尤其是对于电子邮件工作负载(高2.0–5.7倍)。对于整数工作负载,尽管Masstree仍然使用最多的内存,但与电子邮件工作负载相比,差距较小(除ART以外,仅高1.3–2.5倍)。
在单线程和多线程环境下,ART索引的高吞吐量和低内存使用都应归因于其灵活的构造不同大小的特里节点的方式。此外,在每个级别上仅比较一个字节。表3显示,这两个属性均最小化了CPU周期并减少了高速缓存未命中,从而实现了较高的IPC。
每个周期的指令;
时钟周期数
6.2 High ContentionWorkload
The salient aspect of the Bw-Tree’s design is that it is lock-free, whereas most other data structures that we tested here use locks (although sparingly). Lock-free data structures are often favored in high contention environments because threads can make global progress [30], even though the progress may be small in practice.
To better understand this issue, we created a specialized workload that with extreme contention. Each thread in the benchmark uses the RDTSC instruction with a unique thread ID suffix to generate monotonically increasing integers in real-time as keys, to mimic multiple threads appending new records to the end of a table.
To further demonstrate how and in which way the NUMA configuration affects performance, we run the evaluation under three NUMA settings: 20 worker threads on a single NUMA node, 20 worker threads on two NUMA nodes, and 40 worker threads on two NUMA nodes. The last setting uses all available hardware threads on our testing system.
The results shown in Fig. 16a indicate that all five indexes degrade
under high contention. Both Insert-only and Read/Update performance drops in both one- and two-node NUMA settings.
The local and remote NUMA access rate, which is the number of DRAM accesses per second, is shown in Fig. 16b and Fig. 16c, respectively.
Under high contention, Masstree has the best result, followed by ART, and then B+Tree. OpenBw-Tree suffers from an extremely high abort rate as threads contend for the head of the Delta Chain. Table 2 shows that the abort rate is over 1000%, i.e., on average there are more than 10 aborts for every insert.
Overall, under high contention, none of these six data structures perform well. As shown in Fig. 17, compared with their multithreaded performance numbers without high contention, all of them suffer from performance degradation. In particular, all lock-free indexes struggled more than any lock-based indexes; for example, the SkipList failed to make progress in this high-contention workload 2
6.2高竞争工作量
Bw-Tree设计的显着方面是无锁,而我们在此测试的大多数其他数据结构都使用锁(尽管很少使用)。在高竞争的环境中,无锁数据结构通常更受青睐,因为线程可以取得全局进展[30],即使在实践中进展可能很小。
为了更好地理解此问题,我们创建了具有极端争用的专门工作负载。基准测试中的每个线程都使用带有唯一线程ID后缀的RDTSC指令实时生成单调递增的整数作为键,以模拟将新记录追加到表末尾的多个线程。
为了进一步说明NUMA配置如何以及以哪种方式影响性能,我们在三种NUMA设置下运行评估:单个NUMA节点上有20个工作线程,两个NUMA节点上有20个工作线程,而两个NUMA节点上有40个工作线程。最后一个设置使用我们测试系统上所有可用的硬件线程。
图16a中显示的结果表明所有五个指标均下降
在激烈的竞争中。一节点和两节点NUMA设置的“仅插入”性能和“读取/更新”性能均下降。
图16b和图16c分别显示了本地和远程NUMA访问速率,即每秒的DRAM访问次数。
在竞争激烈的情况下,Masstree的效果最佳,其次是ART,然后是B + Tree。由于线程争夺Delta链的头部,因此OpenBw-Tree具有极高的中止率。表2显示中止率超过1000%,即平均每个插入超过10次中止。
总体而言,在竞争激烈的情况下,这六个数据结构均无法正常运行。如图17所示,与没有高争用的多线程性能数相比,它们都遭受性能下降的困扰。特别是,所有无锁索引都比任何基于锁的索引都要费劲。例如,SkipList在此高竞争性工作负载中无法取得进展2