DAOS 系统内部介绍(二）—— VOS

最新推荐文章于 2023-10-24 17:57:32 发布

Hahafly1234

最新推荐文章于 2023-10-24 17:57:32 发布

阅读量3k

点赞数 2

分类专栏：存储知识

本文链接：https://blog.csdn.net/Hahafly1234/article/details/119298746

版权

存储知识专栏收录该内容

13 篇文章 4 订阅

订阅专栏

Versioning Object Store

The Versioning Object Store (VOS) is responsible for providing and maintaining a persistent object store that supports byte-granular access and versioning for a single shard in a DAOS pool. It maintains its metadata in persistent memory and may store data either in persistent memory or on block storage, depending on available storage and performance requirements. It must provide this functionality with minimum overhead so that performance can approach the theoretical performance of the underlying hardware as closely as possible, both with respect to latency and bandwidth. Its internal data structures, in both persistent and non-persistent memory, must also support the highest levels of concurrency so that throughput scales over the cores of modern processor architectures. Finally, and critically, it must validate the integrity of all persisted object data to eliminate the possibility of silent data corruption, both in normal operation and under all possible recoverable failures.

//版本控制对象存储（VOS）负责提供和维护持久对象存储，该持久对象存储支持DAOS池中单个shard的字节粒度访问和版本控制。它将元数据保存在持久内存中，并且可以将数据存储在持久内存或块存储中，具体取决于可用存储和性能要求。它必须以最小的开销提供此功能，以便在延迟和带宽方面，性能尽可能接近底层硬件的理论性能。它的内部数据结构（在持久性和非持久性内存中）还必须支持最高级别的并发性，以便吞吐量在现代处理器体系结构的核心上扩展。最后，也是至关重要的一点，它必须验证所有持久化对象数据的完整性，以消除在正常操作和所有可能的可恢复故障下发生静默数据损坏的可能性。

This section provides the details for achieving the design goals discussed above in building a versioning object store for DAOS.

This document contains the following sections:

Persistent Memory based Storage
- In-Memory Storage
- Lightweight I/O Stack: PMDK Libraries
VOS Concepts
- VOS Indexes
- Object Listing
Key Value Stores
Key Array Stores
Conditional Update and MVCC
Epoch Based Operations
- VOS Discard
- VOS Aggregate
VOS Checksum Management
Metadata Overhead
Replica Consistency
- DAOS Two-phase Commit
- DTX Leader Election

Persistent Memory based Storage

In-Memory Storage

The VOS is designed to use a persistent-memory storage model that takes advantage of byte-granular, sub-microsecond storage access possible with new NVRAM technology. This enables a disruptive change in performance compared to conventional storage systems for application and system metadata and small, fragmented, and misaligned I/O. Direct access to byte-addressable low-latency storage opens up new horizons where metadata can be scanned in less than a second without bothering with seek time and alignment.

//VOS设计为使用持久内存存储模型，利用新NVRAM技术可能实现的字节粒度、亚微秒级存储访问。与传统存储系统相比，对于应用程序和系统元数据以及小的、零碎的和未对齐的I/O，这使得性能发生了颠覆性的变化。直接访问字节可寻址的低延迟存储开辟了一个新的视野，在这里，可以在不到一秒钟的时间内扫描元数据，而不必考虑查找时间和对齐。

The VOS relies on a log-based architecture using persistent memory primarily to maintain internal persistent metadata indexes. The actual data can be stored either in persistent memory directly or in block-based NVMe storage. The DAOS service has two tiers of storage: Storage Class Memory (SCM) for byte-granular application data and metadata, and NVMe for bulk application data. Similar to how PMDK is currently used to facilitate access to SCM, the Storage Performance Development Kit (SPDK) is used to provide seamless and efficient access to NVMe SSDs. The current DAOS storage model involves three DAOS server xstreams per core, along with one main DAOS server xstream per core mapped to an NVMe SSD device. DAOS storage allocations can occur on either SCM by using a PMDK pmemobj pool, or on NVMe, using an SPDK blob. All local server metadata will be stored in a per-server pmemobj pool on SCM and will include all current and relevant NVMe devices, pool, and xstream mapping information. Please refer to the Blob I/O (BIO) module for more information regarding NVMe, SPDK, and per-server metadata. Special care is taken when developing and modifying the VOS layer because any software bug could corrupt data structures in persistent memory. The VOS, therefore, checksums its persistent data structures despite the presence of hardware ECC.

//VOS依赖于基于日志的体系结构，该体系结构主要使用持久性内存来维护内部持久性元数据索引。实际数据可以直接存储在持久内存中，也可以存储在基于块的NVMe存储中。DAOS服务有两层存储：用于字节粒度应用程序数据和元数据的存储类内存（SCM），以及用于批量应用程序数据的NVMe。与当前使用PMDK方便访问SCM的方式类似，存储性能开发工具包（SPDK）用于提供对NVMe ssd的无缝高效访问。当前的DAOS存储模型涉及每个core三个DAOS服务xstream，以及映射到NVMe SSD设备的每个core一个主DAOS服务器xstream。DAOS存储分配可以通过使用PMDK pmemobj池在SCM上进行，也可以通过使用SPDK blob在NVMe上进行。所有本地服务器元数据将存储在SCM上的每服务器pmemobj池中，并将包括所有当前和相关的NVMe设备、池和xstream映射信息。有关NVMe、SPDK和每服务器元数据的更多信息，请参阅Blob I/O（BIO）模块。在开发和修改VOS层时要特别小心，因为任何软件错误都可能损坏持久内存中的数据结构。因此，尽管存在硬件ECC，VOS还是对其持久数据结构进行校验和。

The VOS provides a lightweight I/O stack fully in user space, leveraging the PMDK open-source libraries developed to support this programming model. //VOS在用户空间提供了一个轻量级的I/O堆栈，利用PMDK开源库来支持这个编程模型。

Lightweight I/O Stack: PMDK Libraries

Although persistent memory is accessible via direct load/store, updates go through multiple levels of caches, including the processor L1/2/3 caches and the NVRAM controller. Durability is guaranteed only after all those caches have been explicitly flushed. The VOS maintains internal data structures in persistent memory that must retain some level of consistency so that operation may be resumed without loss of durable data after an unexpected crash or power outage. The processing of a request will typically result in several memory allocations and updates that must be applied atomically. //尽管持久性内存可以通过直接加载/存储来访问，但是更新会经过多个级别的缓存，包括处理器L1/2/3缓存和NVRAM控制器。只有在所有缓存都被显式刷新之后，才能保证持久性。VOS在持久性内存中维护内部数据结构，这些数据结构必须保持一定程度的一致性，以便在意外崩溃或断电后可以恢复操作而不会丢失持久性数据。对请求的处理通常会导致一些必须以原子方式应用的内存分配和更新。

Consequently, a transactional interface must be implemented on top of persistent memory to guarantee internal VOS consistency. It is worth noting that such transactions are different from the DAOS transaction mechanism. Persistent memory transactions must guarantee consistency of VOS internal data structures when processing incoming requests, regardless of their epoch number. Transactions over persistent memory can be implemented in many different ways, e.g., undo logs, redo logs, a combination of both, or copy-on-write.

//因此，必须在持久性内存之上实现事务接口，以保证内部VOS的一致性。值得注意的是，这种事务不同于DAOS事务机制。持久内存事务在处理传入请求时必须保证VOS内部数据结构的一致性，而不管它们的历元数是多少。持久内存上的事务可以通过许多不同的方式实现，例如，撤消日志、重做日志、两者的组合或写时复制。

PMDK is an open source collection of libraries for using persistent memory, optimized specifically for NVRAM. Among these is the libpmemobj library, which implements relocatable persistent heaps called persistent memory pools. This includes memory allocation, transactions, and general facilities for persistent memory programming. The transactions are local to one thread (not multi-threaded) and rely on undo logs. Correct use of the API ensures that all memory operations are rolled back to the last committed state upon opening a pool after a server failure. VOS utilizes this API to ensure consistency of VOS internal data structures, even in the event of server failures.

// PMDK是一个开源的库集合，用于使用持久内存，专门针对NVRAM进行了优化。其中包括libpmemobj库，它实现了称为持久内存池的可重定位持久堆。这包括内存分配、事务和持久内存编程的通用工具。事务是一个线程（不是多线程）的本地事务，并且依赖于撤消日志。正确使用API可确保在服务器发生故障后打开池时，所有内存操作都回滚到上次提交的状态。VOS利用这个API来确保VOS内部数据结构的一致性，即使在服务器发生故障时也是如此。

VOS Concepts

The versioning object store provides object storage local to a storage target by initializing a VOS pool (vpool) as one shard of a DAOS pool. A vpool can hold objects for multiple object address spaces called containers. Each vpool is given a unique UID on creation, which is different from the UID of the DAOS pool. The VOS also maintains and provides a way to extract statistics like total space, available space, and number of objects present in a vpool.

//版本控制对象存储通过将VOS池（vpool）初始化为DAOS池的一个分片来提供存储目标的本地对象存储。vpool可以为多个对象地址空间（称为容器）保存对象。每个vpool在创建时都有一个唯一的UID，这与DAOS池的UID不同。VOS还维护并提供了一种提取统计信息的方法，如总空间、可用空间和vpool中存在的对象数。

The primary purpose of the VOS is to capture and log object updates in arbitrary time order and integrate these into an ordered epoch history that can be traversed efficiently on demand. This provides a major scalability improvement for parallel I/O by correctly ordering conflicting updates without requiring them to be serialized in time. For example, if two application processes agree on how to resolve a conflict on a given update, they may write their updates independently with the assurance that they will be resolved in the correct order at the VOS.

//VOS的主要目的是捕获和记录任意时间顺序的对象更新，并将这些更新集成到一个有序的历元历史中，该历史可以根据需要高效地遍历。这为并行I/O提供了一个主要的可伸缩性改进，它正确地排序冲突的更新，而不需要及时序列化它们。例如，如果两个应用程序进程就如何解决给定更新上的冲突达成一致，则它们可以独立编写更新，并保证在VOS上以正确的顺序解决这些更新。思考：同一个key按照epoch顺序更新IO，如果遇到有更小的epoch已经持久化，则返回错误？

The VOS also allows all object updates associated with a given epoch and process group to be discarded. This functionality ensures that when a DAOS transaction must be aborted, all associated updates are invisible before the epoch is committed for that process group and becomes immutable. This ensures that distributed updates are atomic - i.e. when a commit completes, either all updates have been applied or been discarded.

// VOS还允许丢弃与给定epoch和进程组关联的所有对象更新。此功能可确保在必须中止DAOS事务时，在为该进程组提交epoch之前，所有关联的更新都是不可见的，并且是不可变的。这确保了分布式更新是原子的—即，当提交完成时，所有更新都已应用或被丢弃。

Finally, the VOS may aggregate the epoch history of objects in order to reclaim space used by inaccessible data and to speed access by simplifying indices. For example, when an array object is "punched" from 0 to infinity in a given epoch, all data updated after the latest snapshot before this epoch becomes inaccessible once the container is closed.

//最后，VOS可以聚合对象的历元历史，以回收不可访问数据使用的空间，并通过简化索引来加快访问速度。例如，当数组对象在给定的历元中从0到无穷大被“穿孔”时，一旦容器关闭，在该历元之前的最新快照之后更新的所有数据都将变得不可访问。思考：为啥加个一旦容器关闭呢？不应该是关闭关闭都全部是0了吗？

Internally, the VOS maintains an index of container UUIDs that references each container stored in a particular pool. The container itself contains three indices. The first is an object index used to map an object ID and epoch to object metadata efficiently when servicing I/O requests. The other two indices are for maintining active and committed DTX records for ensuring efficient updates across multiple replicas.

//在内部，VOS维护一个容器uuid索引，该索引引用存储在特定池中的每个容器。容器本身包含三个索引。第一个是对象索引，用于在服务I/O请求时有效地将对象ID和epoch映射到对象元数据。另外两个索引用于维护活动的和提交的DTX记录，以确保跨多个副本进行高效更新。思考：分布式事务首先找到容器？

DAOS supports two types of values, each associated with a Distribution Key (DKEY) and an Attribute Key (AKEY): Single value and Array value. The DKEY is used for placement, determining which VOS pool is used to store the data. The AKEY identifies the data to be stored. The ability to specify both a DKEY and an AKEY provides applications with the flexibility to either distribute or co-locate different values in DAOS. A single value is an atomic value meaning that writes to an AKEY update the entire value and reads retrieve the latest value in its entirety. An array value is an index of equally sized records. Each update to an array value only affects the specified records and reads read the latest updates to each record index requested. Each VOS pool maintains the VOS provides a per container hierarchy of containers, objects, DKEYs, AKEYs, and values as shown below. The DAOS API provides generic Key-Value and Array abstractions built on this underlying interface. The former sets the DKEY to the user specified key and uses a fixed AKEY. The latter uses the upper bits of the array index to create a DKEY and uses a fixed AKEY thus evenly distributing array indices over all VOS pools in the object layout. For the remainder of the VOS description, Key-Value and Key-Array shall be used to describe the VOS layout rather than these simplifying abstractions. In other words, they shall describe the DKEY-AKEY-Value in a single VOS pool.

//DAOS支持两种类型的值，每种值都与分布键（DKEY）和属性键（AKEY）相关联：单值和数组值。DKEY用于placement，确定哪个VOS池（pool 分片）用于存储数据。AKEY标识要存储的数据。同时指定DKEY和AKEY的能力为应用程序提供了在DAOS中分布或共同定位不同值的灵活性。单个值是一个原子值，这意味着写入AKEY将更新整个值，读取并检索整个最新值。数组值是大小相等的记录的索引。对数组值的每次更新仅影响指定的记录，并读取对请求的每个记录索引的最新更新。每个VOS池维护VOS，并为每个容器提供容器、对象、dkey、akey和值的层次结构，如下所示。Daos api提供了基于这个底层接口的通用键值和数组抽象。前者将DKEY设置为用户指定的key，并使用固定的AKEY。后者使用数组索引的高位来创建DKEY，并使用固定的AKEY，从而将数组索引均匀地分布在对象布局中的所有VOS池上。对于VOS描述的其余部分，应使用键值和键数组来描述VOS布局，而不是这些简化的抽象。换句话说，它们应描述单个VOS池中的 DKEY-AKEY-Value。

VOS objects are not created explicitly but are created on the first write by creating the object metadata and inserting a reference to it in the owning container's object index. All object updates log the data for each update, which may be an object, DKEY, AKEY, a single value, or array value punch or an update to a single value or array value. Note that "punch" of an extent of an array object is logged as zeroed extents, rather than causing relevant array extents or key values to be discarded. A punch of an object, DKEY, AKEY, or single value is logged, so that reads at a later timestamp see no data. This ensures that the full version history of objects remain accessible. The DAOS api, however, only allows accessing data at snapshots so VOS aggregation can aggressively remove objects, keys, and values that are no longer accessible at a known snapshot.

//VOS对象不是显式创建的，而是在第一次写入时通过创建对象元数据并在所属容器的对象索引中插入对它的引用来创建的。所有对象更新都会记录每次更新的数据，这些更新可以是对象、DKEY、AKEY、单个值或数组值，也可以是单个值或数组值的更新。请注意，数组对象extent的“punch”记录为零extent，而不是导致丢弃相关的数组extent或键值。记录对象、DKEY、AKEY或单个值的punch，以便在稍后的时间戳读取时看不到数据。这样可以确保对象的完整版本历史记录保持可访问性。但是，daos api只允许在快照上访问数据，因此VOS聚合可以主动删除在已知快照上不再可访问的对象、键和值。思考：创建对象的过程中，要往容器服务中插入引用信息，那么，IO流程岂不是要变长？？？

When performing lookup on a single value in an object, the object index is traversed to find the index node with the highest epoch number less than or equal to the requested epoch (near-epoch) that matches the key. If a value or negative entry is found, it is returned. Otherwise, a "miss" is returned, meaning that this key has never been updated in this VOS. This ensures that the most recent value in the epoch history of is returned irrespective of the time-order in which they were integrated and that all updates after the requested epoch are ignored.

// 在对对象中的单个值执行查找时，将遍历对象索引，以查找最高历元数小于或等于与键匹配的请求历元（接近历元）的索引节点。如果找到值或负条目，则返回它。否则，将返回一个“miss”，这意味着此VOS中从未更新过此key。这样可以确保返回的历元历史中的最新值，而不考虑其集成的时间顺序，并且忽略请求的历元之后的所有更新。

Similarly, when reading an array object, its index is traversed to create a gather descriptor that collects all object extent fragments in the requested extent with the highest epoch number less than or equal to the requested epoch. Entries in the gather descriptor either reference an extent containing data, a punched extent that the requestor can interpret as all zeroes, or a "miss", meaning that this VOS has received no updates in this extent. Again, this ensures that the most recent data in the epoch history of the array is returned for all offsets in the requested extent, irrespective of the time-order in which they were written, and that all updates after the requested epoch are ignored.

// 类似地，在读取数组对象时，遍历其索引以创建一个聚集描述符，该描述符收集请求范围内的所有对象范围片段，其最高epoch数小于或等于请求的epoch。聚集描述符中的条目要么引用包含数据的区段，要么引用请求者可以解释为全零的穿孔区段，要么引用“未命中”，这意味着此VOS在此区段中没有收到任何更新。同样，这确保了为请求范围内的所有偏移量返回数组的历元历史中的最新数据，而不管它们写入的时间顺序如何，并且忽略请求历元之后的所有更新。

VOS Indexes

The value of the object index table, indexed by OID, points to a DKEY index. The values in the DKEY index, indexed by DKEY, point to an AKEY index. The values in the AKEY index, indexed by AKEY, point to either a Single Value index or an Array index. A single value index is referenced by epoch and will return the latest value inserted at or prior to the epoch. An array value is indexed by the extent and the epoch and will return portions of extents visible at the epoch.

// 按OID索引的对象索引表的值指向DKEY索引。由DKEY索引的DKEY索引中的值指向AKEY索引。由AKEY索引的AKEY索引中的值指向单个值索引或数组索引。单个值索引由epoch引用，并将返回在epoch处或之前插入的最新值。数组值按范围和epoch索引，并将返回在epoch处可见的范围部分。

Hints about the expectations of the object can be encoded in the object ID. For example, an object can be replicated, erasure coded, use checksums, or have integer or lexical DKEYs and/or AKEYs. If integer or lexical keys are used, the object index is ordered by keys, making queries, such as array size, more efficient. Otherwise, keys are ordered by the hashed value in the index. The object ID is 128 bits. The upper 32 bits are used to encodes the object type, and key types while the lower 96 bits are a user defined identifier that must be unique to the container.

//有关对象期望值的提示可以编码在对象ID中。例如，可以多副本、EC编码、使用校验和，或者使用整数或词法dkey和/或akey。如果使用整数键或词法键，则对象索引将按键排序，从而使查询（如数组大小）更高效。否则，键将按索引中的哈希值排序。对象ID是128位。上面的32位用于编码对象类型和键类型，而下面的96位是用户定义的标识符，必须对容器唯一。

Each object, dkey, and akey has an associated incarnation log. The incarnation log can be described as an in-order log of creation and punch events for the associated entity. The log is checked for each entity in the path to the value to ensure the entity, and therefore the value, is visible at the requested time.

//每个对象、dkey和akey都有一个相关的incarnation 日志。incarnation 日志可以描述为关联实体的creation 和punch 事件的顺序日志。在值的路径中检查每个实体的日志，以确保在请求的时间可以看到实体，因此也可以看到值。

Object Listing

VOS provides a generic iterator that can be used to iterate through containers, objects, DKEYs, AKEYs, single values, and array extents in a VOS pool. The iteration API is shown in the figure below.

//VOS提供了一个通用迭代器，可用于遍历VOS池中的容器、对象、dkey、akey、单个值和数组范围。迭代API如下图所示。

/**

* Iterate VOS entries (i.e., containers, objects, dkeys, etc.) and call \a

* cb(\a arg) for each entry.

* If \a cb returns a nonzero (either > 0 or < 0) value that is not

* -DER_NONEXIST, this function stops the iteration and returns that nonzero

* value from \a cb. If \a cb returns -DER_NONEXIST, this function completes

* the iteration and returns 0. If \a cb returns 0, the iteration continues.

* \param[in] param iteration parameters

* \param[in] type entry type of starting level

* \param[in] recursive iterate in lower level recursively

* \param[in] anchors array of anchors, one for each

* iteration level

* \param[in] cb iteration callback

* \param[in] arg callback argument

* \retval 0 iteration complete

* \retval > 0 callback return value

* \retval -DER_* error (but never -DER_NONEXIST)

int

vos_iterate(vos_iter_param_t *param, vos_iter_type_t type, bool recursive,

struct vos_iter_anchors *anchors, vos_iter_cb_t cb, void *arg);

the generic VOS iterator API enables both the DAOS enumeration API as well as DAOS internal features supporting rebuild, aggregation, and discard. It is flexible enough to iterate through all keys, single values, and extents for a specified epoch range. Additionally, it supports iteration through visible extents.

//通用VOS迭代器API支持DAOS枚举API和DAOS内部特性，支持重建、聚合和丢弃。它足够灵活，可以遍历指定历元范围内的所有键、单个值和范围。此外，它还支持通过可见extent进行迭代。

Key Value Stores (Single Value)

High-performance simulations generating large quantities of data require indexing and analysis of data, to achieve good insight. Key Value (KV) stores can play a vital role in simplifying the storage of such complex data and allowing efficient processing.

// 生成大量数据的高性能仿真需要对数据进行索引和分析，以获得良好的洞察力。键值（KV）存储在简化复杂数据的存储和允许高效处理方面起着至关重要的作用。

VOS provides a multi-version, concurrent KV store on persistent memory that can grow dynamically and provide quick near-epoch retrieval and enumeration of key values.

//VOS在持久性内存上提供了一个多版本的并发KV存储，可以动态增长，并提供快速的近历元检索和键值枚举。

Although there is an array of previous work on KV stores, most of them focus on cloud environments and do not provide effective versioning support. Some KV stores provide versioning support but expect monotonically increasing ordering of versions and further, do not have the concept of near-epoch retrieval.

//尽管有一系列关于KV存储的先前工作，但大多数都集中在云环境上，并且没有提供有效的版本控制支持。有些KV存储提供版本控制支持，但期望版本的顺序单调递增，而且没有近历元检索的概念。注：凸显了DAOS的优点

VOS must be able to accept insertion of KV pairs at any epoch and must be able to provide good scalability for concurrent updates and lookups on any key-value object. KV objects must also be able to support any type and size of keys and values.

// VOS必须能够在任何epoch接受KV对的插入（？？？？这个咋个理解，能够在过去的epoch更新数据？？？？？，参考历元有效性 79页），并且必须能够为任何键值对象的并发更新和查找提供良好的可伸缩性。KV对象还必须能够支持任何类型和大小的键和值。

Operations Supported with Key Value Store

VOS supports large keys and values with four types of operations; update, lookup, punch, and key enumeration.

The update and punch operations add a new key to a KV store or log a new value of an existing key. Punch logs the special value "punched", effectively a negative entry, to record the epoch when the key was deleted. Sharing the same epoch for both an update and a punch of the same object, key, value, or extent is disallowed, and VOS will return an error when such is attempted.

//VOS支持四种类型的大键和大值操作；更新、查找、穿孔和键枚举。

update和punch操作将一个新key添加到KV存储或记录一个现有key的新值。Punch记录特殊值“punched”，实际上是一个负条目，用于记录删除键时的历元。不允许为同一对象、键、值或范围的更新和punched共享同一个epoch，并且尝试这样做时VOS将返回错误。

Lookup traverses the KV metadata to determine the state of the given key at the given epoch. If the key is not found at all, a "miss" is returned to indicate that the key is absent from this VOS. Otherwise, the value at the near-epoch or greatest epoch less than or equal to the requested epoch is returned. If this is the special "punched" value, it means the key was deleted in the requested epoch. The value here refers to the value in the internal tree-data structure. The key-value record of the KV-object is stored in the tree as the value of its node. So in case of punch this value contains a "special" return code/flag to identify the punch operation.//查找遍历KV元数据，以确定给定历元中给定key的状态。

//如果根本找不到key，则返回一个“miss”来指示该VOS中没有该key。否则，返回小于或等于请求的历元的近历元或最大历元的值。如果这是特殊的“穿孔”值，则表示该键在请求的历元中被删除。这里的值是指内部树数据结构中的值。KV对象的键值记录作为其节点的值存储在树中。因此，在打孔的情况下，该值包含一个“特殊”返回代码/标志，用于标识打孔操作。思考：在去重完后，是否还保留打孔标记？

VOS also supports the enumeration of keys belonging to a particular epoch.//VOS还支持枚举属于特定历元的key。

Key in VOS KV Stores

VOS KV supports key sizes from small keys to extremely large keys. For AKEYs and DKEYs, VOS supports either hashed keys or one of two types of "direct" keys: lexical or integer.

//VOS KV支持从小key到超大key的尺寸。对于AKEYs和DKEYs，VOS支持散列键或两种类型的“直接”键之一：词法键或整数键。

Hashed Keys

The most flexible key type is the hashed key. VOS runs two fast hash algorithms on the user supplied key and uses the combined hashed key values for the index. The intention of the combined hash is to avoid collisions between keys. The actual key still must be compared for correctness.

//最灵活的密钥类型是散列key 。VOS在用户提供的键上运行两个快速哈希算法，并使用组合的哈希键值作为索引。组合散列的目的是避免key之间的冲突。仍然必须比较实际key的正确性。

Direct Keys

The use of hashed keys results in unordered keys. This is problematic in cases where the user's algorithms may benefit from ordering. Therefore, VOS supports two types of keys that are not hashed but rather interpreted directly.

//使用散列键会导致键无序。在用户的算法可能从排序中受益的情况下，这是有问题的。因此，VOS支持两种类型的键，它们不是散列的，而是直接解释的。

Lexical Keys

Lexical keys are compared using a lexical ordering. This enables usage such as sorted strings. Presently, lexical keys are limited in length, however to 80 characters. // 词法键使用词法排序进行比较。这将启用诸如排序字符串之类的用法。目前，词法键的长度是有限的，但是只有80个字符。

Integer Keys

Integer keys are unsigned 64-bit integers and are compared as such. This enables use cases such as DAOS array API using the upper bits of the index as a dkey and the lower bits as an offset. This enables such objects to use the the DAOS key query API to calculate the size more efficiently. //整数键是无符号的64位整数，因此进行比较。这支持使用诸如DAOS数组API之类的用例，使用索引的高位作为dkey，低位作为偏移量。这使这些对象能够使用DAOS键查询API更有效地计算大小。

KV stores in VOS allow the user to maintain versions of the different KV pairs in random order. For example, an update can happen in epoch 10, and followed by another update in epoch 5, where HCE is less than 5. To provide this level of flexibility, each key in the KV store must maintain the epoch of update/punch along with the key. The ordering of entries in index trees first happens based on the key, and then based on the epochs. This kind of ordering allows epochs of the same key to land in the same subtree, thereby minimizing search costs. Conflict resolution and tracking is performed using DTX described later. DTX ensures that replicas are consistent, and failed or uncommitted updates are not visible externally.

//VOS中的KV存储允许用户以随机顺序维护不同KV对的版本。例如，更新可以在epoch 10中进行，然后在epoch 5中进行另一次更新，其中HCE小于5。为了提供这种灵活性，KV存储中的每个密钥必须与密钥一起保持更新/穿孔的epoch。索引树中条目的排序首先基于键，然后基于历元。这种排序允许相同密钥的历元落在同一子树中，从而最小化搜索成本。使用稍后描述的DTX执行冲突解决和跟踪。DTX确保副本是一致的，并且失败或未提交的更新在外部不可见。

Internal Data Structures

Designing a VOS KV store requires a tree data structure that can grow dynamically and remain self-balanced. The tree needs to be balanced to ensure that time complexity does not increase with an increase in tree size. Tree data structures considered are red-black trees and B+ Trees, the former is a binary search tree, and the latter an n-ary search tree. //设计vos kv存储需要一个能够动态增长并保持自平衡的树数据结构。树需要平衡，以确保时间复杂度不会随着树大小的增加而增加。所考虑的树数据结构是红黑树和B+树，前者是二叉搜索树，后者是n元搜索树。

Although red-black trees provide less rigid balancing compared to AVL trees, they compensate by having cheaper rebalancing cost. Red-black trees are more widely used in examples such as the Linux kernel, the java-util library, and the C++ standard template library. B+ trees differ from B trees in the fact they do not have data associated with their internal nodes. This can facilitate fitting more keys on a page of memory. In addition, leaf-nodes of B+ trees are linked; this means doing a full scan would require just one linear pass through all the leaf nodes, which can potentially minimize cache misses to access data in comparison to a B Tree.

// 虽然红黑树相比AVL树提供了较低的刚性平衡，但他们再平衡成本更低。红黑树在Linux内核、java UTL库和C++标准模板库等实例中得到了更广泛的应用。B+树与B树的不同之处在于，它们没有与其内部节点关联的数据。这有助于在内存页上安装更多的键。另外，B+树的叶节点被连接起来；这意味着执行完整扫描只需要一次线性遍历所有叶节点，与B树相比，这可能会最大限度地减少访问数据时的缓存未命中。

To support update and punch as mentioned in the previous section (Operations Supported with Key Value Stores), an epoch-validity range is set along with the associated key for every update or punch request, which marks the key to be valid from the current epoch until the highest possible epoch. Updates to the same key on a future epoch or past epoch modify the end epoch validity of the previous update or punch accordingly. This way only one key has a validity range for any given key-epoch pair lookup while the entire history of updates to the key is recorded. This facilitates nearest-epoch search. Both punch and update have similar keys, except for a simple flag identifying the operation on the queried epoch. Lookups must be able to search a given key in a given epoch and return the associated value. In addition to the epoch-validity range, the container handle cookie generated by DAOS is also stored along with the key of the tree. This cookie is required to identify behavior in case of overwrites on the same epoch.

//为了支持上一节（键值存储支持的操作）中提到的更新和打孔，将为每个更新或打孔请求设置一个epoch-validity范围和相关键，该范围将标记从当前epoch到可能的最高epoch的有效键。在未来历元或过去历元上更新同一个键会相应地修改以前更新或冲孔的结束历元有效性。这样，在记录key更新的整个历史时，对于任何给定的key-epoch对查找，只有一个key具有有效范围。这便于最近历元搜索。punch和update都有相似的键，只是有一个简单的标志标识查询的epoch上的操作。查找必须能够在给定的历元中搜索给定的键并返回相关的值。除了epoch有效性范围之外，DAOS生成的容器句柄cookie也与树的键一起存储。此cookie用于标识在同一历元上发生覆盖时的行为。思考：没咋看懂这段。

A simple example input for crearting a KV store is listed in the Table below. Both a B+ Tree based index and a red-black tree based index are shown in the Table and figure below, respectively. For explanation purposes, representative keys and values are used in the example.

//下表列出了创建KV存储的简单输入示例。基于B+树的索引和基于红黑树的索引分别显示在下表和图中。为了便于解释，示例中使用了代表键和值。

Example VOS KV Store input for Update/Punch

The red-black tree, like any traditional binary tree, organizes the keys lesser than the root to the left subtree and keys greater than the root to the right subtree. Value pointers are stored along with the keys in each node. On the other hand, a B+ Tree-based index stores keys in ascending order at the leaves, which is where the value is stored. The root nodes and internal nodes (color-coded in blue and maroon accordingly) facilitate locating the appropriate leaf node. Each B+ Tree node has multiple slots, where the number of slots is determined from the order. The nodes can have a maximum of order-1 slots. The container handle cookie must be stored with every key in case of red-black trees, but in case of B+ Trees having cookies only in leaf nodes would suffice, since cookies are not used in traversing.

// 红黑树与任何传统的二叉树一样，将小于根的键组织到左子树，将大于根的键组织到右子树。值指针与每个节点中的键一起存储。另一方面，基于B+树的索引按升序在叶子处存储键，这是存储值的地方。根节点和内部节点（相应地用蓝色和栗色编码）有助于定位适当的叶节点。每个B+树节点有多个slots，其中slots的数量由顺序决定。节点最多可以有一个1阶插槽。在红黑树的情况下，容器句柄cookie必须与每个键一起存储，但是在B+树的情况下，只有叶节点中有cookie就足够了，因为cookie不用于遍历。

In the table below, n is the number of entries in the tree, m is the number of keys, k is the number of the key, epoch entries between two unique keys。

Comparison of average case computational complexity for index

Although both these solutions are viable implementations, determining the ideal data structure would depend on the performance of these data structures on persistent memory hardware. //尽管这两种解决方案都是可行的实现，但确定理想的数据结构将取决于这些数据结构在持久内存硬件上的性能。

VOS also supports concurrent access to these structures, which mandates that the data structure of choice provides good scalability while there are concurrent updates. Compared to B+ Tree, rebalancing in red-black trees causes more intrusive tree structure change; accordingly, B+ Trees may provide better performance with concurrent accesses. Furthermore, because B+ Tree nodes contain many slots depending on the size of each node, prefetching in cache can potentially be easier. In addition, the sequential computational complexities in the Table above show that a B+ Tree-based KV store with a reasonable order, can perform better in comparison to a Red-black tree.

//VOS还支持对这些结构的并发访问，这要求所选择的数据结构在有并发更新的情况下提供良好的可伸缩性。与B+树相比，红黑树的再平衡导致更具侵入性的树结构变化；因此，B+树可以提供更好的并发访问性能。此外，由于B+树节点根据每个节点的大小包含许多插槽，因此缓存中的预取可能更容易。此外，上表中的顺序计算复杂性表明，与红黑树相比，具有合理顺序的基于B+树的KV存储性能更好。

VOS supports enumerating keys valid in a given epoch. VOS provides an iterator-based approach to extract all the keys and values from a KV object. Primarily, KV indexes are ordered by keys and then by epochs. With each key holding a long history of updates, the size of a tree can be huge. Enumeration with a tree-successors approach can result in an asymptotic complexity of O(m* log (n) + log (n)) with red-black trees, where m is the number of keys valid in the requested epoch. It takes O(log2 (n)) to locate the first element in the tree and O(log2 (n)) to locate a successor. Because "m" keys need to be retrieved, O( m * log2 (n)) would be the complexity of this enumeration.

// VOS支持枚举给定历元中有效的键。VOS提供了一种基于迭代器的方法来从KV对象中提取所有键和值。首先，KV索引按键排序，然后按epoch排序。由于每个键都有很长的更新历史，因此树的大小可能很大。使用树后继方法的枚举可以导致红黑树的渐近复杂性为O（m*log（n）+log（n）），其中m是请求的历元中有效的key数。需要O（log2（n））来定位树中的第一个元素，O（log2（n））来定位后续元素。因为需要检索“m”个密钥，所以O（m*log2（n））将是这个枚举的复杂性。

In the case of B+-trees, leaf nodes are in ascending order, and enumeration would be to parse the leaf nodes directly. The complexity would be O (m * k + logbn), where m is the number of keys valid in an epoch, k is the number of entries between two different keys in B+ tree leaf nodes, and b is the order for the B+tree. Having "k" epoch entries between two distinct keys incurs in a complexity of O(m * k). The additional O(logbn) is required to locate the first leftmost key in the tree. The generic iterator interfaces as shown in Figure above would be used for KV enumeration also.//在B+树的情况下，叶节点按升序排列，枚举将直接解析叶节点。复杂度为O（m*k+logbn），其中m是一个历元中有效的密钥数，k是B+树叶节点中两个不同密钥之间的条目数，b是B+树的顺序。在两个不同的键之间有“k”个历元条目会导致O（m*k）的复杂性。需要额外的O（logbn）来定位树中最左边的第一个键。上图所示的通用迭代器接口也将用于KV枚举。

In addition to the enumeration of keys for an object valid in an epoch, VOS also supports enumerating keys of an object modified between two epochs. The epoch index table provides keys updated in each epoch. On aggregating the list of keys associated with each epoch, (by keeping the latest update of the key and discarding the older versions) VOS can generate a list of keys with their latest epoch. By looking up each key from the list in its associated index data structure, VOS can extract values with an iterator-based approach.

//除了枚举在一个纪元中有效的对象的键外，VOS还支持枚举在两个纪元之间修改的对象的键。epoch索引表提供在每个epoch中更新的键。在聚合与每个epoch相关联的密钥列表时（通过保留密钥的最新更新并丢弃旧版本），VOS可以生成具有其最新epoch的key列表。通过在相关索引数据结构中从列表中查找每个键，VOS可以使用基于迭代器的方法提取值。

Key Array Stores

The second type of object supported by VOS is a Key-Array object. Array objects, similar to KV stores, allow multiple versions and must be able to write, read, and punch any part of the byte extent range concurrently. The figure below shows a simple example of the extents and epoch arrangement within a Key-Array object. In this example, the different lines represent the actual data stored in the respective extents and the color-coding points to different threads writing that extent range.

//VOS支持的第二种类型的对象是键数组对象。数组对象类似于KV存储，允许多个版本，并且必须能够同时写入、读取和打孔字节范围的任何部分。下图显示了键数组对象中范围和历元排列的简单示例。在此示例中，不同的行表示存储在各个区段中的实际数据，并且颜色编码点指向写入该区段范围的不同线程。

Example of extents and epochs in a Key Array object

In the above example, there is significant overlap between different extent ranges. VOS supports nearest-epoch access, which necessitates reading the latest value for any given extent range. For example, in the figure above, if there is a read request for extent range 4 - 10 at epoch 10, the resulting read buffer should contain extent 7-10 from epoch 9, extent 5-7 from epoch 8, and extent 4-5 from epoch 1. VOS array objects also support punch over both partial and complete extent ranges.

// 在上述示例中，不同范围之间存在显著重叠。VOS支持最近历元访问，这需要读取任何给定范围的最新值。例如，在上图中，如果在epoch 10有一个区段范围4-10的读取请求，则产生的读取缓冲区应该包含epoch 9的区段7-10、epoch 8的区段5-7和epoch 1的区段4-5。VOS数组对象还支持对部分和完整区段范围进行穿孔。

Example Input for Extent Epoch Table

Trees provide a reasonable way to represent both extent and epoch validity ranges in such a way as to limit the search space required to handle a read request. VOS provides a specialized R-Tree, called an Extent-Validity tree (EV-Tree) to store and query versioned array indices. In a traditional R-Tree implementation, rectangles are bounded and immutable. In VOS, the "rectangle" consists of the extent range on one axis and the epoch validity range on the other. However, the epoch validity range is unknown at the time of insert so all rectangles are inserted assuming an upper bound of infinity. Originally, the DAOS design called for splitting such in-tree rectangles on insert to bound the validity range but a few factors resulted in the decision to keep the original validity range. First, updates to persistent memory are an order of magnitude more expensive than lookups. Second, overwrites between snapshots can be deleted by aggregation, thus maintaining a reasonably small history of overlapping writes. As such, the EV-Tree implements a two part algorithm on fetch.

// R树提供了一种合理的方法来表示extent 和历元有效性范围，从而限制了处理读取请求所需的搜索空间。VOS提供了一个专门的R树，称为扩展有效性树（EV-Tree），用于存储和查询版本化的数组索引。在传统的R-树实现中，矩形是有界且不可变的。在VOS中，“矩形”由一个轴上的范围和另一个轴上的历元有效范围组成。但是，插入时历元有效范围未知，因此所有矩形均假定上界为无穷大插入。最初，DAOS设计要求在insert上拆分这些树矩形以限制有效性范围，但有几个因素导致了保留原始有效性范围的决定。首先，对持久内存的更新要比查找昂贵一个数量级。其次，可以通过聚合删除快照之间的覆盖，从而保持较小的重叠写入历史。因此，EV树在fetch上实现了一个由两部分组成的算法。

Find all overlapping extents. This will include all writes that happened before the requested epoch, even if they are covered by a subsequent write. // 查找所有重叠extent。这将包括在请求的历元之前发生的所有写入，即使它们被后续写入所覆盖。
Sort this by extent start and then by epoch //按extent开始排序，然后按纪元排序
Walk through the sorted array, splitting extents if necessary and marking them as visible as applicable //遍历已排序的数组，必要时拆分数据块，并将其标记为可见（如适用）
Re-sort the array. This final sort can optionally keep or discard holes and covered extents, depending on the use case. //重新排序数组。根据用例的不同，最终排序可以选择保留或放弃孔和覆盖范围。

TODO: Create a new figure Rectangles representing extent_range.epoch_validity arranged in 2-D space for an order-4 EV-Tree using input in the table above

The figure below shows the rectangles constructed with splitting and trimming operations of EV-Tree for the example in the previous table with an additional write at offset {0 - 100} introduced to consider the case for extensive splitting. The figure above shows the EV-Tree construction for the same example.

// 下面的图显示了用前一个表中的EV树的分裂和修剪操作构造的矩形，附加的写在偏移{ 0 - 100 }中，以考虑广泛分裂的情况。上图显示了同一示例的EV树结构。

Tree (order - 4) for the example in Table 6 3 (pictorial representation shown in the figure above

Inserts in an EV-Tree locate the appropriate leaf-node to insert, by checking for overlap. If multiple bounding boxes overlap, the bounding box with the least enlargement is chosen. Further ties are resolved by choosing the bounding box with the least area. The maximum cost of each insert can be O (logbn).

// 在EV树中插入通过检查重叠来定位要插入的适当叶节点。如果多个边界框重叠，则选择放大最小的边界框。通过选择面积最小的边界框来解决进一步的关系。每个插入的最大成本可以是O（logbn）。

Searching an EV-Tree would work similar to R-Tree, aside from the false overlap issue described above. All overlapping internal nodes must be pursued, till there are matching internal nodes and leaves. Since extent ranges can span across multiple rectangles, a single search can hit multiple rectangles. In an ideal case (where the entire extent range falls on one rectangle), the read cost is O(logbn) where b is the order of the tree. The sorting and splitting phase adds the additional overhead of O(n log n) where n is the number of matching extents. In the worst case, this is equivalent to all extents in the tree, but this is mitigated by aggregation and the expectation that the tree associated with a single shard of a single key will be relatively small.

// 除了上面描述的错误重叠问题外，搜索EV树的工作原理与R树类似。必须追踪所有重叠的内部节点，直到有匹配的内部节点和叶子。由于extent范围可以跨越多个矩形，因此单个搜索可以命中多个矩形。在理想情况下（整个范围落在一个矩形上），读取成本为O（logbn），其中b是树的顺序。排序和拆分阶段增加了O（n log n）的额外开销，其中n是匹配扩展数据块的数量。在最坏的情况下，这相当于树中的所有区段，但通过聚合和与单个键的单个碎片关联的树将相对较小的期望，这可以缓解这种情况。

For deleting nodes from an EV-Tree, the same approach as search can be used to locate nodes, and nodes/slots can be deleted. Once deleted, to coalesce multiple leaf-nodes that have less than order/2 entries, reinsertion is done. EV-tree reinserts are done (instead of merging leaf-nodes as in B+ trees) because on deletion of leaf node/slots, the size of bounding boxes changes, and it is important to make sure the rectangles are organized into minimum bounding boxes without unnecessary overlaps. In VOS, delete is required only during aggregation and discard operations. These operations are discussed in a following section (Epoch Based Operations).

//对于从EV树中删除节点，可以使用与搜索相同的方法来定位节点，并且可以删除节点/插槽。删除后，要合并具有少于order/2条目的多个叶节点，将执行重新插入。重新插入EV树（而不是像在B+树中那样合并叶节点）是因为删除叶节点/槽时，边界框的大小会发生变化，确保矩形组织成最小的边界框而没有不必要的重叠非常重要。在VOS中，只有在聚合和放弃操作期间才需要删除。这些操作将在下一节（基于历元的操作）中讨论。

Conditional Update and MVCC

VOS supports conditional operations on individual dkeys and akeys. The following operations are supported:

//VOS支持对单个DKY和AKEY执行条件操作。支持以下操作：

Conditional fetch: Fetch if the key exists, fail with -DER_NONEXIST otherwise
Conditional update: Update if the key exists, fail with -DER_NONEXIST otherwise
Conditional insert: Update if the key doesn't exist, fail with -DER_EXIST otherwise
Conditional punch: Punch if the key exists, fail with -DER_NONEXIST otherwise

These operations provide atomic operations enabling certain use cases that require such. Conditional operations are implemented using a combination of existence checks and read timestamps. The read timestamps enable limited MVCC to prevent read/write races and provide serializability guarantees.

// 这些操作提供了原子操作，以支持某些需要此类操作的用例。条件操作使用存在性检查和读取时间戳的组合来实现。读取时间戳使有限的MVCC能够防止读/写竞争，并提供可序列化性保证。

VOS Timestamp Cache

VOS maintains an in-memory cache of read and write timestamps in order to enforce MVCC semantics. The timestamp cache itself consists of two parts:

//VOS维护读写时间戳的内存缓存，以强制执行MVCC语义。时间戳缓存本身由两部分组成：

Negative entry cache. A global array per target for each type of entity including objects, dkeys, and akeys. The index at each level is determined by the combination of the index of the parent entity, or 0 in the case of containers, and the hash of the entity in question. If two different keys map to the same index, they share timestamp entries. This will result in some false conflicts but does not affect correctness so long as progress can be made. The purpose of this array is to store timestamps for entries that do not exist in the VOS tree. Once an entry is created, it will use the mechanism described in #2 below. Note that multiple pools in the same target use this shared cache so it is also possible for false conflicts across pools before an entity exists. These entries are initialized at startup using the global time of the starting server. This ensures that any updates at an earlier time are forced to restart to ensure we maintain automicity since timestamp data is lost when a server goes down.

//1. 负项缓存。每种类型实体（包括对象、DKEY和AKEY）的每个target的全局数组。每一级的索引都是由父实体的索引（如果是容器，则为0）和相关实体的哈希值的组合来确定的。如果两个不同的键映射到同一索引，则它们共享时间戳条目。这将导致一些错误的冲突，但只要程序能够处理，就不会影响正确性。此数组的目的是为VOS树中不存在的条目存储时间戳。创建条目后，它将使用下面#2中描述的机制。请注意，同一目标中的多个池使用此共享缓存，因此在实体存在之前，池之间也可能发生错误冲突。这些条目在启动时使用启动服务器的全局时间进行初始化。这确保了在早期的任何更新都会被强制重新启动，以确保我们保持自动性，因为当服务器停机时，时间戳数据会丢失。

Positive entry cache. An LRU cache per target for existing containers, objects, dkeys, and akeys. One LRU array is used for each level such that containers, objects, dkeys, and akeys only conflict with cache entries of the same type. Some accuracy is lost when existing items are evicted from the cache as the values will be merged with the corresponding negative entry described in #1 above until such time as the entry is brought back into cache. The index of the cached entry is stored in the VOS tree though it is only valid at runtime. On server restarts, the LRU cache is initialized from the global time when the restart occurs and all entries are automatically invalidated. When a new entry is brought into the LRU, it is initialized using the corresponding negative entry. The index of the LRU entry is stored in the VOS tree providing O(1) lookup on subsequent accesses.

//正项缓存。现有容器、对象、DKEY和AKEY的每个目标的LRU缓存。每个级别使用一个LRU数组，这样容器、对象、DKEY和AKEY仅与相同类型的缓存项冲突。当现有项从缓存中逐出时，会失去一些准确性，因为这些值将与上面#1中描述的相应负项合并，直到该项被带回缓存。缓存项的索引存储在VOS树中，尽管它仅在运行时有效。在服务器重新启动时，LRU缓存将从重新启动时的全局时间开始初始化，并且所有条目将自动失效。将新条目引入LRU时，将使用相应的负条目对其进行初始化。LRU条目的索引存储在VOS树中，为后续访问提供O（1）查找。

Read Timestamps

Each entry in the timestamp cache contains two read timestamps in order to provide serializability guarantees for DAOS operations. These timestamps are

//时间戳缓存中的每个条目都包含两个读取时间戳，以便为DAOS操作提供序列化保证。这些时间戳是:

A low timestamp (entity.low) indicating that all nodes in the subtree rooted at the entity have been read at entity.low //一个低时间戳（entity.low），表示在entity.low读取了以该entity为根的子树中的所有节点
A high timestamp (entity.high) indicating that at least one node in the subtree rooted at the entity has been read at entity.high. //一个高时间戳（entity.high），表示在entity.high上至少读取了以该实体为根的子树中的一个节点

For any leaf node (i.e., akey), low == high; for any non-leaf node, low <= high.

The usage of these timestamps is described below

Write Timestamps

In order to detect epoch uncertainty violations, VOS also maintains a pair of write timestamps for each container, object, dkey, and akey. Logically, the timestamps represent the latest two updates to either the entity itself or to an entity in a subtree. At least two timestamps are required to avoid assuming uncertainty if there are any later updates. The figure below shows the need for at least two timestamps. With a single timestamp only, the first, second, and third cases would be indistinguishable and would be rejected as uncertain. The most accurate write timestamp is used in all cases. For instance, if the access is an array fetch, we will check for conflicting extents in the absence of an uncertain punch of the corresponding key or object.

// 为了检测历元不确定性冲突，VOS还为每个容器、对象、dkey和akey维护一对写入时间戳。从逻辑上讲，时间戳表示实体本身或子树中实体的最近两次更新。如果有任何后续更新，则至少需要两个时间戳以避免假设不确定性。下图显示了至少需要两个时间戳。如果只使用一个时间戳，第一个、第二个和第三个案例将无法区分，并将被视为不确定而拒绝。在所有情况下都使用最准确的写入时间戳。例如，如果访问是一个数组获取，我们将在没有对应键或对象的不确定穿孔的情况下检查冲突的范围。

Scenarios illustrating utility of write timestamp cache

// 说明写时间戳缓存实用性的场景

MVCC Rules （Multiversion concurrency control rules）

Every DAOS I/O operation belongs to a transaction. If a user does not associate an operation with a transaction, DAOS regards this operation as a single-operation transaction. A conditional update, as defined above, is therefore regarded as a transaction comprising a conditional check, and if the check passes, an update, or punch operation.

//每个DAOS I/O操作都属于一个事务。如果用户未将操作与事务关联，DAOS将此操作视为单个操作事务。因此，如上所述，条件更新被视为包含条件检查的事务，如果检查通过，则视为更新或打孔操作。

Every transaction gets an epoch. Single-operation transactions and conditional updates get their epochs from the redundancy group servers they access, snapshot read transactions get their epoch from the snapshot records and every other transaction gets its epoch from the HLC of the first server it accesses. (Earlier implementations use client HLCs to choose epochs in the last case. To relax the clock synchronization requirement for clients, later implementations have moved to use server HLCs to choose epochs, while introducing client HLC Trackers that track the highest server HLC timestamps clients have heard of.) A transaction performs all operations using its epoch.

//每一个事务都有一个epoch。单操作事务和条件更新从其访问的冗余组服务器获取其epoch，快照读取事务从快照记录获取其历元，其他每个事务从其访问的第一台服务器的HLC获取其历元(早期的实现在最后一种情况下使用客户机HLC来选择epoch。为了放宽客户机的时钟同步要求，以后的实现已经转向使用服务器HLC来选择epoch，同时引入了客户机HLC跟踪器，跟踪客户机听说过的最高服务器HLC时间戳。）事务使用其epoch执行所有操作。

The MVCC rules ensure that transactions execute as if they are serialized in their epoch order while ensuring that every transaction observes all conflicting transactions commit before it opens, as long as the system clock offsets are always within the expected maximum system clock offset (epsilon). For convenience, the rules classify the I/O operations into reads and writes:

MVCC规则确保事务按照epoch顺序序列化执行，同时确保每个事务在打开之前观察所有冲突事务提交，只要系统时钟偏移始终在预期的最大系统时钟偏移（epsilon）内。为方便起见，这些规则将I/O操作分为读操作和写操作：

Reads
- Fetch akeys [akey level]
- Check object emptiness [object level]
- Check dkey emptiness [dkey level]
- Check akey emptiness [akey level]
- List objects under container [container level]
- List dkeys under object [object level]
- List akeys under dkey [dkey level]
- List recx under akey [akey level]
- Query min/max dkeys under object [object level]
- Query min/max akeys under dkey [dkey level]
- Query min/max recx under akey [akey level]
Writes
- Update akeys [akey level]
- Punch akeys [akey level]
- Punch dkey [dkey level]
- Punch object [object level]

And each read or write is at one of the four levels: container, object, dkey, and akey. An operation is regarded as an access to the whole subtree rooted at its level. Although this introduces a few false conflicts (e.g., a list operation versus a lower level update that does not change the list result), the assumption simplifies the rules.

//每个读或写都在四个级别之一：容器、对象、dkey和akey。一个操作被认为是对其层次上的整个子树的访问。尽管这会引入一些错误的冲突（例如，列表操作与不更改列表结果的较低级别更新之间的冲突），但该假设简化了规则。

A read at epoch e follows these rules:

// Epoch uncertainty check

if e is uncertain

if there is any overlapping, unaborted write in (e, e_orig + epsilon]

reject

find the highest overlapping, unaborted write in [0, e]

if the write is not committed

wait for the write to commit or abort

if aborted

retry the find skipping this write

// Read timestamp update

for level i from container to the read's level lv

update i.high

update lv.low

A write at epoch e follows these rules:

// Epoch uncertainty check

if e is uncertain

if there is any overlapping, unaborted write in (e, e_orig + epsilon]

reject

// Read timestamp check

for level i from container to one level above the write

if (i.low > e) || ((i.low == e) && (other reader @ i.low))

reject

if (i.high > e) || ((i.high == e) && (other reader @ i.high))

reject

find if there is any overlapping write at e

if found and from a different transaction

reject

A transaction involving both reads and writes must follow both sets of rules. As optimizations, single-read transactions and snapshot (read) transactions do not need to update read timestamps. Snapshot creations, however, must update the read timestamps as if it is a transaction reading the whole container.

//同时涉及读取和写入的事务必须遵循这两组规则。作为优化，单次读取事务和快照（读取）事务不需要更新读取时间戳。但是，快照创建必须更新读取时间戳，就像它是读取整个容器的事务一样。

When a transaction is rejected, it restarts with the same transaction ID but a higher epoch. If the epoch becomes higher than the original epoch plus epsilon, the epoch becomes certain, guaranteeing the restarts due to the epoch uncertainty checks are bounded.

// 当事务被拒绝时，它将使用相同的事务ID但更高的历元重新启动。如果历元高于原始历元加上ε，则历元变得确定，从而保证由于历元不确定性检查而导致的重新启动是有界的。

Deadlocks among transactions are impossible. A transaction t_1 with epoch e_1 may block a transaction t_2 with epoch e_2 only when t_2 needs to wait for t_1's writes to commit. Since the client caching is used, t_1 must be committing, whereas t_2 may be reading or committing. If t_2 is reading, then e_1 <= e_2. If t_2 is committing, then e_1 < e_2. Suppose there is a cycle of transactions reaching a deadlock. If the cycle includes a committing-committing edge, then the epochs along the cycle must increase and then decrease, causing a contradiction. If all edges are committing-reading, then there must be two such edges together, causing a contradiction that a reading transaction cannot block other transactions. Deadlocks are, therefore, not a concern.

// 事务之间的死锁是不可能的。只有当t_2需要等待t_1的写入提交时，具有epoch e_1的事务t_1才能阻止具有epoch e_2的事务t_2。因为使用了客户端缓存，所以t_1必须正在提交，而t_2可能正在读取或提交。如果t_2正在读取，则e_1<=e_2。如果t_2正在提交，则e_1<e_2。假设有一个事务循环达到死锁。如果周期包括提交边缘，则周期中的历元必须先增加后减少，从而导致矛盾。如果所有的边都在提交读取，那么必须有两条这样的边在一起，从而导致一个读取事务无法阻止其他事务的矛盾。因此，死锁不是一个问题。

If an entity keeps getting reads with increasing epochs, writes to this entity may keep being rejected due to the entity's ever-increasing read timestamps. Exponential backoffs with randomizations (see d_backoff_seq) have been introduced during daos_tx_restart calls. These are effective for dfs_move workloads, where readers also write.

//如果一个实体的读取次数不断增加，则由于该实体的读取时间戳不断增加，对该实体的写入可能会不断被拒绝。在daos_tx_重新启动调用期间，引入了带有随机性的指数退避（参见d_退避）。这些方法对于dfs_move 负载非常有效，因为读者也会在其中进行写操作。

Punch propagation

Since conditional operations rely on an emptiness semantic, VOS read operations, particularly listing can be very expensive because they would require potentially reading the subtree to see if the entity is empty or not. In order to alieviate this problem, VOS instead does punch propagation. On a punch operation, the parent tree is read to see if the punch causes it to be empty. If it does, the parent tree is punched as well. Propagation presently stops at the dkey level, meaning the object will not be punched. Punch propagation only applies when punching keys, not values.

// 由于条件操作依赖于空语义，VOS读取操作，尤其是列表操作可能非常昂贵，因为它们可能需要读取子树以查看实体是否为空。为了避免这个问题，VOS转而进行punch 传播。在punch 操作中，读取父树以查看punch 是否导致其为空。如果是，父树也会被punch 。传播目前在dkey级别停止，这意味着对象将不会被punch 。冲压传播仅在punch 关键点时应用，而不是在punch 值时应用。

Epoch Based Operations

Epochs provide a way for modifying VOS objects without destroying the history of updates/writes. Each update consumes memory and discarding unused history can help reclaim unused space. VOS provides methods to compact the history of writes/updates and reclaim space in every storage node. VOS also supports rollback of history in case transactions are aborted. The DAOS API timestamp corresponds to a VOS epoch. The API only allows reading either the latest state or from a persistent snapshot, which is simply a reference on a given epoch.

//epoch提供了一种在不破坏更新/写入历史的情况下修改VOS对象的方法。每次更新都会消耗内存，丢弃未使用的历史记录有助于回收未使用的空间。VOS提供了压缩写入/更新历史记录和回收每个存储节点中空间的方法。VOS还支持在事务中止时回滚历史记录。DAOS API时间戳对应于VOS历元。API只允许读取最新状态或从持久快照中读取，持久快照只是给定历元上的引用。

To compact epochs, VOS allows all epochs between snapshots to be aggregated, i.e., the value/extent-data of the latest epoch of any key is always kept over older epochs. This also ensures that merging history does not cause loss of exclusive updates/writes made to an epoch. To rollback history, VOS provides the discard operation.

// 为了压缩历元，VOS允许聚合快照之间的所有历元，即任何键的最新历元的值/范围数据始终保留在较旧的历元上。这还可以确保合并历史不会导致丢失对历元进行的独占更新/写入。要回滚历史记录，VOS提供放弃操作。

int vos_aggregate(daos_handle_t coh, daos_epoch_range_t *epr);

int vos_discard(daos_handle_t coh, daos_epoch_range_t *epr);

int vos_epoch_flush(daos_handle_t coh, daos_epoch_t epoch);

Aggregate and discard operations in VOS accept a range of epochs to be aggregated normally corresponding to ranges between persistent snapshots.

// VOS中的聚合和放弃操作接受要聚合的一系列epoch，这些纪元通常对应于持久快照之间的范围。

VOS Discard

Discard forcefully removes epochs without aggregation. This operation is necessary only when the value/extent-data associated with a pair needs to be discarded. During this operation, VOS looks up all objects associated with each cookie in the requested epoch range from the cookie index table and removes the records directly from the respective object trees by looking at their respective epoch validity. DAOS requires a discard to service abort requests. Abort operations require a discard to be synchronous。

// Discard强制删除没有聚合的历元。仅当需要丢弃与对关联的值/范围数据时，才需要此操作。在此操作期间，VOS从cookie索引表中查找与请求的历元范围内的每个cookie相关联的所有对象，并通过查看各自的历元有效性直接从各自的对象树中删除记录。DAOS需要discard 以服务abort请求。中止操作要求丢弃是同步的。

During discard, keys and byte-array rectangles need to be searched for nodes/slots whose end-epoch is (discard_epoch - 1). This means that there was an update before the now discarded epoch, and its validity got modified to support near-epoch lookup. This epoch validity of the previous update has to be extended to infinity to ensure future lookups at near-epoch would fetch the last known updated value for the key/extent range.

// 在丢弃过程中，需要在键和字节数组矩形中搜索结束历元为（discard_epoch-1）的节点/插槽。这意味着在现在被丢弃的历元之前有一个更新，它的有效性被修改以支持近历元查找。必须将以前更新的历元有效性扩展到无穷大，以确保将来在近历元的查找将获取密钥/数据块范围的最后一个已知更新值。

VOS Aggregate

During aggregation, VOS must retain the latest update to a key/extent-range discarding the others and any updates visible at a persistent snapshot. VOS can freely remove or consolidate keys or extents so long as it doesn't alter the view visible at the latest timestamp or any persistent snapshot epoch. Aggregation makes use of the vos_iterate API to find both visible and hidden entries between persistent snapshots and removes hidden keys and extents and merges contiguous partial extents to reduce metadata overhead. Aggregation can be an expensive operation but doesn't need to consume cycles on the critical path. A special aggregation ULT processes aggregation, frequently yielding to avoid blocking the continuing I/O.

// 在聚合期间，VOS必须将最新更新保留到key/extent范围，丢弃其他更新和在持久快照上可见(?应该是不可见吧)的任何更新。VOS可以自由删除或合并键或extent，只要它不改变在最新时间戳或任何持久快照时可见的视图。聚合利用vos_iterate API在持久快照之间查找可见和隐藏项，并删除隐藏键和extent，合并连续的部分扩展数据块以减少元数据开销。聚合可能是一项昂贵的操作，但不需要消耗关键路径上的周期。一个特殊的聚合ULT程序处理聚合，经常退避以避免阻塞持续的I/O。

VOS Checksum Management

VOS is responsible for storing checksums during an object update and retrieve checksums on an object fetch. Checksums will be stored with other VOS metadata in storage class memory. For Single Value types, a single checksum is stored. For Array Value types, multiple checksums can be stored based on the chunk size.

//VOS负责在对象更新期间存储校验和，并在对象获取时检索校验和。校验和将与其他VOS元数据一起存储在存储类内存中。对于单值类型，将存储一个校验和。对于数组值类型，可以根据块(chunk)大小存储多个校验和。

The Chunk Size is defined as the maximum number of bytes of data that a checksum is derived from. While extents are defined in terms of records, the chunk size is defined in terms of bytes. When calculating the number of checksums needed for an extent, the number of records and the record size is needed. Checksums should typically be derived from Chunk Size bytes, however, if the extent is smaller than Chunk Size or an extent is not "Chunk Aligned," then a checksum might be derived from bytes smaller than Chunk Size.

//块大小定义为从中派生校验和的最大数据字节数。extent块是根据记录定义的，而chunk块大小是根据字节定义的。计算数据块所需的校验和数时，需要记录数和记录大小。校验和通常应从chunk块大小字节派生，但是，如果区段小于块大小或区段未“块对齐”，则校验和可能从小于chunk块大小的字节派生。

The Chunk Alignment will have an absolute offset, not an I/O offset. So even if an extent is exactly, or less than, Chunk Size bytes long, it may have more than one Chunk if it crosses the alignment barrier.

//块对齐将具有绝对偏移量，而不是I/O偏移量。因此，即使一个extent块的长度正好等于或小于chunk块大小字节，如果它跨越对齐障碍，也可能有多个块。

Configuration

Checksums will be configured for a container when a container is created. Checksum specific properties can be included in the daos_cont_create API. This configuration has not been fully implemented yet, but properties might include checksum type, chunk size, and server side verification.

// 创建容器时，将为容器配置校验和。特定于校验和的属性可以包含在daos_cont_create API中。此配置尚未完全实现，但属性可能包括校验和类型、chunk 块大小和服务器端验证。

Storage

Checksums will be stored in a record(vos_irec_df) or extent(evt_desc) structure for Single Value types and Array Value types respectfully. Because the checksum can be of variable size, depending on the type of checksum configured, the checksum itself will be appended to the end of the structure. The size needed for checksums is included while allocating memory for the persistent structures on SCM (vos_reserve_single/vos_reserve_recx).

//校验和将分别存储在单值类型和数组值类型的记录（vos_irec_df）或数据块（evt_desc）结构中。由于校验和的大小可以是可变的，这取决于配置的校验和类型，校验和本身将附加到结构的末尾。为SCM上的持久结构（vos_reserve_single/vos_reserve_recx）分配内存时，包括校验和所需的大小。

The following diagram illustrates the overall VOS layout and where checksums will be stored. Note that the checksum type isn't actually stored in vos_cont_df yet.

//下图说明了整个VOS布局以及校验和的存储位置。请注意，校验和类型实际上尚未存储在vos_cont_df中。

Checksum VOS Flow (vos_obj_update/vos_obj_fetch)

On update, the checksum(s) are part of the I/O Descriptor. Then, in akey_update_single/akey_update_recx, the checksum buffer pointer is included in the internal structures used for tree updates (vos_rec_bundle for SV and evt_entry_in for EV). As already mentioned, the size of the persistent structure allocated includes the size of the checksum(s). Finally, while storing the record (svt_rec_store) or extent (evt_insert), the checksum(s) are copied to the end of the persistent structure.

//更新时，校验和是I/O描述符的一部分。然后，在akey_update_single/akey_update_recx中，校验和缓冲区指针包含在用于树更新的内部结构中（SV的vos_rec_bundle和EV的evt_entry_in）。如前所述，分配的持久结构的大小包括校验和的大小。最后，在存储记录（svt_rec_store）或数据块（evt_insert）时，校验和被复制到持久结构的末尾。

On a fetch, the update flow is essentially reversed. //在获取时，更新流基本上是反向的。

For reference, key junction points in the flows are://供参考，流程中的关键连接点为：

SV Update: vos_update_end -> akey_update_single -> svt_rec_store
Sv Fetch: vos_fetch_begin -> akey_fetch_single -> svt_rec_load
EV Update: vos_update_end -> akey_update_recx -> evt_insert
EV Fetch: vos_fetch_begin -> akey_fetch_recx -> evt_fill_entry

Metadata Overhead

There is a tool available to estimate the metadata overhead. It is described on the storage estimator section.

有一个工具可用于估计元数据开销。存储估计器部分对此进行了描述。

Replica Consistency

DAOS supports multiple replicas for data high availability. Inconsistency between replicas is possible when a target fails during an update to a replicated object and when concurrent updates are applied on replicated targets in an inconsistent order.

//DAOS支持多个副本以实现数据的高可用性。如果目标在对复制对象的更新过程中失败，并且以不一致的顺序对复制目标应用并发更新，则复制副本之间可能存在不一致。

The most intuitive solution to the inconsistency problem is distributed lock (DLM), used by some distributed systems, such as Lustre. For DAOS, a user-space system with powerful, next generation hardware, maintaining distributed locks among multiple, independent application spaces will introduce unacceptable overhead and complexity. DAOS instead uses an optimized two-phase commit transaction to guarantee consistency among replicas.

//对于不一致性问题，最直观的解决方案是分布式锁（DLM），它被一些分布式系统（如Lustre）使用。对于DAOS，一个具有强大的下一代硬件的用户空间系统，在多个独立的应用程序空间之间维护分布式锁，将带来不可接受的开销和复杂性。DAOS使用优化的两阶段提交事务来保证副本之间的一致性。

Single redundancy group based DAOS Two-Phase Commit (DTX)

When an application wants to modify (update or punch) a multiple replicated object or EC object, the client sends the modification RPC to the leader shard (via DTX Leader Election algorithm discussed below). The leader dispatches the RPC to the other related shards, and each shard makes its modification in parallel. Bulk transfers are not forwarded by the leader but rather transferred directly from the client, improving load balance and decreasing latency by utilizing the full client-server bandwidth.

// 当应用程序想要修改（更新或打孔）多个复制对象或EC对象时，客户机将修改RPC发送给leader shard（通过下面讨论的DTX leader Election算法）。领导者将RPC分派给其他相关碎片，每个碎片并行地进行修改。批量传输不是由领导者转发，而是直接从客户端传输，通过利用完整的客户端-服务器带宽改善负载平衡并减少延迟。

Before modifications are made, a local transaction, called 'DTX', is started on each related shard (both leader and non-leaders) with a client generated DTX identifier that is unique for the modification within the container. All the modifications in a DTX are logged in the DTX transaction table and back references to the table are kept in related modified record. After local modifications are done, each non-leader marks the DTX state as 'prepared' and replies to the leader. The leader sets the DTX state to 'committable' as soon as it has completed its modifications and has received successful replies from all non-leaders. If any shard(s) fail to execute the modification, it will reply to the leader with failure, and the leader will globally abort the DTX. Once the DTX is set by the leader to 'committable' or 'aborted', it replies to the client with the appropriate status.

// 在进行修改之前，在每个相关碎片（包括前导和非前导）上启动一个称为“DTX”的本地事务，该事务具有客户端生成的DTX标识符，该标识符对于容器内的修改是唯一的。DTX中的所有修改都记录在DTX事务表中，对该表的反向引用保存在相关的修改记录中。完成本地修改后，每个非引线将DTX状态标记为“准备就绪”，并回复leader。领导者完成修改并收到所有非领导者的成功回复后，立即将DTX状态设置为“可提交”。如果任何碎片未能执行修改，它将以失败的方式回复领导者，领导者将全局中止DTX。一旦领导者将DTX设置为“可提交”或“已中止”，它将以适当的状态回复客户端。

The client may consider a modification complete as soon as it receives a successful reply from the leader, regardless of whether the DTX is actually 'committed' or not. It is the responsibility of the leader to commit the 'committable' DTX asynchronously. This can happen if the 'committable' count or DTX age exceed some thresholds or the DTX is piggybacked via other dispatched RPCs due to potential conflict with subsequent modifications.

//客户端可以考虑修改完成后，只要它收到一个成功的答复从领导者，无论DTX是否是真正的“提交完成”或不。领导者负责异步提交“可提交”DTX。如果“可提交”计数或DTX期限超过某些阈值，或者由于与后续修改的潜在冲突，DTX通过其他已调度的RPC进行承载，则可能发生这种情况。

When an application wants to read something from an object with multiple replicas, the client can send the RPC to any replica. On the server side, if the related DTX has been committed or is committable, the record can be returned to. If the DTX state is prepared, and the replica is not the leader, it will reply to the client telling it to send the RPC to the leader instead. If it is the leader and is in the state 'committed' or 'committable', then such entry is visible to the application. Otherwise, if the DTX on the leader is also 'prepared', then for transactional read, ask the client to wait and retry via returning -DER_INPROGRESS; for non-transactional read, related entry is ignored and the latest committed modification is returned to the client.

// 当应用程序希望从具有多个副本的对象读取内容时，客户端可以将RPC发送到任何副本。在服务器端，如果相关DTX已提交或可提交，则可以将记录返回到。如果DTX状态已准备就绪，且副本不是leader，则它将回复客户端，告知其将RPC发送给leader。如果它是领导者并且处于“已提交”或“可提交”状态，则该条目对应用程序可见。否则，如果leader上的DTX也“准备就绪”，那么对于事务读取，请客户机等待并通过返回-deru INPROGRESS重试；对于非事务性读取，将忽略相关条目，并将最新提交的修改返回给客户端。

If the read operation refers to an EC object and the data read from a data shard (non-leader) has a 'prepared' DTX, the data may be 'committable' on the leader due to the aforementioned asynchronous batched commit mechanism. In such case, the non-leader will refresh related DTX status with the leader. If the DTX status after refresh is 'committed', then related data can be returned to the client; otherwise, if the DTX state is still 'prepared', then for transactional read, ask the client to wait and retry via returning -DER_INPROGRESS; for non-transactional read, related entry is ignored and the latest committed modification is returned to the client.

// 如果读取操作引用EC对象，并且从数据碎片（非leader）读取的数据具有“准备好的”DTX，则由于前面提到的异步批处理提交机制，leader上的数据可能是“可提交的”。在这种情况下，非leader将刷新与领导相关的DTX状态。如果刷新后的DTX状态为“已提交”，则可以将相关数据返回给客户端；否则，如果DTX状态仍为“准备就绪”，则对于事务读取，请客户机等待并通过返回-deru INPROGRESS重试；对于非事务性读取，将忽略相关条目，并将最新提交的修改返回给客户端。

The DTX model is built inside a DAOS container. Each container maintains its own DTX table that is organized as two B+trees in SCM: one for active DTXs and the other for committed DTXs. The following diagram represents the modification of a replicated object under the DTX model.

//DTX模型构建在DAOS容器中。每个容器维护自己的DTX表，该表在SCM中组织为两个B+树：一个用于活动DTX，另一个用于提交的DTX。下图表示DTX模型下复制对象的修改。

Modify multiple replicated object under DTX model

SV-tree猜测: 是单值构成的树，因为有多个历史版本，所以单值也可以构成一颗树。

EV-tree猜测：多值，多版本构成的一颗树。

Single redundancy group based DTX Leader Election

In single redundancy group based DTX model, the leader selection is done for each object or dkey following these general guidelines:

// 在基于单冗余组的DTX模型中，按照以下一般准则为每个对象或dkey选择leader：

R1: When different replicated objects share the same redundancy group, the same leader should not be used for each object. 当不同的复制对象共享同一冗余组时，不应为每个对象使用相同的leader。

R2: When a replicated object with multiple DKEYs span multiple redundancy groups, the leaders in different redundancy groups should be on different servers. //当具有多个DKEY的复制对象跨越多个冗余组时，不同冗余组中的leader应位于不同的服务器上。

R3: Servers that fail frequently should be avoided in leader selection to avoid frequent leader migration.

//在leader选择中应避免频繁失败的服务器，以避免频繁的leader迁移。

R4: For EC object, the leader will be one of the parity nodes within current redundancy group.

//对于EC对象，leader将是当前冗余组中的奇偶校验节点之一。

Hahafly1234

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
DAOS 系统内部介绍(二）—— VOS

Versioning Object StoreThe Versioning Object Store (VOS) is responsible for providing and maintaining a persistent object store that supports byte-granular access and versioning for a single shard in aDAOS pool. It maintains its metadata in persistent m.
复制链接

扫一扫