Blobstore Programmer‘s Guide Blobstore编程指南

目标受众

Target Audience

本编程指南是为开发人员利用SPDK Blobstore编写应用程序。旨在补充源码,这些源码是关于提供一个整体的对如何集成Blobstore到应用,以及如何提供一些高级的关于Blobstore如何在后端工作的洞见的理解。其目的不是作为设计文档或者API参考,在某些场景中,将讨论源码的一些片段和高级段落;有关最新的源码请参考代码仓

The programmer’s guide is intended for developers authoring applications that utilize the SPDK Blobstore. It is intended to supplement the source code in providing an overall understanding of how to integrate Blobstore into an application as well as provide some high level insight into how Blobstore works behind the scenes. It is not intended to serve as a design document or an API reference and in some cases source code snippets and high level sequences will be discussed; for the latest source code reference refer to the repo.

简介

Introduction

Blobstore是一个持久的、断电安全的块分配器,设计用作支持更高层存储服务的本地存储系统,通常代替传统的文件系统。这些更高层的服务可以是本地数据库或键/值存储(MySQL、RocksDB),也可以是专用设备(SAN、NAS)或分布式存储系统(例如Ceph,Cassandra)。然而,它并没有被设计成一个通用的文件系统,而且它故意不兼容POSIX。为了避免混淆,我们避免引用文件或对象,而是使用术语“blob”。Blobstore旨在允许对称为“blobs”的块设备上的块组进行异步、非缓存、并行读写。Blobs通常很大,至少有数百千字节,并且总是底层块大小的倍数。

Blobstore is a persistent, power-fail safe block allocator designed to be used as the local storage system backing a higher level storage service, typically in lieu of a traditional filesystem. These higher level services can be local databases or key/value stores (MySQL, RocksDB), they can be dedicated appliances (SAN, NAS), or distributed storage systems (ex. Ceph, Cassandra). It is not designed to be a general purpose filesystem, however, and it is intentionally not POSIX compliant. To avoid confusion, we avoid references to files or objects instead using the term ‘blob’. The Blobstore is designed to allow asynchronous, uncached, parallel reads and writes to groups of blocks on a block device called ‘blobs’. Blobs are typically large, measured in at least hundreds of kilobytes, and are always a multiple of the underlying block size.

Blobstore主要设计为在“下一代”介质上运行,这意味着该设备支持快速随机读写,无需后台垃圾收集。然而,实际上这种设计在NAND上也能很好地运行。

The Blobstore is designed primarily to run on “next generation” media, which means the device supports fast random reads and writes, with no required background garbage collection. However, in practice the design will run well on NAND too.

运行原理

Theory of Operation

抽象

Abstractions

Blobstore定义了如下的存储抽象层次。

  • 逻辑块(Logical Block):逻辑块由磁盘本身暴露,从0到N编号,其中N是磁盘中的块数。逻辑块通常为512B或4KiB。
  • 页(Page):页被定义为Blobstore创建时定义的固定数量的逻辑块。组成页的逻辑块总是连续的。页也从磁盘的开始编号,使得第一页是页0,第二页是页1,等等。一个页面的大小通常为4KiB,因此实际上是8或1个逻辑块。SSD必须能够执行至少页面大小的原子读写。
  • 簇(Cluster):簇是在Blobstore创建时定义的固定数量的页面。组成簇的页面总是连续的。簇也从磁盘的开始编号,其中簇0是第一簇值的页面,簇1是第二组页面,等等。一个簇的大小通常为1MiB,即256个页。
  • Blob: 一个blob是簇的有序列表。Blobs被操作(创建、调整大小、删除等),并在电源故障和重启后持续存在。应用程序使用Blobstore提供的标识符来访问特定的blob。通过指定从blob开始的偏移量,以页为单位读写blob。应用程序还可以在每个blob中以键/值对的形式存储元数据,我们称之为xattrs(扩展属性)。
  • Blobstore:由基于Blobstore的应用程序初始化的SSD称为“Blobstore”。 Blobstore拥有由私有Blobstore元数据区域和应用程序管理的blobs集合组成的整个底层设备。

The Blobstore defines a hierarchy of storage abstractions as follows.

  • Logical Block: Logical blocks are exposed by the disk itself, which are numbered from 0 to N, where N is the number of blocks in the disk. A logical block is typically either 512B or 4KiB.
  • Page: A page is defined to be a fixed number of logical blocks defined at Blobstore creation time. The logical blocks that compose a page are always contiguous. Pages are also numbered from the beginning of the disk such that the first page worth of blocks is page 0, the second page is page 1, etc. A page is typically 4KiB in size, so this is either 8 or 1 logical blocks in practice. The SSD must be able to perform atomic reads and writes of at least the page size.
  • Cluster: A cluster is a fixed number of pages defined at Blobstore creation time. The pages that compose a cluster are always contiguous. Clusters are also numbered from the beginning of the disk, where cluster 0 is the first cluster worth of pages, cluster 1 is the second grouping of pages, etc. A cluster is typically 1MiB in size, or 256 pages.
  • Blob: A blob is an ordered list of clusters. Blobs are manipulated (created, sized, deleted, etc.) by the application and persist across power failures and reboots. Applications use a Blobstore provided identifier to access a particular blob. Blobs are read and written in units of pages by specifying an offset from the start of the blob. Applications can also store metadata in the form of key/value pairs with each blob which we’ll refer to as xattrs (extended attributes).
  • Blobstore: An SSD which has been initialized by a Blobstore-based application is referred to as “a Blobstore.” A Blobstore owns the entire underlying device which is made up of a private Blobstore metadata region and the collection of blobs as managed by the application.
+-----------------------------------------------------------------+
|                              Blob                               |
| +-----------------------------+ +-----------------------------+ |
| |           Cluster           | |           Cluster           | |
| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
| | |Page| |Page| |Page| |Page| | | |Page| |Page| |Page| |Page| | |
| | +----+ +----+ +----+ +----+ | | +----+ +----+ +----+ +----+ | |
| +-----------------------------+ +-----------------------------+ |
+-----------------------------------------------------------------+

原子性

Atomicity

对于所有关于原子性的Blobstore操作,都依赖于底层设备来保证至少一个页大小的原子操作。这里的原子性可以指多个操作:

  • 写数据:对于数据写入的情况,原子性的单位是一页。因此,如果多于一个页面的写操作正在进行,并且系统遭遇电源故障,则介质上的数据将在一个页大小上是一致的(如果断电时单个页面正在被更新,则在该页面位置的数据将如同在电源恢复后写操作开始之前一样。)
  • Blob元数据更新: 每个blob都有自己的一组元数据(xattrs、size等)。出于性能原因,元数据的副本保存在RAM中,只有当应用程序显式调用时,或者当Blobstore被卸载时,才与磁盘上的版本同步。因此,例如,xattr的设置在调用同步它(稍后介绍)之前是不一致的,而同步是自动执行的。
  • Blobstore元数据更新: Blobstore本身有自己的元数据,像每个blob元数据一样,在RAM和磁盘上都有一个副本。然而,与每blob元数据不同,Blobstore元数据区域并不通过blob同步调用来保持一致,它仅在Blobstore通过API被正确卸载时才被同步。因此,如果Blobstore元数据被更新(blob创建、删除、调整大小等),并且没有正确卸载,那么下次加载时它将需要执行一些额外的步骤,这将比完全关闭时花费更多的时间,但是不会出现不一致。

For all Blobstore operations regarding atomicity, there is a dependency on the underlying device to guarantee atomic operations of at least one page size. Atomicity here can refer to multiple operations:

  • Data Writes: For the case of data writes, the unit of atomicity is one page. Therefore if a write operation of greater than one page is underway and the system suffers a power failure, the data on media will be consistent at a page size granularity (if a single page were in the middle of being updated when power was lost, the data at that page location will be as it was prior to the start of the write operation following power restoration.)
  • Blob Metadata Updates: Each blob has its own set of metadata (xattrs, size, etc). For performance reasons, a copy of this metadata is kept in RAM and only synchronized with the on-disk version when the application makes an explicit call to do so, or when the Blobstore is unloaded. Therefore, setting of an xattr, for example is not consistent until the call to synchronize it (covered later) which is, however, performed atomically.
  • Blobstore Metadata Updates: Blobstore itself has its own metadata which, like per blob metadata, has a copy in both RAM and on-disk. Unlike the per blob metadata, however, the Blobstore metadata region is not made consistent via a blob synchronization call, it is only synchronized when the Blobstore is properly unloaded via API. Therefore, if the Blobstore metadata is updated (blob creation, deletion, resize, etc.) and not unloaded properly, it will need to perform some extra steps the next time it is loaded which will take a bit more time than it would have if shutdown cleanly, but there will be no inconsistencies.

回调

Callbacks

Blobstore是回调驱动的;如果任何Blobstore API无法继续前进,它将不会阻塞,而是在该点返回控制,并在原始调用完成时调用API中提供的回调函数和参数。回调将在发出调用的同一个线程上进行,稍后将在更多线程上进行。然而,有些API不提供回调参数;在这些情况下,调用是完全同步的。利用回调的异步调用的例子包括那些涉及磁盘IO的调用,例如,在IO完成之前需要一定量的轮询。

Blobstore is callback driven; in the event that any Blobstore API is unable to make forward progress it will not block but instead return control at that point and make a call to the callback function provided in the API, along with arguments, when the original call is completed. The callback will be made on the same thread that the call was made from, more on threads later. Some API, however, offer no callback arguments; in these cases the calls are fully synchronous. Examples of asynchronous calls that utilize callbacks include those that involve disk IO, for example, where some amount of polling is required before the IO is completed.

后端支持

Backend Support

Blobstore需要一个可以使用bdev层集成的后备存储设备,或者通过将设备驱动程序直接集成到Blobstore中。blobstore通过调用初始化时提供给它的函数指针,在后备块设备上执行操作。为方便起见,bdev_blob.c中提供了这些将I/O路由到bdev层的函数指针的实现。或者,例如,可以直接集成SPDK NVMe驱动程序,从而绕过少量bdev层开销。这些选项将在下一节的示例中进一步讨论。

Blobstore requires a backing storage device that can be integrated using the bdev layer, or by directly integrating a device driver to Blobstore. The blobstore performs operations on a backing block device by calling function pointers supplied to it at initialization time. For convenience, an implementation of these function pointers that route I/O to the bdev layer is available in bdev_blob.c. Alternatively, for example, the SPDK NVMe driver may be directly integrated bypassing a small amount of bdev layer overhead. These options will be discussed further in the upcoming section on examples.

元数据运行

Metadata Operations

因为Blobstore被设计成无锁的,所以元数据操作需要被隔离到单个线程,以避免在内存数据结构中锁定,这些数据结构维护blobs定义布局上的数据(以及其他数据)。在Blobstore中,这被实现为元数据线程,并被定义为应用程序在其上进行元数据相关调用的线程。应用程序需要设置一个单独的线程来进行这些调用,并确保不会将相关的IO操作与元数据操作混合在一起,即使它们在不同的线程上。这将在设计考虑部分进一步讨论。

Because Blobstore is designed to be lock-free, metadata operations need to be isolated to a single thread to avoid taking locks on in memory data structures that maintain data on the layout of definitions of blobs (along with other data). In Blobstore this is implemented as the metadata thread and is defined to be the thread on which the application makes metadata related calls on. It is up to the application to setup a separate thread to make these calls on and to assure that it does not mix relevant IO operations with metadata operations even if they are on separate threads. This will be discussed further in the Design Considerations section.

线程

Threads

例如,使用Blobstore和SPDK NVMe驱动程序的应用程序可以支持各种线程场景。最简单的是单线程应用程序,应用程序、Blobstore代码和NVMe驱动程序共享一个内核。在这种情况下,单个线程将用于提交元数据操作以及IO操作,并且将由应用程序来确保一次仅发布一个元数据操作,并且不会与受影响的IO操作混合。

An application using Blobstore with the SPDK NVMe driver, for example, can support a variety of thread scenarios. The simplest would be a single threaded application where the application, the Blobstore code and the NVMe driver share a single core. In this case, the single thread would be used to submit both metadata operations as well as IO operations and it would be up to the application to assure that only one metadata operation is issued at a time and not intermingled with affected IO operations.

通道

Channels

通道是SPDK范围内的抽象概念,对于Blobstore来说,考虑它们的最佳方式是它们是执行IO所必需的。应用程序将对通道执行IO,通道最好被认为是与线程1:1相关联的。

Channels are an SPDK-wide abstraction and with Blobstore the best way to think about them is that they are required in order to do IO. The application will perform IO to the channel and channels are best thought of as being associated 1:1 with a thread.

Blob标识

Blob Identifiers

当应用程序创建blob时,它不像许多其他类似的存储系统那样提供名称,而是由Blobstore返回一个唯一的标识符,它需要在后续API上使用该标识符来对Blobstore执行操作。

When an application creates a blob, it does not provide a name as is the case with many other similar storage systems, instead it is returned a unique identifier by the Blobstore that it needs to use on subsequent APIs to perform operations on the Blobstore.

设计考虑

Design Considerations

初始化选项

Initialization Options

初始化Blobstore时,需要考虑多个配置选项。这些选项及其默认值如下:

When the Blobstore is initialized, there are multiple configuration options to consider. The options and their defaults are:

  • 簇大小:默认值为1MB。簇大小必须是页面大小的倍数,并且应该根据应用程序在分配方面的使用模型来选择。回想一下,blob是由簇组成的,因此当blob被分配/取消分配或大小发生变化时,磁盘LBA将按簇大小分组进行操作。如果应用程序希望处理非常大的blobs(总是多个GB ),那么将簇大小更改为1GB可能是有意义的。
  • 元数据页的数量:默认情况下,Blobstore将假设有多少个元数据页就有多少个簇,这是元数据使用方面最糟糕的情况,在这种场景下Blobstore可以被覆盖,但是空间效率并不显著
  • 最大的并发元数据操作数:确定为执行元数据操作留出多少内部预分配的内存结构。不太可能需要更改该值(默认值为32)。
  • 每个通道的最大并发操作数: 确定为通道操作留出多少内部预分配的内存结构。对该值的更改将取决于应用,最好由对典型使用模式的了解、对所用固态硬盘类型的理解以及经验数据来决定。默认值为512。
  • Blobstore类型:此字段是一个字符数组,由需要确定此处找到的Blobstore是否适合声明的应用程序使用。缺省值为NULL,除非应用程序部署在一个环境中,在该环境中,使用相同磁盘的多个应用程序有无意中使用错误Blobstore的风险,否则没有必要设置该值。但是,它可以设置为任何有效的字符集。
  • Cluster Size: By default, this value is 1MB. The cluster size is required to be a multiple of page size and should be selected based on the application’s usage model in terms of allocation. Recall that blobs are made up of clusters so when a blob is allocated/deallocated or changes in size, disk LBAs will be manipulated in groups of cluster size. If the application is expecting to deal with mainly very large (always multiple GB) blobs then it may make sense to change the cluster size to 1GB for example.
  • Number of Metadata Pages: By default, Blobstore will assume there can be as many clusters as there are metadata pages which is the worst case scenario in terms of metadata usage and can be overridden here however the space efficiency is not significant.
  • Maximum Simultaneous Metadata Operations: Determines how many internally pre-allocated memory structures are set aside for performing metadata operations. It is unlikely that changes to this value (default 32) would be desirable.
  • Maximum Simultaneous Operations Per Channel: Determines how many internally pre-allocated memory structures are set aside for channel operations. Changes to this value would be application dependent and best determined by both a knowledge of the typical usage model, an understanding of the types of SSDs being used and empirical data. The default is 512.
  • Blobstore Type: This field is a character array to be used by applications that need to identify whether the Blobstore found here is appropriate to claim or not. The default is NULL and unless the application is being deployed in an environment where multiple applications using the same disks are at risk of inadvertently using the wrong Blobstore, there is no need to set this value. It can, however, be set to any valid set of characters.

子页面大小运算

Sub-page Sized Operations

Blobstore只能进行页面大小的读/写操作。如果应用程序需要更精细的粒度,它必须自己适应。

Blobstore is only capable of doing page sized read/write operations. If the application requires finer granularity it will have to accommodate that itself.

线程

Threads

如前所述,Blobstore可以与应用程序共享一个线程,或者应用程序可以在资源限制内定义任意数量的线程,这是有意义的。必须遵循的基本考虑是:

As mentioned earlier, Blobstore can share a single thread with an application or the application can define any number of threads, within resource constraints, that makes sense. The basic considerations that must be followed are:

  • 元数据操作(名称中带有MD的API)应该相互隔离,因为受这些API影响的内存结构上没有内部锁定。
  • 元数据操作应该与冲突的IO操作隔离开来 (冲突的IO的一个例子是读取/写入元数据操作正在解除分配的blob区域)。
  • 异步回调总是发生在调用线程上。
  • 无论发布中涉及多少线程或哪些线程,都不能对IO排序做出任何假设。
  • Metadata operations (API with MD in the name) should be isolated from each other as there is no internal locking on the memory structures affected by these API.
  • Metadata operations should be isolated from conflicting IO operations (an example of a conflicting IO would be one that is reading/writing to an area of a blob that a metadata operation is deallocating).
  • Asynchronous callbacks will always take place on the calling thread.
  • No assumptions about IO ordering can be made regardless of how many or which threads were involved in the issuing.

数据缓冲存储器

Data Buffer Memory

与所有基于SPDK的应用程序一样,Blobstore需要使用SPDK API分配用于数据缓冲区的内存。

As with all SPDK based applications, Blobstore requires memory used for data buffers to be allocated with SPDK API.

错误处理

Error Handling

异步Blobstore回调都包含应该检查的错误号;非零值表示有错误。如果适用,同步调用通常会返回一个错误值。

Asynchronous Blobstore callbacks all include an error number that should be checked; non-zero values indicate an error. Synchronous calls will typically return an error value if applicable.

异步API

Asynchronous API

异步回调不会立即返回控制,而是在执行过程中的某个点返回控制,该点是如果没有阻塞,就不会有更多的前进。因此,在回调完成之前,不能对异步调用的进度做出任何假设。

Asynchronous callbacks will return control not immediately, but at the point in execution where no more forward progress can be made without blocking. Therefore, no assumptions can be made about the progress of an asynchronous call until the callback has completed.

扩展属性

Xattrs

在Blobstore中设置和删除xattr是元数据操作,xattr存储在每个blob元数据中。因此,在进行并完成blob同步调用之前,不会持久保存xattrs。例如,具有用于持久化每个blob元数据的分步过程允许应用程序执行成批的xattr更新,而只需要一个开销更大的调用来同步和持久化这些值。

Setting and removing of xattrs in Blobstore is a metadata operation, xattrs are stored in per blob metadata. Therefore, xattrs are not persisted until a blob synchronization call is made and completed. Having a step process for persisting per blob metadata allows for applications to perform batches of xattr updates, for example, with only one more expensive call to synchronize and persist the values.

同步元数据

Synchronizing Metadata

如前所述,Blobstore中有两种类型的元数据,每个blob和Blobstore本身的一个全局元数据。只有每个blob的元数据可以通过API显式同步。全局元数据在运行时会不一致,只有在正确关闭时才会同步。然而,不适当的关闭的含义只是下一次启动时的性能损失,因为全局元数据将需要基于对每个blob元数据的解析来重建。为了获得一致的开始时间,通过API适当地关闭Blobstore是很重要的。

As described earlier, there are two types of metadata in Blobstore, per blob and one global metadata for the Blobstore itself. Only the per blob metadata can be explicitly synchronized via API. The global metadata will be inconsistent during run-time and only synchronized on proper shutdown. The implication, however, of an improper shutdown is only a performance penalty on the next startup as the global metadata will need to be rebuilt based on a parsing of the per blob metadata. For consistent start times, it is important to always close down the Blobstore properly via API.

迭代Blobs

Iterating Blobs

示例代码和工具中包含了如何遍历blobs的多个示例。但是,值得注意的是,如果通过iter API遍历现有的blob,如果您的应用程序找到了它所寻找的blob,则需要显式关闭它(因为是由Blobstore在内部打开的),或者完成遍历整个列表。

Multiple examples of how to iterate through the blobs are included in the sample code and tools. Worthy to note, however, if walking through the existing blobs via the iter API, if your application finds the blob its looking for it will either need to explicitly close it (because was opened internally by the Blobstore) or complete walking the full list.

超级Blob

The Super Blob

超级Blob只是单个blob ID,它可以作为全局元数据的一部分存储,充当某种“root” blob。应用程序可以选择使用这个blob来存储它需要的或发现与理解Blobstore上的任何类型的结构相关的任何信息。

The super blob is simply a single blob ID that can be stored as part of the global metadata to act as sort of a “root” blob. The application may choose to use this blob to store any information that it needs or finds relevant in understanding any kind of structure for what is on the Blobstore.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值