[io_uring][自用] io_uring.pdf DeepL机翻

1nv1s1ble

已于 2022-09-26 19:00:21 修改

阅读量381

点赞数

分类专栏： io_uring [笔记] 文章标签： linux 服务器运维

于 2022-09-26 17:21:04 首次发布

本文链接：https://blog.csdn.net/weixin_38734472/article/details/127053332

版权

[笔记] 同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

io_uring

2 篇文章 0 订阅

订阅专栏

[io_uring] [自用笔记] io_uring.pdf DeepL机翻

This article is intended to serve as an introduction to the newest Linux IO interface, io_uring, and compare it to the existing offerings. We’ll go over the reasons for its existence, inner workings of it, and the user visible interface. The article will not go into details about specific commands and the likes, as that would just be duplicating the information available in the associated man pages. Rather, it will attempt to provide an introduction to io_uring and how it works, with the goal hopefully being that the reader will have gained a deeper understanding of how it all ties together. That said, there will be some overlap between this article and the man pages. It’s impossible to provide a description of io_uring without including some of those details.

本文旨在介绍最新的Linux IO接口，io_uring，并将其与现有的接口进行比较。我们将讨论它存在的原因，它的内部工作原理，以及用户可见的接口。这篇文章不会详细介绍具体的命令和类似的东西，因为那只是重复了相关手册中的信息。相反，它将试图提供一个关于io_uring和它如何工作的介绍，目的是希望读者能够更深入地了解它是如何联系在一起的。也就是说，这篇文章和手册页之间会有一些重叠。如果不包括其中的一些细节，就不可能提供对io_uring的描述。

1.0 Introduction

There are many ways to do file based IO in Linux. The oldest and most basic are the read(2) and write(2) system calls. These were later augmented with pread(2) and pwrite(2) versions which allow passing in of an offset, and later still we got preadv(2) and pwritev(2) which are vector-based versions of the former. Because that still wasn’t quite enough, Linux also has preadv2(2) and pwritev2(2) system calls, which further extend the API to allow modifier flags. The various differences of these system calls aside, they share the common trait that they are synchronous interfaces. This means that the system calls return when the data is ready (or written). For some use cases that is sub-optimal, and an asynchronous interface is desired. POSIX has aio_read(3) and aio_write(3) to satisfy that need, however the implementation of those is most often lackluster and performance is poor.

在Linux中，有许多方法可以进行基于文件的IO。最古老和最基本的是read(2)和write(2)系统调用。后来又增加了pread(2)和pwrite(2)的版本，允许传入一个偏移量，再后来我们又有了preadv(2)和writev(2)，它们是前者的vector-based的版本。因为这还不够，Linux还有preadv2(2)和pwritev2(2)系统调用，它们进一步扩展了API以允许modifier flags。撇开这些系统调用的不同之处，它们的共同特点是它们是同步接口。这意味着系统调用会在数据准备好（或写入）时返回。对于某些用例来说，这是次优的，它需要一个异步的接口。POSIX有aio_read(3)和aio_write(3)来满足这一需求，然而这些接口的实现通常是乏善可陈的，性能也很差。

Linux does have a native async IO interface, simply dubbed aio. Unfortunately, it suffers from a number of limitations:

The biggest limitation is undoubtedly that it only supports async IO for O_DIRECT (or un-buffered) accesses. Due to the restrictions of O_DIRECT (cache bypassing and size/alignment restraints), this makes the native aio interface a no-go for most use cases. For normal (buffered) IO, the interface behaves in a synchronous manner.
Even if you satisfy all the constraints for IO to be async, it’s sometimes not. There are a number of ways that the IO submission can end up blocking - if meta data is required to perform IO, the submission will block waiting for that. For storage devices, there are a fixed number of request slots available. If those slots are currently all in use, submission will block waiting for one to become available. These uncertainties mean that applications that rely on submission always being async are still forced to offload that part.
The API isn’t great. Each IO submission ends up needing to copy 64 + 8 bytes and each completion copies 32 bytes. That’s 104 bytes of memory copy, for IO that’s supposedly zero copy. Depending on your IO size, this can definitely be noticeable. The exposed completion event ring buffer mostly gets in the way by making completions slower, and is hard (impossible?) to use correctly from an application. IO always requires at least two system calls (submit + wait-for-completion), which in these post spectre/meltdown days is a serious slowdown.

Linux确实有一个原生的异步IO接口，简单地说就是aio。不幸的是，它有许多限制。

最大的限制无疑是，它只支持O_DIRECT（或无缓冲）访问的异步IO。由于O_DIRECT的限制（缓存绕过和大小/对齐限制），这使得本地aio接口在大多数使用情况下是不可行的。对于正常的（缓冲的）IO，该接口以同步的方式进行。
即使你满足了所有关于IO是异步的约束条件，它有时也不是(异步的)。有许多方法可以使IO提交最终阻塞–如果执行IO需要元数据，提交将阻塞等待。对于存储设备，有固定数量的请求槽可用。如果这些槽位目前都在使用中，提交将阻塞，等待一个可用的。这些不确定性意味着依赖于提交总是异步的应用程序仍然被迫卸载该部分。
该API并不出色。每个IO提交最终需要复制64+8字节，每个完成复制32字节。这就是104字节的内存拷贝，对于IO来说，这应该是零拷贝。根据你的IO大小，这肯定是很明显的。暴露出来的完成事件环形缓冲区主要是通过使完成速度变慢来实现的，而且很难（不可能）从一个应用程序中正确使用。IO总是需要至少两个系统调用（提交+等待完成），在这些后spectre/meltdown时代是一个严重的减速。

Over the years there has been various efforts at lifting the first limitation mentioned (I also made a stab at it back in 2010), but nothing succeeded. In terms of efficiency, arrival of devices that are capable of both sub-10usec latencies and very high IOPS, the interface is truly starting to show its age. Slow and non-deterministic submission latencies are very much an issue for these types of devices, as is the lack of performance that you can extract out of a single core. On top of that, because of the aforementioned limitations, it’s safe to say that native Linux aio doesn’t have a lot of use cases. It’s been relegated to a niche corner of applications, with all the issues that come with that (long term undiscovered bugs, etc).

多年来，人们为解除上述第一个限制做出了各种努力（我在2010年也做过尝试），但都没有成功。就效率而言，那些能够达到10微秒以下延迟和非常高的IOPS的设备的到来，这些接口就显得古老。对于这些类型的设备来说，缓慢和不确定的提交延迟是一个非常大的问题，正如你可以从单个核心中提取的性能不足。除此之外，由于上述的限制，可以说原生的Linux aio并没有很多的使用案例。它已经被归入一个小众的应用角落，以及随之而来的所有问题（长期未被发现的bug，等等）。

Furthermore, the fact that “normal” applications have no use for aio means that Linux is still lacking an interface that provides the features that they desire. There is absolutely no reason that applications or libraries continue to need to create private IO offload thread pools to get decent async IO, especially when that can be done more efficiently in the kernel.

此外，"普通 "应用程序很少使用aio的事实意味着Linux仍然缺乏一个提供他们所期望的功能的接口。绝对没有理由让应用程序或库继续需要创建私有的IO负载线程池来获得像样的异步IO，特别是在内核中可以更有效地完成的情况下。

2.0 Improving the status quo

Initial efforts were focused on improving the aio interface, and work progressed fairly far down that path before being abandoned. There are multiple reasons why this initial direction was chosen:

If you can extend and improve an existing interface, that’s preferable to providing a new one. Adoption of new interfaces take time, and getting new interfaces reviewed and approved is a potentially long and arduous task.
It’s a lot less work in general. As a developer, you’re always looking to accomplish the most with the least amount of work. Extending an existing interface gives you many advantages in terms of existing test infrastructure.

最初的努力集中在改进aio接口上，在放弃之前，工作在这条路上进展得相当顺利。之所以选择这个最初的方向，有多种原因。

如果你能扩展和改进现有的接口，这比提供一个新的接口要好。采用新的接口需要时间，而且让新的接口得到审查和批准是一项潜在的漫长而艰巨的任务。
总的来说，它的工作量要小得多。作为一个开发者，你总是希望以最少的工作量完成最多的工作。在现有的测试基础设施方面，扩展一个现有的接口给你带来很多优势。

The existing aio interface is comprised of three main system calls: a system call to setup an aio context (io_setup(2)), one to submit IO (io_submit(2)), and one to reap or wait for completions of IO (io_getevents(2)). Since a change in behavior was required for multiple of these system calls, we needed to add new system calls to pass in this information. This created both multiple entry points to the same code, as well as shortcuts in other places. The end result wasn’t very pretty in terms of code complexity and maintainability, and it only ended up fixing one of the highlighted deficiencies from the previous section. On top of that, it actually made one of them worse, since now the API was even more complicated to understand and use.
While it’s always hard to abandon a line of work to start from scratch, it was clear that we needed something new entirely. Something that would allow us to deliver on all points. We needed it to be performant and scalable, while still making it easy to use and having the features that existing interfaces were lacking.

现有的aio接口由三个主要的系统调用组成：一个用于设置aio上下文的系统调用（io_setup(2)），一个用于提交IO（io_submit(2)），一个用于收获或等待IO的完成（io_getevents(2)）。由于这些系统调用中的多个需要改变行为，我们需要添加新的系统调用来传递这些信息。这既创造了同一代码的多个入口点，也在其他地方创造了捷径。就代码的复杂性和可维护性而言，最终的结果并不漂亮，而且它最终只修复了上一节中强调的一个缺陷。除此之外，它实际上使其中一个缺陷变得更糟，因为现在的API更加复杂，难以理解和使用。

虽然放弃一个行当从头开始总是很困难，但很明显，我们需要一些完全新的东西。一个能让我们实现所有目标的东西。我们需要它的性能和可扩展性，同时还要使它易于使用，并具有现有接口所缺乏的功能。

3.0 New interface design goals

While starting from scratch was not an easy decision to make, it did allow us full artistic freedom in coming up with something new. In rough ascending order of importance, the main design goals were:

Easy to use, hard to misuse. Any user/application visible interface should have this as a main goal. The interface should be easy to understand and intuitive to use.
Extendable. While my background is mostly storage related, I wanted the interface to be usable for more than just block oriented IO. That meant networking and non-block storage interfaces that may be coming down the line. If you’re creating a brand new interface, it should be (or at least attempt to be) future proof in some shape or form.
Feature rich. Linux aio caters to a subset (of a subset) of applications. I did not want to create yet another interface that only covered some of what applications need, or that required applications to reinvent the same functionality over and over again (like IO thread pools).
Efficiency. While storage IO is mostly still block based and hence at least 512b or 4kb in size, efficiency at those sizes is still critical for certain applications. Additionally, some requests may not even be carrying a data payload. It was important that the new interface was efficient in terms of per-request overhead.
Scalability. While efficiency and low latencies are important, it’s also critical to provide the best performance possible at the peak end. For storage in particular, we’ve worked very hard to deliver a scalable infrastructure. A new interface should allow us to expose that scalability all the way back to applications.
Some of the above goals may seem mutually exclusive. Interfaces that are efficient and scalable are often hard to use, and more importantly, hard to use correctly. Both feature rich and efficient can also be hard to achieve. Nevertheless, these were the goals we set out with.

虽然从头开始并不是一个容易的决定，但它确实允许我们充分自由地提出新的东西。按照粗略的重要性递增顺序，主要的设计目标是。

易于使用，难以误用。任何用户/应用程序的可见接口都应该以这个为主要目标。接口应该是容易理解和直观使用的。
可扩展。虽然我的背景主要是与存储有关的，但我希望这个接口不仅仅可以用于面向块的IO。这意味着网络和非块存储接口可能会在未来出现。如果你正在创建一个全新的接口，它应该（或至少试图）以某种形式来适应未来。
功能丰富。Linux aio迎合了一个子集（的一个子集）的应用。我不想再创建一个只涵盖应用程序所需的部分功能的接口，或者要求应用程序反复发明同样的功能（如IO线程池）。
高效。虽然存储IO大多仍然是基于块的，因此至少有512b或4kb的大小，但这些大小的效率对某些应用来说仍然是至关重要的。此外，一些请求甚至可能没有携带数据有效载荷。新的接口在每个请求的开销方面是有效的，这一点很重要。
可扩展性。虽然效率和低延迟很重要，但在峰值端提供尽可能好的性能也很关键。特别是对于存储，我们一直在努力提供一个可扩展的基础设施。一个新的接口应该允许我们将这种可扩展性一直暴露给应用程序。
上述的一些目标可能看起来是相互排斥的。高效和可扩展的接口往往很难使用，更重要的是，很难正确使用。丰富的功能和高效的功能也很难实现。尽管如此，这些都是我们出发时的目标。

4.0 Enter io_uring

Despite the ranked list of design goals, the initial design was centered around efficiency. Efficiency isn’t something that can be an afterthought, it has to be designed in from the start - you can’t wring it out of something later on once the interface is fixed. I knew I didn’t want any memory copies for either submissions or completion events, and no memory in-directions either. At the end of the previous aio based design, both efficiency and scalability were visibly harmed by the multiple separate copies that aio had to do to handle both sides of the IO.
As copies aren’t desirable, it’s clear that the kernel and the application have to graciously share the structures defining the IO itself, and the completion event. If you’re taking the idea of sharing that far, it was a natural extension to have the coordination of shared data also reside in memory shared between the application and the kernel. Once you’ve made that leap, it also becomes clear that synchronization between the two has to be managed somehow. An application can’t share locking with the kernel without invoking system calls, and a system call would surely reduce the rate at which we communicate with the kernel. This was at odds with the efficiency goal. One data structure that would satisfy our needs would be a single producer and single consumer ring buffer. With a shared ring buffer, we could eliminate the need to have shared locking between the application and the kernel, getting away with some clever use of memory ordering and barriers instead.
There are two fundamental operations associated with an async interface: the act of submitting a request, and the event that is associated with the completion of said request. For submitting IO, the application is the producer and the kernel is the consumer. The opposite is true for completions - here the kernel produces completion events and the application consumes them. Hence, we need a pair of rings to provide an effective communication channel between an application and the kernel. That pair of rings is at the core of the new interface, io_uring. They are suitably named submission queue (SQ), and completion queue (CQ), and form the foundation of the new interface.

尽管有一系列的设计目标，最初的设计还是以效率为中心。效率不是事后才想到的，它必须从一开始就被设计进去–你不可能在接口固定后再把它拧出来。我知道我不希望提交或完成事件有任何内存拷贝，也不希望有内存导向。在之前基于aio的设计中，效率和可扩展性都因为aio必须处理IO两边的多个独立拷贝而受到明显的损害。
由于拷贝是不可取的，很明显，内核和应用程序必须慷慨地分享定义IO本身的结构，以及完成事件。如果你对共享的想法有那么深入的了解，那么让共享数据的协调也驻留在应用程序和内核之间的共享内存中是一个自然的扩展。一旦你实现了这一飞跃，你就会发现两者之间的同步必须以某种方式进行管理。如果不调用系统调用，应用程序就不能与内核共享锁，而系统调用肯定会降低我们与内核通信的速度。这与效率的目标是相悖的。有一种数据结构可以满足我们的需要，那就是单一生产者和单一消费者的环形缓冲器（ringbuffer）。有了一个共享的环形缓冲区，我们就不需要在应用程序和内核之间共享锁了，而是巧妙地使用内存排序和屏障。
有两个与异步接口相关的基本操作：提交请求的行为和与上述请求的完成相关的事件。对于提交IO，应用程序是生产者，内核是消费者。对于完成来说，情况正好相反–在这里，内核产生完成事件，而应用程序则消费它们。因此，我们需要一对环来提供应用程序和内核之间的有效通信渠道。这对环是新接口的核心，即io_uring。它们被恰当地命名为提交队列（SQ）和完成队列（CQ），并构成了新接口的基础。

4.1 DATA STRUCTURES

With the communication foundation in place, it was time to look at defining the data structures that would be used to describe the request and completion event. The completion side is straight forward. It needs to carry information pertaining to the result of the operation, as well as some way to link that completion back to the request it originated from. For io_uring, the layout chosen is as follows:
随着通信基础的建立，是时候看看如何定义用于描述请求和完成事件的数据结构了。完成方面是直截了当的。它需要携带与操作结果有关的信息(即提交的context)，以及某种方式将该完成事件链接到它所产生的请求。对于io_uring，选择的布局如下。

# 完成队列(CQ)
struct io_uring_cqe {
	__u64	user_data;	/* sqe->data submission passed back */
	__s32	res;		/* result code for this event */
	__u32	flags;
};

The io_uring name should be recognizable by now, and the _cqe postfix refers to a Completion Queue Event. For the rest of this article, commonly referred to as just a cqe. The cqe contains a user_data field. This field is carried from the initial request submission, and can contain any information that the the application needs to identify said request. One common use case is to have it be the pointer of the original request. The kernel will not touch this field, it’s simply carried straight from submission to completion event. res holds the result of the request. Think of it like the return value from a system call. For a normal read/write operation, this will be like the return value from read(2) or write(2). For a successful operation, it will contain the number of bytes transferred. If a failure occurred, it will contain the negative error value. For example, if an I/O error occurred, res will contain -EIO. Lastly, the flags member can carry meta data related to this operation. As of now, this field is unused.

io_uring这个名字现在应该是可以识别的，_cqe这个后缀指的是一个完成队列事件。在本文的其余部分，通常只称为cqe。cqe包含一个user_data字段。这个字段从最初的请求提交开始，可以包含应用程序需要的任何信息，以识别该请求。一个常见的用例是让它成为原始请求的指针。内核不会碰这个字段，它只是直接从提交信息拷贝到cqe中。res表示这个请求的结果。可以把它看作是系统调用的返回值。对于一个正常的读/写操作，这将是read(2)或write(2)的返回值。对于一个成功的操作，它将包含传输的字节数。如果发生故障，它将包含负的错误值。例如，如果发生了一个I/O错误，res将包含-EIO(errno)。最后，flags成员可以携带与此操作相关的元数据。到现在为止，这个字段还没有使用。

Definition of a request type is more complicated. Not only does it need to describe a lot more information than a completion event, it was also a design goal for io_uring to be extendable for future request types. What we came up with is as follows:

请求类型的定义更为复杂。它不仅需要描述比完成事件多得多的信息，也是io_uring的一个设计目标，即为未来的请求类型提供可扩展性。我们得出的结果是这样的:

struct io_uring_sqe {
	__u8	opcode;		/* type of operation for this sqe */
	__u8	flags;		/* IOSQE_ flags */
	__u16	ioprio;		/* ioprio for the request */
	__s32	fd;		/* file descriptor to do IO on */
	union {
		__u64	off;	/* offset into file */
		__u64	addr2;
	};
	union {
		__u64	addr;	/* pointer to buffer or iovecs */
		__u64	splice_off_in;
	};
	__u32	len;		/* buffer size or number of iovecs */
	union {
		__kernel_rwf_t	rw_flags;
		__u32		fsync_flags;
		__u16		poll_events;	/* compatibility */
		__u32		poll32_events;	/* word-reversed for BE */
		__u32		sync_range_flags;
		__u32		msg_flags;
		__u32		timeout_flags;
		__u32		accept_flags;
		__u32		cancel_flags;
		__u32		open_flags;
		__u32		statx_flags;
		__u32		fadvise_advice;
		__u32		splice_flags;
		__u32		rename_flags;
		__u32		unlink_flags;
		__u32		hardlink_flags;
	};
	__u64	user_data;	/* data to be passed back at completion time */
	/* pack this to avoid bogus arm OABI complaints */
	union {
		/* index into fixed buffers, if used */
		__u16	buf_index;
		/* for grouped buffer selection */
		__u16	buf_group;
	} __attribute__((packed));
	/* personality to use, if used */
	__u16	personality;
	union {
		__s32	splice_fd_in;
		__u32	file_index;
	};
	__u64	__pad2[2];
};

Akin to the completion event, the submission side structure is dubbed the Submission Queue Entry, or sqe for short. It contains an opcode field that describes the operation code (or op-code for short) of this particular request. One such op-code is IORING_OP_READV, which is a vectored read. flags contains modifier flags that are common across command types. We’ll get into this a bit later in the advanced use case section. ioprio is the priority of this request. For normal read/writes, this follows the definition as outlined for the ioprio_set(2) system call. fd is the file descriptor associated with the request, and off holds the offset at which the operation should take place. addr contains the address at which the operation should perform IO, if the op-code describes an operation that transfers data. If the operation is a vectored read/write of some sort, this will be a pointer to an struct iovec array, as used by preadv(2), for example. For a non-vectored IO transfer, addr must contain the address directly. This carries into len, which is either a byte count for a non-vectored IO transfer, or a number of vectors described by addr for a vectored IO transfer.

与完成事件类似，提交方的结构被称为提交队列条目，简称sqe。它包含一个操作代码字段，描述了这个特定请求的操作代码（或简称操作代码）。其中一个操作代码是IORING_OP_READV，它是一个矢量读取。 flags包含各种命令类型中常见的flags。我们将在后面的高级用例部分讨论这个问题。 ioprio是这个请求的优先级。fd是与请求相关的文件描述符，off是操作应该发生的偏移量。addr包含操作应该执行IO的地址，如果操作代码描述的是一个传输数据的操作。如果操作是某种矢量读/写，这将是一个指向iovec结构数组的指针，例如preadv(2)所使用的指针。对于非向量的IO传输，addr必须直接包含地址。这将进入len，对于非向量的IO传输来说，它是一个字节数，对于向量的IO传输来说，它是addr所描述的向量的数量。(个人添加: 这里的len和addr需要判断是否是vectored read/write语义)

Next follows a union of flags that are specific to the op-code. For example, for the mentioned vectored read (IORING_OP_READV), the flags follow those described for the preadv2(2) system call. user_data is common across opcodes, and is untouched by the kernel. It’s simply copied to the completion event, cqe, when a completion event is posted for this request. buf_index will be described in the advanced use cases section. Lastly, there’s some padding at the end of the structure. This serves the purpose of ensuring that the sqe is aligned nicely in memory at 64 bytes in size, but also for future use cases that may need to contain more data to describe a request. A few use cases for that comes to mind - one would be a key/value store set of commands, another would be for end-to-end data protection where the application passes in a pre-computed checksum for the data it wants to write.

接下来是操作码特有的flags。例如，对于提到的矢量读取（IORING_OP_READV），这些标志是按照preadv2(2)系统调用所描述的。user_data在所有操作码中是通用的，并且不被内核触及。当这个请求的完成事件被发布时，它被简单地复制到完成事件cqe中。buf_index将在高级用例部分描述。最后，在结构的末端有个padding。这样做的目的是为了确保sqe在内存中很好地对齐，大小为64字节，但也是为了将来可能需要包含更多的数据来描述一个请求(同时还有避免false sharding)。我想到了几个用例–一个是键/值存储的命令集，另一个是端到端的数据保护，应用程序为它要写的数据传入一个预先计算的校验和。

4.2 COMMUNICATION CHANNEL

With the data structures described, let’s go into some detail on how the rings work. Even though there is symmetry in the sense that we have a submission and completion side, the indexing is different between the two. Like in the previous section, let’s start with less complicated one, the completion ring.
The cqes are organized into an array, with the memory backing the array being visible and modifiable by both the kernel and the application. However, since the cqe’s are produced by the kernel, only the kernel is actually modifying the cqe entries. The communication is managed by a ring buffer. Whenever a new event is posted by the kernel to the CQ ring, it updates the tail associated with it. When the application consumes an entry, it updates the head. Hence, if the tail is different than the head, the application knows that it has one or more events available for consumption. The ring counters themselves are free flowing 32-bit integers, and rely on natural wrapping when the number of completed events exceed the capacity of the ring. One advantage of this approach is that we can utilize the full size of the ring without having to manage a “ring is full” flag on the side, which would have complicated the management of the ring. With that, it also follows that the ring must be a power of 2 in size.
To find the index of an event, the application must mask the current tail index with the size mask of the ring. This commonly looks something like the below:

随着数据结构的描述，让我们来详细了解一下环的工作原理。尽管在这个意义上存在对称性，即我们有一个提交方和完成方，但两者的索引是不同的。和上一节一样，让我们从不太复杂的环开始，即cq ring。
cqe被组织成一个数组，支持该数组的内存是可见的，并且可以被内核和应用程序修改。然而，由于cqe是由内核产生的，所以实际上只有内核在修改cqe条目。通信是由一个环形缓冲器管理的。每当一个新的事件被内核发布到CQ环上，它就会更新与之相关的tail。当应用程序消耗一个条目时，它就会更新头部。因此，如果尾部与头部不同，应用程序知道它有一个或多个事件可供消费。环形计数器本身是自由流动的32位整数，当完成的事件数量超过环的容量时，依靠自然包装。这种方法的一个好处是，我们可以利用环的全部容量，而不必在边上管理一个 "环已满 "的标志，这将使环的管理变得复杂。有了这个，也就意味着环的大小必须是2的幂。
为了找到一个事件的索引，应用程序必须用环的大小屏蔽当前的尾部索引。这通常看起来像下面这样。

unsigned head;
head = cqring->head;
read_barrier();
if (head != cqring->tail) {
	struct io_uring_cqe *cqe;
	unsigned index;
	index = head & (cqring->mask);
	cqe = &cqring->cqes[index];
	/* process completed cqe here */
	// ...
	/* we've now consumed this entry */
	head++;
}
cqring->head = head;
write_barrier();

ring→cqes[] is the shared array of io_uring_cqe structures. In the next sections, we’ll get into the inner details of how this shared memory (and the io_uring instance itself) is setup and managed, and what the magic read and write barrier calls are doing here.
For the submission side, the roles are reversed. The application is the one updating the tail, and the kernel consumes entries (and updates) the head. One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them. Hence the submission side ring buffer is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there’s some reasoning behind it. Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one operation. That in turns allows for easier conversion of said applications to the io_uring interface.
Adding an sqe for consumption by the kernel is basically the opposite operation of reaping an cqe from the kernel. A typical example would look something like this:

ring→cqes[] 是io_uring_cqe结构的共享数组。在接下来的章节中，我们将深入了解这个共享内存（以及io_uring实例本身）是如何设置和管理的内部细节，以及神奇的读写障碍调用在这里做了什么。
在提交方面，角色是相反的。应用程序是更新尾部的人，而内核则消耗条目（和更新）头部。一个重要的区别是，当CQ环直接索引共享的CQes数组时，提交方在它们之间有一个间接数组。因此，提交方的环形缓冲区是这个数组的索引，而这个数组又包含了sqes的索引。这最初可能看起来很奇怪和混乱，但这背后有一些原因。一些应用程序可能会在内部数据结构中嵌入请求单元，这使他们能够灵活地这样做，同时保留在一个操作中提交多个sqes的能力。这反过来又允许上述应用程序更容易转换到io_uring接口。
添加一个供内核使用的 sqe 基本上与从内核获得一个 cqe 的操作相反。一个典型的例子是这样的:

struct io_uring_sqe *sqe;
unsigned tail, index;
tail = sqring->tail;
index = tail & (*sqring->ring_mask);
sqe = &sqring->sqes[index];
/* this call fills in the sqe entries for this IO */
init_io(sqe);
/* fill the sqe index into the SQ ring array */
sqring->array[index] = index;
tail++;
write_barrier();
sqring->tail = tail;
write_barrier();

As with the CQ ring side, the read and write barriers will be explained later. The above is a simplified example, it assumes that the SQ ring is currently empty, or at least that it has room for one more entry.
As soon as an sqe is consumed by the kernel, the application is free to reuse that sqe entry. This is true even for cases where the kernel isn’t completely done with a given sqe yet. If the kernel does need to access it after the entry has been consumed, it will have made a stable copy of it. Why this can happen isn’t necessarily important, but it has an important side effect for the application. Normally an application would ask for a ring of a given size, and the assumption may be that this size corresponds directly to how many requests the application can have pending in the kernel. However, since the sqe lifetime is only that of the actual submission of it, it’s possible for the application to drive a higher pending request count than the SQ ring size would indicate. The application must take care not to do so, or it could risk overflowing the CQ ring. By default, the CQ ring is twice the size of the SQ ring. This allows the application some amount of flexibility in managing this aspect, but it doesn’t completely remove the need to do so. If the application does violate this restriction, it will be tracked as an overflow condition in the CQ ring. More details on that later.
Completion events may arrive in any order, there is no ordering between the request submission and the association completion. The SQ and CQ ring run independently of each other. However, a completion event will always correspond to a given submission request. Hence, a completion event will always be associated with a specific submission request.

与CQ环方面一样，读和写的障碍将在后面解释。以上是一个简化的例子，它假设SQ环目前是空的，或者至少它有空间再容纳一个条目。
一旦一个SQE被内核消耗掉，应用程序就可以自由地重新使用这个SQE条目。即使是在内核还没有完全用完一个sqe的情况下也是如此。如果内核在该条目被消耗后确实需要访问它，它将对其进行稳定的复制。为什么会发生这种情况并不重要，但它对应用程序有一个重要的副作用。通常情况下，一个应用程序会要求一个给定大小的环，并且假设这个大小直接对应于该应用程序在内核中可以有多少个请求被等待。然而，由于sqe的寿命只是实际提交的寿命，应用程序有可能驱动一个比SQ环大小更高的待定请求数。应用程序必须注意不要这样做，否则会有溢出CQ环的风险。默认情况下，CQ环的大小是SQ环的两倍。这允许应用程序在管理这方面有一定的灵活性，但它并没有完全消除这样做的必要性。如果应用程序确实违反了这一限制，它将被追踪为CQ环中的溢出条件。稍后会有更多关于这个问题的细节。
完成事件可以以任何顺序到达，在请求提交和关联完成之间没有顺序。SQ和CQ环的运行是相互独立的。然而，一个完成事件将总是对应于一个给定的提交请求。因此，一个完成事件将总是与一个特定的提交请求相关联。

5.0 io_uring interface

Just like aio, io_uring has a number of system calls associated with it that define its operation. The first one is a system call to setup an io_uring instance:

就像aio一样，io_uring也有一些与之相关的系统调用来定义其操作。第一个是系统调用，用来设置一个io_uring实例。

int io_uring_setup(unsigned entries, struct io_uring_params *params)；

The application must provide a desired number of entries for this io_uring instance, and a set of parameters associated with it. entries denotes the number of sqes that will be associated with this io_uring instance. It must be a power of 2, in the range of 1…4096 (both inclusive). The params structure is both read and written by the kernel, it is defined as follows:

应用程序必须为这个io_uring实例提供所需的条目数，以及与之相关的一组参数。 entries表示将与这个io_uring实例相关的sqes数。它必须是2的幂，范围是1…4096（包括两个）。params结构是由内核读取和写入的，它的定义如下:

/*
 * Passed in for io_uring_setup(2). Copied back with updated info on success
 */
struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u32 sq_thread_cpu;
	__u32 sq_thread_idle;
	__u32 features;
	__u32 wq_fd;
	__u32 resv[3];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
};

The sq_entries will be filled out by the kernel, letting the application know how many sqe entries this ring supports. Likewise for the cqe entries, the cq_entries member tells the application how big the CQ ring is. Discussion of the rest of this structure is deferred to the advanced use cases section, with the exception of the sq_off and cq_off fields as they are necessary to setup the basic communication through the io_uring.

sq_entries将由内核来填写，让应用程序知道这个环支持多少个sqe条目。同样，对于cqe条目，cq_entries成员告诉应用程序这个CQ环有多大。这个结构的其余部分将被推迟到高级用例部分讨论，除了sq_off和cq_off字段，因为它们对于通过io_uring设置基本通信是必要的。

On a successful call to io_uring_setup(2), the kernel will return a file descriptor that is used to refer to this io_uring instance. This is where the sq_off and cq_off structures come in handy. Given that the sqe and cqe structures are shared by the kernel and the application, the application needs a way to gain access to this memory. This is done through mmap(2)'ing it into the application memory space. The application uses the sq_off member to figure out the offsets of the various ring members. The io_sqring_offsets structure looks as follows:

在成功调用io_uring_setup(2)后，内核将返回一个文件描述符，用来引用这个io_uring实例。这就是sq_off和cq_off结构派上用场的地方。鉴于sqe和cqe结构是由内核和应用程序共享的，应用程序需要一种方法来获得对该内存的访问。这是通过mmap(2)将其放入应用程序的内存空间来实现的。应用程序使用sq_off成员来计算各种环状成员的偏移量。io_sqring_offsets结构看起来如下:

/*
 * Filled with the offset for mmap(2)
 */
struct io_sqring_offsets{
    __u32 head;         /* offset of ring head */
    __u32 tail;         /* offset of ring tail */
    __u32 ring_mask;    /* ring mask value */
    __u32 ring_entries; /* entries in ring */
    __u32 flags;        /* ring flags */
    __u32 dropped;      /* number of sqes not submitted */
    __u32 array;        /* sqe index array */
    __u32 resv1;
    __u64 resv2;
};

To access this memory, the application must call mmap(2) using the io_uring file descriptor and the memory offset associated with the SQ ring. The io_uring API defines the following mmap offsets for use by the application:

要访问这个内存，应用程序必须使用io_uring文件描述符和与SQ环相关的内存偏移量调用mmap(2)。io_uring API定义了以下mmap偏移量供应用程序使用。

/*
 * Magic offsets for the application to mmap the data it needs
 */
#define IORING_OFF_SQ_RING		0ULL
#define IORING_OFF_CQ_RING		0x8000000ULL
#define IORING_OFF_SQES			0x10000000ULL

where IORING_OFF_SQ_RING is used to map the SQ ring into the application memory space, IORING_OFF_CQ_RING for the CQ ring ditto, and finally IORING_OFF_SQES to map the sqe array. For the CQ ring, the array of cqes is a part of the CQ ring itself. Since the SQ ring is an index of values into the sqe array, the sqe array must be mapped separately by the application.

其中IORING_OFF_SQ_RING用于将SQ环映射到应用程序内存空间，IORING_OFF_CQ_RING用于CQ环同上，最后IORING_OFF_SQES用于映射sqe数组。对于CQ环，cqes数组是CQ环本身的一部分。由于SQ环是对sqe数组的数值索引，所以sqe数组必须由应用程序单独映射。

The application will define its own structure holding these offsets. One example might look like the following:

应用程序将定义自己的结构来保存这些偏移量。一个例子可能是这样的:

struct app_io_sq_ring
{
    unsigned *head;
    unsigned *tail;
    unsigned *ring_mask;
    unsigned *ring_entries;
    unsigned *flags;
    unsigned *array;
};

and a typical setup case will thus look like:

struct app_sq_ring app_setup_sq_ring(int ring_fd, struct io_uring_params *p)
{
    struct app_sq_ring sqring;
    void *ptr;
    ptr = mmap(NULL, p->sq_off.array + p->sq_entries * sizeof(__u32),
               PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
               ring_fd, IORING_OFF_SQ_RING);
    sring->head = ptr + p->sq_off.head;
    sring->tail = ptr + p->sq_off.tail;
    sring->ring_mask = ptr + p->sq_off.ring_mask;
    sring->ring_entries = ptr + p->sq_off.ring_entries;
    sring->flags = ptr + p->sq_off.flags;
    sring->dropped = ptr + p->sq_off.dropped;
    sring->array = ptr + p->sq_off.array;
    return sring;
}

The CQ ring is mapped similarly to this, using IORING_OFF_CQ_RING and the offset defined by the io_cqring_offsets cq_off member. Finally, the sqe array is mapped using the IORING_OFF_SQES offset. Since this is mostly boiler plate code that can be reused between applications, the liburing library interface provides a set of helpers to accomplish the setup and memory mapping in a simple manner. See the io_uring library section for details on that. Once all of this is done, the application is ready to communicate through the io_uring instance.

CQ环的映射与此类似，使用IORING_OFF_CQ_RING和io_cqring_offsets cq_off成员定义的偏移。最后，sqe数组是使用IORING_OFF_SQES偏移量来映射的。由于这主要是可以在应用程序之间重复使用的锅炉板代码，liburing库接口提供了一组帮助程序，以简单的方式完成设置和内存映射。关于这方面的细节，请看io_uring库部分。一旦所有这些都完成了，应用程序就可以通过 io_uring实例进行通信。

The application also needs a way to tell the kernel that it has now produced requests for it to consume. This is done through another system call:

应用程序还需要一种方法来告诉内核，它现在已经产生了要消费的请求。这可以通过另一个系统调用来完成。

int io_uring_enter(int ring_fd, unsigned int to_submit,
                   unsigned int min_complete, unsigned int flags,
                   sigset_t sig);

fd refers to the ring file descriptor, as returned by io_uring_setup(2). to_submit tells the kernel that there are up to that amount of sqes ready to be consumed and submitted, while min_complete asks the kernel to wait for completion of that amount of requests. Having the single call available to both submit and wait for completions means that the application can both submit and wait for request completions with a single system call. flags contains flags that modify the behavior of the call. The most important one being:

fd指的是环形文件描述符，由io_uring_setup(2)返回。to_submit告诉内核有多达该数量的sqes准备被消耗和提交，而min_complete要求内核等待该数量的请求的完成。让单一的调用同时用于提交和等待完成意味着应用程序可以通过单一的系统调用同时提交和等待请求的完成。 flags包含修改调用行为的标志。最重要的一个是:

#define IORING_ENTER_GETEVENTS (1U << 0)

If IORING_ENTER_GETEVENTS is set in flags, then the kernel will actively wait for min_complete events to be available. The astute reader might be wondering what we need this flag for, if we have min_complete as well. There are cases where the distinction is important, which will be covered later. For now, if you wish to wait for completions, IORING_ENTER_GETEVENTS must be set.

如果IORING_ENTER_GETEVENTS在flags中被设置，那么内核将主动等待min_complete事件的出现。精明的读者可能会想，如果我们也有min_complete，那么我们需要这个标志做什么。在一些情况下，这种区别是很重要的，这将在后面介绍。现在，如果你想等待完成，必须设置IORING_ENTER_GETEVENTS。

That essentially covers the basic API of io_uring. io_uring_setup(2) will create an io_uring instance of the given size. With that setup, the application can start filling in sqes and submitting them with io_uring_enter(2). Completions can be waited for with the same call, or they can be done separately at a later time. Unless the application wants to wait for completions to come in, it can also just check the cq ring tail for availability of any events. The kernel will modify CQ ring tail directly, hence completions can be consumed by the application without necessarily having to call io_uring_enter(2) with IORING_ENTER_GETEVENTS set.

这基本上涵盖了io_uring的基本API。io_uring_setup(2)将创建一个指定大小的io_uring实例。有了这个设置，应用程序就可以开始填写sqes，并通过io_uring_enter(2)提交。完成可以用同一个调用来等待，也可以在以后的时间里单独完成。除非应用程序想等待完成，否则它也可以直接检查cq ring tail是否有任何事件。内核将直接修改CQ环形尾巴，因此完成事件可以被应用程序consume，而不一定要在设置了IORING_ENTER_GETEVENTS的情况下调用io_uring_enter(2)。

For the types of commands available and how to use them, please see the io_uring_enter(2) man page

关于可用的命令类型以及如何使用它们，请参见io_uring_enter(2)手册页

5.1 SQE ORDERING

Usually sqes are used independently, meaning that the execution of one does not affect the execution or ordering of subsequent sqe entries in the ring. This allows full flexibility of operations, and enables them to execute and complete in parallel for maximum efficiency and performance. One use case where ordering may be desired is for data integrity writes. A common example of that is a series of writes, followed by an fsync/fdatasync. As long as we can allow the writes to complete in any order, we only care about having the data sync executed when all the writes have completed. Applications often turn that into a write-and-wait operation, and then issue the sync when all the writes have been acknowledged by the underlying storage.

通常情况下，sqe是独立使用的，这意味着一个sqe的执行并不影响环中后续sqe项的执行或排序。这使得操作有充分的灵活性，并使它们能够并行地执行和完成，以获得最大的效率和性能。一个可能需要排序的用例是用于数据完整性写入。一个常见的例子是一系列的写操作，然后是fsync/fdatasync。只要我们能够允许以任何顺序完成写操作，我们只关心在所有的写操作完成后执行数据同步。应用程序通常会把它变成一个写和等待的操作，然后在所有的写被底层存储确认后再发出同步。

io_uring supports draining the submission side queue until all previous completions have finished. This allows the application queue the above mentioned sync operation and know that it will not start before all previous commands have completed. This is accomplished by setting IOSQE_IO_DRAIN in the sqe flags field. Note that this stalls the entire submission queue. Depending on how io_uring is used for the specific application, this may introduce bigger pipeline bubbles than desired. An application may use an independent io_uring context just for integrity writes to allow better simultaneous performance of unrelated commands, if these kinds of drain operations are a common occurrence.

io_uring支持耗尽提交方队列，直到所有之前的完成工作都完成。这允许应用程序对上述同步操作进行排队，并知道在所有先前的命令完成之前不会开始。这是通过在sqe标志字段中设置IOSQE_IO_DRAIN来完成的。注意，这将使整个提交队列停滞。根据具体应用中io_uring的使用方式，这可能会带来比预期更大的流水线气泡。一个应用程序可以使用一个独立的io_uring上下文来进行完整性写入，以允许更好地同时执行不相关的命令，如果这些类型的消耗操作是经常发生的。

5.2 LINKED SQES

While IOSQE_IO_DRAIN includes a full pipeline barrier, io_uring also supports more granular sqe sequence control. Linked sqes provide a way to describe dependencies between a sequence of sqes within the greater submission ring, where each sqe execution depends on the successful completion of the previous sqe. Examples of such use cases may include a series of writes that must be executed in order, or perhaps a copy-like operation where a read from one file is followed by a write to another file, with the buffers of the two sqes being shared. To utilize this feature, the application must set IOSQE_IO_LINK in the sqe flags field. If set, the next sqe will not be started before the previous sqe has completed successfully. If the previous sqe does not fully complete, the chain is broken and the linked sqe is canceled with -ECANCELED as the error code. In this context, fully complete refers to the fully successful completion of the request. Any error or potentially short read/write will abort the chain, the request must complete to its full extent.

虽然IOSQE_IO_DRAIN包括一个完整的管道屏障，但io_uring也支持更细化的soe序列控制。链接的 sqe 提供了一种方法来描述更大的提交环中的 sqe 序列之间的依赖关系，其中每个 sqe 的执行都依赖于前一个 sqe 的成功完成。这种用例可能包括一系列必须按顺序执行的写，或者是类似于复制的操作，即从一个文件读取后，再写到另一个文件，这两个sheet的缓冲区是共享的。为了利用这个特性，应用程序必须在sqe标志域中设置IOSQE_IO_LINK。如果设置了，在前一个sheet成功完成之前，下一个sheet将不会被启动。如果前一个soe没有完全完成，那么链就会中断，并且链接的soe会被取消，错误代码为-ECANCELED。在这里，完全完成指的是请求完全成功完成。任何错误或潜在的短时读/写将中止链，请求必须完全完成。

5.3 TIMEOUT COMMANDS

While most of the commands supported by io_uring work on data, either directly such as a read/write operation or indirectly like the fsync style commands, the timeout command is a bit different. Rather than work on data, IORING_OP_TIMEOUT helps manipulate waits on the completion ring. The timeout command supports two distinct trigger types, which may be used together in a single command. One trigger type is a classic timeout, with the caller passing in a (variant of) struct timespec that has a non-zero seconds/nanoseconds value. To retain compatibility between 32 vs 64-bit applications and kernel space, the type used must be of the following format:

io_uring支持的大多数命令都是针对数据的，有的是直接的，比如读/写操作，有的是间接的，比如fsync风格的命令，而超时命令则有点不同。IORING_OP_TIMEOUT不是在数据上工作，而是帮助操作完成环上的等待。暂停命令支持两种不同的触发器类型，它们可以在一个命令中一起使用。一种触发器类型是经典的超时，调用者传入一个具有非零秒/纳秒值的（变体）结构timespec。为了保持32位与64位应用程序和内核空间之间的兼容性，使用的类型必须是以下格式:

struct __kernel_timespec {
	int64_t		tv_sec;
	long long	tv_nsec;
};

At some point userspace should have a struct timespec64 available that fits this description. Until then, the above type must be used. If timed timeouts is desired, the sqe addr field must point to a structure of this type. The timeout command will complete once the specified amount of time has passed.

在某些时候，用户空间应该有一个符合这种描述的结构timespec64。在那之前，必须使用上述类型。如果需要定时超时，sqe addr字段必须指向这种类型的结构。超时命令将在指定的时间过后完成。

The second trigger type is a count of completions. If used, the completion count value should be filled into the offset field of the sqe. The timeout command will complete once the specified number of completions have happened since the timeout command was queued up.

第二个触发器类型是完成度的计数。如果使用的话，应将完成数的数值填入sqe的偏移量字段。一旦超时命令排队后发生了指定的完成次数，超时命令将完成。

It’s possible to specify both trigger events in a single timeout command. If a timeout is queued with both, the first condition to trigger will generate the timeout completion event. When a timeout completion event is posted, any waiters of completions will be woken up, regardless of whether the amount of completions they asked for have been met or not.

有可能在一个超时命令中指定这两个触发事件。如果一个超时的队列中同时出现了这两种情况，那么第一个触发的条件将产生超时完成事件。当超时完成事件被发布时，任何等待完成的人都会被唤醒，不管他们要求的完成量是否已经达到。

6.0 Memory ordering

这里参考cppreference的memory order即可(定义是相同的)

7.0 liburing library

With the inner details of the io_uring out of the way, you’ll now be relieved to learn that there’s a simpler way to do much of the above. The liburing library serves two purposes:

Remove the need for boiler plate code for setup of an io_uring instance.
Provide a simplified API for basic use cases.

随着io_uring内部细节的解决，你现在会松一口气，知道有一个更简单的方法来完成上述大部分工作。liburing库有两个目的。

消除对设置io_uring实例的锅炉板代码的需要。
为基本用例提供一个简化的API。

后者确保应用程序根本不必担心内存障碍，也不必自己做任何环形缓冲区管理。这使得API的使用和理解更加简单，事实上也不需要了解它如何工作的所有细节。如果我们只专注于提供基于liburing的例子，这篇文章可能会短得多，但至少对内部工作有一些了解，对从一个应用程序中提取最大的性能往往是有益的。此外，libring目前专注于减少锅炉板代码，并为标准用例提供基本帮助。一些更高级的功能还不能通过liburing获得。然而，这并不意味着你不能混合和匹配这两者。他们都在相同的结构上运行。一般来说，我们鼓励应用程序使用 liburing 的设置助手，即使他们使用的是原始接口。

7.1 LIBURING IO_URING SETUP

Let’s start with an example. Instead of calling io_uring_setup(2) manually and subsequently doing an mmap(2) of the three necessary regions, liburing provides the following basic helper to accomplish the very same task:

让我们从一个例子开始。与其手动调用io_uring_setup(2)并随后对三个必要的区域进行mmap(2)，liburing提供了以下基本的辅助工具来完成同样的任务。

struct io_uring ring;
io_uring_queue_init(ENTRIES, &ring, 0);

The io_uring structure holds the information for both the SQ and CQ ring, and the io_uring_queue_init(3) call handles all the setup logic for you. For this particular example, we’re passing in 0 for the flags argument. Once an application is done using an io_uring instance, it simply calls:

io_uring结构持有SQ和CQ环的信息，io_uring_queue_init(3)调用为你处理所有的设置逻辑。在这个特殊的例子中，我们在flags参数中传递了0。一旦一个应用程序完成了对io_uring实例的使用，它就可以简单地调用:

io_uring_queue_exit(&ring);

to tear it down. Similarly to other resources allocated by an application, once the application exits, they are automatically reaped by the kernel. This is also true for any io_uring instances the application may have created.

(调用io_uring_queue_exit)来释放它。与其他由应用程序分配的资源类似，一旦应用程序退出，它们会被内核自动收割（回收）。对于应用程序可能创建的任何io_uring实例也是如此。

7.2 LIBURING SUBMISSION AND COMPLETION

One very basic use case is submitting a request and, later on, waiting for it to complete. With the liburing helpers, this looks something like this:

一个非常基本的用例是提交一个请求，随后等待它完成。使用liburing帮助器，这看起来像这样:

struct io_uring_sqe sqe;
struct io_uring_cqe cqe;
/* get an sqe and fill in a READV operation */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, fd, &iovec, 1, offset);

/* tell the kernel we have an sqe ready for consumption */
io_uring_submit(&ring);

/* wait for the sqe to complect */
io_uring_wait_cqe(&ring, &cqe);

/* read and process cqe event */
app_handle_cqe(ceq);
io_uring_cqe_seen(&ring, cqe);

This should be mostly self explanatory. The last call to io_uring_wait_cqe(3) will return the completion event for the sqe that we just submitted, provided that you have no other sqes in flight. If you do, the completion event could be for another sqe.

这应该大部分是自我解释的。对io_uring_wait_cqe(3)的最后一次调用将返回我们刚刚提交的那个sqe的完成事件，前提是你没有其他正在运行的sqe。如果你有，那么完成的事件可能是另一个sqe的。

If the application merely wishes to peek at the completion and not wait for an event to become available, io_uring_peek_cqe(3) does that. For both use cases, the application must call io_uring_cqe_seen(3) once it is done with this completion event. Repeated calls to io_uring_peek_cqe(3) or io_uring_wait_cqe(3) will otherwise keep returning the same event. This split is necessary to avoid the kernel potentially overwriting the existing completion even before the application is done with it. io_uring_cqe_seen(3) increments the CQ ring head, which enables the kernel to fill in a new event at that same slot.

如果应用程序只是希望偷看完成度，而不是等待事件的发生，io_uring_peek_cqe(3)就可以做到。对于这两种使用情况，一旦完成了这个完成事件，应用程序必须调用io_uring_cqe_seen(3)。重复调用io_uring_peek_cqe(3)或io_uring_wait_cqe(3)，否则将不断返回同一事件。这种分割是必要的，以避免内核在应用完成之前就可能覆盖现有的完成事件。 io_uring_cqe_seen(3)增加了CQ环头，这使得内核能够在同一槽中填充一个新的事件。

There are various helpers for filling in an sqe, io_uring_prep_readv(3) is just one example. I would encourage applications to always take advantage of the liburing provided helpers to the extent possible.