io_uring

最新推荐文章于 2024-07-04 22:51:06 发布

proware

最新推荐文章于 2024-07-04 22:51:06 发布

阅读量179

点赞数

分类专栏： linux内核

本文链接：https://blog.csdn.net/proware/article/details/118797437

版权

linux内核专栏收录该内容

39 篇文章 0 订阅

订阅专栏

git log链接： https://git.kernel.dk/cgit/linux-block/log/?h=io_uring-5.14

https://git.kernel.dk/cgit/liburing/log/

Extendable. While my background is mostly storage related, I wanted the interface to be usable for more than
just block oriented IO. That meant networking and non-block storage interfaces that may be coming down the
line. If you're creating a brand new interface, it should be (or at least attempt to be) future proof in some shape
or form

problem

1. 内核和用户共享内存的方式

2. . With a shared ring buffer, we could
eliminate the need to have shared locking between the application and the kernel, getting away with some clever use of memory ordering and barriers instead.

struct io_uring_cqe {  //_cqe postfix refers to a Completion Queue Event
   __u64 user_data; //initial request submission, and can contain any information //that the the application needs to identify said request。应用程序的请求信息。通常是指//针。内核不对此字段进行处理，仅仅将其从submit回调到complete event事件中。
   __s32 res;  // 此次submit的处理结果，result的缩写，类似ioctl的返回值。错误时的负值，正常时读写的字节数目等。
   __u32 flags;
};

//submit 的设计，由于要考虑将来各种类型，所以设计更为复杂
struct io_uring_sqe { // Submission Queue Entry,
   __u8 opcode; //at describes the operation code (or op-code for short) of this //particular request. One such op-code is IORING_OP_READV, which is a vectored //read.都有哪些操作码，是否可以扩展操作码，可以的话，扩展的原则是什么？
   __u8 flags; // modifier flags that are common across command types.命令类型具体是指？？
   __u16 ioprio; //the priority of this request. For
normal read/writes, this follows the definition as outlined for the ioprio_set(2) system call.  有几级优先级，内核处理这种优先级的机制是怎样的？
   __s32 fd;  //the file descriptor associated with the request,
   __u64 off; //offset at which the operation should take place 也就是读写访问的偏移量
   __u64 addr; // contains the
address at which the operation should perform IO, if the op-code describes an operation that transfers data. If the operation is a vectored read/write of some sort, this will be a pointer to an struct iovec array  (存放读写数据操作时，数据所在的地址，这个是用户态的虚拟地址？？)
   __u32 len; //which is either a
byte count for a non-vectored IO transfer(数据的字节数), or a number of vectors（向量组的个数）
   union {
   __kernel_rwf_t rw_flags;
   __u32 fsync_flags;
   __u16 poll_events;
__u32 sync_range_flags;
__u32 msg_flags;   
   };    // s a union of flags that are specific to the op-code.  各个flag的用途
   __u64 user_data;  //与cqe中的相同
   union {
   __u16 buf_index;  // 用途待定
   __u64 __pad2[3];
   };
};

队列长度、内核与应用处理速度匹配问题

, the CQ ring is twice the size of the SQ ring. This allows the application some amount of flexibility in managing this aspect, but it doesn't completely remove the need to do so. If
the application does violate this restriction, it will be tracked as an overflow condition in the CQ ring

完成事件与请求的对应关系问题

Completion events may arrive in any order, there is no ordering between the request submission and the association
completion. The SQ and CQ ring run independently of each other. However, a completion event will always correspond
to a given submission request. Hence, a completion event will always be associated with a specific submission request.

系统调用接口

实例化io_uring

应用告知内核 io_uring实例的参数及sqes数目。
int io_uring_setup(unsigned entries, struct io_uring_params *params);

参数：
//entries : denotes the number of sqes that will be associated with this io_uring instance. 2的指数【1-4096】
// params :
struct io_uring_params {
__u32 sq_entries; //此两个字段都由内核填充，实际作为输出参数。告知应用
__u32 cq_entries; // 目前内核支持的sq 和cq的数目
__u32 flags;
__u32 sq_thread_cpu;
__u32 sq_thread_idle;
__u32 resv[5];
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};
返回值：
a file descriptor that is used to refer to this io_uring instance
实例化成功后，返回指向此实例的FD

共享内存的方式

将内核创建的io_uring实例通过mmap的方式映射到用户空间

Given that the sqe and cqe structures are shared by the kernel and the application, the application needs a way to gain access to this memory. This is done through mmap(2)'ing it into the application memory space.

The application uses the sq_off member to figure out the offsets of the various ring members. 应用使用sq_off来获取ring 的各个成员的位置。

struct io_sqring_offsets {
   __u32 head; /* offset of ring head */
   __u32 tail; /* offset of ring tail */
   __u32 ring_mask; /* ring mask value */
   __u32 ring_entries; /* entries in ring */
   __u32 flags; /* ring flags */
   __u32 dropped; /* number of sqes not submitted */
   __u32 array; /* sqe index array /
__u32 resv1;
__u64 resv2;
};

To access this memory, the application must call mmap(2) using the io_uring file descriptor and the memory offset
associated with the SQ ring. The io_uring API defines the following mmap offsets for use by the application:
#define IORING_OFF_SQ_RING 0ULL
#define IORING_OFF_CQ_RING 0x8000000ULL
#define IORING_OFF_SQES 0x10000000ULL
where IORING_OFF_SQ_RING is used to map the SQ ring into the application memory space, IORING_OFF_CQ_RING for
the CQ ring ditto, and finally IORING_OFF_SQES to map the sqe array. For the CQ ring, the array of cqes is a part of the
CQ ring itself. Since the SQ ring is an index of values into the sqe array, the sqe array must be mapped separately by the application.

这里面描述了CQ SQ的差异，需要图示这种差异，sqe array 和cqe array的差异点图示化？

应用程序访问io_uring实例的方法

以下很好的示例了应用程序获取io_uring实例中操控指针等的方法，在应用中定义一个如下的数据结构，

The application will define its own structure holding these offsets.
 One example might look like the following:
struct app_sq_ring {
   unsigned *head;
   unsigned *tail;
   unsigned *ring_mask;
   unsigned *ring_entries;
   unsigned *flags;
   unsigned *dropped;
   unsigned *array;
};
and a typical setup case will thus look like:

而后通过mmap获取实例化后的io_uring的信息，主要为mmap操作。


struct app_sq_ring app_setup_sq_ring(int ring_fd, struct io_uring_params *p)
{
   struct app_sq_ring sqring;
   void *ptr;
   ptr = mmap(NULL, p→sq_off.array + p→sq_entries * sizeof(__u32),
   PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE,
   ring_fd, IORING_OFF_SQ_RING);
   sring→head = ptr + p→sq_off.head;
   sring→tail = ptr + p→sq_off.tail;
   sring→ring_mask = ptr + p→sq_off.ring_mask;
   sring→ring_entries = ptr + p→sq_off.ring_entries;
   sring→flags = ptr + p→sq_off.flags;
   sring→dropped = ptr + p→sq_off.dropped;
   sring→array = ptr + p→sq_off.array;
   return sring;
}

应用通知内核准备提交

int io_uring_enter(unsigned int fd, unsigned int to_submit,
unsigned int min_complete, unsigned int flags,
sigset_t sig);
参数说明：
fd ：refers to the ring file descriptor，io_uring实例的FD
to_submit ： tells the kernel that there are up to  that amount of sqes ready to be consumed and submitted  提交的sqes的数目
min_complete ：asks the kernel to wait for completion of that amount of requests. Having the single call available to both submit and wait for completions means that the application can both submit and wait for request completions with a single system call  ，单次系统调用内完成的seqs的数目，也就是完成了这个参数个数的seqs后，内核就可以通知应用了。
flags：设定此次调用内核具体处理行为的标志

flags 参数的详细介绍

The most important one being:
#define IORING_ENTER_GETEVENTS (1U << 0)
If IORING_ENTER_GETEVENTS is set in flags, then the kernel will actively wait for min_complete events to be available.
For now, if you wish to wait for completions, IORING_ENTER_GETEVENTS must be set，如果希望等待此次调用的完成，则可以使用此flags（等待也就变成同步了）

初始化及发送seqs接口总结

That essentially covers the basic API of

io_uring. io_uring_setup(2) will create an io_uring instance of the given size.
With that setup, the application can start filling in sqes and submitting them with io_uring_enter(2).

一个或者多个系统调用完成请求

Completions can be waited for with the same call, 同步等待
or they can be done separately at a later time. Unless the application wants to wait for completions to come in, it can also just check the cq ring tail for availability of any events. The kernel will modify CQ ring tail directly, hence completions can be consumed by the application without necessarily having to call io_uring_enter(2) with IORING_ENTER_GETEVENTS set. 异步查询CQ ring

SEQ ORDERING 命令顺序化

通常情况下，应用下发到内核的命令对顺序没有要求，比如下发了10个命令，则10个命令可以被内核并行处理。但是某些场景中，例如 data integrity writes. A common example of that is a series of writes, followed by an fsync/fdatasync.

io_uring supports draining the submission side queue until all previous completions have finished. This allows the application queue the above mentioned sync operation and know that it will not start before all previous commands have completed. Thi s is accomplished by setting IOSQE_IO_DRAIN in the sqe flags field.

以上这个标志，将影响整个提交队列。因此应用在使用时，可以单独创建一个 io uring实例来进行drain operation类的操作命令。即针对命令顺序有要求的应用可能创建几个 io uring实例来分别进行无序与有序命令的跟踪处理。

此机制类似于 GPU中的fence机制，即在处理到某个命令（带有DRAIN标志）时，在其前面所有的命令都必须先正常完成。

Linked SEQS 命令队列间的依赖

针对另外一种场景，例如：

dependencies between a sequence of sqes within the greater submission ring 在一个大的提交队列中，多个小的sqes队列存在依赖的情况。

Examples of such use cases may include

a series of writes that must be executed in order,
or perhaps a copy-like operation where a read from one file is followed by a write to another file， with the buffers of the two sqes being shared.

the application must set IOSQE_IO_LINK in the sqe flags field. If set, the next sqe will not be started before the previous sqe has completed successfully

这种机制和上面ordering的区别在于：更多的SQE被要求设定flag，以组成一个LINK。

the chain is defined as starting with the first sqe that has IOSQE_IO_LINK set, and ends with the first subsequent sqe that does not have it set. link的长度同第一个带有LINK flag的sqe到第一个没有LINK flag。 chain之间彼此独立，可以并行处理。

TimeOut Command 超时命令

两种处理方式方式：

One trigger type is a classic timeout, 即设置超时时间
The second trigger type is a count of completions.

内核对命令的处理方式

内存操作顺序

One important aspect of both safe and efficient communication through an io_uring instance is the proper use of memory ordering primitives.

两个简单的内存操作

read_barrier(): Ensure previous writes are visible before doing subsequent memory reads.
write_barrier(): Order this write after previous writes.

和CPU架构强相关。

This is a two stage process -

first the various sqe members are filled in and the sqe index is placed in the SQ ring array,
and then the SQ ring tail is updated to show the kernel that a new entry is available.

如果没有特别指定ordering，为了得到最好的性能，CPU可能重新排序写操作。

Without any ordering implied, it's perfectly legal for the processor to reorder these writes in any order it deems the most optimal. Let's take a look at the following example, with each number indicating a memory operatio n:

1: sqe→opcode = IORING_OP_READV;
2: sqe→fd = fd;
3: sqe→off = 0;
4: sqe→addr = &iovec;
5: sqe→len = 1;
6: sqe→user_data = some_value;
 write_barrier(); /* ensure previous writes are seen before tail write */
7: sqring→tail = sqring→tail + 1;
 write_barrier(); /* ensure tail write is seen */

上述代码中，如果不加write_barrier，则在第7步运行后，内核即可以看到此SQE，然而由于CPU可能重新排序写操作，因而此SQE里面的其他字段，例如第4步可能还没有写入，导致内核看到的SQE的信息不全。

The kernel will include a read_barrier() before reading the SQ ring tail, to ensure that the tail write from the application is visible.

内核读取SQ ring 尾部前，先运行read_barrier,

liburing接口库

封装直接系统调用的复杂性，屏蔽诸如mmap、memory barrier等操作，目前主要集中在常规场景的应用。一些特殊场景还需要直接系统调用。

实例化io_uring及映射

struct io_uring ring; //The io_uring structure holds the information for both the SQ and CQ ring  
 io_uring_queue_init(ENTRIES, &ring, 0);

Once an application is done using an io_uring instance, it simply calls:

 io_uring_queue_exit(&ring);

to tear it down.

提交任务及完成

One very basic use case is submitting a request and, later on, waiting for it to complete. With the liburing helpers, this looks something like this:

struct io_uring_sqe sqe;
   struct io_uring_cqe cqe;
   /* get an sqe and fill in a READV operation */
 sqe = io_uring_get_sqe(&ring);
 io_uring_prep_readv(sqe, fd, &iovec, 1, offset);
   /* tell the kernel we have an sqe ready for consumption */
 io_uring_submit(&ring);
   /* wait for the sqe to complete */
 io_uring_wait_cqe(&ring, &cqe);
  /* read and process cqe event */
 app_handle_cqe(cqe);
 io_uring_cqe_seen(&ring, cqe);

上述代码为简单实例，提交一个SQE，等待，处理完成事件。

If the application merely wishes to peek at the completion and not wait for an event to become available, io_uring_peek_cqe(3) does that. For both use cases, the application must call io_uring_cqe_seen(3) once it is done with this completion event.

如果需要异步，则可以调用 io_uring_peek_cqe。

The liburing library is still in its infancy, and is continually being developed to expand both the supported features and the helpers available.

目前这个库仍然不完善。

高级应用场景与特性

参考资料

内存屏障 https://zhuanlan.zhihu.com/p/125737864

关于性能的激烈讨论

https://github.com/axboe/liburing/issues/189

测试基准用的软件

https://github.com/alexhultman/io_uring_epoll_benchmark

阿里做的工作 https://developer.aliyun.com/article/781368

https://kernel.taobao.org/2019/06/io_uring-a-new-linux-asynchronous-io-API/

https://openanolis.cn/sig/high-perf-storage

https://developers.mattermost.com/blog/hands-on-iouring-go/

https://zhuanlan.zhihu.com/p/348225926

https://www.phoronix.com/scan.php?page=news_item&px=KVM-IO-uring-Passthrough-LF2020

https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

ubuntu安装qemu with io_uring

https://gist.github.com/sl45sms/e09ee281b0fb7707196fbf6de3d67e46

deepin 运行qemu报错： https://github.com/flutter/flutter/issues/72987

Failed To Initialize KVM: No such file or directory

BIOS 虚拟化功能没有打开

qemu-system-x86_64 -enable-kvm -smp 2 -m 2048 -hda uos.img -cdrom uniontechos-desktop-20-professional-1032_amd64.iso -vnc 127.0.0.1:0

启用virt-blk

https://zhuanlan.zhihu.com/p/50550676