目录
分散输入和集中输出(Scatter-Gather IO):readv, writev
引言
存储场景中,我们对性能的要求非常高。在存储引擎底层的IO技术选型时,可能会有如下讨论关于IO的讨论。
http://davmac.org/davpage/linux/async-io.html
So from the above documentation, it seems that Linux doesn't have a true async file I/O that is not blocking (AIO, Epoll or POSIX AIO are all broken in some ways). I wonder if tlinux has any remedy. We should reach out to tlinux experts to get their opinions.
看完这段话,读者可能会有如下的问题。
- 这是在讨论什么,为何会有此番讨论?
- 有没有更好的解决方案?
- 更好的解决方案是通过怎样的设计和实现解决问题?
- ...
2019年,Linux Kernel正式进入5.x时代,众多新特性中,与存储领域相关度最高的便是最新的IO引擎——io\_uring。从一些性能测试的结论来看,io\_uring性能远高于native AIO方式,带来了巨大的性能提升,这对当前异步IO领域也是一个big news。
- 对于问题1,本文简述了Linux过往的的IO发展历程,同步IO接口、原生异步IO接口AIO的缺陷,为何原有方式存在缺陷。
- 对于问题2,本文从设计的角度出发,介绍了最新的IO引擎io\_uring的相关内容。
- 对于问题3,本文深入最新版内核linux-5.10中解析了io\_uring的大体实现(关键数据结构、流程、特性实现等)。
- ...
一切过往,皆为序章
以史为镜,可以知兴替。我们先看看现存过往IO接口的缺陷。
过往同步IO接口
当今Linux对文件的操作有很多种方式,过往同步IO接口从功能上划分,大体分为如下几种。
- 原始版本
- offset版本
- 向量版本
- offset+向量版本
read,write
最原始的文件IO系统调用就是read,write
read系统调用从文件描述符所指代的打开文件中读取数据。
read简单介绍:
NAME read - read from a file descriptorSYNOPSIS #include <unistd.h> ssize_t read(int fd, void *buf, size_t count);DESCRIPTION read() attempts to read up to count bytes from file descriptor fd into the buffer starting at buf. On files that support seeking, the read operation commences at the file offset, and the file offset is incremented by the number of bytes read. If the file offset is at or past the end of file, no bytes are read, and read() returns zero. If count is zero, read() may detect the errors described below. In the absence of any errors, or if read() does not check for errors, a read() with a count of 0 returns zero and has no other effects. According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES for the upper limit on Linux.
write系统调用将数据写入一个已打开的文件中。
write简单介绍:
NAME write - write to a file descriptorSYNOPSIS #include <unistd.h> ssize_t write(int fd, const void *buf, size_t count);DESCRIPTION write() writes up to count bytes from the buffer starting at buf to the file referred to by the file descriptor fd. The number of bytes written may be less than count if, for example, there is insufficient space on the underlying physical medium, or the RLIMIT_FSIZE resource limit is encountered (see setrlimit(2)), or the call was interrupted by a signal handler after having written less than count bytes. (See also pipe(7).) For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. POSIX requires that a read(2) that can be proved to occur after a write() has returned will return the new data. Note that not all filesystems are POSIX conforming. According to POSIX.1, if count is greater than SSIZE_MAX, the result is implementation-defined; see NOTES for the upper limit on Linux.
在文件特定偏移处的IO:pread,pwrite
在多线程环境下,为了保证线程安全,需要保证下列操作的原子性。
off_t orig; orig = lseek(fd, 0, SEEK_CUR); // Save current offset lseek(fd, offset, SEEK_SET); s = read(fd, buf, len); lseek(fd, orig, SEEK_SET); // Restore original file offset
让使用者来保证原子性较繁,从接口上就有保证是一个好的选择,后来出现的pread便实现了这一点。
与read, write类似,pread, pwrite调用时可以指定位置进行文件IO操作,而非始于文件的当前偏移处,且他们不会改变文件的当前偏移量。这种方式,减少了编码,并提高了代码的健壮性。
pread、pwrite简单介绍:
NAME pread, pwrite - read from or write to a file descriptor at a given offsetSYNOPSIS #include <unistd.h> ssize_t pread(int fd, void *buf, size_t count, off_t offset); ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset); DESCRIPTION pread() reads up to count bytes from file descriptor fd at offset offset (from the start of the file) into the buffer starting at buf. The file offset is not changed. pwrite() writes up to count bytes from the buffer starting at buf to the file descriptor fd at offset offset. The file offset is not changed. The file referenced by fd must be capable of seeking.
当然,往read,write接口参数的标志位集合中加入新标志,用以表征新逻辑,可能达到相同的效果,但是这可能不够优雅——如果某个参数有多种可能的值,而函数内又以条件表达式检查这些参数值,并根据不同参数值做出不同的行为,那么以明确函数取代参数(Replace Parameter with Explicit Methods)也是一种合适的重构手法。
如果需要反复执行lseek,并伴之以文件IO,那么pread和pwrite系统调用在某些情况下是具有性能优势的。这是因为执行单个pread或pwrite系统调用的成本要低于执行lseek和read/write两个系统调用(当然,相对地,执行实际IO的开销通常要远大于执行系统调用,系统调用的性能优势作用有限)。历史上,一些数据库,通过使用kernel的这一新接口,获得了不菲的收益。如PostgreSQL:[PATCH] Using pread instead of lseek (with analysis)
分散输入和集中输出(Scatter-Gather IO):readv, writev
“物质的组成与结构决定物质的性质,性质决定用途,用途体现性质。”是自然科学的重要思想,在计算机科学中也是如此。现有计算机体系结构下,数据存储由一个或多个基本单元组成,物理、逻辑上的结构,决定了数据存储的性质——可能是连续的,也可能是不连续的。
对于不连续的数据的处理相对较繁,例如,使用read将数据读到不连续的内存,使用write将不连续的内存发送出去。更具体地看,如果要从文件中读一片连续的数据至进程的不同区域,有两种方案:
- 使用read一次将它们读至一个较大的缓冲区中,然后将它们分成若干部分复制到不同的区域。
- 调用read若干次分批将它们读至不同区域。
同样地,如果想将程序中不同区域的数据块连续地写至文件,也必须进行类似的处理。而且这种方案需要多次调用read、write系统调用,有损性能。
那么如何简化编程,如何解决这种开销呢?一种有效的解法就是使用特定的数据结构对非连续的数据进行管理,批量传输数据。从接口上就有此保证是一个好的选择,后来出现的readv,writev便实现了这一点。
这种基于向量的,分散输入和集中输出的系统调用并非只对单个缓冲区进行读写操作,而是一次即可传输多个缓冲区的数据,免除了多次系统调用的开销。该机制使用一个数组iov定义了一组用来传输数据的缓冲区,一个整形数iovcnt指定iov的成员个数,其中,iov中的每个成员都是如下形式的数据结构。
struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */};
功能交集:preadv,pwritev
上述两种功能都是一种进步,不过似乎格格不入,那么是否能合二为一,进两步呢?
数学上,集合是指具有某种特定性质的具体的或抽象的对象汇总而成的集体。其中,构成集合的这些对象则称为该集合的元素。我这里将接口定义成一种集合,一种特定功能就是其中的一个元素。根据已知有限集构造一个子集,该子集对于每一个元素要么包含要么不包含,那么根据乘法原理,这个子集共有2^N 种构造方式,即有2^N个子集。这么多可能的集合,显然较繁。基于场景对于功能子集的需求、元素之间的容斥、集合中元素是否需要有序(接口层面对功能的表现)、简约性等因素,我们会确立一些优雅的接口,这也是函数接口设计的一个哲学话题。
后来出现的preadv,pwritev,便是偏移和向量的交集,也是一种在排列组合的巨大可能性下确立的少部分简约的接口。
带标志位集合的IO:preadv2,pwritev2
再后来,还出现了变种函数preadv2和pwritev2,相比较preadv,pwritev,v2版本还能设置本次IO的标志,比如RWF\_DSYNC、RWF\_HIPRI、RWF\_SYNC、RWF\_NOWAIT、RWF\_APPEND。
readv、preadv、preadv2系列简单介绍:
NAME readv, writev, preadv, pwritev, preadv2, pwritev2 - read or write data into multiple buffersSYNOPSIS #include <sys/uio.h> ssize_t readv(int fd, const struct iovec *iov, int iovcnt); ssize_t writev(int fd, const struct iovec *iov, int iovcnt); ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags); ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags);DESCRIPTION The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov ("scatter input"). The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd ("gather output"). The pointer iov points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* Starting address */ size_t iov_len; /* Number of bytes to transfer */ }; The readv() system call works just like read(2) except that multiple buffers are filled. The writev() system call works just like write(2) except that multi‐ ple buffers are written out. Buffers are processed in array order. This means that readv() com‐ pletely fills iov[0] before proceeding to iov[1], and so on. (If there is insufficient data, then not all buffers pointed to by iov may be filled.) Similarly, writev() writes out the entire contents of iov[0] before proceeding to iov[1], and so on. The data transfers performed by readv() and writev() are atomic: the data written by writev() is written as a single block that is not in‐ termingled with output from writes in other processes (but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous block of data from the file, regardless of read opera‐ tions performed in other threads or processes that have file descrip‐ tors referring to the same open file description (see open(2)). preadv() and pwritev()