file_operations 结构体变小

https://lwn.net/Articles/972081/ By Jonathan Corbet May 3, 2024

Kernel developers are encouraged to send their changes in small batches as a way of making life easier for reviewers. So when a longtime developer and maintainer hits the list with a 437-patch series touching 859 files, eyebrows are certain to head skyward. Specifically, this series from Jens Axboe is cleaning up one of the core abstractions that has been part of the Linux kernel almost since the beginning; authors of device drivers (among others) will have to take note.
内核开发者被鼓励以小批量更改的方式提交他们的修改,这是为了让审查者的工作更加轻松。因此,当一个资深开发者和维护者向邮件列表提交了一个包含437个补丁、涉及859个文件的系列时,肯定会让人眉头一挑。具体来说,Jens Axboe 提交的这个系列是在清理一个自Linux内核几乎初始就有的核心抽象;设备驱动的作者(包括其他人)将不得不注意。

struct file_operations 起源

In the beginning, the Linux kernel lacked any sort of virtual filesystem layer. See, for example, the 0.01 implementation of read(), which contained explicit checks for each possible file-descriptor type. That approach worked to get an initial kernel to boot but, before long, Linus Torvalds realized that it would not scale well. As developers sought to add more device types, and to implement more than one filesystem type, the need for an abstraction layer became more urgent.
Linux内核初始时缺乏任何形式的虚拟文件系统层。例如,可以参看0.01版本中read()函数的实现,它包含了对每种可能的文件描述符类型的显式检查。这种方法可以让初版内核启动,但不久之后,Linus Torvalds 意识到它无法很好地扩展。随着开发者们试图增加更多的设备类型,并实现不止一种文件系统类型,对抽象层的需求变得更为紧迫。
The Linux 0.95 release, which came out in March 1992, brought a number of changes, including a switch to the GPL license. It also added the first pieces of what was to become the kernel’s virtual filesystem layer. A core piece of that layer was the first file_operations structure, defined, in its entirety, as:
1992年3月发布的 Linux 0.95 版本带来了众多变化,包括切换到了 GPL 许可证。它还添加了一些最终成为内核虚拟文件系统层的初步模块。该层的核心部分是第一个 file_operations 结构,其完整定义为:

   struct file_operations {
        int (*lseek) (struct inode *, struct file *, off_t, int);
        int (*read) (struct inode *, struct file *, char *, int);
        int (*write) (struct inode *, struct file *, char *, int);
    };

This structure contains the pointers to the functions needed to implement specific system calls on anything that can be represented by a file descriptor. Rather than use an extended if-then-else sequence to determine which type of file was being operated on, the kernel could just do an indirect call to the appropriate file_operations member. As might be expected, the most fundamental operations — reading, writing, and seeking — showed up here first. In early versions of the kernel, there wasn’t much else that one could do with a file descriptor.
这个结构包含了用于在任何可以由文件描述符表示的对象上实现特定系统调用所需的函数指针。内核可以直接间接调用合适的 file_operations 成员函数,而不再是使用扩展的 if-then-else 序列来决定正在操作的文件类型。如所预期的,最基本的操作——读取、写入和寻找(seeking)——首先出现在这里。在内核的早期版本中,人们对文件描述符的操作并不多。
The file_operations structure grew from there. The 1.0 version of this structure included ten members, implementing system calls like readdir(), ioctl(), and mmap(). The 2.0 version of struct file_operations had 13 members, and 2.2 added two more. Through all of this history, the read() and write() members remained the way to read from and write to a file descriptor, though their prototypes changed somewhat.
file_operations 结构从那时起不断发展。在这个结构的 1.0 版本中包含了十个成员,实现了像readdir()、ioctl()、和mmap()等系统调用。struct file_operations 的 2.0 版本有13个成员,2.2 版又增加了两个。贯穿整个历史,read()和write()成员一直是读取和写入文件描述符的方式,尽管它们的原型有所变化。

情况越来越复杂

The 2.4 release, made at the beginning of 2001, included a version of struct file_operations with these new members:
2001年初发布的 2.4 版本包含了带有这些新成员的 struct file_operations版本:

    ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
    ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

User-space developers often needed the ability to perform scatter/gather I/O — operations involving multiple segments of memory that needed to be transferred in a single operation. In response, the kernel gained support for readv() and writev() but, to properly support these system calls, the kernel needed to pass them down to the underlying implementations. The new members, which took an array of iovec structures containing an address (in user space) and size for each segment, were added for this purpose. For device drivers or filesystems that did not implement the new functions, the kernel would emulate them with a series of read() or write() calls instead.
用户空间开发者经常需要执行 scatter/gather I/O 操作——涉及多个需要在单个操作中传输的内存段。作为回应,内核增加了对 readv() 和 writev() 的支持,但为了正确支持这些系统调用,内核需要将它们传递给底层实现。为此目的添加了新的成员,这些成员接受一个包含每个段的地址(在用户空间中)和大小的 iovec 结构体数组。对于未实现新函数的设备驱动或文件系统,内核将通过一系列的 read() 或 write() 调用来模拟它们。
Subsequent work added many more members to struct file_operations, including other variants of read() and write(). aio_read() and aio_write(), used to implement the kernel’s somewhat unloved asynchronous I/O mechanism, went into the 2.5.33 development release. splice_read() and splice_write(), implementing the splice() system call, were added for 2.6.17. Removals of file_operations members, like the removal of kernel code in general, was rare, but readv() and writev() were removed in 2.6.19 after all users were switched to use aio_read() and aio_write() instead.
后续的工作为 struct file_operations 添加了许多成员,包括其他变体的 read() 和 write()。aio_read() 和aio_write() 用于实现内核有点不受欢迎的异步 I/O 机制,它们在 2.5.33 开发版中被加入。splice_read() 和splice_write() 用于实现 splice() 系统调用,在 2.6.17 版本中被添加。像内核代码一般,file_operations 成员的移除是罕见的,但 readv() 和 writev() 在 2.6.19 版本后被移除,所有用户被转换为使用 aio_read() 和 aio_write() 代替。
The 3.16 version of struct file_operations, had grown to 27 members, including these additions indicating a new approach to I/O within the kernel:
struct file_operations 在 3.16 版本中已增长到27个成员,包括这些新增成员,标志着内核中 I/O 方法的新方式:

ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);

Increasingly, I/O operations were being initiated from the kernel, not just from from user space; they often involved multiple segments and needed to be executed asynchronously. The data buffers involved could be referenced in a number of ways. The iov_iter structure used to describe these more complex I/O operations looked like this at the time:
越来越多的 I/O 操作是从内核发起的,而不仅仅是从用户空间发起的;它们通常涉及多个段,并且需要异步执行。相关的数据缓冲区可能以多种方式被引用。用来描述这些更复杂的 I/O 操作的 iov_iter 结构当时看起来是这样的:

 struct iov_iter {
        int type;
        size_t iov_offset;
        size_t count;
        union {
            const struct iovec *iov;
            const struct bio_vec *bvec;
        };
        unsigned long nr_segs;
    };

The key distinguishing feature of this structure is related to the type field. If it was ITER_IOVEC, then the iov union member contained an array of segments using user-space addresses. If it was, instead, ITER_KVEC, then the addresses were in kernel space. And if type was ITER_BVEC, then the bvec field pointed to an array of bio structures (used to describe block I/O requests). An I/O API defined in this way could be called from a number of contexts and would work regardless of whether the operation was initiated from user space or from within the kernel.
这个结构的关键区别特征与 type 字段相关。如果是 ITER_IOVEC,那么 iov 联合成员包含了使用用户空间地址的一系列段。如果是 ITER_KVEC,那么地址位于内核空间。如果 type 是 ITER_BVEC,那么 bvec 字段指向一个 bio结构的数组(用于描述块 I/O 请求)。以这种方式定义的 I/O API 可以从多种上下文中调用,并且无论操作是从用户空间发起还是从内核内部发起,都能正常工作。
The kiocb structure is used by the kernel to coordinate asynchronous I/O operations. Drivers are not required to implement asynchronous I/O (though they may not perform as well if they don’t), but if they do implement it, they need the information in this structure. The use of struct kiocb reflects the fact that, among other goals, the new methods were intended to replace aio_read() and aio_write(), which were duly removed for the 4.0 release.
kiocb 结构被内核用来协调异步 I/O 操作。驱动程序不是必须实现异步 I/O(尽管如果不实现可能性能不会那么好),但如果它们确实实现了,它们需要这个结构中的信息。使用 struct kiocb 反映了这样一个事实,即新方法的目标之一就是要取代 aio_read() 和 aio_write(),这两个方法在 4.0 版本中正式被移除。

struct iov_iter 无处不在

Over time, struct iov_iter has evolved and become rather more complex; see the 6.8 version for the details. The kernel has also accumulated a set of helpers that free code from dealing with that complexity much of the time. Meanwhile, struct file_operations in 6.8 is up to 32 callable members. But, through all of this change, read() and write() have remained essentially unchanged, even though they only handle the simplest of I/O operations in what has become a complicated world.
随着时间的推移,struct iov_iter 已经演变并变得更为复杂;详见 6.8 版本的细节。内核也积累了一套帮助工具,使代码大部分时间无需处理这种复杂性。与此同时,6.8 版本的 struct file_operations 增长到了32个可调用成员。但是,尽管 read() 和 write() 只处理了在变得复杂的世界中最简单的 I/O 操作,它们在所有这些变化中基本保持不变。
Axboe has decided that, perhaps, those two members have reached the end of their useful life:
Axboe 决定,也许,这两个成员已经到达了它们有用的生命周期的尽头:

10 years ago we added ->read_iter() and ->write_iter() to struct file_operations. These are great, as they pass in an iov_iter rather than a user buffer + length, and they also take a struct kiocb rather than just a file. Since then we’ve had two paths for any read or write - one legacy one that can’t do per-IO hints like “This read should be non-blocking”, they strictly only work with O_NONBLOCK on the file, and a newer one that supports everything the old path does and a bunch more.
十年前,我们在 struct file_operations 中添加了 ->read_iter() 和 ->write_iter()。这些都很好,因为它们传递了一个 iov_iter 而不是用户缓冲区+长度,并且它们还接受一个 struct kiocb 而不仅仅是一个文件。从那时起,我们有了读取或写入的两条路径——一条是旧的,不能做出类似“这个读取应该是非阻塞的”之类的每次 I/O 提示,它们严格地只使用文件上的 O_NONBLOCK 工作,还有一条更新的支持旧路径的所有功能并且还有很多其他功能。

Since read_iter() and write_iter() can do everything that read() and write() can do, it makes sense to simply remove the older members. The only problem is, of course, there is a lot of code that only implements read() and write() in the kernel; much of it is in drivers that may not have seen significant development (or even use) in years. Some of them surely are being used, though, and breaking them would undoubtedly increase the (already high) level of grumpiness on the net.
由于 read_iter() 和 write_iter() 可以做到 read() 和 write() 所能做的一切,简单地移除较老的成员似乎是合理的。唯一的问题是,当然,内核中有很多代码只实现了read() 和 write();其中很多是在驱动程序中,这些驱动程序可能多年没有看到过重大开发(甚至使用)。尽管如此,其中一些肯定是在使用的,破坏它们无疑会增加(已经很高的)网络上的不满情绪。
Many modules that use the older interface can, with some effort, be converted to use read_iter() and write_iter() instead, perhaps gaining functionality in the process. But there are a lot of these modules, and trying to understand every one of them well enough to do such a conversion is a path to madness, with little benefit. So, instead, Axboe started by implementing a set of helpers that emulates the new functions with a series of calls to read() or write(); that minimizes the amount of change to any given module while maximizing the chances that the results will be correct. See this patch as an example of what the simplest conversions look like.
很多使用旧接口的模块可以付出一些努力,转而使用 read_iter() 和write_iter(),在此过程中可能还会增加一些功能。但是这类模块非常多,试图理解其中的每一个模块,足以完成这样的转换,是一条通往疯狂的道路,收益甚微。因此,Axboe 开始通过实现一套辅助函数来模拟新的函数,这些辅助函数通过一系列对 read() 或 write() 的调用来完成工作;这大大减少了对任何给定模块的改动量,同时最大化了保证结果正确的机会。请参见这个补丁,作为最简单的转换看起来是什么样子的示例。
The final patch in the series removes read() and write() with a surprising lack of ceremony, given that they have been there for 32 years.
系列的最后一个补丁毫不客气地移除了 read() 和 write(),考虑到它们已经存在了32年,这种做法可以说是出乎意料的直接。
There have not been a lot of comments on the series; perhaps many developers are still waiting for the whole thing to download into their inboxes. Al Viro noted that some of the conversions might need to be done a bit more carefully. But nobody has objected to the overall concept, thus far.
关于这一系列的评论并不多;或许许多开发者仍在等待整个系列下载到他们的收件箱中。Al Viro 指出,有些转换可能需要更加小心地去做。但到目前为止,还没有人对整个概念提出异议。
For a series like this to be accepted, it will need to be split into more manageable chunks — which Axboe acknowledged at the outset. This set of changes does simplify the kernel, though, and it removes a fair amount of old code, so chances are that it will happen in some form, sooner or later. At that point, there will likely be a lot of out-of-tree modules that will need to be updated before they can be built on newer kernels. The good news is that developers can make those changes now and get ahead of the game.
要让这样的系列被接受,它需要被拆分成更易于管理的小块 —— Axboe 一开始就承认了这一点。这套改变确实简化了内核,并且移除了相当多的旧代码,所以很可能它迟早会以某种形式发生。到那时,很可能会有很多树外模块(tree-out modules)需要更新,才能在更新的内核上构建。好消息是,开发者现在就可以进行这些更改,领先一步。

  • 22
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值