initramfs,一个新initial RAM disks模型
KEY:系统引导 Linux 内存文件系统 ramfs tmpfs ramdisk
The problem. (Why "root=" doesn't scale.)
When the Linux kernel boots the system, it must find and run the first user program, generally called "init". User programs live in filesystems, so the Linux kernel must find and mount the first (or "root") filesystem in order to boot successfully.
Ordinarily, available filesystems are listed in the file /etc/fstab so the mount program can find them. But /etc/fstab is itself a file, stored in a filesystem. Finding the very first filesystem is a chicken and egg problem, and to solve it the kernel developers created the kernel command line option "root=", to specify which device the root filesystem lives on.
问题
当Linux内核启动系统时,它必须找到并执行第一个用户程序(通常被称为init)方能成功开机。由于用户程序(文件)是保存在文件系统中的,因此Linux内核必须先找到并挂上第一个(或者叫“根”)文件系统。
通常,可用的文件系统列在文件/etc/fstab里,以便mount能够找到它们。但/etc/fstab本身就是一个保存在文件系统中文件中。所以找到第一个文件系统成为鸡生蛋蛋生鸡的问题,而为了解决它,内核开发者建立[内核命令行选项 ]( kernel command line option)“root=”,用来指定root文件系统在哪个设备上。
KEMIN:不要被[ 内核 ]这个词吓倒,完全可以把它看成普通的程序,可以在运行前配置命令行参数。
Fifteen years ago, "root=" was easy to interpret. It was either a floppy drive or a partition on a hard drive. These days the root filesystem could be on dozens of different types of hardware (SCSI, SATA, flash MTD), or even spread across several of them in a RAID. Its location could move around from boot to boot, such as hot pluggable USB devices on a system with multiple USB ports -- when there are several USB devices, which one is correct? The root filesystem might be compressed (how?), encrypted (with what keys?), or loopback mounted (where?). It could even live out on a network server, requiring the kernel to acquire a DHCP address, perform a DNS lookup, and log in to a remote server (with username and password), all before the kernel can find and run the first userspace program.
十五年前,“root=”所指的设备很容易找到,因为它不是在软盘上就是硬盘的分区上。如今root文件系统可以保存在大量各种不同类型的硬件(SCSI, SATA, flash MTD,USB)上,甚至是由不同类型硬件所建立的RAID上。root文件系统还可以存在主机外部的网络服务器上(内核找到“root”前得先取得DHCP地址,完成DNS lookup,并 使用帐号及密码 登入到远端服务器)。 root文件系统的位置很可能每次启动都在不同的位置。
除了位置策略,root文件系统的保存方式也存在策略需要,比如,root文件系统也可能被压缩(如何?),被加密(用什么keys?),或 loopback挂接(哪里?)。
KEMIN:不管处理root的策略如何,记住内核最终的目标还是找到根文件系统并执行其上的第一个userspace程序,完成开机。
These days, "root=" just isn't enough information. Even hard-wiring tons of special case behavior into the kernel doesn't help with device enumeration, encryption keys, or network logins that vary from system to system. Worse, programming the kernel to perform these kind of complicated multipart tasks is like writing web software in assembly language: it can be done, but it's considerably easier to simply use the proper tools for the job. The kernel is designed to follow orders, not give them.
With no end to this ever-increasing complexity in sight, the kernel developers decided to back up and find a better way to deal with the whole problem.
面对如此多的动态策略,单一命令行参数(root=)远远不能满足了。即使将所有特殊案例的行为(设备列举,加密钥,或网络登入)都硬编码入进内核也无济于事,因为特殊的系统太多了。更糟的是,替内核加入这些复杂行为的工作,就像是用汇编语言写web软件:即便可以做到,但也不能很划算很容易的通过使用适当工具完成。
随着内核的发展,为了解决这个无休止甚至在不断增加复杂度的问题,内核开发者决定收回(back up)现有解决方案,重新寻求更好的整体的解决方案。
The solution
Linux 2.6 kernels bundle a small ram-based initial root filesystem into the kernel, and if this filesystem contains a program called "/init" the kernel runs that as its first program. At that point, finding some other filesystem containing some other program to run is no longer the kernel's problem, but is now the job of the new program.
The contents of initramfs don't have to be general purpose. If a given system's root filesystem lives on an encrypted network block device, and the network address, login, and decryption key are all to be found on a USB device named "larry" (which requires a password to access), that system's initramfs can have a special-purpose program that knows all about that, and makes it happen.
For systems that don't need a large root filesystem, there's no need to locate or switch to any other root filesystem.
解决方案
Linux 2.6内核将一个小的ram-based initial root filesystem(initramfs)内建入内核,且如果这个文件系统包含程序/init,内核会将它当作第一个程序执行。此时,查找其他文件系统并执行其上的程序已不再是内核的问题,而是init的问题了。
initramfs的内容(或功能)不必是通用的。例如,如果系统的root文件系统在一个加密过的网络块设备上,而网络地址、登入、加密钥保存在一个名为"larry" USB设备上,并且需密码方能存取,这时候,内核的initramfs可以保存一个特殊功能的程序,它知道"larry"的访问密码,并处理网络访问逻辑等。
对不需要大root filesystem的系统(比如LiveCD)而言,可以不必查找或切换到任何其他root文件系统。
How is this different from initrd?
The linux kernel already had a way to provide a ram-based root filesystem, the initrd mechanism. For 2.4 and earlier kernels, initrd is still the only way to do this sort of thing. But the kernel developers chose to implement a new mechanism in 2.6 for several reasons.
ramdisk vs ramfs
A ramdisk (like initrd) is a ram based block device, which means it's a fixed size chunk of memory that can be formatted and mounted like a disk. This means the contents of the ramdisk have to be formatted and prepared with special tools (such as mke2fs and losetup), and like all block devices it requires a filesystem driver to interpret the data at runtime. This also imposes an artificial size limit that either wastes space (if the ramdisk isn't full, the extra memory it takes up still can't be used for anything else) or limits capacity (if the ramdisk fills up but other memory is still free, you can't expand it without reformatting it).
But ramdisks actually waste even more memory due to caching. Linux is designed to cache all files and directory entries read from or written to block devices, so Linux copies data to and from the ramdisk into the "page cache" (for file data), and the "dentry cache" (for directory entries). The downside of the ramdisk pretending to be a block device is it gets treated like a block device.
这跟initrd有何不同?
其实Linux kernel早已实现基于RAM的根文件系统(ram-based root filesystem)的手段——initrd机制。对2.4及更早的kernel来说,initrd是唯一的方法实现了基于RAM的根文件系统,不过 kernel开发者出一些原因而选择在2.6实现一个新的机制。
ramdisk对ramfs
ramdisk(如initrd)是ram based块设备,意思是说它是一块固定大小的内存段,它可以像磁盘一样被格式化及挂接。也就是说ramdisk在使用前必须用工具(像 mke2fs及losetup)格式化,而且如同所有的块设备,它需要文件系统驱动程序在运行行时期做转译。这也意味着ramdisk有不必要的人为的大小限制,这既有浪费空间问题(若 ramdisk没有满,已被占用的额外的内存也不能用来做其他事),也会有容量限制问题(若ramdisk满了,但其他仍有闲置的内存,也不能不经由重新格式化将它扩展)。
其实ramdisk浪费的内存空间更多在于缓冲机制上。 Linux被设计在读写块设备时使用缓冲机制的,因此Linux在读写ramdisk这块假的块设备时,会把文件数据拷到“page cache”,把目录数据拷到“dentry cache”。把ramdisk虚拟成块设备的缺点是完全把它当成块设备来处理。
A few years ago, Linus Torvalds had a neat idea: what if Linux's cache could be mounted like a filesystem? Just keep the files in cache and never get rid of them until they're deleted or the system reboots? Linus wrote a tiny wrapper around the cache called "ramfs", and other kernel developers created an improved version called "tmpfs" (which can write the data to swap space, and limit the size of a given mount point so it fills up before consuming all available memory). Initramfs is an instance of tmpfs.
These ram based filesystems automatically grow or shrink to fit the size of the data they contain. Adding files to a ramfs (or extending existing files) automatically allocates more memory, and deleting or truncating files frees that memory. There's no duplication between block device and cache, because there's no block device. The copy in the cache is the only copy of the data. Best of all, this isn't new code but a new application for the existing Linux caching code, which means it adds almost no size, is very simple, and is based on extremely well tested infrastructure.
A system using initramfs as its root filesystem doesn't even need a single filesystem driver built into the kernel, because there are no block devices to interpret as filesystems. Just files living in memory.
几年前,Linus Torvalds有一个灵巧的想法:如果Linux的系统缓冲区可以像文件系统被挂接入内核会发生什么事呢?这些文件只存在缓冲中,不对其处理,直到它们被显式删除或系统重启可不可以呢? Linus写了一小小的基于缓冲区文件的文件系统,称它为ramfs;接着其他的kernel开发者写了个改进版本称为tmpfs(临时文件的文件系统)。tmpfs可以把文件交换回交换区swap,并且可限制挂接点的空间大小。initramfs是tmpfs的一个实例。
这些ram based的文件系统会根据数据实际需要进行空间增长和缩小(shrink)。不存在像ramdisk那样有重复的数据,因为根本没有块设备。数据在内存只有一份拷贝。更棒的是实现ram base filesystems不需要新的代码,ram base filesystems是已有的[Linux缓冲代码]的新应用,也就是说几乎没有增加内核大小,非常简单,且基于严格测试过的系统基础构架(infrastructure)上。
回到系统启动需要一个根文件系统的问题上,系统使用initramfs作为它的root文件系统甚至不需要将[文件系统驱动程序]内建到 kernel,因为没有块设备要用来做文件系统。直接使用内存内的文件就可以了。(KEMIN:这种说不对,应该说内核不需内建基于块设备的文件系统驱动代码,它还是要内建基于缓存的tmpfs的驱动代码。)
(这部分理解需要很多[文件系统]原理以及内核[缓冲机制]的背景知识。)
tmpfs与ramfs的区别
Tmpfs是一个虚拟内存文件系统,它不同于传统的用块设备形式来实现的Ramdisk,也不同于针对物理内存的Ramfs。
Tmpfs可以使用物理内存,也可以使用交换分区。在Linux内核中,虚拟内存资源由物理内存(RAM)和交换分区组成,这些资源是由内核中的虚拟内存子系统来负责分配和管理。
Tmpfs向虚拟内存子系统请求页来存储文件,它同Linux的其它请求页的部分一样,不知道分配给自己的页是在内存中还是在交换分区中。同Ramfs一样,其大小也不是固定的,而是随着所需要的空间而动态的增减。
Initrd vs initramfs
The change in underlying infrastructure was a reason for the kernel developers to create a new implementation, but while they were at it they cleaned up a lot of bad behavior and assumptions.
Initrd was designed as front-end to the old "root=" root device detection code, not a replacement for it. It ran a program called "/linuxrc" which was intended to perform setup functions (like logging on to the network, determining which of several devices contained the root partition, or associating a loopback device with a file), tell the kernel which block device contained the real root device (by writing the de_t number to /proc/sys/kernel/real-root-dev), and then return to the kernel so the kernel could mount the real root device and execute the real init program.
This assumed that the "real root device" was a block device rather than a network share, and also assumed that initrd wasn't itself going to be the real root filesystem. The kernel didn't even execute "/linuxrc" as the special process ID 1, because that process ID (and its special properties like being the only process that can not be killed with "kill -9") was reserved for init, which the kernel was waiting to run after it mounted the real root filesystem.
With initramfs, the kernel developers removed all these assumptions. Once the kernel launches "/init" out of initramfs, the kernel is done making decisions and can go back to following orders. With initramfs, the kernel doesn't care where the real root filesystem is (it's initramfs until further notice), and the "/init" program from initramfs is run as a real init, with PID 1. (If initramfs's init needs to hand that special Process ID off to another program, it can use the exec() syscall just like everybody else.)
initrd对上initramfs
改进内核的系统基础构架(infrastructure)是kernel开发者实现一个基于RAM的文件系统的原因之一,这个新实现去除了很多不好的系统行为和假设(assumptions)。
initrd的设计目的是把initrd作为老式命令行“root=”的root设备检测代码的前端(front-end),而不是取代root设备。 initrd会执行/linuxrc,/linuxrc负责完成系统初始化(setup)逻辑(像是登入网络,判定哪个设备含有root分区,或把 loopback设备关联到一个文件),告知kernel哪个块设备含有真正的root设备(通过把de_t数值写到 /proc/sys/kernel/real-root-dev),并最后返回到kernel,从而kernel可以挂接真正的root设备及执行真正的 init 程序。
intird是由bootloader调入内存的并转移控制权的,想像一下intird只完成[内核初始化]的一部分工作而已。
这里边initrd会假设“real root device”是块设备而不是网络共享,同时也假设initrd不是最终的root文件系统。 kernel也不会把/linuxrc做为特定的process ID 1来运行,因为process ID(和它的一些特别的属性,像process ID是唯一一个无法被以“kill -9”杀掉的进程)是保留给init的,init要等到kernel挂接(真正的)root文件系统后才会被执行。
有了initramfs后,kernel开发者去掉了所有这些假设。当kernel在initramfs上启动/init后,kernel完全撒手给/init,并等待/init完成工作。有了initramfs,kernel不关心最终的root文件系统在哪里(无特别情况这是/init的职责),而在initramfs上的“/init”也是以真正的init(以PID 1)运行的(如果initramfs上的init想把PID特权转手给其他程序,它可以执行exec()系统调用)。
Summary
The traditional root= kernel command-line option is still supported and usable, but new developments in the types of initial RAM disks supported by the kernel provide many optimizations and much-needed flexibility for the future of the Linux kernel. The next article in this series, available in next month's issue of TimeSource, explains how you can start making the transition to the new initramfs initial RAM disk mechanism.
结语
传统的 root= [kernel命令行选项]仍然被支持且可用,但,new developments in the types of initial RAM disks supported by the kernel provide many optimizations and much-needed flexibility for the future of the Linux kernel.