The "Virtual File System" in Linux

The "Virtual File System" in Linux
(May 1997)

Reprinted with permission of Linux Journal

This article outlines the VFS idea and gives an overview of thehow the Linux kernel accesses its file hierarchy. The informationherein refers to Linux 2.0.x (for any x) and 2.1.y (with y up to atleast 18). The sample code, on the other hand, is for 2.0 only.
by Alessandro Rubini

The main data item in any Unix-like system is the ``file'', andan unique pathname identifies each file within a running system. Everyfile appears like any other file in the way is is accessed andmodified: the same system calls and the same user commands apply toevery file. This applies independently of both the physical mediumthat holds information and the way information is laid out on themedium. Abstraction from the physical storage of information isaccomplished by dispatching data transfer to different device drivers;abstraction from the information layout is obtained in Linux throughthe VFS implementation.

The Unix way

Linux looks at its file-system in the way Unix does: it adopts theconcepts of super-block, inode, directory and file in the way Unixuses them. The tree of files that can be accessed at any time isdetermined by how the different parts are assembled together, eachpart being a partition of the hard driver or another physical storagedevice that is ``mounted'' to the system.

While the reader is assumed to be confident with the idea of mountinga file-system, I'd better detail the concepts of super-block, inode,directory and file.

  • The super-block owes its name to its historicalheritage, from when the first data block of a disk or partition wasused to hold meta-information about the partition itself. Thesuper-block is now detached from the concept of data block, but stillis the data structure that holds information about each mountedfile-system. The actual data structure in Linux is called structsuper_block and hosts various housekeeping information, like mountflags, mount time and device blocksize. The 2.0 kernel keeps a staticarray of such structures to handle up to 64 mounted file-systems.
  • An inode is associated to each file. Such an ``indexnode'' encloses all the information about a named file except its nameand its actual data. The owner, group, permissions and access timesfor a file are stored in its inode, as well as the size of the data itholds, the number of links and other information. The idea ofdetaching file information from filename and data is what allows toimplement hard-links -- and to use the `dot' and `dot-dot' notationsfor directories without any need to treat them specially. An inode isdescribed in the kernel by a struct inode.
  • The directory is a file that associates inodes tofilenames. The kernel has no special data strcture to represent adirectory, which is treated like a normal file in most situations.Functions specific to each filesystem-type are used to read and modifythe contents of a directory independently of the actual layout of itsdata.
  • The file itself is something that is associated to aninode. Usually files are data areas, but they can also be directories,devices, FIFO's or sockets. An ``open file'' is described in theLinux kernel by a struct file item; the structure encloses apointer to the inode representing the file. file structures arecreated by system calls like open, pipe and socket,and are shared by father and child across fork.

Object Orientedness

While the previous list describes the theoretical organization ofinformation, an operating system must be able to deal with differentways to layout information on disk. While it is theoretically possibleto look for an optimum layout of information on disks and use it forevery disk partition, most computer users need to access all of theirhard drives without reformatting, mount NFS volumes across thenetwork, and sometimes even access those funny CDROM's and floppydisks whose filenames can't exceed 8+3 characters.

The problem of handling different data formats in a transparent wayhas been addresses by making super-blocks, inodes and files into``objects'': an object declares a set of operations that must be usedto deal with it. The kernel won't be stuck into big switchstatements to be able to access the different physical layouts ofdata, and new ``filesystem types'' can be added and removed at runtime.

All the VFS idea, therefore, is implemented around sets of operationsto act on the objects. Each object includes a structure declaring itsown operations, and most operations receive a pointer to the ``self''object as first argument, thus allowing modification of the objectitself.

In practice, a super-block structure, encloses a field ``structsuper_operations *s_op'', an inode encloses ``structinode_operations *i_op'' and a file encloses ``structfile_operations *f_op''.

All the data handling and buffering that is performed by the Linuxkernel is independent of the actual format of the stored data: everycommunication with the storage medium passes through one of theoperations structures. The ``file-system type'', then, is thesoftware module that is in charge of mapping the operations to theactual storage mechanism -- either a block device, a networkconnection (NFS) or virtually any other mean to store/retrievedata. These software modules implementing filesystem types can eitherbe linked to the kernel being booted or actually compiled in the formof loadable modules.

The current implementation of Linux allows to use loadable modules forall the filesystem types but the root filesystem (the root filesystemmust be mounted before loading a module from it). Actually, theinitrd machinery allows to load a module before mounting theroot filesystem, but this technique is usually only exploited ininstallation floppies.

In this article I use the phrase ``filesystem module'' to referindependently to a loadable module or a filesystem decoder linked tothe kernel.

This is in summary how all the file handling happens for any givenfile-system type, and is depicted in figure 1.


Virtual file system items
Figure 1: VFS data structure (available as PostScript here


  • struct file_system_type is a structure that declaresonly its own name and a read_super function. At mounttime, the function is passed information about the storage mediumbeing mounted and is asked to fill a super-block structure, as well alloading the inode of the root directory of the filesystem assb-">"s_mounted (where sb is the super-block just filled).The additional field requires_dev is used by the filesystemtype to state if it will access a block device or not: for example,the NFS and proc types don't require a device, whileext2 and iso9660 do. After the superblock is filled,struct file_system_type is not used any more; only thesuperblock just filled will hold a pointer to it in order to be ableto give back status information to the user (/proc/mounts is anexample of such information). The structure is shown in panel 1.

    Panel 1

    struct file_system_type {
        struct super_block *(*read_super) (struct super_block *, void *, int);
        const char *name;
        int requires_dev;
        struct file_system_type * next; /* there's a linked list of types */
    };
    

  • The super_operations structure is used by the kernelto read/write inodes, write superblock information back to disk andcollect statistics (to deal with the statfs and fstatfssystem calls). When a filesystem is eventually unmounted, theput_super operation is called -- in standard kernel wording``get'' means ``allocate and fill'', ``read'' means ``fill'' and``put'' means ``release''. The super_operations declared byeach filesystem type are shown in panel 2.

    Panel 2

    struct super_operations {
        void (*read_inode) (struct inode *);  /* fill the structure */
        int (*notify_change) (struct inode *, struct iattr *);
        void (*write_inode) (struct inode *);
        void (*put_inode) (struct inode *);
        void (*put_super) (struct super_block *);
        void (*write_super) (struct super_block *);
        void (*statfs) (struct super_block *, struct statfs *, int);
        int (*remount_fs) (struct super_block *, int *, char *);
    };
    

  • After a memory copy of the inode has been created, thekernel will act on it using its own operations. structinode_operations is the second set of operations declared byfilesystem modules, and are listed below: they deal mainly with thedirectory tree. Directory-handling operations are part of the inodeoperations because the implementation of a dir_operationsstructure would bring in extra conditionals in filesystemaccess. Instead, inode operations that only make sense for directorieswill do their own error checking. The first field of the inodeoperations defines the file operations for regular files: if the inodeis a FIFO, a socket or a device specific file operations will be used.Inode operations appear in panel 3, note that the definition ofrename was changed in 2.0.1.

    Panel 3

    struct inode_operations {
        struct file_operations * default_file_ops;
        int (*create) (struct inode *,const char *,int,int,struct inode **);
        int (*lookup) (struct inode *,const char *,int,struct inode **);
        int (*link) (struct inode *,struct inode *,const char *,int);
        int (*unlink) (struct inode *,const char *,int);
        int (*symlink) (struct inode *,const char *,int,const char *);
        int (*mkdir) (struct inode *,const char *,int,int);
        int (*rmdir) (struct inode *,const char *,int);
        int (*mknod) (struct inode *,const char *,int,int,int);
        int (*rename) (struct inode *,const char *,int, struct inode *,
                   const char *,int, int); /* this from 2.0.1 onwards */
        int (*readlink) (struct inode *,char *,int);
        int (*follow_link) (struct inode *,struct inode *,int,int,struct inode **);
        int (*readpage) (struct inode *, struct page *);
        int (*writepage) (struct inode *, struct page *);
        int (*bmap) (struct inode *,int);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int);
        int (*smap) (struct inode *,int);
    };
    

  • The file_operations, finally, specify how data inthe actual file is handled: the operations implement the low-leveldetails of read, write, lseek and the otherdata-handling system calls. Since the same file_operationsstructure is used to act on devices, it also encloses some fields thatonly make sense for char or block devices. It's interesting to notethat the structure shown here is the one declared in the 2.0 kernels,while 2.1 chenged the prototypes of read, write andlseek to allow a wider range of file offsets. The fileoperations (as of 2.0) are shown in panel 4.

    Panel 4

    struct inode_operations {
        struct file_operations * default_file_ops;
        int (*create) (struct inode *,const char *,int,int,struct inode **);
        int (*lookup) (struct inode *,const char *,int,struct inode **);
        int (*link) (struct inode *,struct inode *,const char *,int);
        int (*unlink) (struct inode *,const char *,int);
        int (*symlink) (struct inode *,const char *,int,const char *);
        int (*mkdir) (struct inode *,const char *,int,int);
        int (*rmdir) (struct inode *,const char *,int);
        int (*mknod) (struct inode *,const char *,int,int,int);
        int (*rename) (struct inode *,const char *,int, struct inode *,
                   const char *,int, int); /* this from 2.0.1 onwards */
        int (*readlink) (struct inode *,char *,int);
        int (*follow_link) (struct inode *,struct inode *,int,int,struct inode **);
        int (*readpage) (struct inode *, struct page *);
        int (*writepage) (struct inode *, struct page *);
        int (*bmap) (struct inode *,int);
        void (*truncate) (struct inode *);
        int (*permission) (struct inode *, int);
        int (*smap) (struct inode *,int);
    };
    

Typical Implementation Problems

The mechanisms to access filesystem data described above aredetached from the physical layout of data and are designed to accountfor all the Unix semantics as far as filesystems are concerned.

Unfortunately, however, not all the filesystem types support all ofthe functions just described -- in particular, not all the types haveto concept of ``inode'', even though the kernel identifies every fileby means of its unsigned long inode number. If the physicaldata being accessed by a filesystem type has no physical inodes, thecode implementing readdir and read_inode must invent aninode number for each file in the storage medium.

A typical technique to choose an inode number is using the offset ofthe control block for the file within the filesystem data area,assuming the files are identified by something that can be called`control block'. The iso9660 type, for example, uses thistechnique to create an inode number for each file in the device.

The /proc filesystem, on the other hand, has no physical deviceto extract its data from, and therefore uses hardwired numbers forfiles that always exist, like /proc/interrupts, and dynamicallyallocated inode numbers for other files. The inode numbers are storedin the data structure associated to each dynamic file.

Another typical problem to face when implementing a filesystem type isdealing with limitations in the actual storage capabilities. Forexample, how to react when the user tries to rename a file to a namelonger than the maximum allowed length for the particular filesystem,or when she tries to modify the access time of a file within afilesystem that doesn't have the concept of access time.

In these cases, the standard is to return -ENOPERM, which means``Operation not permitted''. Most VFS functions, like all the systemcalls and a number of other kernel functions, return 0 or a positivenumber in case of success, and a negative number in case of errors.Error codes returned by kernel functions are always one of the integervalues defined in "<"asm/errno.h">".

Dynamic /proc Files

I'd like to show now a little code to play with the VFS idea, butit's quite hard to conceive a small enough filesystem type to fit inthe article. Writing a new filesystem type is surely an interestingtask, but a complete implementation includea 39 ``operation''functions. In practice, is there the need to build yet anotherfilesystem type just for the sake of it?

Fortunately enough, the /proc filesystem as defined in theLinux kernel lets modules play with the VFS internals without the needto register a whole-new filesystem type. Each file within /proccan define its own inode operations and file operations, and istherefore able to exploit all the features of the VFS. The interfaceto creating /proc files is easy enough to be introduced here,although not in too much detail. `Dynamic /proc files' arecalled that way because their inode number is dynamically allocated atfile creation (instead of being extracted from an inode table orgenerated by a block number).

In this section we'll build a module called burp, for``Beautiful and Understandable Resource for Playing''. Not all of themodule will be shown because the innards of each dynamic file is notrelated with the VFS idea.

The main structure used in building up the file tree of /procis struct proc_dir_entry: one such structure is associated toeach node within /proc and it is used to keep track of the filetree. The default readdir and lookup inode operationsfor the filesystem access a tree of struct proc_dir_entry toreturn information to the user process.

The burp module, once equipped with the needed structures, willcreate three files: /proc/root is the block device associatedthe current root partition; /proc/insmod is an interface toload/unload modules without the need to become root;proc/jiffies reads as the current value of the jiffy counter(i.e., the number of clock ticks since system boot). These three fileshave no real value and are just meant to show how the inode and fileoperations are used. As you see, burp is really a ``BoringUtility Relying on Proc''. To avoid making the utility too boring Iwon't tell here the details about module loading and unloading: theyhave been described in previous Kernel Korner articles which are nowaccessible through the web. The whole burp.c file is availableas well.

Creation and desctruction of /proc files is performed bycalling the following functions:

proc_register_dynamic(struct proc_dir_entry *where,
                      struct proc_dir_entry *self);
proc_unregister(struct proc_dir_entry *where, int inode);

In both functions, where is the directory where the new filebelongs, and we'll use &proc_root to use the root directory ofthe filesystem. The self structure, on the other hand, isdeclared inside burp.c for each of the three files. Thedefinition of the structure is reported in panel 5 for your reference;I'll show the three burp incantations of the structure in awhile, after discussing their role in the game.


Panel 5

struct proc_dir_entry {
        unsigned short low_ino;  /* inode number for the file */
        unsigned short namelen;  /* lenght of filename */
        const char *name;        /* the filename itself */
        mode_t mode;             /* mode (and type) of file */
        nlink_t nlink;           /* number of links (1 for files) */
        uid_t uid;               /* owner */
        gid_t gid;               /* group */
        unsigned long size;      /* size, can be 0 if not relevant */
        struct inode_operations * ops; /* inode ops for this file */
        int (*get_info)(char *, char **, off_t, int, int);  /* read data */
        void (*fill_inode)(struct inode *);  /* fill missing inode info */
        struct proc_dir_entry *next, *parent, *subdir; /* internal use */
        void *data;              /* used in sysctl */
};

The `synchronous' part of burp reduces therefore to three lineswithin init_module() and three within cleanup_module().Everything else is dispatched by the VFS interface and is`event-driven' as far as a process accessing a file can be consideredan event (yes, this way to see things is etherodox, and youshould never use it with professional people).

The three lines in ini_module() look like:

proc_register_dynamic(&proc_root, &burp_proc_root);

and the ones in cleanup_module() look like:

proc_unregister(&proc_root, burp_proc_root.low_ino);

The low_ino field here is the inode number for the filebeing unregistered, and has been dynamically assigned at load time.

But how will these three files respond to user access? Let's look ateach of them independently.

  • /proc/root is meant to be a block device. Its`mode' should therefore have the S_IBLK bit set, its inodeoperations should be those of block devices and its device numbershould be the same as the root device currently mounted. Since thedevice number associated to the inode is not part of theproc_dir_entry structure, the fill_inode field must beused. The inode number of the root device will be extracted from thetable of mounted filesystems.
  • /proc/insmod is a writable file: it needs ownfile_operations to declare its own ``write'' method. Thereforeit declares its own inode_operations that point to its fileoperations. Whenever its write() implementation is called, thefile asks to kerneld to load or unload the module whole name hasbeen written. The file is writable by anybody: this is not a bigproblem as loading a module doesn't mean accessing its resources; andwhat is loadable is still controlled by root via/etc/modules.conf.
  • /proc/jiffies is much easier: the file is only readfrom. Kernel version 2.0 and later ones offer a simplified interfacefor read-only files: the get_info function poiinter, if set,will be asked to fell a page of data each time the file isread. Therefore /proc/jiffies doesn't need own file operationsnor inode operations: it just uses get_info. The function usessprintf() to convert the integer jiffies value to astring.

The snapshot of tty session in panel 6 shows how the files appearand how two of them work. Panel 7, finally, shows the threestructures used to declare the file entries in /proc. Thestructures have not been completely defined, because the C compilerfills with zeroes any partially-defined structure without issuing anywarning (feature, not bug).

The module has been compiled and run on a PC, an Alpha and a Sparc,all of them running Linux version 2.0.x


Panel 6

morgana% ls -l /proc/root /proc/insmod /proc/jiffies
--w--w--w-   1 root     root            0 Feb  4 23:02 /proc/insmod
-r--r--r--   1 root     root           11 Feb  4 23:02 /proc/jiffies
brw-------   1 root     root       3,   1 Feb  4 23:02 /proc/root
morgana% cat /proc/jiffies
0002679216
morgana% cat /proc/modules
burp               1            0
morgana% echo isofs ">" /proc/insmod
morgana% cat /proc/modules
isofs              5            0 (autoclean)
burp               1            0
morgana% echo -isofs ">" /proc/insmod
morgana% cat /proc/jiffies
0002682697
morgana%


Panel 7

struct proc_dir_entry burp_proc_root = {
    0,                  /* low_ino: the inode -- dynamic */
    4, "root",          /* len of name and name */
    S_IFBLK | 0600,     /* mode: block device, r/w by owner */
    1, 0, 0,            /* nlinks, owner (root), group (root) */
    0, &blkdev_inode_operations,  /* size (unused), inode ops */
    NULL,               /* get_info: unused */
    burp_root_fill_ino, /* fill_inode: tell your major/minor */
    /* nothing more */
};

struct proc_dir_entry burp_proc_insmod = {
    0,                  /* low_ino: the inode -- dynamic */
    6, "insmod",        /* len of name and name */
    S_IFREG | S_IWUGO,  /* mode: REGular, Write UserGroupOther */
    1, 0, 0,            /* nlinks, owner (root), group (root) */
    0, &burp_insmod_iops, /* size - unused; inode ops */
};

struct proc_dir_entry burp_proc_jiffies = {
    0,                  /* low_ino: the inode -- dynamic */
    7, "jiffies",       /* len of name and name */
    S_IFREG | S_IRUGO,  /* mode: regular, read by anyone */
    1, 0, 0,            /* nlinks, owner (root), group (root) */
    11, NULL,           /* size is 11; inode ops unused */
    burp_read_jiffies,  /* use "get_info" instead */
};

The /proc implementation has other interestingfeatures to offer, the most interesting being the sysctlimplementation. The idea is so interesting that it doesn't fit here,and the kernel-korner article of Sptember 1997 will fill the gap.

Interesting Examples to Look at

My discussion is over now, but there are many interesting places whereinteresting source code is on show. Interesting implementations offilesystem types are:

  • Obviously, the ``/proc'' filesystem: it is quite easy tolook at because it is neither performance-critical nor particularlyfull-featured (except the sysctl idea). Enough said.
  • The ``umsdos'' filesystem: it is part of the mainstreamkernel and runs piggy-back on the ``msdos'' filesystem. It implementsonly a few of the operations of the VFS to add new capabilities to anold-fashioned filessytem format.
  • The ``userfs'' module: it is available from bothtsx-11 and sunsite under ALPHA/userfs; version0.9.3 will load to Linux 2.0. The module defines a new filesystem typewhich uses external programs to retrieve data; interestingapplications are the ftp filesystem and a read-only filesystem tomount compressed tar files. Even though reverting to user programs toget filesystem data is dangerouus and might lead to unexpecteddeadlocks, the idea is quite interesting.
  • ``supermount'': the filesystem is available onsunsite and mirrors. This filesystem type is able to mountremovable devices like floppies of cdrom and handle device removalwithout forcing the user to umount/mount the device. The moduleworks by controlling another filesystem type while arranging to keepthe device unmounted when it is not used; the operation is transparentto the user.
  • ``ext2'': the extended-2 filesystem has been the standardLinux filesystem for a few years now. It is difficult code, butreally worth reading for who is interested in looking at how a realfilesystem is implemented. It also has hooks for interesting securityfeatures like the immutable-flag and the append-only-flag. Filesmarked as immutable or append-only can only be deleted when the systemis in single-user mode, and are therefore secured from networkintruders.
Alessandro is a wild soul with an attraction for source code. He'sthe author of "Writing Linux Device Drivers": an O'Reilly book due outin summer. He is a fan of Linus Torvalds and Baden Powell and enjoysthe two communities of volunteer workers they happened to build. Hecan be reached as rubini@linux.it.
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值