VFS文件系统

最新推荐文章于 2024-08-05 19:17:11 发布

天地合，乃敢与君绝

最新推荐文章于 2024-08-05 19:17:11 发布

阅读量984

点赞数 26

分类专栏： Linux内核模块分析文章标签：驱动开发系统架构

本文链接：https://blog.csdn.net/qq_49668258/article/details/136683856

版权

Linux内核模块分析专栏收录该内容

3 篇文章 0 订阅

订阅专栏

可以认为Linux内核就是围绕文件系统进行展开设计的。从文件系统引出设备文件的概念，设备文件又可以引申到字符设备和块设备，这样就从文件系统过渡到设备管理。设备管理包含了设备驱动，设备驱动要用到中断，设备里面的块设备又控制了通用块层和I/O调度。而文件系统向外引申和网络的socket联系。从文件系统出发，层次推进基本囊括了内核应用层的重要概念和架。

文件系统的基本概念

在深入分析文件系统之前，有必要介绍文件系统的几个基本概念，这将从架构层次理解文件系统设计的目的，从而从全局层面理解内核文件系统的代码，大大减低分析代码的难度和工作量。

什么是VFS

内核通过VFS管理各个文件系统，VFS为所有的文件系统提供了统一接口，Linux对于每个具体的文件系统都是通过VFS定义的接口访问的。同时VFS也是一种标准，所有文件系统的设计都必须要遵守这个标准。

VFS本身只存在于内存里面，在内存里面来表示一个物理硬盘，就需要有相应的数据结构来表示，或者说抽象。dentry、inode、super_block这三个数据结构对一个物理硬盘进行抽象，在内存里面表示出来。也通过dentry、inode可以对文件系统进行一些读写的操作，但是这些操作出来的数据还是在内存里面的，只有在合适的时候才会从内存写回到硬盘或者块设备上。

super_block

超级块是一个文件系统的控制块，就和进程控制块PCB、或者RTOS上的TCB很像。其中包括了文件块大小、超级块操作函数，还有很多链表，比如说文件系统内所有的inode节点都要链接到超级块里面的inode链表上。而超级块是需要通过读取硬盘里面的超级块结构才能获得的，超级块就是一个具体文件系统在内存里面的抽象。超级块结构很庞大，没必要每个成员都详细了解，下面是简化的代码。

struct super_block {
    unsigned char        s_blocksize_bits;
    unsigned long        s_blocksize;    
    struct file_system_type    *s_type;
    const struct super_operations    *s_op;//最重要的超级块操作，例如read_inode

    unsigned long        s_magic;
    struct dentry        *s_root;//指向根目录dentry
    struct list_head    s_inodes;    /* all inodes */
    struct list_head    s_files;

    struct list_head    s_mounts;    /* list of mounts; _not_ for fs use */
    struct block_device    *s_bdev;//指向文件系统所在的块设备
    void             *s_fs_info;    /* Filesystem private info */
    const struct dentry_operations *s_d_op; /* default d_op for dentries */
}

dentry

Linux里面文件是按照树状结构来保存的，通过一层层汇聚最终到达根目录，所以VFS是有对应的数据结构来反映这种树状结构，那就是dentry结构。

在VFS里面，每个文件（目录也是一种文件）都有一个dentry，dentry链接到上级目录的dentry。根目录有一个dentry，称它为根dentry。根目录下面的所有目录、文件都链接到这个根dentry，如果是二级目录，则二级目录的dentry链接到一级目录的dentry，最终构成树状结构。

而Linux内核里面为了快速查找dentry，用到了dentry cache来缓存，所以在查找一个目录时，首先是从dentry cache里面开始查找。dentry结构一样很庞大，我们精简一下。

struct dentry {

    struct hlist_bl_node d_hash;    /* lookup hash list */
    struct dentry *d_parent;    /* parent directory */
    struct qstr d_name;//文件或者目录名字，打开一个文件时，通过该成员和用户输入的名字对比
    struct inode *d_inode;        /* Where the name belongs to - NULL is
                     * negative */
    const struct dentry_operations *d_op;
    struct super_block *d_sb;    /* The root of the dentry tree */

    struct list_head d_subdirs;    /* our children */
    struct hlist_node d_alias;    /* inode alias list */
};

inode

inode代表一个文件，它保存了文件的元信息，如大小、创建时间、修改时间，还有最重要的对文件的读写操作函数、文件的读写缓存信息，一个文件可以有多个dentry但是inode只能有一个，因为可以有不同的路径指向一个文件，所以会有多个dentry。同样，inode结构庞大，无需全部理解，精简即可。

struct inode {
    const struct inode_operations    *i_op;
    struct super_block    *i_sb;
    struct address_space    *i_mapping;//文件的缓存，在内存里面，先从内存读写，后面写回硬盘

    dev_t            i_rdev;//设备号
    loff_t            i_size;//文件长度、字节为单位

    unsigned int        i_blkbits;//文件块位数
    blkcnt_t        i_blocks;//块大小

    struct hlist_node    i_hash;
    struct list_head    i_wb_list;    /* backing dev IO list */
    struct hlist_head    i_dentry;
    struct list_head    i_sb_list;
    const struct file_operations    *i_fop;    /* former ->i_op->default_file_ops */
};

文件

文件是描述进程和文件交互的关系。磁盘上并不存在这个结构，进程打开一个文件，就在内存里面动态创建一个文件对象。同一个文件在不同进程中是不同的对象。

struct file {
    struct list_head    fu_list;

    struct path        f_path;

    struct inode        *f_inode;    /* cached value */
    const struct file_operations    *f_op;

    fmode_t            f_mode;
    loff_t            f_pos;//对文件操作的位置 偏移量

    /* needed for tty driver, and maybe others */
    void            *private_data;

    struct address_space    *f_mapping; //指向文件的读写缓存页面
};

VFS是具体文件系统的抽象，而VFS又是依靠上面描述的四个数据结构来发挥作用的。所以实际上对文件系统的操作（读写）其实就是在操作这些个数据结构。这四个数据结构之间呢也会存在一些联系，比如前面提到，文件系统内所有的inode都会链接到超级块里面的链表。

具体代码分析

文件系统很抽象，如果只看概念性的东西，其实也很难真正对它进行理解，所以还是得分析源码。底层再复杂都是为了用户用的更便捷，所以对于用户而言，无非就是打开文件、创建文件（目录）、文件的读写。如果是涉及到基于磁盘的文件系统，还会涉及到块设备的操作，从文件系统层到通用块层再到IO调度层，还要再到块设备驱动层，实在是太复杂了。所以，为了简化分析，目前还是只分析文件系统层就好了。

具体的代码入口

想要分析文件系统代码，肯定要找到一个切入点，下面给出一段代码（内核里面的文件系统注册和下面代码如出一辙），该段代码只是为了说明我们分析文件系统大体的一个思路，如下。

static int _init aufs_init(void)
{
    retval = register_filesystem(&au_fs_type);

    aufs_mount = kern_mount(&au_fs_type);

    pslot = aufs_create_dir("woman star",NULL);
    aufs_create_file("1bb",S_IFREGI S_IRUGO,pslot,NULL,NULL);
    aufs_create_file("fbb",S_IFREGI S_IRUGO,pslot,NULL,NULL);
    aufs_create_file("lll",S_IFREGI S_IRUGO,pslot,NULL,NULL);
    
    pslot = aufs_create_dir("man star",NULL);
    aufs_create_file("ldh",S_IFREGIS_IRUGO,pslot,NULL,NULL);
    aufs_create_file("lcw",S_IFREGIS_IRUGO,pslot,NULL,NULL);
    aufs_create_flle("jw",S_IFREGIS_IRUGO,pslot,NULL,NULL);
    
    return retval;
}

上面的代码很简单，思路就是，一、调用register_filesystem向内核注册一个文件系统；二、调用kern_mount函数挂载文件系统；三、在该文件系统下面创建目录、文件。所以我们的分析思路就是这三个步骤。

register_filesystem

方法实现如下。该方法只是用于向内核注册一个文件系统，内核自定义了一个file_systems全局链表，所有注册的文件系统都在这条链表上，所以这个方法很简单，就是在链表里面遍历查找要注册的文件系统，能找到就返回EBUSY，找不到就放到链表末尾。显然，没去操作那四个重要的数据结构，肯定是放在下面的代码里面去操作了。

int register_filesystem(struct file_system_type * fs)
{
    int res = 0;
    struct file_system_type ** p;


    write_lock(&file_systems_lock);
    p = find_filesystem(fs->name, strlen(fs->name));
    if (*p)
        res = -EBUSY;
    else
     *p = fs;
      write_unlock(&file_systems_lock);
    return res;
}

kern_mount

kern_mount是一个宏定义，真身为kern_mount_data，其中又套了一层，最终真正的函数如下代码，传入的参数就是要注册的文件系统对象，文件系统名字，data为NULL。返回值是一个vfsmount对象，这个对象很重要，它是用来联系文件系统直接的挂载关系的。

struct vfsmount *vfs_kern_mount(struct file_system_type *type, int flags, const char *name,                             
                                void *data)
{
    struct mount *mnt;
    struct dentry *root;  
      
    mnt = alloc_vfsmnt(name);
 
    root = mount_fs(type, flags, name, data);
   
    mnt->mnt.mnt_root = root;
    mnt->mnt.mnt_sb = root->d_sb;
    mnt->mnt_mountpoint = mnt->mnt.mnt_root;
    mnt->mnt_parent = mnt;
    br_write_lock(&vfsmount_lock);
    list_add_tail(&mnt->mnt_instance, &root->d_sb->s_mounts);
    br_write_unlock(&vfsmount_lock);
    return &mnt->mnt;
}

上述代码也是做了三件事，分配一个mnt对象，类型就是vfsmount。然后调用mount_fs，最后设置mnt对象的各个成员。其中最核心的是第二步，继续展开看看。

struct dentry *mount_fs(struct file_system_type *type, int flags, const char *name, void 
                          *data)
{
    struct dentry *root;
    struct super_block *sb;
    int error = -ENOMEM;
    root = type->mount(type, flags, name, data);
    sb = root->d_sb;

    sb->s_flags |= MS_BORN;
    
    up_write(&sb->s_umount);

    return root；
}

上面省略了很多代码，保留了核心代码，其中最核心的代码就是type->mount，这是个回调函数，实际指向是最底层的具体文件系统的mount函数，例如ext4_mount、sysfs_mount。这里就借助sysfs_mount方法来分析，因为其他文件系统都大同小异。

static struct dentry *sysfs_mount(struct file_system_type *fs_type,
                                int flags, const char *dev_name, void *data)
{

    info = kzalloc(sizeof(*info), GFP_KERNEL);
        
    sb = sget(fs_type, sysfs_test_super, sysfs_set_super, flags, info);
  
    if (!sb->s_root) {
        error = sysfs_fill_super(sb, data, flags & MS_SILENT ? 1 : 0);
    }

    return dget(sb->s_root);
}

同样去掉一些繁多的判断代码，留下要分析的代码，其中最核心的就是sget和sysfs_fill_super两个方法，sget方法就不展开了，它是用于得到一个超级块，先查表看有没有该文件系统的超级块，没有就创建一个超级块并对超级块内部一些成员做初始化，然后再交给sysfs_fill_super方法去填充超级块里面其他一些成员，下面给出代码。

static int sysfs_fill_super(struct super_block *sb, void *data, int silent)
{
    struct inode *inode;
    struct dentry *root;

     sb->s_blocksize = PAGE_CACHE_SIZE;
     sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
     sb->s_magic = SYSFS_MAGIC;
     sb->s_op = &sysfs_ops;
     sb->s_time_gran = 1;

     inode = sysfs_get_inode(sb, &sysfs_root);
     root = d_make_root(inode);
     root->d_fsdata = &sysfs_root;
     sb->s_root = root;
     sb->s_d_op = &sysfs_dentry_ops;
     return 0;
}

由此可见sysfs_fill_super方法就是用于填充超级块的一些成员，还有超级块操作方法集sys_ops；利用sysfs_get_inode方法创建一个inode节点，同时为inode节点成员赋值，操作方法赋值；再利用d_make_root创建一个根目录，把根目录和刚刚创建好的inode节点绑定起来，表示为该超级块的"/"目录；最后再让超级块的根目录成员指向刚刚创建的目录，设置超级块目录操作方法成员。

到此，kern_mount就分析完了，总结起来就是，kern_mount会创建超级块、根dentry、根inode；同时还会设置这三个对象里面的各个成员，各个操作方法集。返回到vfs_kern_mount，然后让mnt对象里面的一些成员指向刚刚创建的根目录、超级块，把超级块添加到mnt链表里面，然后返回。

创建目录、文件

分析了前面两个步骤，剩下的就是创建目录和文件了。回忆前面的内容，文件就是由dentry和inode表示的，所以实际上创建目录、文件的操作，就是在创建dentry、inode对象，然后再把两个对象通过方法联系起来。只是，这个过程会涉及到大量的细节的处理，比如要创建文件有没有父目录，没有就说明是在根目录下面创建；还需要看看有没有重名的、已经存在的文件，不存在就创建，创建好了还要继续判断有没有重名的存在的，在dentry_hashtable里面找，进行判断；dentry弄好了就弄inode，然后通过dentry层层链接，形成一棵树状。

值得一提的是，这些过程方法是随着内核版本的变化而变化的，所以不同的版本是不同的处理方法的，具体的需要对照自己用的版本去看源代码。这里给个2.6.18版本的例子。

static int create_dir(struct kobject * k, struct dentry * p,const char * n, 
                      struct dentry ** d)
{    
    int error;    umode_t mode = S_IFDIR| S_IRWXU | S_IRUGO | S_IXUGO; 
    *d = lookup_one_len(n, p, strlen(n));    if (!IS_ERR(*d)) 
    {         
        if (sysfs_dirent_exist(p->d_fsdata, n))              
            error = -EEXIST;          
        else            
            error = sysfs_make_dirent(p->d_fsdata, *d, k, mode,    SYSFS_DIR);
        if (!error) 
        {            
            error = sysfs_create(*d, mode, init_dir);            
            if (!error) 
            {                
                p->d_inode->i_nlink++;                
                (*d)->d_op = &sysfs_dentry_ops;    
                 d_rehash(*d);            
             }        
          }        
          if (error && (error != -EEXIST)) 
          {            
              struct sysfs_dirent *sd = (*d)->d_fsdata;            
              if (sd) 
              {                 
                  list_del_init(&sd->s_sibling);
                  sysfs_put(sd);            
              }            
               d_drop(*d);        
           }        
           dput(*d);    
}

mount过程

通过上面的说明，现在内核里面已经有这样一颗树了，但是这棵树还不能供用户使用，因为还需要mount挂载文件系统。从理论上讲，系统本身存在一个“/”文件系统，如果把上面的这个文件系统挂载到系统本身的文件系统里面，就可以识别到并且使用了。

mount挂载首先执行系统调用sys_mount，然后执行do_mount，在do_mount里面会获取挂载点的dentry和vfs_mount对象，并保存在nameidata对象里面，然后根据不同的选项执行不同的函数，但是对于第一次挂载的文件系统，会执行do_new_mount，重点也是它。如下精简代码

static int do_new_mount(struct path *path, const char *fstype, int flags,
            int mnt_flags, const char *name, void *data)
{
    struct file_system_type *type;

    struct vfsmount *mnt;
    int err;

    type = get_fs_type(fstype);
    
    mnt = vfs_kern_mount(type, flags, name, data);

    err = do_add_mount(real_mount(mnt), path, mnt_flags);

}

上面的代码是保留的主要内容。先获取要挂载的文件系统，然后获得该文件系统的vfs_mount对象，最后执行do_add_mount函数进行挂载，接着往下分析do_add_mount函数。

static int do_add_mount(struct mount *newmnt, struct path *path, int mnt_flags)
{
    struct mountpoint *mp;
    struct mount *parent;
    int err;

    mp = lock_mount(path);
    
    parent = real_mount(path->mnt);

    if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
        path->mnt->mnt_root == path->dentry)
        goto unlock;

    if (S_ISLNK(newmnt->mnt.mnt_root->d_inode->i_mode))
        goto unlock;
    err = graft_tree(newmnt, parent, mp);
    unlock_mount(mp);
    return err;
}

这段代码的作用就是，首先获得挂载点对象mp，里面包括了挂载点的哈希链表、挂载点的dentry对象，然后又使用real_mount（这是一个宏定义，用的container_of）去获得挂载点的mount对象，然后下面就进行一系列的判断，第一个判断是看该文件系统在之前有没有挂载过，后面判断是不是link文件，最后又调用graft_tree方法，将需要挂载的文件系统嫁接到挂载点下面，来看看这个方法。

static int graft_tree(struct mount *mnt, struct mount *p, struct mountpoint *mp)
{
    if (mnt->mnt.mnt_sb->s_flags & MS_NOUSER)
        return -EINVAL;

    if (S_ISDIR(mp->m_dentry->d_inode->i_mode) !=
S_ISDIR(mnt->mnt.mnt_root->d_inode->i_mode))
        return -ENOTDIR;

    return attach_recursive_mnt(mnt, p, mp, NULL);
}


//继续看一下attach_recursive_mnt方法
static int attach_recursive_mnt(struct mount *source_mnt,
            struct mount *dest_mnt,
            struct mountpoint *dest_mp,
            struct path *parent_path)
{
    struct mount *child, *p;

    if (parent_path) {
        detach_mnt(source_mnt, parent_path);
        attach_mnt(source_mnt, dest_mnt, dest_mp);
        touch_mnt_namespace(source_mnt->mnt_ns);
    } else {
        mnt_set_mountpoint(dest_mnt, dest_mp, source_mnt);
        commit_tree(source_mnt);
    }

    list_for_each_entry_safe(child, p, &tree_list, mnt_hash) {
        list_del_init(&child->mnt_hash);
        commit_tree(child);
    }
}

因为传递进来的参数为NULL，所以只会走else分支，mnt_set_mountpoint方法会设置要挂载的文件系统的mnt指向，让mnt里面的某些成员指向挂载点dest_mp，由此要挂载的文件系统（暂且称为源文件系统吧）和挂载点就联系起来了，但是还需要通过commit_tree方法将源文件系统加入到全局hash表，这样所有的进程都可以看到和操作这个源文件系统了。commit_tree方法如下。

static void commit_tree(struct mount *mnt)
{
    struct mount *parent = mnt->mnt_parent;
    struct mount *m;
    LIST_HEAD(head);
    struct mnt_namespace *n = parent->mnt_ns;

    list_add_tail(&head, &mnt->mnt_list);
    list_for_each_entry(m, &head, mnt_list)
        m->mnt_ns = n;

    list_splice(&head, n->list.prev);

    list_add_tail(&mnt->mnt_hash, mount_hashtable +
                hash(&parent->mnt, mnt->mnt_mountpoint));
    list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
    touch_mnt_namespace(n);
}

上面就是设置mount对象里面的各种链表，其中最重要的就是mount_hashtable全局哈希链表，这样子任何进程就可以通过这个哈希链表找到我们的源文件系统。至此mount过程分析完毕。

open过程

系统调用open过程实在是太复杂了，涉及到了太多太多的调用，调用过程还考虑了太多的情况：一是路径解析；二是. 和..目录的处理，尤其是..目录处理可能还会涉及到文件系统的切换；三是查询能不能找到路径里面的dentry目录项；经过处理一直到最后才会调用到具体文件系统的open接口，由于调用过程代码实在是太多了，这里就不给出代码了，直接给出核心函数调用过程即可，并加以说明。

do_sys_open会调用get_unused_fd_flags得到一个空闲fd，调用do_filp_open得到一个file结构体，并把这两项关联起来。
do_filp_open会先后以RCU模式和普通模式调用path_openat，已返回file，同时蕴含这inode和dentry。
path_openat会先调用path_init根据传入的参数确定查找的起始路径，然后调用link_path_walk对每一项进行遍历，最后调用do_last执行一个钩子函数，也就是文件系统自带的inode.open。
link_path_walk会调用walk_component来驱动循环。
walk_component会根据不同的遍历项类型调用不同的处理函数，比如“.”“. .”就会调用handle_dots，而对于一般项就会调用do_lookup查找dentry和inode。
do_lookup会先查找dcache，如果不存在的话就会调用文件系统的钩子函数inode.lookup查找对应的inode和dentry。
handle_dots在处理“. .”的时候可能会遇到文件系统之间的切换，这里需要做一点处理。

上面3—6步骤就是最复杂的，不得不感叹欧美那边的大佬的实力！最后呢进程的fdtable会和文件系统里面的inode联系起来，并且返回fdtable数组的下标，这个下标就是进程里面的文件描述符，然后数组是一个file类型的指针，file里面有fops文件操作方法，它是文件在open过程中，把inode里面的fops赋值给file的。所以在应用层读写文件，就会使用到inode里面的方法。