razvan@valhalla:~/school/so2/wiki$ ls -i
1277956 lab10.wiki
1277962 lab9.wikibak
1277964 replace_lxr.sh
1277954 lab9.wiki
1277958 link.txt
1277955 homework.wiki
file
文件按是文件系统模型的组件,和用户最接近,此结构只存在于内存作为VFS的实体,在磁盘没有物理对应物。尽管inode抽象磁盘上的文件,文件结构抽象一个打开的文件,从进程的观点看,文件实体抽象了文件,从文件系统实现的角度讲,inode是文件的抽象实体。
文件结构维护下面的信息:
*file cursor position;
*file opening rights;
*pointer to the associated inode (eventually its
index).
Localization:
struct
file和VFS实体关联,struct
file_operations表示了和文件实体关联的操作函数。
dentry
dentry和inode关联在一个文件名上,通常dentry结构包含2个域:
*an integer that identifies the inode;
*a string representing its name.
dentry是路径的特定部分,可以是目录也可以是文件,例如,路径/
bin/vi,可以创建3个dentry对象,包括[/,
bin, and
vi.】
*dentry在磁盘上有对应物,但这个对应不是直接的因为每个文件系统都以特定方法保存dentry实体。
*在VFS,dentry实体表述为结构struct dentry,所关联的操作定义在struct dentry_operations
Register and unregister filesystems
在当前的Linux版本中,内核支持大约50个文件系统,包括:
*ext2/ ext4
*reiserfs
*xfs
*fat
*ntfs
*iso9660
*udf for CDs and DVDs
*hpfs
在一个系统中,通常不会超过5到6个文件系统,因此,文件系统(更准确一点,文件系统类型),实现
为可在任何时候加载/卸载的模块,
为了能够动态加载/卸载文件系统模块,文件系统提供注册/去除API,描述一个文件系统类型的结构是,struct
file_system_type
---------------------------------------
#include
struct file_system_type {
const char *name;
int fs_flags;
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
void (*kill_sb) (struct super_block *);
struct module *owner;
struct file_system_type * next;
struct hlist_head fs_supers;
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
//...
};
--name 标识文件系统,在mount -t 中作为参数,如 yaffs2,ubifs
--owner ,文件系统设计为模块时,是THIS_MODULE,实现在内核时,为NULL
--mount, 在加载文件系统时从磁盘读取superblock 到内存, 此函数对每个文件系统是唯一的。
--kill_sb ,在卸载文件系统是释放super-block
--fs_flags,确定了文件系统加载时的flags,例子,FS_REQUIRES_DEV 表明该文件系统需要物理磁盘。
-- fs_supers,列举了文件系统的超级块superblocks,因为同一个文件系统可以mount多次,每次都有一个独立的superblock
----------------------------------------
注册文件系统到内核通常在模块初始化函数执行,针对注册,程序必须执行:
*.初始化struct
file_system_type以参数:文件系统名,flags,函数实现读取文件系统超级块,对识别当前
模块的结构的引用。
*.调用
register_filesystem()
在卸载模块时,应该解注册调用unregister_filesystem()去除文件系统。
ramfs是一个注册虚拟文件系统的例子
==================================
static struct file_system_type ramfs_fs_type = {
.name = "ramfs",
.mount = ramfs_mount,
.kill_sb = ramfs_kill_sb,
.fs_flags = FS_USERNS_MOUNT,
};
static int __init init_ramfs_fs(void)
{
if (test_and_set_bit(0, &once))
return 0;
return register_filesystem(&ramfs_fs_type);
}
==================================
Functions mount, kill_sb
在加载文件系统时,内核调用定义在结构file_system_type的mount函数,这个函数调用一系列初始化动作,并返回结构struct
dentry,其含有文件系统加载点目录,通常,mount()是个简单的函数会调用下列函数:
*mount_bdev(),加载块设备文件系统。
*mount_single(),加载一个文件系统在所有的mount操作中共享一个实例。
*mount_nodev(),加载一个不在物理设备上的文件系统
*mount_pseudo(),
伪文件系统的helper函数,(sockfs,
pipefs,一般不能被加载的文件系统)
这些函数作为参数指针给函数fill_super(),在超级块初始化后被驱动调用来结束初始化,这样的函数在fill_super部分可以看到。
在卸载文件系统时,内核调用kill_sb(),执行清除操作并执行如下函数之一:
*kill_block_super(),卸载块设备上的文件系统。
*kill_anon_super(),卸载虚拟文件系统
*kill_litter_super(),卸载不在磁盘上的文件系统(信息保存在内存)
没有磁盘支持的文件系统函数例子如ramfs的ramfs_mount()。
==================
struct dentry *ramfs_mount(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
return mount_nodev(fs_type, flags, data, ramfs_fill_super);
}
====================
磁盘文件系统的例子如minix文件系统的 minix_mount() 函数。
-------------------------------
struct dentry *minix_mount(struct file_system_type *fs_type,
int flags, const char *dev_name,
void *data)
{
return mount_bdev(fs_type, flags, dev_name, data, minix_fill_super);
}
------------------------------
Superblock in VFS
superblock存在于作为物理实体(在磁盘上),或者作为VFS实体(在struct super_block结构中),superblock包含元信息用于从磁盘读写元数据,(inodes, directory entries),一个superblock(隐含为一个struct super_block结构),含所使用的块设备信息,inode列表,指向文件系统根目录的inode的指针,指向超级块操作的指针。
The struct super_block
structure
struct
super_block结构部分定义如下:
----------------------------------------
struct super_block {
//...
dev_t s_dev; unsigned char s_blocksize_bits; unsigned long s_blocksize; unsigned char s_dirt; loff_t s_maxbytes; struct file_system_type *s_type; struct super_operations *s_op; //...
unsigned long s_flags; unsigned long s_magic; struct dentry *s_root; //...
char s_id[32]; void *s_fs_info; };
----------------------------------------
superblock存储一个文件系统实例的全局信息,包括:
*. the physical device on which it
resides
*. block
size
*. the maximum
size of a file
*.
file system
type
*.
the operations it
supports
*. magic number
(identifies the file system)
*. the root
directory
dentry
Superblock operations
superbloc operations在结构struct
super_operations中描述。
=====================================
struct super_operations {
//...
int (*write_inode) (struct inode *, struct writeback_control *wbc);
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*put_super) (struct super_block *, int *, char *);
//...
};
====================================
结构的各个域是函数指针:
^. write_inode,
alloc_inode, destroy_inode:
写,分配,释放inode关联的资源
^.
put_super,
superblock在umount时被释放时被调用,在这个函数,任何从文件系统来的私有数据关联的资源被释放
^.
remount_fs,在内核检测到remount操作时被调用(mount with
MS_REMOUNTM),
^.
statfs,在statfs系统调用完成时被调用,(try
stat –f or
df),这个调用以参数结构kstatfs,ext4_statfs()
The fill_super() function
fill_super()用来终止superblock初始化,这个初始化包括填充struct
super_block,和初始化根目录inode.
一个实现的例子是ramfs_fill_super()函数,在初始化superblock结构的其他域后被调用。
=======================
#include
#define RAMFS_MAGIC 0x858458f6
static const struct super_operations ramfs_ops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
.show_options = ramfs_show_options,
};
static int ramfs_fill_super(struct super_block *sb, void *data, int silent)
{
struct ramfs_fs_info *fsi;
struct inode *inode;
int err;
save_mount_options(sb, data);
fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
sb->s_fs_info = fsi;
if (!fsi)
return -ENOMEM;
err = ramfs_parse_options(data, &fsi->mount_opts);
if (err)
return err;
sb->s_maxbytes = MAX_LFS_FILESIZE;
sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
sb->s_magic = RAMFS_MAGIC;
sb->s_op = &ramfs_ops;
sb->s_time_gran = 1;。。
inode = ramfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0);
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;
return 0;
}
===========================
内核提供了通用函数对文件系统结构进行操作,上面使用的generic_drop_inode()
and
simple_statfs()函数就是这样的函数,并能够用来实现驱动,如果它们的功能是足够的。
ramfs_fill_super()函数填充superblock的某些域,然后读取root
inode并分配root dentry,读取root
inode由函数ramfs_get_inode()完成,使用new_inode()分配新的inode并初始化它,为了释放inode,iput()被调用,d_make_root()用来分配
root dentry。
磁盘文件系统的一个例子是在minix操作系统的minix_fill_super()函数,在磁盘操作系统的功能是和在虚拟文集那系统类似,除了它使用了buffer
cache。minix文件系统使用struct
minix_sb_info结构保存私有数据,这个函数的一大部分工作是处理这些私有数据的初始化,私有数据用kzalloc分配,并存储在superblock
结构的s_fs_info域。
VFS函数的参数一般包括:superblock,
inode或者dentry其包含有指针指向superblock,这样这些私有数据可以很容易被访问。
Buffer cache
Buffer cache内核子系统处理块设备的数据块caching(读写),用于cache
buffer实体的基本数据结构是struct
buffer_head,这个结构最重要的域包括:
#b_data,
pointer to a memory area where the data was read from or where the
data must be written to
#b_size, buffer
size
#b_bdev, the
block device
#b_blocknr, the
number of block on the device that has been loaded or needs to be
saved on the disk
#b_state, the
status of the buffer
和这些结构共同工作的函数是:
__bread():
根据buffer_head结构中的number和size读取块数据否则返回NULL。
sb_bread():
和上面一样,但读取块的大小size来自于superblock
mark_buffer_dirty(): 标记buffer为dirty
(设置BH_Dirty位); buffer将在后面的时间写入磁盘
(在每一次bdflush内核线程运行并写buffer到磁盘);
brelse(): 释放buffer使用的内存,
在它的内容被写入磁盘后。
map_bh(): 关联buffer-head到对应的扇区。
Functions and useful macros
superblock一般包含占用块的映射(by inodes, dentries,
data),以bitmap形式表示,为了操作这些bitmap,推荐使用下列功能:
find_first_zero_bit(), 在内存区域发现第一个0值, size参数意思是在搜索区域的位数;
test_and_set_bit(), 设置位值并获得前值;
test_and_clear_bit(), 删除位并获得前值;
test_and_change_bit(), 反转位值并获得前值.
====================part 2======================
Inode
inode是unix文件系统的基本组件,同时也是虚拟文件系统的重要组件,inode是个多元数据(有关信息的信息),一个inode唯一标识了磁盘的一个文件并保有它的信息(uid,
gid, access rights, access times, pointers to data blocks,
etc.),inode不包含文件名,文件名通过关联的 struct
dentry结构获得
inode指向磁盘的文件,为了指向一个打开的文件,(关联进程的文件描述符)struct
file结构用于这个操作,一个inode可以和任意多个file结构关联,(多个进程可以打开一个文件,一个进程也可以多次打开同一个文件),
inode即存在与VFS(在内存),也可以作为磁盘实体(for
UNIX, HFS, NTFS,
etc.),VFS中的inode表述为
struct
inode,像其他的VFS结构一样,struct
inode是一个通用结构覆盖了所有支持的文件类型的选项,即使那些没有关联磁盘实体的文件系统类型(FAT)。
The inode structure
inode结构对所有的文件系统都是一样的,文件系统一般有些私有信息,这些通过域i_private来引用,惯例上保持这些特定信息的结构是_inode_info,fsname是文件系统名,例如,minix
and ext4文件系统存储特定信息在struct
minix_inode_info, or struct
ext4_inode_info.
struct
inode中的一些重要域是:
i_sb : inode所属的文件系统的超级块结构
i_rdev: the device on which this file
system is mounted
i_ino : the number of the inode
(uniquely identifies the inode within the file system)
i_blkbits: number of bits used for
the block size == log2(block size)
i_mode,
i_uid, i_gid:
access rights, uid, gid
i_size: file/directory/etc. size in
bytes
i_mtime,
i_atime,
i_ctime: change, access, and creation
time
i_nlink: the number of names entries
(dentries) that use this inode; for file systems without links
(either hard or symbolic) this is always set to 1
i_blocks: the number of blocks used
by the file (all blocks, not just data); this is only used by the
quota subsystem
i_op,
i_fop: pointers to operations structures:
struct inode_operations and
struct file_operations;
i_mapping->a_ops contains a pointer to
struct
address_space_operations.
i_count: the inode counter indicating
how many kernel components use it.
inode相关的函数包括:
new_inode(): creates a new inode, sets
the i_nlink field to 1 and initializes
i_blkbits, i_sb
and i_dev;
insert_inode_hash(): adds the inode to
the hash table of inodes; an interesting effect of this call is
that the inode will be written to the disk if it is marked as
dirty;
mark_inode_dirty(): marks the inode as
dirty; at a later moment, it will be written on the disc;
iget_locked(): loads the inode with
the given number from the disk, if it is not already loaded;
unlock_new_inode(): used in
conjunction with iget_locked(), releases
the lock on the inode;
iput(): tells the kernel that the work
on the inode is finished; if no one else uses it, it will be
destroyed (after being written on the disk if it is maked as
dirty);
make_bad_inode(): tells the kernel
that the inode can not be used; It is generally used from the
function that reads the inode when the inode could not be read from
the disk, being invalid.
Inode operations
Getting an inode
inode的一个主要操作是获得一个inode(the struct
inode in
VFS),直到内核2.6.24,定义了一个函数read_inode,从2.6.25版本开始,开发人员必须定义
_iget函数,这函数负责寻找VFS
的inode或者创建新的inode并从磁盘读取信息填充该inode。
这个函数将调用iget_locked()从VFS获取inode结构,如果inode是新创建的,将从磁盘读取inode
(using sb_bread())并填充有用的信息。
这个函数的一个例子如ubifs_iget()
==================================
struct inode *ubifs_iget(struct super_block *sb, unsigned long inum)
{
int err;
union ubifs_key key;
struct ubifs_ino_node *ino;
struct ubifs_info *c = sb->s_fs_info;
struct inode *inode;
struct ubifs_inode *ui;
dbg_gen("inode %lu", inum);
inode = iget_locked(sb, inum);
if (!inode)
return ERR_PTR(-ENOMEM);
if (!(inode->i_state & I_NEW))
return inode;
ui = ubifs_inode(inode);
ino = kmalloc(UBIFS_MAX_INO_NODE_SZ, GFP_NOFS);
if (!ino) {
err = -ENOMEM;
goto out;
}
...
}
=================================
ubifs_iget()函数调用iget_locked()得到VFS
inode,如果inode经存在,函数返回,否则读取磁盘信息填充
VFS inode。
Superoperations
很多superoperations
(superblock所使用的struct
super_operations结构的组件)用于inode的操作,这些操作包括:
alloc_inode: 分配一个inode. 通常,
这个函数分配一个struct _inode_info
结构和执行基本的VFS inode
初始化(调用inode_init_once()); minix uses for
allocation the kmem_cache_alloc()
function that interacts with the SLAB subsystem. For each
allocation, the cache construction is called, which in the case of
minix is the init_once() function.
Alternatively, kmalloc() can be used, in
which case the inode_init_once() function
should be called. The alloc_inode()
function will be called by the
new_inode() and
iget_locked() functions.
write_inode : saves/updates the inode
received as a parameter on disk; to update the inode, though
inefficient, for beginners it is recommended to use the following
sequence of operations:
load the inode from the disk using the
sb_bread() function;
modify the buffer according to the saved inode;
mark the buffer as dirty using
mark_buffer_dirty(); the kernel will then
handle its writing on the disk;
an example is the minix_write_inode()
function in the minix file system
evict_inode: removes any information
about the inode with the number received in the
i_ino field from the disk and memory
(both the inode on the disk and the associated data blocks). This
involves performing the following operations:
delete the inode from the disk;
updates disk bitmaps (if any);
delete the inode from the page cache by calling
truncate_inode_pages();
delete the inode from memory by calling
clear_inode() ;
an example is the minix_evict_inode()
function from the minix file system.
destroy_inode releases the memory
occupied by inode
inode_operations
inode操作由struct
inode_operations结构描述,Inodes有几种类型:file,
directory, special file (pipe, fifo), block device, character
device, link
etc。因此,inode所需要实现的操作针对每类inode是不同的,下面是文件类型inode和目录类型inode的详细操作。一个inode的操作用struct
inode结构的i_op初始化和访问。
The file structure
file结构对应进程中打开的文件并只存在与内存中,和一个inode关联,是用户空间最常用的VFS实体,结构字段含有用户空间文件类似的信息,(access
mode, file position,
etc.)。与之相关的操作由已知系统调用完成(read,
write , etc.).
文件系统由struct
file_operations结构描述,文件系统的文件操作用struct
inode结构的i_fop域初始化,当打开一个文件时,VFS初始化struct
file结构的
f_op字段,用inode->i_fop地址,这样随后的系统调用使用存储在file->f_op的值。
Regular files inodes
为了与inode协同,inode结构的i_op and
i_fop域必须被设置,inode类型决定了它所需要实现的操作。
Regular files inode operations
在ubifs文件系统,ubifs_file_inode_operations结构定义为inode的操作,ubifs_file_operations定义为file结构的操作。
-----------------------------------------------------------
const struct inode_operations ubifs_file_inode_operations =
{
.setattr = ubifs_setattr,
.getattr = ubifs_getattr,
.setxattr =
ubifs_setxattr,
.getxattr =
ubifs_getxattr,
.listxattr =
ubifs_listxattr,
.removexattr
= ubifs_removexattr,
};
const struct file_operations ubifs_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = generic_file_aio_read,
.aio_write = ubifs_aio_write,
.mmap = ubifs_file_mmap,
.fsync = ubifs_fsync,
.unlocked_ioctl = ubifs_ioctl,
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
#ifdef CONFIG_COMPAT
.compat_ioctl =
ubifs_compat_ioctl,
#endif
};
-------------------------------------------------
函数generic_file_llseek() ,
generic_file_mmap() ,
generic_file_read_iter() and
generic_file_write_iter()在内核实现。
对简单的文件系统,只有truncation操作(truncate
system
call)需要实现,尽管起初这只是个专有操作,从3.14起,该操作嵌入在setattr实现:如果大小和inode当前的size不同,truncate操作必须被执行,参看
ubifs_setattr()的实现。
============================
int ubifs_setattr(struct dentry *dentry, struct iattr
*attr)
{
int
err;
struct inode
*inode = dentry->d_inode;
struct
ubifs_info *c = inode->i_sb->s_fs_info;
dbg_gen("ino
%lu, mode %#x, ia_valid %#x",
inode->i_ino, inode->i_mode, attr->ia_valid);
err =
inode_change_ok(inode, attr);
if
(err)
return
err;
err =
dbg_check_synced_i_size(c, inode);
if
(err)
return
err;
if
((attr->ia_valid & ATTR_SIZE) && attr->ia_size
< inode->i_size)
err =
do_truncation(c, inode, attr);
else
err =
do_setattr(c, inode, attr);
return
err;
}
==============================
truncate操作包括:
freeing blocks of data on the disk that are now extra (if the
new dimension is smaller than the old one) or allocating new blocks
(for cases where the new dimension is larger)
updating disk bit maps (if used);
updating the inode;
filling with zero the space that was left unused from the last
block using the block_truncate_page()
function.
修剪功能的例子如minix文件系统的minix_truncate函数
==============================
Address space operations
在进程的地址空间和文件间有紧密的链接,程序执行在映射文件到进程地址空间基本结束,因为这个方法工作的很好并且很普遍,也被用于普通系统调用如read
and write.
struct
address_space结构用于描述地址空间,与其关联的操作由struct
address_space_operations描述,为了初始化地址空间操作,需要填充文件类型的inode的inode->i_mapping->a_ops。
例子,ubifs文件系统的ubifs_file_address_operations
====================
const struct address_space_operations
ubifs_file_address_operations = {
.readpage = ubifs_readpage,
.writepage = ubifs_writepage,
.write_begin = ubifs_write_begin,
.write_end = ubifs_write_end,
.invalidatepage = ubifs_invalidatepage,
.set_page_dirty = ubifs_set_page_dirty,
#ifdef CONFIG_MIGRATION
.migratepage = ubifs_migrate_page,
#endif
.releasepage = ubifs_releasepage,
};
===================
大多数函数很容易实现,如下:
-------------------------------
static int minix_writepage(struct page *page, struct
writeback_control *wbc)
{
return block_write_full_page(page, minix_get_block, wbc);
}
static int minix_readpage(struct file *file, struct page
*page)
{
return block_read_full_page(page, minix_get_block);
}
static void minix_write_failed(struct address_space *mapping,
loff_t to)
{
struct inode *inode = mapping->host;
if (to > inode->i_size) {
truncate_pagecache(inode, inode->i_size);
minix_truncate(inode);
}
}
static int minix_write_begin(struct file *file, struct
address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
int ret;
ret = block_write_begin(mapping, pos, len, flags, pagep,
minix_get_block);
if (unlikely(ret))
minix_write_failed(mapping, pos + len);
return ret;
}
static sector_t minix_bmap(struct address_space *mapping, sector_t
block)
{
return generic_block_bmap(mapping, block, minix_get_block);
}
----------------------------------
Dentry structure