1.块设备与字符设备
字符设备与块设备是并列的概念,这两类设备在Linux中的驱动结构差异较大,总体而言,块设备驱动比字符设备驱动要复杂的多,在I/O操作上也表现出极大的不同。缓冲,I/O调度、请求队列等都是与块设备驱动相关的概念。
2.块设备I/O操作特点
字符设备与块设备I/O操作的不同如下:
(1) 块设备只能以块为单位接收输入和返回输出,而字符设备则以字节为单位。大多数设备时字符设备,因为它们不需要缓冲而且不以固定块大小进行操作
(2) 块设备对于I/O请求由对应的缓冲区,因此它们可以选择以什么顺序进行响应,字符设备无需缓冲且被直接读写。对于存储设备而言,调整读写的顺序作用巨大,因为在读写连续的山区的存储速度比分离的扇区更快。
(3)字符设备只能被顺序读写,而块设备可以随机访问
在Linux中,用户通常通过磁盘文件系统EXT4、UBIFS等访问磁盘,但是磁盘也有原始的访问方式,比如:直接访问/dev/sdb1等。
I/O调度层的基本目的是将请求按照它们对应在块设备上的扇区号进行排列,以减少磁头的移动,提高效率。
3.Linux块设备驱动结构
3.1 block_device_operations结构体
字符设备驱动的操作结合结构体是file_operations,块设备操作的集合是block_device_operations结构体:
struct block_device_operations {
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
int (*rw_page)(struct block_device *, sector_t, struct page *, bool);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
struct module *owner;
const struct pr_ops *pr_ops;
};
主要操作函数分析:
1.打开和释放
与字符设备驱动类似,当设备被打开和变比时将调用它们
int (*open) (struct block_device *, fmode_t);
void (*release) (struct gendisk *, fmode_t);
2.I/O控制
以下是ioctl()系统调用的实现,块设备包含大量的标准要求,这些标准请求由Linux通用块设备层处理。
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
// 当一个64位系统内的32位进程调用ioctl()的时候,实际调用compat_ioctl
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
3.介质改变
被内核调用以检查驱动器中的介质是否已经改变,如果是,则返回一个非0值,否则返回0。这个函数仅适用于支持可移动介质的驱动器,通常需要好在驱动中增加一个表示介质状态是否改变的标志变量,非可移动存储设备不需要实现这个方法:
// 检查有没有挂起的事件,如果有DISK_EVENT_MEDIA_CHANGE和DISK_EVENT_EJECT_REQUEST事件,就返回
unsigned int (*check_events)(struct gendisk *disk,unsigned int clearing);
4.使介质有效
revalidate_disk()函数被调用来响应一个介质改变,给驱动一个机会来进行必要的工作以使新介质准备好
int (*revalidate_disk) (struct gendisk *);
5.获得驱动器信息
此函数可以根据驱动器的几何信息填充一个hd_geometry结构体,hd_geometry结构体包含磁头,扇区,柱面等信息,它定义在include/linux/hdreg.h头文件中。
int (*getgeo)(struct block_device *, struct hd_geometry *);
6.模块指针
指向拥有这个结构体的模块的指针,它通常被初始化为THIS_MODULE
struct module *owner;
3.2 gendisk结构体
在内核中,使用gendisk(通用磁盘)结构体来表示一个独立的磁盘设备(或分区),结构体定义如下:
struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
*/
int major; /* major number of driver */
int first_minor;
int minors; /* maximum number of minors, =1 for
* disks that can't be partitioned. */
char disk_name[DISK_NAME_LEN]; /* name of major driver */
char *(*devnode)(struct gendisk *gd, umode_t *mode);
unsigned int events; /* supported events */
unsigned int async_events; /* async events, subset of all */
/* Array of pointers to partitions indexed by partno.
* Protected with matching bdev lock but stat and other
* non-critical accesses use RCU. Always access through
* helpers.
*/
struct disk_part_tbl __rcu *part_tbl; // 用于容纳分区表
struct hd_struct part0; // 表示一个分区
const struct block_device_operations *fops; // 块设备操作集合
struct request_queue *queue; // 内核用来管理这个设备的I/O请求队列的指针
void *private_data; // 可用于指向磁盘的任何私有数据
int flags;
struct kobject *slave_dir;
struct timer_rand_state *random;
atomic_t sync_io; /* RAID */
struct disk_events *ev;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct kobject integrity_kobj;
#endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
struct badblocks *bb;
};
major,first_minor和minors:共同表征磁盘的主、次设备号,同一磁盘的各个分区共享一个主设备号,而次设备号则不同
Linux提供了一组函数用来操作gendisk:
1.分配gendisk
gendisk结构体使一个动态分配的结构体,需要特别的内核操作来初始化,驱动不能自动分配这个结构体,而应该使用下列函数来分配gendisk:
struct gendisk *alloc_disk(int minors);
minors参数是这个磁盘使用的次设备号的数量,一般也就是磁盘分区的数量,此后minors能被修改
2.增加gendisk
gendisk结构体被分配后,系统还不能使用这个磁盘,需要调用如下函数来注册这个磁盘设备
void add_disk(struct gendisk *disk);
对add_disk()的调用必须发生在驱动程序的初始化工作完成并能响应磁盘的请求之后。
3.释放gendisk
当不需要此磁盘时,应当释放gendisk:
void del_gendisk(struct gendisk *gp);
4.gendisk引用计数
通过get_disk()和put_disk()函数可操作gendisk的引用计数:
struct kobject *get_disk(struct gendisk *disk); // 最终调用kobject_get(&disk_to_dev(disk)->kobj)
void put_disk(struct gendisk *disk); // 最终调用kobject_put(&disk_to_dev(disk)->kobj)
3.3 bio、request和request_queue
通常一个bio对应上层传递给块层的I/O请求。每个bio结构体示例及其包含的bvec_iter、bio_vec结构体实例描述了该I/O请求的开始扇区、数据方向(读还是写)、数据放入的页,bio数据类型如下:
// bvec.h
struct bvec_iter {
sector_t bi_sector; /* device address in 512 byte
sectors */
unsigned int bi_size; /* residual I/O count */
unsigned int bi_idx; /* current index into bvl_vec */
unsigned int bi_done; /* number of bytes completed */
unsigned int bi_bvec_done; /* number of bytes completed in
current bvec */
};
// blk_types.h
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
*/
struct bio {
struct bio *bi_next; /* request queue link */
struct gendisk *bi_disk;
unsigned int bi_opf; /* bottom bits req flags,
* top bits REQ_OP. Use
* accessors.
*/
unsigned short bi_flags; /* status, etc and bvec pool number */
unsigned short bi_ioprio;
unsigned short bi_write_hint;
blk_status_t bi_status;
u8 bi_partno;
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
unsigned int bi_phys_segments;
/*
* To keep track of the max segment size, we account for the
* sizes of the first and last mergeable segments in this bio.
*/
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;
struct bvec_iter bi_iter;
atomic_t __bi_remaining;
bio_end_io_t *bi_end_io;
void *bi_private;
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio
* release. Read comment on top of bio_associate_current().
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
void *bi_cg_private;
struct blk_issue_stat bi_issue_stat;
#endif
#endif
union {
#if defined(CONFIG_BLK_DEV_INTEGRITY)
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
};
unsigned short bi_vcnt; /* how many bio_vec's */
/*
* Everything starting with bi_max_vecs will be preserved by bio_reset()
*/
unsigned short bi_max_vecs; /* max bvl_vecs we can hold */
atomic_t __bi_cnt; /* pin count */
struct bio_vec *bi_io_vec; /* the actual vec list */
struct bio_set *bi_pool;
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
* MUST obviously be kept at the very end of the bio.
*/
struct bio_vec bi_inline_vecs[0];
};
与bio对应的数据每次存放的内存不一定是连续的,bio_vec结构体用来描述与这个bio请求对应的所有的内存,它可能不总是在一个页面里面,因此需要一个向量,向量中的每个元素是一个[page,offset,len],一般也称为一个片段:
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
struct bio_vec {
struct page *bv_page;
unsigned int bv_len;
unsigned int bv_offset;
};
请求和bio的区别:I/O调度算法可将连续的bio合并成一个请求,请求是bio经由I/O调度进行调整后的结果,因此一个request可以包含多个bio。当bio被提交给I/O调度器时,I/O调度器可能会将这个bio插入现存的请求中,也可能生成新的请求。
每个块设备或者块设备的分区都对应有自身的request_queue,从I/O调度器合并和排序出来的请求会被分发(Dispatch)到设备级的request_queue。下图描述了request_queue、request、bio、bio_vec之间的关系。
驱动中涉及的处理bio、request和request_queue的主要API如下:
(1) 初始化请求队列
request_queue_t *blk_init_queue(request_fn_proc *rfn,spinlock *lock);
*rfn:请求处理函数的指针
*lock: 控制访问队列权限的自旋锁
这个函数会发生内存分配行为,所以实际使用时要检查返回值。一般在块设备初始化过程中调用
(2) 清除请求队列
void blk_cleanup_queue(request_queue_t *q);
完成将请求队列返回给系统的任务,一般在块设备驱动写在过程中调用
(3) 分配请求队列
request_queue_t *blk_alloc_queue(int gfp_mask);
对于RAMDISK这种完全随机访问的非机械设备,并不需要进行复杂的I/O调度,这个时候可以不通过I/O调度器,而使用如下函数来绑定请求队列和制造请求的函数(make_request_fn)
void blk_queue_make_request(request_queue_t *q,make_request_fn *mfn);
两者结合起来的使用的逻辑一般是:
xxx_queue = blk_alloc_queue(GFP_KERNEL);
blk_queue_make_request(xxx_queue,xxx_make_request);
(4) 提取请求
struct request *blk_peek_request(struct request_queue *q);
此函数用于返回下一个要处理的请求(由I/O调度器决定),如果没有请求则返回NULL。它不会清除请求,而是仍然将这个请求保留在队列上。
(5) 启动请求
void blk_start_request(struct request *req);
从请求队列中移除请求。
实际开发中可以考虑使用blk_fetch_request()函数,它同时做完blk_peek_request()和blk_start_request的工作,实现代码如下:
struct request blk_fetch_request(struct request_queue *q)
{
struct request *rq;
rq = blk_peek_request(q);
if(rq)
blk_start_request(rq);
return rq;
}
(6) 遍历bio和片段
__rq_for_each_bio()遍历一个请求的所有bio。
#define __rq_for_each_bio(_bio,rq) \
if((rq->bio)) \
for(_bio = (rq)->bio; _bio ; _bio = _bio->bi_next)
__bio_for_each_segment()遍历一个bio的所有bio_vec。
#define __bio_for_each_segment(bvl, bio, iter, start) \
for (iter = (start); \
(iter).bi_size && \
((bvl = bio_iter_iovec((bio), (iter))), 1); \
bio_advance_iter((bio), &(iter), (bvl).bv_len))
#define bio_for_each_segment(bvl, bio, iter) \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
rq_for_each_segment() 迭代遍历一个请求所有bio中的所有segment
#define rq_for_each_segment(bvl, _rq, _iter) \
__rq_for_each_bio(_iter.bio, _rq) \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
(7) 报告完成
以下两个函数用于报告请求是否完成,errror为0表示成功,小于0表示失败。__blk_end_request_all()需要再持有队列锁的场景下调用
void __blk_end_request_all(struct request *rq,int error);
void blk_end_request_all(struct request *rq,int error);
类似的函数还有blk_end_request_cur()、blk_end_request_err()、__blk_end_request()。其中xxx_end_request_cur()只是表明完成了request中当前的那个chunk,也就是完成了当前的bio_cur_bytes(rq->bio)的传输。
如果使用blk_queue_make_request()绕开了I/O调度,但是bio处理完成后应该使用bio_endio()函数通知处理结束:
void bio_endio(struct bio *bio,int error);
如果是I/O操作故障,可以调用快捷函数bio_io_error(),它定义为:
#define bio_io_error(bio) bio_endio((bio),-EIO);
3.4 I/O调度器
Linux内核中主要包含以下几个I/O调度器:Noop I/O调度器、DeadLine I/O调度器与CFQ I/O调度器。
Noop I/O调度器:是一个简化的调度程序,该算法实现了一个简单FIFO队列,它只进行最基本的合并,比较适合基于Flash的存储器
Deadline I/O调度器:它把每次请求的延迟降至最低,该算法重排了请求的顺序来提高性能。使用轮询的调度器,简单小巧,提供了最小的读取延迟和尚佳的吞吐量,适合读取较多的环境(比如:数据库)
CFQ I/O调度器:为系统内的所有任务分配均匀的I/O带宽,提供一个公平的工作环境,在多媒体应用中,能保证音、视频即使从磁盘中读取数据。
在内核源码的block目录下的noop_iosched.c、deadline-iosched.c和cfq-iosched.c文件分别实现了IOSCHED_NOOP、IOSCHED_DEADLINE和IOSCHED_CFQ调度算法。当前情况下,默认的调度器时CFQ。
可以通过给内核添加启动参数,选择所使用的I/O调度算法,比如:
kernel elevator=deadline
如果改变一个设备的调度器,也可以通过以下命令:
echo SCHEDULER > /sys/block/DEVICE/queue/scheduler
4.Linux块设备驱动的初始化
块设备驱动要注册它们自己到内核中,申请设备号,完成这个任务的是register_blkdev(),函数原型:
int register_blkdev(unsigned int major,const char *name);
major:块设备要使用的主设备号
name:设备名
注册后的设备会显示在/proc/devices中。如果major为0,内核会自动分配一个新的主设备号
register_blkdev()函数的返回值就是这个主设备号。如果register_blkdev()返回一个负值,表明发生了一个错误。
与register_blkdev()对应的注销函数是unregister_blkdev(),其原型为:
int unregister_blkdev(unsigned int major,const char *name);
这里传递给register_blkdev()的参数必须与传递给register_blkdev()的参数匹配,否则这个函数返回-EINVAL。
除了注册设备外,在块设备初始化过程中,通常需要完成分配、初始化请求队列,绑定清华求队列和请求处理函数的工作,并且可能会分配、初始化gendisk,给gendisk的major、fops、queue等成员赋值,最后添加gendisk。
以下代码演示了一个典型的块设备驱动的初始化过程,其中包含了register_blkdev()、blk_init_queue()和add_disk()的工作。
// 块设备驱动的初始化
static int xxx_init(void)
{
// 块设备驱动注册
if(register_blkdev(XXX_MAJOR,"xxx"))
{
err = -EIO;
goto out;
}
// 请求队列初始化
xxx_queue = blk_init_queue(xxx_request,xxx_lock);
if(!xxx_queue)
goto out_queue;
// 通知通用块层和I/O调度器改请求队列支持的每个请求中能够包含的最大扇区数
blk_queue_max_hw_sectors(xxx_queue,255);
// 告知该请求队列的逻辑块大小
blk_queue_logical_block_size(xxx_queue,512);
// gendisk初始化
xxx_disks->major = XXX_MAJOR;
xxx_disks->first_minor = 0;
xxx_disks->fops = &xxx_op;
xxx_disks->queue = xxx_queue;
sprintf(xxx_disks->disk_name, "xxx%d", i);
set_capacity(xxx_disks, xxx_size *2);
add_disk(xxx_disks); /* 添加gendisk */
return 0;
out_queue: unregister_blkdev(XXX_MAJOR,"xxx");
out:put_disk(xxx_disks);
blk_cleanup_queue(xxx_queue);
}
在块设备的卸载过程中完成与模块加载函数相反的工作。
(1) 清除请求队列,使用blk_cleanup_queue()
(2) 删除对gendisk的引用,使用put_disk()
(3) 删除对块设备的引用,注销块设备驱动,使用unregister_blkdev()
5.块设备的打开与释放
块设备驱动的open()函数和其字符设备驱动的对等体不太相似,在open()中我们可以通过block_device参数bdev获取private_data、在release()函数中则通过gendisk参数disk获取,代码清单如下:
static int xxx_open(struct block_device *bdev,fmode_t mode)
{
struct xxx_dev *dev = bdev->bd_disk->private_data;
...
return 0;
}
static void xxx_release(struct gendisk *disk,fmode_t mode)
{
struct xxx_dev *dev = disk->private_data;
...
}
6.块设备驱动的ioctl函数
高层的块设备层代码处理了绝大多数I/O控制。实际上高层的块设备层代码处理了绝大多数I/O控制,如BLKFLSBUF、BLKROSET、BLKDISCARD、HDIO_GETGEO、BLKROGET和BLKSECTGET等。
而在具体的块设备驱动中,通常只需要实现与设备相关的特定ioctl命令。例如,源代码文件为drivers/block/floppy.c实现了与软驱相关的命令(如FDEJECT、FDSETOPRM、FDFMTTRK等)。
Linux MMC子系统支持一个IOCTL命令MMC_IOC_CMD,drivers/mmc/card/block.c实现了这个命令的处理,Linux MMC块设备的ioctl函数:
static int mmc_blk_ioctl(struct block_device *bdev,fmode_t mode,unsigned int cmd,unsigned long arg)
{
int ret = _EINVAL;
if(cmd == MMC_IOC_CMD)
ret = mmc_blk_ioctl_cmd(bdev,(struct mmc_ioc_cmd __user *)arg);
return ret;
}
7.块设备驱动的I/O请求处理
7.1 使用请求队列
块设备驱动在使用请求队列的场景下,会用blk_init_queue()初始化request_queue,而该函数的第一个参数就是请求处理函数的指针。request_queue会作为参数传递给我们在调用blk_init_queue()时指定的请求处理函数,块设备驱动请求处理函数的原型为:
static void xxx_req(struct request_queue *q);
这个函数不能由驱动自己调用,只有当内核认为是时候让驱动处理对设备的读写等操作时,它才调用这个函数。该函数的主要工作就是发起与request对应的块设备I/O动作,块设备驱动请求处理函数实现:
// driver/memstick/core/ms_block.c
static void msb_submit_req(struct request_queue *q)
{
struct memstick_dev *card = q->queuedata;
struct msb_data *msb = memstick_get_drvdata(card);
struct request *req = NULL;
dbg_verbose("Submit request");
if(msb->card_dead)
{
dbg("Refusing requests on removed card");
WARN_ON(!msb->io_queue_stopped);
while ((req = blk_fetch_request(q)) != NULL) // 获得队列中第一个未完成的请求
__blk_end_request_all(req, -ENODEV);
return;
}
if(msb->req)
return;
if(!msb->io_queue_stopped)
queue_work(msb->io_queue,&msb->io_work);
}
因为msb->card_dead成立,实际上我们处理不了该请求,所以就直接通过__blk_end_request_all(req,-ENODEV)返回错误了。
正常的情况下,通过queue_work(msb->io_queue,&msb->io_work)启动工作队列执行msb_io_work这个函数,msb_io_work原型代码如下:
static void msb_io_work(struct work_struct *work)
{
struct msb_data *msb = container_of(work,struct msb_data,io_work);
int page,error,len;
sector_t lba;
struct scatterlist *sg = msb->prealloc_sg;
dbg_verbose("IO:work started");
while(1)
{
spin_lock_irqsave(&msb->q_lock,flags);
if(msb->need_flush_cache)
{
msb->need_flush_cache = flase;
spin_unlock_irqresotre(&msb->q_lock,flags);
msb_cache_flush(msb);
continue;
}
if (!msb->req) {
msb->req = blk_fetch_request(msb->queue);
if (!msb->req) {
dbg_verbose("IO: no more requests exiting");
spin_unlock_irqrestore(&msb->q_lock, flags);
return;
}
}
spin_unlock_irqrestore(&msb->q_lock, flags);
/* If card was removed meanwhile */
if (!msb->req)
return;
/* process the request */
dbg_verbose("IO: processing new request");
// 遍历所有的bio,以及所有的片段,将所有与某请求相关的页组成一个scatter/gather的列表,具体实现于block/blk-merge.c
// 通过rq_for_each_bio()、bio_for_each_segment()来遍历所有的bio,以及所有的片段,
// 将所有与某请求相关的页做成一个scatter/gather列表
blk_rq_map_sg(msb->queue, msb->req, sg);
lba = blk_rq_pos(msb->req);
sector_div(lba, msb->page_size / 512);
page = do_div(lba, msb->pages_in_block);
if (rq_data_dir(msb->req) == READ)
error = msb_do_read_request(msb, lba, page, sg,
blk_rq_bytes(msb->req), &len);
else
error = msb_do_write_request(msb, lba, page, sg,
blk_rq_bytes(msb->req), &len);
spin_lock_irqsave(&msb->q_lock, flags);
if (len)
{
// 告诉上层该请求处理完成
if (!__blk_end_request(msb->req, 0, len))
msb->req = NULL;
}
// 如果处理有错,则将出错原因做为第2个参数传入上层
if (error && msb->req) {
dbg_verbose("IO: ending one sector of the request with error");
if (!__blk_end_request(msb->req, error, msb->page_size))
msb->req = NULL;
}
if (msb->req)
dbg_verbose("IO: request still pending");
spin_unlock_irqrestore(&msb->q_lock, flags);
}
}
7.2 不使用请求队列
使用请求队列对于一个机械磁盘设备而言可以有助于提高系统的性能,但对于RAMDISK、ZRAM等完全可以真正随机访问的设备而言,无法从高级的请求队列逻辑中获益。对于这些设备,块层支持"无队列"的操作模式,为使用这个模式,驱动必须提供一个"制造请求"函数,而不是一个请求处理函数,"制造请求"函数的原型为:
static void xxx_make_request(struct request_queue *queue,struct bio *bio);
块设备驱动初始化的时候不再调用blk_init_queue(),而是调用blk_alloc_queue()和blk_queue_make_request,xxx_make_request则会成为blk_queue_make_request()的第2个参数。
xxx_make_request()函数的第一个参数仍然是"请求队列",但这个请求队列实际不包含任何请求,因为块层没有必要将bio调整为请求。因此,"制造请求"函数的主要参数是bio结构体。示例如下:
/*
* Handler function for all zram I/O requests.
*/
static blk_qc_t zram_submit_bio(struct bio *bio)
{
struct zram *zram = bio->bi_bdev->bd_disk->private_data;
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
atomic64_inc(&zram->stats.invalid_io);
goto error;
}
__zram_make_request(zram, bio);
return BLK_QC_T_NONE;
error:
bio_io_error(bio);
return BLK_QC_T_NONE;
}
static void __zram_make_request(struct zram *zram, struct bio *bio)
{
int offset;
u32 index;
struct bio_vec bvec;
struct bvec_iter iter;
unsigned long start_time;
index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
offset = (bio->bi_iter.bi_sector &
(SECTORS_PER_PAGE - 1)) << SECTOR_SHIFT;
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_WRITE_ZEROES:
zram_bio_discard(zram, index, offset, bio);
bio_endio(bio);
return;
default:
break;
}
start_time = bio_start_io_acct(bio);
// 通过bio_for_each_segment()迭代bio中的每个segment,最终调用zram_bvec_rw完成内存的压缩、解压、读取和写入
bio_for_each_segment(bvec, bio, iter) {
struct bio_vec bv = bvec;
unsigned int unwritten = bvec.bv_len;
do {
bv.bv_len = min_t(unsigned int, PAGE_SIZE - offset,
unwritten);
if (zram_bvec_rw(zram, &bv, index, offset,
bio_op(bio), bio) < 0) {
bio->bi_status = BLK_STS_IOERR;
break;
}
bv.bv_offset += bv.bv_len;
unwritten -= bv.bv_len;
update_position(&index, &offset, &bv);
} while (unwritten);
}
bio_end_io_acct(bio, start_time);
bio_endio(bio);
}
ZRAM是Linux的一种内存优化技术, 它划定一片内存区域作为SWAP的交换分区, 但是它本身具备自动压缩功能, 从而可以达到辅助Linux匿名页的交换效果, 变相"增大"了内存。
8.vmem_disk驱动
8.1 vmem_disk的硬件原理
vmem_disk是一种虚拟磁盘,它的数据实际上存储在RAM中,通过使用vmalloc()分配出来的内存空间来模拟出一个磁盘,以块设备的方式来访问这片内存。
加载vmem_disk.ko后,在使用默认模块参数的情况下,系统会增加4个块设备节点:
# ls -l /dev/vmem_disk*
brw-rw---- 1 root disk 252, 0 2月 25 14:00 /dev/vmem_diska
brw-rw---- 1 root disk 252, 16 2月 25 14:00 /dev/vmem_diskb
brw-rw---- 1 root disk 252, 32 2月 25 14:00 /dev/vmem_diskc
brw-rw---- 1 root disk 252, 48 2月 25 14:00 /dev/vmem_diskd
其中,mkfs.ext2 /dev/vmem_diska命令的执行会回馈如下信息:
$ sudo mkfs.ext2 /dev/vmem_diska
mke2fs 1.42.9 (4-Feb-2014)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
Stride=0 blocks, Stripe width=0blocks
64 inodes, 512 blocks
25 blocks (4.88%) reserved for the super user
First data block=1
Maximum filesystem blocks=524288
1 block group
8192 blocks per group, 8192fragments per group
64 inodes per group
Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done
它将/dev/vmem_diska格式化为EXT2文件系统。之后我们可以在mount这个分区并在其中进行文件读写。
8.2 vmem_disk驱动模块的加载与卸载
vmem_disk驱动支持制造请求、请求队列两种模式,请求队列方面又支持简、繁两种莫老师,使用模块参数request_mode进行区分。
vmem_disk设备驱动的模块加载与卸载函数:
static void setup_device(struct vmem_disk_dev *dev,int which)
{
memset(dev,0,sizeof(struct vmem_disk_dev));
dev->size = NSECTORS*HARDSECT_SIZE;
dev->data = vmalloc(dev->size);
if(dev->data = NULL)
{
printk(KERN_NOTICE "vmalloc failure.\n");
return;
}
spin_lock_init(&dev->lock);
switch(request_mode) {
case VMEMD_NOQUEUE:
dev->queue = blk_alloc_queue(GFP_KERNEL);
if(dev->queue == NULL)
goto out_vfree;
blk_queue_make_request(dev->queue,vmem_disk_make_request);
break;
case VMEMD_QUEUE:
dev->queue = blk_init_queue(vmem_disk_request, &dev->lock);
if (dev->queue == NULL)
goto out_vfree;
break;
default:
printk(KERN_NOTICE "Bad request mode %d, using simple\n", request_mode);
}
blk_queue_logical_block_size(dev->queue,HARDSECT_SZIE);
dev->queue->queuedata = dev;
dev->gd = alloc_disk(VMEM_DISK_MINORS);
if(!dev->gd)
{
printk(KERN_NOTICE "alloc_disk failure\n");
goto out_vfree;
}
dev->gd->major = vmem_disk_major;
dev->gd->first_minor = which*VMEM_DISK_MINORS;
dev->gd->fops = &vmem_disk_ops;
dev->gd->queue = dev->queue;
dev->gd->private_data = dev;
snprintf (dev->gd->disk_name, 32, "vmem_disk%c", which + 'a');
set_capacity(dev->gd, NSECTORS*(HARDSECT_SIZE/KERNEL_SECTOR_SIZE));
add_disk(dev->gd);
return;
out_vfree:
if(dev->data)
vfree(dev->data);
}
static int __init vmem_disk_init(void)
{
int i;
vmem_disk_major = register_blkdev(vmem_disk_major, "vmem_disk");
if (vmem_disk_major <= 0) {
printk(KERN_WARNING "vmem_disk: unable to get major number\n");
return -EBUSY;
}
devices = kmalloc(NDEVICES*sizeof (struct vmem_disk_dev), GFP_KERNEL);
if (!devices)
goto out_unregister;
for (i = 0; i < NDEVICES; i++)
setup_device(devices + i, i);
return 0;
out_unregister:
unregister_blkdev(vmem_disk_major, "sbd");
return -ENOMEM;
}
module_init(vmem_disk_init);
上述代码支持两种I/O请求模式,一种是make_request,另一种是request_queue。
make_request的版本直接使用vmem_disk_make_request()来处理bio,而request_queue的版本则使用vmem_disk_request来处理请求队列
8.3 vmem_disk设备驱动的block_device_operations
vmem_disk提供block_device_operation结构体中的getgeo()成员函数,以下代码给了vmem_disk设备驱动的block_device_operations结构体定义及其成员函数的实现:
static int vmem_disk_getgeo(struct block_device *bdev,struct hd_geometry *geo)
{
long size;
struct vmem_disk_dev *dev = bdev->bd_disk->private_data;
size = dev->size*(HARDSECT_SIZE/KERNEL_SECTOR_SIZE);
geo->cylinders = (size & ~0x3f) >> 6;
geo->head = 4;
geo->sectors = 16;
geo->start = 4;
return 0;
}
static struct block_device_operations vmem_disk_ops =
{
.getgeo = vmem_disk_getgeo;
}
8.4 vmem_disk的I/O请求处理
vmem_disk驱动通过模块参数request_mode的方式来支持3种不同的请求处理模式,以下清单列出了vmem_disk设备驱动的请求处理代码:
// 处理I/O请求。用于完成真实的硬件I/O操作
static void vmem_disk_transfer(struct vmem_disk_dev *dev,unsigned long sector,unsigned long nsect,char *buffer,int write)
{
unsigned long offset = sector*KERNEL_SECTOR_SIZE;
unsigned long nbytes = nsect*KERNEL_SECTOR_SIZE;
if((offset + nbytes) > dev->size)
{
printk (KERN_NOTICE "Beyond-end write (%ld %ld)\n", offset, nbytes);
return;
}
if (write)
memcpy(dev->data + offset, buffer, nbytes);
else
memcpy(buffer,dev->data + offset,nbytes);
}
// 传输单个BIO。调用vmem_disk_transfer完成一个与bio对应的硬件操作
static int vmem_disk_xfer_bio(struct vmem_disk_dev *dev,struct bio *bio)
{
struct bio_vec bvec;
struct bvec_iter iter;
sector_t sector = bio->bi_itger.bi_sector;
// 展开该bio的每个segment
bio_for_each_segment(bvec, bio, iter) {
char *buffer = __bio_kmap_atomic(bio, iter);
vmem_disk_transfer(dev, sector, bio_cur_bytes(bio) >> 9,
buffer, bio_data_dir(bio) == WRITE);
sector += bio_cur_bytes(bio) >> 9;
__bio_kunmap_atomic(buffer);
}
return 0;
}
// the request_queue version
static void vmem_disk_request(struct request_queue *q)
{
struct request *req;
struct bio *bio;
// blk_peek_request先从request_queue种拿出一个请求
while ((req = blk_peek_request(q)) != NULL)
{
struct vmem_disk_dev *dev = req->rq_disk->private_data;
if (req->cmd_type != REQ_TYPE_FS)
{
printk (KERN_NOTICE "Skip non-fs request\n");
blk_start_request(req);
__blk_end_request_all(req, -EIO);
continue;
}
blk_start_request(req);
__rq_for_each_bio(bio, req); // 从该请求中取出一个bio
vmem_disk_xfer_bio(dev, bio); // 完成该I/O请求
__blk_end_request_all(req, 0);
}
}
// 直接生成请求的版本。调用vmem_disk_xfer_bio完成一个bio操作
static void vmem_disk_make_request(struct request_queue *q,struct bio *bio)
{
struct vmem_disk_dev *dev = q->queuedata;
int status;
status = vmem_disk_xfer_bio(dev,bio);
bio_endio(bio,status);
}
9.Linux MMC子系统
Linux MMC/SD存储卡是一种典型的块设备,它的实现位于drivers/mmc,下面又分为card、core和host这3个子目录。
card实际上跟Linux的块设备子系统对接,实现块设备驱动以及完成请求,但是具体的协议经过core层的接口,最终通过host完成传输。
drivers/mmc/card/queue.c的mmc_init_queue()函数通过blk_init_queue(mmc_request_fn,lock)绑定了请求处理函数mmc_request_fn():
/**
* mmc_init_queue - initialise a queue structure.
* @mq: mmc queue
* @card: mmc card to attach this queue
* @lock: queue lock
* @subname: partition subname
*
* Initialise a MMC card request queue.
*/
int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card,
spinlock_t *lock, const char *subname)
{
struct mmc_host *host = card->host;
int ret = -ENOMEM;
mq->card = card;
mq->use_cqe = host->cqe_enabled;
if (mq->use_cqe || mmc_host_use_blk_mq(host))
return mmc_mq_init(mq, card, lock);
mq->queue = blk_alloc_queue(GFP_KERNEL);
if (!mq->queue)
return -ENOMEM;
mq->queue->queue_lock = lock;
mq->queue->request_fn = mmc_request_fn;
mq->queue->init_rq_fn = mmc_init_request;
mq->queue->exit_rq_fn = mmc_exit_request;
mq->queue->cmd_size = sizeof(struct mmc_queue_req);
mq->queue->queuedata = mq;
mq->qcnt = 0;
ret = blk_init_allocated_queue(mq->queue);
if (ret) {
blk_cleanup_queue(mq->queue);
return ret;
}
blk_queue_prep_rq(mq->queue, mmc_prep_request);
mmc_setup_queue(mq, card);
mq->thread = kthread_run(mmc_queue_thread, mq, "mmcqd/%d%s",
host->index, subname ? subname : "");
if (IS_ERR(mq->thread)) {
ret = PTR_ERR(mq->thread);
goto cleanup_queue;
}
return 0;
cleanup_queue:
blk_cleanup_queue(mq->queue);
return ret;
}
而mmc_request_fn()函数会唤醒与MMC对应的内核线程来处理请求,与该线程对应的处理函数mmc_queue_thread()执行与MMC对应的mq->issue_fn(mq,req):
static int mmc_queue_thread(void *d)
{
struct mmc_queue *mq = d;
struct request_queue *q = mq->queue;
struct mmc_context_info *cntx = &mq->card->host->context_info;
current->flags |= PF_MEMALLOC;
down(&mq->thread_sem);
do {
struct request *req;
spin_lock_irq(q->queue_lock);
set_current_state(TASK_INTERRUPTIBLE);
req = blk_fetch_request(q);
mq->asleep = false;
cntx->is_waiting_last_req = false;
cntx->is_new_req = false;
if (!req) {
/*
* Dispatch queue is empty so set flags for
* mmc_request_fn() to wake us up.
*/
if (mq->qcnt)
cntx->is_waiting_last_req = true;
else
mq->asleep = true;
}
spin_unlock_irq(q->queue_lock);
if (req || mq->qcnt) {
set_current_state(TASK_RUNNING);
mmc_blk_issue_rq(mq, req);
cond_resched();
} else {
if (kthread_should_stop()) {
set_current_state(TASK_RUNNING);
break;
}
up(&mq->thread_sem);
schedule();
down(&mq->thread_sem);
}
} while (1);
up(&mq->thread_sem);
return 0;
}
对于存储设备而言,mq->issue_fn()函数指向drivers/mmc/card/block.c的mmc_blk_issue_rq():
static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,
struct device *parent,
sector_t size,
bool default_ro,
const char *subname,
int area_type)
{
struct mmc_blk_data *md;
int devidx, ret;
devidx = ida_simple_get(&mmc_blk_ida, 0, max_devices, GFP_KERNEL);
if (devidx < 0) {
/*
* We get -ENOSPC because there are no more any available
* devidx. The reason may be that, either userspace haven't yet
* unmounted the partitions, which postpones mmc_blk_release()
* from being called, or the device has more partitions than
* what we support.
*/
if (devidx == -ENOSPC)
dev_err(mmc_dev(card->host),
"no more device IDs available\n");
return ERR_PTR(devidx);
}
md = kzalloc(sizeof(struct mmc_blk_data), GFP_KERNEL);
if (!md) {
ret = -ENOMEM;
goto out;
}
md->area_type = area_type;
/*
* Set the read-only status based on the supported commands
* and the write protect switch.
*/
md->read_only = mmc_blk_readonly(card);
md->disk = alloc_disk(perdev_minors);
if (md->disk == NULL) {
ret = -ENOMEM;
goto err_kfree;
}
spin_lock_init(&md->lock);
INIT_LIST_HEAD(&md->part);
md->usage = 1;
ret = mmc_init_queue(&md->queue, card, &md->lock, subname);
if (ret)
goto err_putdisk;
md->queue.blkdata = md;
/*
* Keep an extra reference to the queue so that we can shutdown the
* queue (i.e. call blk_cleanup_queue()) while there are still
* references to the 'md'. The corresponding blk_put_queue() is in
* mmc_blk_put().
*/
if (!blk_get_queue(md->queue.queue)) {
mmc_cleanup_queue(&md->queue);
goto err_putdisk;
}
md->disk->major = MMC_BLOCK_MAJOR;
md->disk->first_minor = devidx * perdev_minors;
md->disk->fops = &mmc_bdops;
md->disk->private_data = md;
md->disk->queue = md->queue.queue;
md->parent = parent;
set_disk_ro(md->disk, md->read_only || default_ro);
md->disk->flags = GENHD_FL_EXT_DEVT;
if (area_type & (MMC_BLK_DATA_AREA_RPMB | MMC_BLK_DATA_AREA_BOOT))
md->disk->flags |= GENHD_FL_NO_PART_SCAN;
/*
* As discussed on lkml, GENHD_FL_REMOVABLE should:
*
* - be set for removable media with permanent block devices
* - be unset for removable block devices with permanent media
*
* Since MMC block devices clearly fall under the second
* case, we do not set GENHD_FL_REMOVABLE. Userspace
* should use the block device creation/destruction hotplug
* messages to tell when the card is present.
*/
snprintf(md->disk->disk_name, sizeof(md->disk->disk_name),
"mmcblk%u%s", card->host->index, subname ? subname : "");
if (mmc_card_mmc(card))
blk_queue_logical_block_size(md->queue.queue,
card->ext_csd.data_sector_size);
else
blk_queue_logical_block_size(md->queue.queue, 512);
set_capacity(md->disk, size);
if (mmc_host_cmd23(card->host)) {
if ((mmc_card_mmc(card) &&
card->csd.mmca_vsn >= CSD_SPEC_VER_3) ||
(mmc_card_sd(card) &&
card->scr.cmds & SD_SCR_CMD23_SUPPORT))
md->flags |= MMC_BLK_CMD23;
}
if (mmc_card_mmc(card) &&
md->flags & MMC_BLK_CMD23 &&
((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) ||
card->ext_csd.rel_sectors)) {
md->flags |= MMC_BLK_REL_WR;
blk_queue_write_cache(md->queue.queue, true, true);
}
return md;
err_putdisk:
put_disk(md->disk);
err_kfree:
kfree(md);
out:
ida_simple_remove(&mmc_blk_ida, devidx);
return ERR_PTR(ret);
}
MMC host操作的集合对应的结构体是mmc_host_ops,MMC主机驱动的主体工作就是实现该结构体的成员函数,如drivers/mmc/host/mmc_spi.c、drivers/mmc/host/bfin_sdh.c等。
struct mmc_host_ops {
/*
* It is optional for the host to implement pre_req and post_req in
* order to support double buffering of requests (prepare one
* request while another request is active).
* pre_req() must always be followed by a post_req().
* To undo a call made to pre_req(), call post_req() with
* a nonzero err condition.
*/
void (*post_req)(struct mmc_host *host, struct mmc_request *req,
int err);
void (*pre_req)(struct mmc_host *host, struct mmc_request *req);
void (*request)(struct mmc_host *host, struct mmc_request *req);
/*
* Avoid calling the next three functions too often or in a "fast
* path", since underlaying controller might implement them in an
* expensive and/or slow way. Also note that these functions might
* sleep, so don't call them in the atomic contexts!
*/
/*
* Notes to the set_ios callback:
* ios->clock might be 0. For some controllers, setting 0Hz
* as any other frequency works. However, some controllers
* explicitly need to disable the clock. Otherwise e.g. voltage
* switching might fail because the SDCLK is not really quiet.
*/
void (*set_ios)(struct mmc_host *host, struct mmc_ios *ios);
/*
* Return values for the get_ro callback should be:
* 0 for a read/write card
* 1 for a read-only card
* -ENOSYS when not supported (equal to NULL callback)
* or a negative errno value when something bad happened
*/
int (*get_ro)(struct mmc_host *host);
/*
* Return values for the get_cd callback should be:
* 0 for a absent card
* 1 for a present card
* -ENOSYS when not supported (equal to NULL callback)
* or a negative errno value when something bad happened
*/
int (*get_cd)(struct mmc_host *host);
void (*enable_sdio_irq)(struct mmc_host *host, int enable);
void (*ack_sdio_irq)(struct mmc_host *host);
/* optional callback for HC quirks */
void (*init_card)(struct mmc_host *host, struct mmc_card *card);
int (*start_signal_voltage_switch)(struct mmc_host *host, struct mmc_ios *ios);
/* Check if the card is pulling dat[0:3] low */
int (*card_busy)(struct mmc_host *host);
/* The tuning command opcode value is different for SD and eMMC cards */
int (*execute_tuning)(struct mmc_host *host, u32 opcode);
/* Prepare HS400 target operating frequency depending host driver */
int (*prepare_hs400_tuning)(struct mmc_host *host, struct mmc_ios *ios);
/* Prepare enhanced strobe depending host driver */
void (*hs400_enhanced_strobe)(struct mmc_host *host,
struct mmc_ios *ios);
int (*select_drive_strength)(struct mmc_card *card,
unsigned int max_dtr, int host_drv,
int card_drv, int *drv_type);
void (*hw_reset)(struct mmc_host *host);
void (*card_event)(struct mmc_host *host);
/*
* Optional callback to support controllers with HW issues for multiple
* I/O. Returns the number of supported blocks for the request.
*/
int (*multi_io_quirk)(struct mmc_card *card,
unsigned int direction, int blk_size);
};
10.总结
块设备的I/O方式相比字符设备存在较大不同,因而引入了request_queue,request、bio等一系列数据结构。
在整个块设备的I/O操作中吗,贯穿始终的就是"请求",块设备的I/O操作会排队和整合;而字符设备的I/O操作则是直接进行,不绕弯
驱动的任务是处理请求,对请求的排队和整合由I/O调度算法解决,因此,块设备驱动的核心就是请求处理函数或"制造请求"函数
虽然块设备驱动种仍然存在block_device_operation结构及其成员函数,但不再包含读写类的成员函数,而只是包含打开、释放及I/O控制等与具体读写无关的函数。
块设备驱动的结构相对复杂,但是块设备不像字符设备那么包罗万象,它通常就是存储设备,而且驱动主体已经由Linux内核提供,所以需要修改的地方很少。