Block Drivers
Global architecture
Block devices are storage media capable of random access. Unlike character devices, block devices can hold file system data.
The storage media contains files residing in a filesystem, such as EXT3 or Reiserfs. User applications invoke I/O system calls to access these files. The resulting filesystem operations pass through the generic Virtual File System (VFS) layer before entering the individual filesystem driver. The buffer cache speeds up filesystem access to block devices by caching disk blocks. If a block is found in the buffer cache, the time required to access the disk to read the block is saved. Data destined for each block device is lined up in a request queue. The filesystem driver populates the request queue belonging to the desired block device, whereas the block driver receives and consumes requests from the corresponding queue. In between, I/O schedulers manipulate the request queue so as to minimize disk access latencies and maximize throughput.
Linux I/O schedulers
- Linus elevator
- Deadline: tries to guarantee that an I/O will be served within a deadline
- Anticipatory: tries to anticipate what could be the next accesses
- Complete Fair Queuing: the default scheduler, tries to guarantee fairness between users of a block device
- Noop: for nondisk based block devices
The current scheduler for a device can be get and set in
/sys/block/<dev>/queue/scheduler
Ramdisk Example
This example is based on kernel version 4.15.0-46-generic.
Register a block I/O device
#define MY_BLOCK_MAJOR 240
#define MY_BLKDEV_NAME "sun_block"
static int my_block_init(void)
{
int status;
status = register_blkdev(MY_BLOCK_MAJOR, MY_BLKDEV_NAME);
if (status < 0) {
printk(KERN_ERR "unable to register mybdev block device\n");
return -EBUSY;
}
printk("reigster a test block driver...\n");
return 0;
}
static void my_block_exit(void)
{
printk("exit my block driver..\n");
unregister_blkdev(MY_BLOCK_MAJOR, MY_BLKDEV_NAME);
}
module_init(my_block_init);
module_exit(my_block_exit);
Register a disk
struct gendisk is an abstraction of real disk, there are differences between gendisk and block_device. gendisk is defined in “include/linux/genhd.h”, which indicate this structure is mainly used by block device driver; Meanwhile, block_device is defined in “include/linux/fs.h”, which indicate block_device has a close relationship with filesystem.
For each partition of a block device that has already been opened, there is an instance of struct block_device . The objects for partitions are connected with the object for the complete device via bd_contains . All block_device instances contain a link to their generic disk data structure gen_disk via bd_disk . Note that while there are multiple block_device instances for a partitioned disk, one gendisk instance is sufficient.
The gendisk instance points to an array with pointers to hd_structs. Each represents one partition. If a block_device represents a partition, then it contains a pointer to the hd_struct in question — the hd_struct instances are shared between struct gendisk and struct block_device .
Additionally, generic hard disks are integrated into the kobject framework as shown in Figure 6-11. The block subsystem is represented by the kset instance block_subsystem . The kset contains a linked list on which the embedded kobject s of each gendisk instance are collected.
Partitions represented by struct hd_struct also contain an embedded kobject . Conceptually, partitions are subelements of a hard disk, and this is also captured in the data structures: The parent pointer of the kobject embedded in every hd_struct points to the kobject of the generic hard disk.
#define SECTOR_SIZE 512
#define MY_SECTORS 16
#define MY_HEADS 4
#define MY_CYLINDERS 1024
#define MY_SECTOR_TOTAL (MY_SECTORS*MY_HEADS*MY_CYLINDERS)
#define MY_SIZE (MY_SECTOR_TOTAL*SECTOR_SIZE)
static struct my_block_dev {
spinlock_t lock; /* For mutual exclusion */
struct request_queue *queue; /* The device request queue */
struct gendisk *gd; /* The gendisk structure */
unsigned char * data;
} dev;
struct block_device_operations my_block_ops = {
.owner = THIS_MODULE,
.open = my_block_open,
.release = my_block_release,
.ioctl = my_block_ioctl,
};
static int create_block_device(struct my_block_dev *dev)
{
dev->gd = alloc_disk(MY_BLOCK_MINORS);
if (!dev->gd) {
printk (KERN_NOTICE "alloc_disk failure\n");
return -ENOMEM;
}
dev->gd->major = MY_BLOCK_MAJOR;
dev->gd->first_minor = 0;
dev->gd->fops = &my_block_ops;
dev->gd->queue = dev->queue;
dev->gd->private_data = dev;
snprintf (dev->gd->disk_name, 32, "sun_block");
set_capacity(dev->gd, MY_SECTOR_TOTAL);
add_disk(dev->gd);
return 0;
}
static int my_block_init(void)
{
...
dev.data = vmalloc(MY_SIZE); //molloc 8M RAM memory for this ramdisk
memset(dev.data, 0, MY_SIZE);
printk("data range 0x%x---0x%x, 0x%x", dev.data, dev.data+MY_SIZE, MY_SIZE);
if(dev.data == NULL) {
printk(KERN_ERR "unable to vmalloc\n");
return -ENOMEM;
}
create_block_device(&dev);
...
}
static void delete_block_device(struct my_block_dev *dev)
{
if (dev->gd)
del_gendisk(dev->gd);
if(dev->data)
vfree(dev->data);
}
static void my_block_exit(void)
{
...
delete_block_device(&dev);
...
}
Request queues
Drivers for block devices use queues to store the block requests I/O that will be processed. A request queue is represented by the struct request_queue structure. request_queue contains a double linked list of request and their associated control information. There are 2 different ways to manipulate the queues: request version and make_request version.
The relationship between request and bio is showed below.
1. request version.
static void my_block_request(struct request_queue *q)
{
struct request *rq;
while ((rq = blk_fetch_request(q)) != NULL) {
__process_request(rq);
__blk_end_request_all(rq, 0);
}
}
static int create_block_device(struct my_block_dev *dev)
{
/* Initialize the I/O queue */
spin_lock_init(&dev->lock);
dev->queue = blk_init_queue(my_block_request, &dev->lock);
if (dev->queue == NULL)
return -ENOMEM;
...
}
2. make_request version
blk_queue_make_request does not alloc queue in its logic, so must use blk_alloc_queue to alloc a queue first.
static blk_qc_t my_make_request(struct request_queue *q, struct bio *bio)
{
struct my_block_dev *pdev = bio->bi_disk->private_data;
struct bio_vec bvec;
sector_t sector;
struct bvec_iter iter;
char *pData, *pBuffer;
sector = bio->bi_iter.bi_sector;
if (bio_end_sector(bio) > get_capacity(bio->bi_disk))
goto io_error;
pData = pdev->data + (sector * SECTOR_SIZE);
bio_for_each_segment(bvec, bio, iter) {
__process_bio_vector();
pData += bvec.bv_len;
}
bio_endio(bio);
return BLK_QC_T_NONE;
io_error:
bio_io_error(bio);
return BLK_QC_T_NONE;
}
static int create_block_device(struct my_block_dev *dev)
{
/* Initialize the I/O queue */
spin_lock_init(&dev->lock);
dev->queue = blk_alloc_queue(GFP_KERNEL);
if (dev->queue == NULL)
return -ENOMEM;
blk_queue_make_request(dev->queue, my_make_request);
blk_queue_logical_block_size(dev->queue, SECTOR_SIZE);
...
}
Once upper layer create a bio structure and use submit_bio(rw, bio) to process these bio, submit_bio would do some initialization and statistics work, then calling generic_make_request(bio) , this function would check if make_request_fn is active on this task right now(there might be recursive call of generic_make_request, thus, stacked bio would be considered), if yes, append the bio at the end of bio_list and return, if no, it will pop bio from current->bio_list, and call q->make_request_fn, if using request verison, the function is hooked with blk_queue_bio(q, bio), or, it’s user defined. blk_queue_bio would rearrange the bio in the queue, merge bios to optimize the operation, and copy data from new bio into queue via init_request_from_bio(req, bio)
. Then calling __blk_run_queue to process the queue, which would eventually call my_block_request.

Process the request
1. request version.
static void my_block_request(struct request_queue *q)
{
...
while ((rq = blk_fetch_request(q)) != NULL) {
pdev = rq->rq_disk->private_data;
rq_for_each_segment(bvec, rq, iter) {
start = iter.iter.bi_sector;
pData = pdev->data + start * SECTOR_SIZE;
pBuffer = kmap(bvec.bv_page) + bvec.bv_offset;
switch(rq_data_dir(rq))
{
case READ:
memcpy(pBuffer, pData, bvec.bv_len);
flush_dcache_page(bvec.bv_page);
break;
case WRITE:
flush_dcache_page(bvec.bv_page);
memcpy(pData, pBuffer, bvec.bv_len);
break;
default:
kunmap(bvec.bv_page);
goto io_error;
}
kunmap(bvec.bv_page);
pData += bvec.bv_len;
}
__blk_end_request_all(rq, 0);
}
io_error:
return;
}
2. make_request version
static blk_qc_t my_make_request(struct request_queue *q, struct bio *bio)
{
...
bio_for_each_segment(bvec, bio, iter) {
pBuffer = kmap(bvec.bv_page) + bvec.bv_offset;
switch(bio_data_dir(bio))
{
case READ:
memcpy(pBuffer, pData, bvec.bv_len);
flush_dcache_page(bvec.bv_page);
break;
case WRITE:
flush_dcache_page(bvec.bv_page);
memcpy(pData, pBuffer, bvec.bv_len);
break;
default:
kunmap(bvec.bv_page);
goto io_error;
}
kunmap(bvec.bv_page);
pData += bvec.bv_len;
}
...
}
3. Makefile
obj-m += block_driver_make_request.o
obj-m += block_driver_request.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
Demo
After compile the modules, there will be block_driver_make_request.ko and block_driver_request.ko, these two kernel module would have same behavior.
#you would see /dev/sun_block is created after insmod
my-machine$ sudo insmod block_driver_make_request.ko
#you can use mkfs to make a filesystem
my-machine$ sudo mkfs.ext4 /dev/sun_block
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 32768 1k blocks and 8192 inodes
Filesystem UUID: 5f64e6ee-3ed1-4807-a81f-d00063f0466b
Superblock backups stored on blocks:
8193, 24577
Allocating group tables: done
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done
#can mount the ramdisk
my-machine$ sudo mount /dev/sun_block /mnt
my-machine$ ls /mnt
lost+found
References
- Block device drivers Thomas Petazzoni Free Electrons
- Professional linux kernel architecture
- Essential linux device driver
- Internals of linux device driver
- https://linux-kernel-labs.github.io/master/labs/block_device_drivers.html
Revision
- Make a draft version of block device driver - 2019.4.3