设备驱动程序模型
基于linux 3.13
sysfs文件系统
允许用户态应用程序访问内核内部数据结构的一种文件系统。被安装于/sys目录下,相应的高层目录结构如下:
block
块设备,独立于所连接的总线
devices
所有被内核所识别的硬件设备,依照连接它们的总线对其进行组织
bus
系统中用于连接设备的总线
dev
在内核中注册的设备,分block和char两大类,存放的设备号
(major:minor),其链接到devices目录下的设备
class
系统中设备的类型(声卡、网卡等),同一类可能包含由不同总线链接的设备,
于是由不同的驱动程序驱动
power
处理一些硬件设备电源状态的文件
firmware
处理一些硬件设备的固件的文件
module
包含所有编译的模块信息
fs
处理一些特殊的文件系统,如cgroup,fuse
hypevisor
与虚拟化xen技术有关
kobject
include/linux/kobject.h
设备驱动程序模型的核心数据结构是一个普通的数据结构,叫做kobject,多个kobject聚成一个kset。kset中包含kobject,kobject又可以聚成一个上层的kset。这样形成一个树形结构,对应/sys目录下的文件
kset(kobject)
/ / ... \
kset(kobject) kset(kobject) kset(kobject)
/ \ / \ / \
kobject kobject kobject kobject kobject kobject
struct kobject {
const char *name; // 指向容器名称的字符串
struct list_head entry; //用于kobject所插入的链表的指针
struct kobject *parent; // 指向父kobject
struct kset *kset; //指向包含该kobject的kset
struct kobj_type *ktype; // 类型
struct sysfs_dirent *sd; // 指向与kobject对应的sysfs文件中的sysfs_dirent数据结构
struct kref kref; //引用计数
#ifdef CONFIG_DEBUG_KOBJECT_RELEASE
struct delayed_work release;
#endif
unsigned int state_initialized:1;
unsigned int state_in_sysfs:1;
unsigned int state_add_uevent_sent:1;
unsigned int state_remove_uevent_sent:1;
unsigned int uevent_suppress:1;
};
struct kobj_type {
void (*release)(struct kobject *kobj); // 释放kobject
const struct sysfs_ops *sysfs_ops; //sysfs操作表的操作,包含两个函数show和store,对应读和写
struct attribute **default_attrs; //sysfs文件系统缺省属性链表
const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);
const void *(*namespace)(struct kobject *kobj);
};
struct kset {
struct list_head list; // 包含在kset中的kobject的头部
spinlock_t list_lock; //遍历kobject表的锁
struct kobject kobj; //内嵌在kset中的kobject
const struct kset_uevent_ops *uevent_ops; //处理所有kobject结构的公共方法
};
如果想让kobject、kset出现在sysfs子树中,就必须首先注册它们。与kobject对应的目录总是出现在其父kobject的目录中。因此,sysfs子树的结构就描述了各种已注册的kobject之间以及各种容器对象(kset)之间的层次关系。
kset_register和kset_unregister分别用来注册和撤销kset的。
设备驱动程序模型的组件
include/linux/device.h
设备驱动程序模型建立在几个基本数据结构之上
/**
* struct device - The basic device structure
* @parent: The device's "parent" device, the device to which it is attached.
* In most cases, a parent device is some sort of bus or host
* controller. If parent is NULL, the device, is a top-level device,
* which is not usually what you want.
* @p: Holds the private data of the driver core portions of the device.
* See the comment of the struct device_private for detail.
* @kobj: A top-level, abstract class from which other classes are derived.
* @init_name: Initial name of the device.
* @type: The type of device.
* This identifies the device type and carries type-specific
* information.
* @mutex: Mutex to synchronize calls to its driver.
* @bus: Type of bus device is on.
* @driver: Which driver has allocated this
* @platform_data: Platform data specific to the device.
* Example: For devices on custom boards, as typical of embedded
* and SOC based hardware, Linux often uses platform_data to point
* to board-specific structures describing devices and how they
* are wired. That can include what ports are available, chip
* variants, which GPIO pins act in what additional roles, and so
* on. This shrinks the "Board Support Packages" (BSPs) and
* minimizes board-specific #ifdefs in drivers.
* @power: For device power management.
* See Documentation/power/devices.txt for details.
* @pm_domain: Provide callbacks that are executed during system suspend,
* hibernation, system resume and during runtime PM transitions
* along with subsystem-level and driver-level callbacks.
* @pins: For device pin management.
* See Documentation/pinctrl.txt for details.
* @numa_node: NUMA node this device is close to.
* @dma_mask: Dma mask (if dma'ble device).
* @coherent_dma_mask: Like dma_mask, but for alloc_coherent mapping as not all
* hardware supports 64-bit addresses for consistent allocations
* such descriptors.
* @dma_parms: A low level driver may set these to teach IOMMU code about
* segment limitations.
* @dma_pools: Dma pools (if dma'ble device).
* @dma_mem: Internal for coherent mem override.
* @cma_area: Contiguous memory area for dma allocations
* @archdata: For arch-specific additions.
* @of_node: Associated device tree node.
* @acpi_node: Associated ACPI device node.
* @devt: For creating the sysfs "dev".
* @id: device instance
* @devres_lock: Spinlock to protect the resource of the device.
* @devres_head: The resources list of the device.
* @knode_class: The node used to add the device to the class list.
* @class: The class of the device.
* @groups: Optional attribute groups.
* @release: Callback to free the device after all references have
* gone away. This should be set by the allocator of the
* device (i.e. the bus driver that discovered the device).
* @iommu_group: IOMMU group the device belongs to.
*
* @offline_disabled: If set, the device is permanently online.
* @offline: Set after successful invocation of bus type's .offline().
*
* At the lowest level, every device in a Linux system is represented by an
* instance of struct device. The device structure contains the information
* that the device model core needs to model the system. Most subsystems,
* however, track additional information about the devices they host. As a
* result, it is rare for devices to be represented by bare device structures;
* instead, that structure, like kobject structures, is usually embedded within
* a higher-level representation of the device.
*/
struct device {
struct device *parent;
struct device_private *p; // 私有数据
struct kobject kobj; // 内嵌的kobject对象
const char *init_name; /* initial name of the device */
const struct device_type *type; //设备类型,包含特定的信息,以及公共操作
struct mutex mutex; /* mutex to synchronize calls to its driver. */
struct bus_type *bus; /* type of bus device is on */
struct device_driver *driver; /* which driver has allocated this device */
void *platform_data; /* Platform specific data, device
core doesn't touch it */
struct dev_pm_info power;
struct dev_pm_domain *pm_domain;
#ifdef CONFIG_PINCTRL
struct dev_pin_info *pins;
#endif
#ifdef CONFIG_NUMA
int numa_node; /* NUMA node this device is close to */
#endif
u64 *dma_mask; /* dma mask (if dma'able device) */
u64 coherent_dma_mask;/* Like dma_mask, but for
alloc_coherent mappings as
not all hardware supports
64 bit addresses for consistent
allocations such descriptors. */
struct device_dma_parameters *dma_parms;
struct list_head dma_pools; /* dma pools (if dma'ble) */
struct dma_coherent_mem *dma_mem; /* internal for coherent mem override */
#ifdef CONFIG_DMA_CMA
struct cma *cma_area; /* contiguous memory area for dma allocations */
#endif
/* arch specific additions */
struct dev_archdata archdata;
struct device_node *of_node; /* associated device tree node */
struct acpi_dev_node acpi_node; /* associated ACPI device node */
dev_t devt; /* dev_t, creates the sysfs "dev" */
u32 id; /* device instance */
spinlock_t devres_lock;
struct list_head devres_head; // 资源列表
struct klist_node knode_class;
struct class *class;
const struct attribute_group **groups; /* optional groups */
void (*release)(struct device *dev);
struct iommu_group *iommu_group;
bool offline_disabled:1;
bool offline:1;
}
device_register函数是往设备驱动程序模型中插入一个新的device对象,其通过内嵌的kobject对象链接到整个kobject层次树中,然后再链接到其他的子系统中,比如bus,class。
/**
* struct device_driver - The basic device driver structure
* @name: Name of the device driver.
* @bus: The bus which the device of this driver belongs to.
* @owner: The module owner.
* @mod_name: Used for built-in modules.
* @suppress_bind_attrs: Disables bind/unbind via sysfs.
* @of_match_table: The open firmware table.
* @acpi_match_table: The ACPI match table.
* @probe: Called to query the existence of a specific device,
* whether this driver can work with it, and bind the driver
* to a specific device.
* @remove: Called when the device is removed from the system to
* unbind a device from this driver.
* @shutdown: Called at shut-down time to quiesce the device.
* @suspend: Called to put the device to sleep mode. Usually to a
* low power state.
* @resume: Called to bring a device from sleep mode.
* @groups: Default attributes that get created by the driver core
* automatically.
* @pm: Power management operations of the device which matched
* this driver.
* @p: Driver core's private data, no one other than the driver
* core can touch this.
*
* The device driver-model tracks all of the drivers known to the system.
* The main reason for this tracking is to enable the driver core to match
* up drivers with new devices. Once drivers are known objects within the
* system, however, a number of other things become possible. Device drivers
* can export information and configuration variables that are independent
* of any specific device.
*/
struct device_driver {
const char *name;
struct bus_type *bus;
struct module *owner;
const char *mod_name; /* used for built-in modules */
bool suppress_bind_attrs; /* disables bind/unbind via sysfs 在kernel中,bind/unbind是从用户空间手动的为driver绑定/解绑定指定的设备的机制。*/
const struct of_device_id *of_match_table; //用来匹配设备
const struct acpi_device_id *acpi_match_table; // 用来匹配支持acpi的设备
int (*probe) (struct device *dev);
int (*remove) (struct device *dev);
void (*shutdown) (struct device *dev);
int (*suspend) (struct device *dev, pm_message_t state);
int (*resume) (struct device *dev);
const struct attribute_group **groups; // 驱动创建的默认属性
const struct dev_pm_ops *pm; // 电源管理操作
struct driver_private *p; // 私有数据
};
device_driver的probe方法是当驱动程序发现一个可能由它处理的设备时就会调用的方法,相应的函数将会探测该硬件,从而对该设备进行更进一步的检查。
/**
* struct bus_type - The bus type of the device
*
* @name: The name of the bus.
* @dev_name: Used for subsystems to enumerate devices like ("foo%u", dev->id).
* @dev_root: Default device to use as the parent.
* @dev_attrs: Default attributes of the devices on the bus.
* @bus_groups: Default attributes of the bus.
* @dev_groups: Default attributes of the devices on the bus.
* @drv_groups: Default attributes of the device drivers on the bus.
* @match: Called, perhaps multiple times, whenever a new device or driver
* is added for this bus. It should return a nonzero value if the
* given device can be handled by the given driver.
* @uevent: Called when a device is added, removed, or a few other things
* that generate uevents to add the environment variables.
* @probe: Called when a new device or driver add to this bus, and callback
* the specific driver's probe to initial the matched device.
* @remove: Called when a device removed from this bus.
* @shutdown: Called at shut-down time to quiesce the device.
*
* @online: Called to put the device back online (after offlining it).
* @offline: Called to put the device offline for hot-removal. May fail.
*
* @suspend: Called when a device on this bus wants to go to sleep mode.
* @resume: Called to bring a device on this bus out of sleep mode.
* @pm: Power management operations of this bus, callback the specific
* device driver's pm-ops.
* @iommu_ops: IOMMU specific operations for this bus, used to attach IOMMU
* driver implementations to a bus and allow the driver to do
* bus-specific setup
* @p: The private data of the driver core, only the driver core can
* touch this.
* @lock_key: Lock class key for use by the lock validator
*
* A bus is a channel between the processor and one or more devices. For the
* purposes of the device model, all devices are connected via a bus, even if
* it is an internal, virtual, "platform" bus. Buses can plug into each other.
* A USB controller is usually a PCI device, for example. The device model
* represents the actual connections between buses and the devices they control.
* A bus is represented by the bus_type structure. It contains the name, the
* default attributes, the bus' methods, PM operations, and the driver core's
* private data.
*/
struct bus_type {
const char *name;
const char *dev_name;
struct device *dev_root;
struct device_attribute *dev_attrs; /* use dev_groups instead */
const struct attribute_group **bus_groups;
const struct attribute_group **dev_groups;
const struct attribute_group **drv_groups;
int (*match)(struct device *dev, struct device_driver *drv);
int (*uevent)(struct device *dev, struct kobj_uevent_env *env);
int (*probe)(struct device *dev);
int (*remove)(struct device *dev);
void (*shutdown)(struct device *dev); int (*online)(struct device *dev); int (*offline)(struct device *dev); int (*suspend)(struct device *dev, pm_message_t state); int (*resume)(struct device *dev); const struct dev_pm_ops *pm; struct iommu_ops *iommu_ops; struct subsys_private *p; struct lock_class_key lock_key; };
bus_type是与/sys/bus的目录对应的,其下有多种类型,如pci,pci
下面还有devices目录,其下的都是与该总线链接的设备。因为在/sys/devices下面已经有了设备,所以/sys/bus/pci/devices下面的都链接到/sys/devices下面的设备
/**
* struct class - device classes
* @name: Name of the class.
* @owner: The module owner.
* @class_attrs: Default attributes of this class.
* @dev_groups: Default attributes of the devices that belong to the class.
* @dev_kobj: The kobject that represents this class and links it into the hierarchy.
* @dev_uevent: Called when a device is added, removed from this class, or a
* few other things that generate uevents to add the environment
* variables.
* @devnode: Callback to provide the devtmpfs.
* @class_release: Called to release this class.
* @dev_release: Called to release the device.
* @suspend: Used to put the device to sleep mode, usually to a low power
* state.
* @resume: Used to bring the device from the sleep mode.
* @ns_type: Callbacks so sysfs can detemine namespaces.
* @namespace: Namespace of the device belongs to this class.
* @pm: The default device power management operations of this class.
* @p: The private data of the driver core, no one other than the
* driver core can touch this.
*
* A class is a higher-level view of a device that abstracts out low-level
* implementation details. Drivers may see a SCSI disk or an ATA disk, but,
* at the class level, they are all simply disks. Classes allow user space
* to work with devices based on what they do, rather than how they are
* connected or how they work.
*/
struct class {
const char *name;
struct module *owner;
struct class_attribute *class_attrs;
const struct attribute_group **dev_groups;
struct kobject *dev_kobj;
int (*dev_uevent)(struct device *dev, struct kobj_uevent_env *env);
char *(*devnode)(struct device *dev, umode_t *mode);
void (*class_release)(struct class *class);
void (*dev_release)(struct device *dev);
int (*suspend)(struct device *dev, pm_message_t state);
int (*resume)(struct device *dev);
const struct kobj_ns_type_operations *ns_type;
const void *(*namespace)(struct device *dev);
const struct dev_pm_ops *pm;
struct subsys_private *p;
};
class是高度抽象的设备类型,比如scsi磁盘和ata磁盘,都被归为磁盘。这样抽象可以让用户空间只跟磁盘的通用特性打交道,而不用关心底层的连接、寻道等。
设备文件
类unix系统都是基于文件概念的,文件是由字节序列而构成的信息载体。根据这一点,可以把I/O设备当成设备文件这种所谓的特殊文件来处理。因此,与磁盘上的普通文件进行交互所用的同一系统调用可直接用于I/O设备。
通常,设备标识符由设备文件的类型(char/block)和一对参数组成。第一个参数叫做主设备号,它标识了设备的类型。第二个参数叫做次设备号,它标识了主设备号相同的设备组中的一个特定设备。具有相同主设备号和类型的所有设备文件共享相同的文件操作集合,因为它们是由同一设备驱动程序处理的。
mknod系统调用是用来创建设备文件的。参数分别为设备文件名,设备类型,主次设备号。
设备文件的用户态处理
主次设备号被合并到一个结构体dev_t中,获得主次设备号最好使用MAJOR和MINOR宏来获得。这样避免以后dev_t升级到64位时,改代码。
动态分配设备号
因为可能设备号冲突,所以可以使用动态分配设备号的方式获得设备号。在这种情景下,不能永久性的获得一个固定的设备号,所以需要一个标准的方法将每个驱动程序所使用的设备号输出到用户态应用程序中。通常在/sys/class子目录下的dev属性中。
动态创建设备文件
linux内核可以动态创建设备文件,它无需把每一个可能想到的硬件设备的设备文件都填充到/dev目录下,因为设备文件可以按需要来创建。由于设备驱动程序模型的存在,linux 2.6提供了一个称为udev的工具集,udev是一个通用的内核设备管理器。它以守护进程的方式运行于Linux系统,并监听在新设备初始化或设备从系统中移除时,内核(通过netlink socket)所发出的uevent。然后执行相关的操作。
设备文件的VFS处理
vfs发现调用的索引节点与设备文件对应时,会把索引节点的i_rdev字段初始化为设备文件的主次设备号,而把索引节点的i_fop制度设置为def_blk_fops或者def_chr_fops文件操作表的地址。这样可以隐藏设备文件和普通文件的区别。最后调用的都是设备相关的操作。
设备驱动程序
- 注册设备驱动程序
- 分配一个device_driver
- 调用driver_register(), 将其插入设备驱动程序模型的数据结构中
- 初始化设备驱动程序
- 分配资源
字符设备驱动程序
字符设备驱动程序时由一个cdev结构描述的。
struct cdev {
struct kobject kobj;
struct module *owner;
const struct file_operations *ops;
struct list_head list; //与字符设备文件对应的索引节点链表的头,可能多个设备文件具有相同的设备号,并对应于相同的字符设备
dev_t dev;
unsigned int count;
};
void cdev_init(struct cdev *, const struct file_operations *);
struct cdev *cdev_alloc(void);
void cdev_put(struct cdev *p);
int cdev_add(struct cdev *, dev_t, unsigned);
void cdev_del(struct cdev *);
void cd_forget(struct inode *);
cdev_alloc()函数是动态分配cdev描述符,并初始化内嵌的KObject对象,引用计数为0时,自动释放该描述符。
cdev_add()函数是在设备驱动模型中注册一个cdev描述符,它初始化cdev中的dev和count字段,然后调用kobj_map()函数。kobj_map()函数依次建立设备驱动程序模型的数据结构,把设备号范围复制到设备驱动程序的描述符中。
分配设备号
- register_chrdev_region()函数和alloc_chrdev_region()函数为驱动程序分配任意范围内的设备号,它们不调用cdev_add()函数,所以执行完了后还要执行cdev_add()函数。后者可以动态分配主设备号,前者是检查设备号范围是否跨越一些次设备号,如果是,则确定其主设备号以及覆盖整个区间的相应设备号范围,然后在每个相应设备号范围上分配。
- register_chrdev()函数,分配一个固定的设备号范围。内部已经调用了cdev_add()函数。设备驱动程序不用再调用了。
块设备驱动程序
块设备的处理
名称 | 意义 | 大小 |
---|---|---|
扇区 | 磁盘传输的最小单位 | 通常是512字节,也有更大的 |
块 | vfs和文件系统传送数据的基本单位 | 通常是2的幂,而且不能超过一个页框,必须是扇区大小的倍数。 |
段 | 为了处理分散-聚集DMA传输,一个段就是一个内存页或内存页的一部分,它们包含相邻磁盘扇区中的数据 | 扇区大小的倍数,小于或等于页大小 |
页 | 内存管理划分的大小 | 通常为4096字节 |
通用块层
include/linux/blk_types.h
/*
* main unit of I/O for the block layer and lower layers (ie drivers and
* stacking drivers)
*/
struct bio {
sector_t bi_sector; /* device address in 512 byte
sectors */
struct bio *bi_next; /* request queue link */
struct block_device *bi_bdev;
unsigned long bi_flags; /* status, command, etc */
unsigned long bi_rw; /* bottom bits READ/WRITE,
* top bits priority
*/
unsigned short bi_vcnt; /* how many bio_vec's */
unsigned short bi_idx; /* current index into bvl_vec */
/* Number of segments in this BIO after
* physical address coalescing is performed.
*/
unsigned int bi_phys_segments;
unsigned int bi_size; /* residual I/O count */
/*
* To keep track of the max segment size, we account for the
* sizes of the first and last mergeable segments in this bio.
*/
unsigned int bi_seg_front_size;
unsigned int bi_seg_back_size;
bio_end_io_t *bi_end_io;
void *bi_private;
#ifdef CONFIG_BLK_CGROUP
/*
* Optional ioc and css associated with this bio. Put on bio
* release. Read comment on top of bio_associate_current().
*/
struct io_context *bi_ioc;
struct cgroup_subsys_state *bi_css;
#endif
#if defined(CONFIG_BLK_DEV_INTEGRITY)
struct bio_integrity_payload *bi_integrity; /* data integrity */
#endif
/*
* Everything starting with bi_max_vecs will be preserved by bio_reset()
*/
unsigned int bi_max_vecs; /* max bvl_vecs we can hold */
atomic_t bi_cnt; /* pin count */
struct bio_vec *bi_io_vec; /* the actual vec list */
struct bio_set *bi_pool;
/*
* We can inline a number of vecs at the end of the bio, to avoid
* double allocations for a small number of bio_vecs. This member
* MUST obviously be kept at the very end of the bio.
*/
struct bio_vec bi_inline_vecs[0];
}
/*
* was unsigned short, but we might as well be ready for > 64kB I/O pages
*/
struct bio_vec {
struct page *bv_page; //段所在的页的描述符
unsigned int bv_len; //段长
unsigned int bv_offset; //段在页中的偏移位置
};
bi_io_dev是bio_vec数据结构, 存放的是该bio中包含的段,bi_vcnt存放的是bi_io_dev中段的个数,bi_idx是bi_io_dev当前段的索引。
inlcude/linux/gendhd.h
struct gendisk {
/* major, first_minor and minors are input parameters only,
* don't use directly. Use disk_devt() and disk_max_parts().
*/
int major; /* major number of driver */
int first_minor;
int minors; /* maximum number of minors, =1 for
* disks that can't be partitioned. */
char disk_name[DISK_NAME_LEN]; /* name of major driver */
char *(*devnode)(struct gendisk *gd, umode_t *mode);
unsigned int events; /* supported events */
unsigned int async_events; /* async events, subset of all */
/* Array of pointers to partitions indexed by partno.
* Protected with matching bdev lock but stat and other
* non-critical accesses use RCU. Always access through
* helpers.
*/
struct disk_part_tbl __rcu *part_tbl; //分区表,这里记录一个disk中所有的逻辑分区
struct hd_struct part0; //0逻辑分区
const struct block_device_operations *fops;
struct request_queue *queue;
void *private_data;
int flags;
struct device *driverfs_dev; // FIXME: remove
struct kobject *slave_dir;
struct timer_rand_state *random;
atomic_t sync_io; /* RAID */
struct disk_events *ev;
#ifdef CONFIG_BLK_DEV_INTEGRITY
struct blk_integrity *integrity;
#endif
int node_id;
};
flags标识gendisk的状态,其中GENHD_FL_UP表示已经初始化正在工作,GENHD_FL_REMOVABLE表示是否支持移除,比如软盘和光盘。fops字段是一个指向一个block_device_operations类型的指针,其中都是该disk的通用操作,重要的而是open,release,ioctl三个操作。
include/linux/blkdev.h
struct block_device_operations {
int (*open) (struct block_device *, fmode_t); //打开一个设备文件
void (*release) (struct gendisk *, fmode_t); // 释放设备文件
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); // 使用大内核锁释放ioctl调用
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); //不适用大内核锁释放ioctl调用
int (*direct_access) (struct block_device *, sector_t,
void **, unsigned long *);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
int (*media_changed) (struct gendisk *);
void (*unlock_native_capacity) (struct gendisk *);
int (*revalidate_disk) (struct gendisk *);
int (*getgeo)(struct block_device *, struct hd_geometry *);
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
struct module *owner;
};
一个磁盘可能有多个逻辑分区,每个分区使用hd_struct数据结构来代表。数据结构如下:
include/linux/genhd.h
struct hd_struct {
sector_t start_sect; //开始的扇区号
/*
* nr_sects is protected by sequence counter. One might extend a
* partition while IO is happening to it and update of nr_sects
* can be non-atomic on 32bit machines with 64bit sector_t.
*/
sector_t nr_sects; //总的扇区号
seqcount_t nr_sects_seq;
sector_t alignment_offset;
unsigned int discard_alignment;
struct device __dev;
struct kobject *holder_dir;
int policy, partno;
struct partition_meta_info *info;
#ifdef CONFIG_FAIL_MAKE_REQUEST
int make_it_fail;
#endif
unsigned long stamp;
atomic_t in_flight[2];
#ifdef CONFIG_SMP
struct disk_stats __percpu *dkstats;
#else
struct disk_stats dkstats;
#endif
atomic_t ref;
struct rcu_head rcu_head;
};
当内核检测到一个新的磁盘,它会调用alloc_disk()函数来分配和初始化一个gendisk对象,如果这个磁盘分为几个分区,则会为每个分区分配hd_struct结构。最后调用add_disk()函数来将gendisk数据结构插入到通用块层相关结构中。
下面分析下系统提交一个io请求的步骤。
1. 使用bio_alloc()函数来分配一个bio数据结构,并初始化相关字段
2. 调用generic_make_request()函数
- 调用generic_make_request_checks函数来检查bio->bi_sector是否超过了块设备的扇区数
1. 如果超过了就设置bio->bi_flags为BIO_EOF,输出内核错误信息,并调用bio_endio()函数,然后中止。
2. 否则,调用 blk_partition_remap函数来查看是否该设备是磁盘分区,如果是就进行重新映射为磁盘的扇区,并把bio->bi_dev指向磁盘的块描述符。从现在开始,io调度器和磁盘驱动只针对磁盘进行操作,不再有分区的概念。 - 获得磁盘的request_queue,调用对应的make_request_fn函数来将bio请求插入到请求队列中。
每个设备都有一个相关的请求队列,在linux中,使用如下数据结构刻画。
include/linux/blkdev.h
struct request_queue {
/*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head; //请求的链表头
struct request *last_merge;
struct elevator_queue *elevator; //调度器实例
int nr_rqs[2]; /* # allocated [a]sync rqs */
int nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */
/*
* If blkcg is not used, @q->root_rl serves all requests. If blkcg
* is used, root blkg allocates from @q->root_rl and all other
* blkgs from their own blkg->rl. Which one to use should be
* determined using bio_request_list().
*/
struct request_list root_rl;
request_fn_proc *request_fn; //驱动程序策略例程的入口点
make_request_fn *make_request_fn; //当一个新的request要插入到队列中时触发的函数
prep_rq_fn *prep_rq_fn;
unprep_rq_fn *unprep_rq_fn;
merge_bvec_fn *merge_bvec_fn;
softirq_done_fn *softirq_done_fn;
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
struct blk_mq_ops *mq_ops;
unsigned int *mq_map;
/* sw queues */
struct blk_mq_ctx *queue_ctx;
unsigned int nr_queues;
/* hw dispatch queues */
struct blk_mq_hw_ctx **queue_hw_ctx;
unsigned int nr_hw_queues;
/*
* Dispatch queue sorting
*/
sector_t end_sector;
struct request *boundary_rq;
/*
* Delayed queue handling
*/
struct delayed_work delay_work;
struct backing_dev_info backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
void *queuedata;
/*
* various queue flags, see QUEUE_* below
*/
unsigned long queue_flags;
/*
* ida allocated id for this queue. Used to index queues from
* ioctx.
*/
int id;
/*
* queue needs bounce pages for pages above this limit
*/
gfp_t bounce_gfp;
/*
* protects queue structures from reentrancy. ->__queue_lock should
* _never_ be used directly, it is queue private. always use
* ->queue_lock.
*/
spinlock_t __queue_lock;
spinlock_t *queue_lock;
/*
* queue kobject
*/
struct kobject kobj;
/*
* mq queue kobject
*/
struct kobject mq_kobj;
#ifdef CONFIG_PM_RUNTIME
struct device *dev;
int rpm_status;
unsigned int nr_pending;
#endif
/*
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned int nr_batching;
unsigned int dma_drain_size;
void *dma_drain_buffer;
unsigned int dma_pad_mask;
unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
struct list_head tag_busy_list;
unsigned int nr_sorted;
unsigned int in_flight[2];
/*
* Number of active block driver functions for which blk_drain_queue()
* must wait. Must be incremented around functions that unlock the
* queue_lock internally, e.g. scsi_request_fn().
*/
unsigned int request_fn_active;
unsigned int rq_timeout;
struct timer_list timeout;
struct list_head timeout_list;
struct list_head icq_list;
#ifdef CONFIG_BLK_CGROUP
DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS);
struct blkcg_gq *root_blkg;
struct list_head blkg_list;
#endif
struct queue_limits limits;
/*
* sg stuff
*/
unsigned int sg_timeout;
unsigned int sg_reserved_size;
int node;
#ifdef CONFIG_BLK_DEV_IO_TRACE
struct blk_trace *blk_trace;
#endif
/*
* for flush operations
*/
unsigned int flush_flags;
unsigned int flush_not_queueable:1;
unsigned int flush_queue_delayed:1;
unsigned int flush_pending_idx:1;
unsigned int flush_running_idx:1;
unsigned long flush_pending_since;
struct list_head flush_queue[2];
struct list_head flush_data_in_flight;
union {
struct request flush_rq;
struct {
spinlock_t mq_flush_lock;
struct work_struct mq_flush_work;
};
};
struct mutex sysfs_lock;
int bypass_depth;
#if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn *bsg_job_fn;
int bsg_job_size;
struct bsg_class_device bsg_dev;
#endif
#ifdef CONFIG_BLK_DEV_THROTTLING
/* Throttle data */
struct throtl_data *td;
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
struct percpu_counter mq_usage_counter;
struct list_head all_q_node;
};
backing_dev_info存储硬件块设备的io数据流,比如预读和请求队列拥挤状态信息。
每个io请求由如下数据结构刻画:
include/linux/blkdev.h
/*
* try to put the fields that are referenced together in the same cacheline.
* if you modify this structure, be sure to check block/blk-core.c:blk_rq_init()
* as well!
*/
struct request {
union {
struct list_head queuelist; // 链接到request_queue
struct llist_node ll_list;
};
union {
struct call_single_data csd;
struct work_struct mq_flush_data;
};
struct request_queue *q;
struct blk_mq_ctx *mq_ctx;
u64 cmd_flags;
enum rq_cmd_type_bits cmd_type;
unsigned long atomic_flags;
int cpu;
/* the following two fields are internal, NEVER access directly */
unsigned int __data_len; /* total data len */
sector_t __sector; /* sector cursor */
struct bio *bio;
struct bio *biotail;
struct hlist_node hash; /* merge hash */
/*
* The rb_node is only used inside the io scheduler, requests
* are pruned when moved to the dispatch queue. So let the
* completion_data share space with the rb_node.
*/
union {
struct rb_node rb_node; /* sort/lookup */
void *completion_data;
};
/*
* Three pointers are available for the IO schedulers, if they need
* more they have to dynamically allocate it. Flush requests are
* never put on the IO scheduler. So let the flush fields share
* space with the elevator data.
*/
union {
struct {
struct io_cq *icq;
void *priv[2];
} elv;
struct {
unsigned int seq;
struct list_head list;
rq_end_io_fn *saved_end_io;
} flush;
};
struct gendisk *rq_disk;
struct hd_struct *part;
unsigned long start_time;
#ifdef CONFIG_BLK_CGROUP
struct request_list *rl; /* rl this rq is alloced from */
unsigned long long start_time_ns;
unsigned long long io_start_time_ns; /* when passed to hardware */
#endif
/* Number of scatter-gather DMA addr+len pairs after
* physical address coalescing is performed.
*/
unsigned short nr_phys_segments;
#if defined(CONFIG_BLK_DEV_INTEGRITY)
unsigned short nr_integrity_segments;
#endif
unsigned short ioprio;
void *special; /* opaque pointer available for LLD use */
char *buffer; /* kaddr of the current segment if available */
int tag;
int errors;
/*
* when request is used as a packet command carrier
*/
unsigned char __cmd[BLK_MAX_CDB];
unsigned char *cmd;
unsigned short cmd_len;
unsigned int extra_len; /* length of alignment and padding */
unsigned int sense_len;
unsigned int resid_len; /* residual count */
void *sense;
unsigned long deadline;
struct list_head timeout_list;
unsigned int timeout;
int retries;
/*
* completion callback.
*/
rq_end_io_fn *end_io;
void *end_io_data;
/* for bidi */
struct request *next_rq;
};
一个request里包含多个bio,开始通用块层创建包含只有一个bio的request。之后io调度器可能会扩展该请求,通过给该bio合并段,或者将其他的bio加到request中。bio和biotail字段分别指向的是request中第一个bio和最后一个biotail,__sector字段存放当前正在传送的扇区号。
当内存紧张时,request的分配成了进程的瓶颈。为了应对这种场景,每个request_queue数据结构里有一个数据结构request_list,如下
include/linux/blkdev.h
struct request_list {
struct request_queue *q; /* the queue this rl belongs to */
#ifdef CONFIG_BLK_CGROUP
struct blkcg_gq *blkg; /* blkg this request pool belongs to */
#endif
/*
* count[], starved[], and wait[] are indexed by
* BLK_RW_SYNC/BLK_RW_ASYNC
*/
int count[2];
int starved[2];
mempool_t *rq_pool;
wait_queue_head_t wait[2];
unsigned int flags;
};
rq_pool用来指向一个request描述符池,两个计数器分别计数分为READ和WRITE的请求的个数,两个等待队列,分别存放因为等待可用的READ/WRITE的request的线程。最后的flags指示最近的READ或WRITE是否成功。
blk_get_request()试着为进程分配一个request,如果不成功就将进程加入等待队列。如果分配成功,就将request中的rl字段指向request_list的地址。blk_put_request()函数释放request。
为了避免请求队列拥塞,每个请求队列有一个最大的允许悬挂的请求个数。request_queue中的nr_requests记录每个传输方向允许的最大悬挂的个数。默认情况下一个request_queue中存放最多128个READ和WRITE请求。如果悬挂的request的个数达到了113后,内核认为队列是拥塞了,于是减速创建request的速度。当个数低于111时,就解除拥塞状态。
延迟触发块设备是为了提高系统的传输效率。这种技术称为塞住(plug)和拔塞(unplug)。只要塞住了,不管request_queue是否阻塞住了,驱动程序都不会被激活。blk_delay_work函数塞住块设备,它设置queue_flags标志为QUEUE_FLAG_PLUGGED, 并且重启嵌在queue里的timeout定时器。超时后,会调用重新启动request_queue的request_fn函数来处理request。
IO调度器
io调度器决定新加入的request如何被加入到request_queue中。系统默认提供三种调度算法。noop、deadline、cfq。
释放一个request到IO调度器
generic_make_request()函数触发make_request_fn函数将request传输到IO调度器。块设备一般是blk_queue_make_request函数来设置request_queue的make_request_fn函数。当给一个块设备初始化一个request_queue时,使用blk_init_allocated_queue这个函数。调用顺序为blk_init_allocated_queue -> blk_queue_make_request -> blk_queue_bio函数。其中blk_queue_bio函数调用blk_queue_bounce来获得一个缓存的bio。
块设备驱动
块设备由block_device数据结构来代表。
include/linux/fs.h
struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
int bd_openers; // 这个块设备被打开了多少次
struct inode * bd_inode; /* will die */ // 指向相关的设备文件的inode
struct super_block * bd_super; // 块设备的超级块
struct mutex bd_mutex; /* open/close mutex */
struct list_head bd_inodes; // 这个块设备打开的设备文件inode头
void * bd_claiming;
void * bd_holder;
int bd_holders;
bool bd_write_holder;
#ifdef CONFIG_SYSFS
struct list_head bd_holder_disks;
#endif
struct block_device * bd_contains; //如果是一个分区,这个就指向整个磁盘的块设备描述符,如果是一个磁盘的块设备描述符,那就指向自己
unsigned bd_block_size;
struct hd_struct * bd_part; //指向分区
/* number of times partitions within this device have been opened. */
unsigned bd_part_count;
int bd_invalidated;
struct gendisk * bd_disk;
struct request_queue * bd_queue;
struct list_head bd_list; // 块设备描述符链表
/*
* Private data. You must have bd_claim'ed the block_device
* to use this. NOTE: bd_claim allows an owner to claim
* the same device multiple times, the owner must take special
* care to not mess up bd_private for that case.
*/
unsigned long bd_private;
/* The counter of freeze processes */
int bd_fsfreeze_count;
/* Mutex for freeze */
struct mutex bd_fsfreeze_mutex;
};
所有的块设备描述符都被插入到一个全局的连接表中,头为all_bdevs这个变量。
bd_holder字段存的是块设备的占有者。占有者不是驱动,是独占使用该块设备的内核组件。通常块设备的占有者是其在文件系统的挂载点。使用bd_link_disk_holder来给块设备设置holder。
一个block_device会和多个设备文件联系,这些文件共用块设备的主次设备号。所以只是查看内存中单个设备文件的inode是否存在不能决定块设备是否已经没有被使用了。bdev这个特殊的文件系统管理块设备文件inode和块设备主次设备号之间的关系。block_device描述符的bd_inode字段指向设备文件的inode,设备文件的inode记录块设备的主次设备号,并由block_device的描述符指针。
当要进入一个块设备时,会顺着文件系统查找到block_device,如果block_device的bd_openers大于0,那就是块设备还在使用,可以使用它进行操作。
块设备驱动注册和初始化
- 定义一个自定义的驱动描述符
一般会描述使用的io端口信息,中断IRQ线,设备内部状态等等信息。lock一般用来保护描述符的字段不被并发访问。gd代表该驱动处理的整个磁盘设备。
struct foo_dev_t {
[...]
spinlock_t lock;
struct gendisk *gd;
[...]
} foo;
- 预留主设备号
调用register_blkdev来分配一个主设备号。 - 初始化自定义的驱动描述符。
spin_lock_init(&foo.lock);
foo.gd = alloc_disk(16);
if (!foo.gd) goto error_no_gendisk;
- 初始化gendisk描述符
foo.gd->private_data = &foo;
foo.gd->major = FOO_MAJOR;
foo.gd->first_minor = 0;
foo.gd->minors = 16;
set_capacity(foo.gd, foo_disk_capacity_in_sectors);
strcpy(foo.gd->disk_name, "foo");
foo.gd->fops = &foo_ops;
- 初始化块设备的方法表
初始化对应的块设备操作的方法,比如open,ioctl等 - 分配和初始化request_queue
foo.gd->rq = blk_init_queue(foo_strategy, &foo.lock);
if (!foo.gd->rq) goto error_no_request_queue;
blk_queue_hardsect_size(foo.gd->rd, foo_hard_sector_size);
blk_queue_max_sectors(foo.gd->rd, foo_max_sectors);
blk_queue_max_hw_segments(foo.gd->rd, foo_max_hw_segments);
blk_queue_max_phys_segments(foo.gd->rd, foo_max_phys_segments);
- 设置中断处理器
request_irq(foo_irq, foo_interrupt,
SA_INTERRUPT|SA_SHIRQ, "foo", NULL);
- 注册磁盘
add_disk(foo.gd);
1. 设置gd->flags为GENHD_FL_UP
2. 建立驱动和主设备号之间的关系
3. 注册bdi信息
4. 注册次设备号
5. 注册磁盘
6. 注册request_queue
7. 在sys系统下注册kobject对象,建立link文件
8. 给disk添加uevent关联
策略例程
策略例程是块设备驱动程序的一个或一组函数,它与设备打交道以完成request_queue中的请求。
通常,策略例程是等到空的request_queue中有新的请求时,才会被启动。只要启动,就会对该队列中所有的请求进行服务,直到队列为空。如果每次发起一次数据传输,策略例程都得把自己挂起等待该传输完成,效率不高,也不能支持可以一次处理多个io数据传输的现代磁盘控制器。
许多的设备驱动程序采用如下策略:
- 策略例程处理队列中的第一个请求,并设置磁盘控制器在完成数据传送时,可以产生中断。策略例程终止。
- 当磁盘控制器产生中断时,中断处理器重新触发策略例程,策略例程要么为当前的请求再启动一次数据传送,要么已经服务完了当前请求,将其从request_queue删除,并开始服务下一个请求。
中断处理程序
中断处理程序是DMA数据传送结束时激活的。它检查是否已经传送完当前请求的所有数据块。如果是,就调用策略例程处理调度队列中的下一个请求。否则,中断处理程序更新请求描述符的相应字段并调用策略例程处理当前请求还没有完成的数据块。一般如下所示:
irqreturn_t foo_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
struct foo_dev_t *p = (struct foo_dev_t *) dev_id;
struct request_queue *rq = p->gd->rq;
[...]
if (!end_that_request_first(rq, uptodate, nr_sectors)) {
blkdev_dequeue_request(rq);
end_that_request_last(rq);
}
rq->request_fn(rq);
[...]
return IRQ_HANDLED;
}
end_that_request_first()用来完成上述逻辑,end_that_request_last()更新一些磁盘使用的统计数据,并从request_queue中删除该请求,唤醒阻塞在该请求上的进程,并释放该请求描述符。
打开块设备文件
当内核挂载一个磁盘或者磁盘分区,激活交换分区,或者用户态进程向
块设备文件发出一个open系统调用时,都会打开块设备文件。实际上,这些操作对vfs来说执行的同样的操作。查找块设备描述符,为即将开始的数据传输设置文件操作方法。块设备文件操作方法,记录在def_blk_fops表中。
const struct file_operations def_blk_fops = {
.open = blkdev_open,
.release = blkdev_close,
.llseek = block_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = blkdev_aio_read,
.aio_write = blkdev_aio_write,
.mmap = generic_file_mmap,
.fsync = blkdev_fsync,
.unlocked_ioctl = block_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = compat_blkdev_ioctl,
#endif
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
};
下面以open函数为例说明下整个的执行路线。
blkdev_open函数由dentry_open()函数触发。该函数主要执行以下这些步骤:
- 执行bd_acquire函数从而获得块设备描述符bdev的地址。该函数接受索引节点对象并执行下列主要步骤:
- 检查inode的i_bdev对象是否不为null,如果是,表明块设备文件打开了,该字段存放了相应块描述符的地址。此时,增加与块设备相关联的bdev特殊文件系统的inode->i_bdev->bd_inode索引节点的引用计数,并返回inode->i_bdev的地址。
- 块设备文件没有被打开的情况下,根据块设备文件的主次设备号,指向bdget获取块设备描述符的地址。如果描述符不存在,bdget就分配一个。描述符可能是存在的,因为其他的块设备文件指向该设备描述符。
- 将块设备描述符放在inode->i_bdev中
- 将inode->i_mapping字段设置为bdev索引节点的相应字段。
- 将索引节点插入到由bdev->bd_inodes确定的块设备描述符的已打开索引节点链表中
- 返回bdev的地址
- 将filp->f_mapping设置为bdev->bd_inode->i_mapping
- 执行blkdev_get函数
- 如果要独占使用,就使用bd_start_claiming来设置该块设备的独占者,并加锁,增加其引用计数。
- 调用__blkdev_get函数
- 获得gendisk对象
- 检查bdev->bd_openers,如果不等于0,表明已经打开了,检查bdev->bd_contains字段,如果指向本身,就是一块整盘,调用bdev->bd_disk->fops->open函数。如果不是本身,那么块设备是一个分区,增加bdev->bd_part_count的计数,跳到第4步
- 如果等于0,表示是第一次打开。设置bdev->bd_disk为gendisk描述符。执行如下操作
- 如果part等于0,则执行如下步骤:
- 如果定义了disk->fops->open方法,就调用该方法
- 调用bd_set_size设置索引节点中表示分区大小和扇区大小的字段
- 如果设置了bdev->bd_invalidated字段,就rescan_partitions来扫描分区表并更新分区描述符。
- 否则是一个分区,则执行如下步骤:
- 获得整个磁盘的gendisk
- 再次调用__blkdev_get函数来重复之前的步骤,并更新bd_part_count的计数表明是对分区的新的打开操作
- 将bdev->bd_containes设置为整个磁盘的gendisk
- 设置bdev->bd_part的值为disk->part[partno - 1]的值
- 设置索引节点中表示分区大小和扇区大小的字段
- 如果part等于0,则执行如下步骤:
- 增加bdev->bd_openers的计数
- 如果声明独占使用了,并解锁,将块设备的引用计数减一就将独占标志设为null,表示我们独占使用该设置完了,其他的进程可以来使用了。
- 成功返回
参考:
- 深入理解linux内核(陈莉君,张琼声)
- http://lxr.free-electrons.com/
- https://www.ibm.com/developerworks/cn/linux/l-cn-udev/
- http://oliveryang.net/2016/04/linux-block-driver-basic-1/
- http://blog.csdn.net/walkingman321/article/details/5917737
[…]
还有很多乐于分享的人分享的博客,没有都记下来。互联网分享精神真的很好!非常感谢大家的分享。本人水平有限,记录肯定有些错漏,欢迎大家指正批评!