linux deadline I/O调度算法分析笔记

最新推荐文章于 2023-12-19 18:52:31 发布

iteye_4515

最新推荐文章于 2023-12-19 18:52:31 发布

阅读量264

点赞数

文章标签：数据结构与算法

deadline算法的核心就是在传统的电梯算法中加入了请求超时的机制，该机制主要体现在两点：(1)请求超时时，对超时请求的选择。(2)没有请求超时时，当扫描完电梯最后一个request后，准备返回时，对第一个request的选择。基于以上两点，平衡了系统i/o吞吐量和响应时间。此外，该算法还考虑到了读操作对写操作造成的饥饿。
算法核心数据结构：
struct deadline_data {
struct rb_root sort_list[2];
struct list_head fifo_list[2];

/*
* next in sort order. read, write or both are NULL
*/
struct request *next_rq[2];
unsigned int batching;/* number of sequential requests made */
sector_t last_sector;/* head position */
unsigned int starved;/* times reads have starved writes */

/*
* settings that change how the i/o scheduler behaves
*/
int fifo_expire[2];
int fifo_batch;
int writes_starved;
int front_merges;
};
sort_list：按照request中的sector号大小，把每个request组织在以sort_list为根的红黑树中。这样方便快速查找。总共有读写两棵树。
fifo_list：按照超时先后顺寻，把request链入filo_list，同样也是分为读写两个队列。
因此，任何一个request，在未提交给设备的请求队列之前，都会同时存在于以上两个结构中。
next_rq:指向sort_list中的下一个请求。
batching：调度算法可能连续提交多个请求，batching用于记录当前连续提交的request数目，只要batching < fifo_batch，都可以继续进行连续提交。
starved:提交读request而造成写饥饿的次数。如果starved超过writes_starved，则需要提交写request，从而避免写饥饿。

deadline I/O调度器中最重要的两个方法是：deadline_add_request和deadline_dispatch_requests。它们分别向调度器队列（非请求队列）添加request和把调度器队列中的请求分发到块设备的请求队列。

static void
deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int data_dir = rq_data_dir(rq);

deadline_add_rq_rb(dd, rq);

/*
* set expire time (only used for reads) and add to fifo list
*/
rq_set_fifo_time(rq, jiffies + dd->fifo_expire[data_dir]);
list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
}
deadline_add_request把请求分别加入红黑树和fifo。并记录请求的超时时间。这儿借用了request的donelist的next字段。因为在deadline调度队列的请求，绝不可能被调入请求完成队列：
#define rq_set_fifo_time(rq,exp)((rq)->donelist.next = (void *) (exp))

static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
const int reads = !list_empty(&dd->fifo_list[READ]);
const int writes = !list_empty(&dd->fifo_list[WRITE]);
struct request *rq;
int data_dir;

/*
* batches are currently reads XOR writes
*/
if (dd->next_rq[WRITE])
rq = dd->next_rq[WRITE];
else
rq = dd->next_rq[READ];

以上红色标记的部分感觉有一点问题，下了一个2.6.32.16版本的内核看了一下，里面的代码如下：

if (rq && dd->batching < dd->fifo_batch)
/* we have a next request are still entitled to batch */
goto dispatch_request;

/*
* at this point we are not running a batch. select the appropriate
* data direction (read / write)
*/

if (reads) {
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));

if (writes && (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;

data_dir = READ;

goto dispatch_find_request;
}

/*
* there are either no reads or writes have been starved
*/

if (writes) {
dispatch_writes:
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));

dd->starved = 0;

data_dir = WRITE;

goto dispatch_find_request;
}

return 0;

dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
rq = dd->next_rq[data_dir];
}

dd->batching = 0;

dispatch_request:
/*
* rq is the selected appropriate request.
*/
dd->batching++;
deadline_move_request(dd, rq);

return 1;
}

deadline_dispatch_requests是deadline算法的核心。主要分为三个部分：（1）确定是要分发读还是写request。影响处理分发类型因素主要包括是否处于batching阶段以及是否发生写饥饿。（2）确定方向以后，根据读写方向找到该方向上下一个请求进行分发。影响定位下一个请求因素主要包括请求是否超时。（3）找到合适的request后，调用deadline_move_request分发给块设备的请求队列。
1.确定读写方向.
a.首先，根据是否处于batching来确定当前处理读写的方向。因为如果处在batching过程中，就意味着调度程序需要连续处理同一方向的请求。这样可以有效增加系统吞吐量。因此，根据batching的方向，可以确定当前处理请求的方向。而读写batching是互斥的：
/*
* batches are currently reads XOR writes
*/
if (dd->next_rq[WRITE])
rq = dd->next_rq[WRITE];
else
rq = dd->next_rq[READ];

通过这个判断，保证了batching只会在某一方向进行，而不会交错。因为在deadline_move_request中有：
dd->next_rq[READ] = NULL;
dd->next_rq[WRITE] = NULL;
dd->next_rq[data_dir] = deadline_latter_request(rq);
也就是在batching方向的next_rq才可能指向下一个request。

if (rq) {
/* we have a "next request" */

if (dd->last_sector != rq->sector)
/* end the batch on a non sequential request */
dd->batching += dd->fifo_batch;

if (dd->batching < dd->fifo_batch)
/* we are still entitled to batch */
goto dispatch_request;
}
如果存在下一个request，也就是没有扫描到电梯的末尾。则判断该request是否和上一个request相连。如果相连，并且batching的request数没有超过fifo_batch，则当前这个request就是我们要分发的request。因此直接跳到最后，把request分发到设备请求队列。此时将忽略写饥饿和超时的处理。如果不连续，则要结束batching。
b.如果此时并不处于batching过程中，则根据是否造成写饥饿“超标”来确定读写方向。
if (reads) {
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));

if (writes && (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;

data_dir = READ;

goto dispatch_find_request;
}

/*
* there are either no reads or writes have been starved
*/

if (writes) {
dispatch_writes:
BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));

dd->starved = 0;

data_dir = WRITE;

goto dispatch_find_request;
}
调度器先处理read，也就是read请求优先。但在处理过程中考虑到了写饥饿。如果此时还有写请求，则写饥饿计数+1，如果写饥饿次数大于了writes_starved，则写饥饿已经“超标”了，因此直接跳到dispath_writes去处理写请求。如果写饥饿没有“超标”，则继续处理读请求。

2.根据读写方向，找到当前要处理的请求：
dispatch_find_request:
/*
* we are not running a batch, find best request for selected data_dir
*/
if (deadline_check_fifo(dd, data_dir) || !dd->next_rq[data_dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
rq = dd->next_rq[data_dir];
}

dd->batching = 0;
在寻找指定方向上的请求时，考虑了请求的超时时间。这就是deadline的算法核心所在。调度器首先调用deadline_check_fifo来检查队列中队首，也就是最老的一个请求是否超时。如果超时则指定当前处理的请求为该超时的请求。但如果没有超时，但已经扫描了电梯的末尾：!dd->next_rq[data_dir]。此时需要返回到电梯首部。但与传统的电梯算法不同，deadline调度器不是返回到sector最小的request开始继续扫描。而是返回到等待时间最久的那个requst，从那个request开始，沿sector递增方向继续扫描。也就是说，如果超时，或者扫描到电梯尾，都会返回来处理等待最久的request，并从这个request开始继续进行电梯扫描。当然，如果既没有发生超时，也没有扫描到电梯末尾，则沿sector递增方向上的下一个request就是当前要处理的request。

3.找到要处理的request后，把它分发到块设备的请求队列。

整个deadline调度器比较简洁，总共只有400多行。它充分考虑了batching，写饥饿，请求超时这三大因素。在保证吞吐量的基础上，有考虑到了响应延时。