Linux 内核中RAID5源码详解之守护进程raid5d

最新推荐文章于 2015-09-28 21:12:26 发布

小表弟皮卡丘

最新推荐文章于 2015-09-28 21:12:26 发布

阅读量3k

点赞数 2

分类专栏： linux kernel raid md 存储 raid5

本文链接：https://blog.csdn.net/chenyouxu/article/details/47041797

版权

linux kernel 同时被 3 个专栏收录

10 篇文章 2 订阅

订阅专栏

raid

9 篇文章 2 订阅

订阅专栏

存储

8 篇文章 2 订阅

订阅专栏

Linux 内核中RAID5源码详解之守护进程raid5d

对于一个人，大脑支配着他的一举一动；对于一支部队，指挥中心控制着它的所有活动；同样，对于内核中的RAID5，也需要一个像大脑一样的东西来支配着他的正确运转，那就是RAID5的守护进程raid5d。今天，我们就好好来看看raid5d到底是怎么一回事~

进程的注册

前面的博文中贴出的源码经常会出现这样一条语句md_wakeup_thread(mddev->thread);,这条语句是干嘛的呢？就是唤醒守护进程的。那么，如何让内存感知这个进程的存在呢？这时需要对进程进行注册。

在raid5.c中的set_conf() 中(这时配置RAID5的全局信息的函数),有这样一句conf->thread = md_register_thread(raid5d, mddev, pers_name); ,这就是将RAID5的守护进程raid5d注册到内核中，让内核识别这个进程。追踪md_register_thread() :

struct md_thread *md_register_thread(void (*run) (struct md_thread *),struct mddev *mddev, const char *name)
{
    struct md_thread *thread;

    thread = kzalloc(sizeof(struct md_thread), GFP_KERNEL);/*为进程开辟内存空间*/
    if (!thread)
        return NULL;

    init_waitqueue_head(&thread->wqueue);//初始化进程等待队列

    thread->run = run;//设置进程的执行函数
    thread->mddev = mddev;
    thread->timeout = MAX_SCHEDULE_TIMEOUT;//这时进程的超时机制
    thread->tsk = kthread_run(md_thread, thread,
                  "%s_%s",
                  mdname(thread->mddev),
                  name);//设置运行信息
    if (IS_ERR(thread->tsk)) {//运行出错时的反应
        kfree(thread);
        return NULL;
    }
    return thread;
}
EXPORT_SYMBOL(md_register_thread);

结合上述注释，可以清楚的发现当唤醒这个进程时，执行的是raid5d这个函数，真正的主体在raid5d里，好吧，现在是时候揭开它神秘的面纱了，gogogo！！！

进程执行函数—raid5d

在raid5.c中搜索raid5d的代码：

/*
 * This is our raid5 kernel thread.
 *
 * We scan the hash table for stripes which can be handled now.
 * During the scan, completed stripes are saved for us by the interrupt
 * handler, so that they will not have to wait for our next wakeup.
 */
static void raid5d(struct md_thread *thread)
{
    struct mddev *mddev = thread->mddev;
    struct r5conf *conf = mddev->private;
    int handled;
    struct blk_plug plug;

    pr_debug("+++ raid5d active\n");

    md_check_recovery(mddev);//检查RAID5同步

    blk_start_plug(&plug);
    handled = 0;
    spin_lock_irq(&conf->device_lock);
    while (1) {//^_^死循环哦
        struct bio *bio;
        int batch_size, released;

        released = release_stripe_list(conf, conf->temp_inactive_list);

        if (
            !list_empty(&conf->bitmap_list)) {//激活bitmap处理
            /* Now is a good time to flush some bitmap updates */
            conf->seq_flush++;
            spin_unlock_irq(&conf->device_lock);
            bitmap_unplug(mddev->bitmap);
            spin_lock_irq(&conf->device_lock);
            conf->seq_write = conf->seq_flush;
            activate_bit_delay(conf, conf->temp_inactive_list);
        }
        raid5_activate_delayed(conf);//激活延迟处理装置

        while ((bio = remove_bio_from_retry(conf))) {//有关重试读的操作
            int ok;
            spin_unlock_irq(&conf->device_lock);
            ok = retry_aligned_read(conf, bio);
            spin_lock_irq(&conf->device_lock);
            if (!ok)
                break;
            handled++;
        }

        batch_size = handle_active_stripes(conf, ANY_GROUP, NULL,conf->temp_inactive_list);//处理stripe_head的主战场，返回处理的个数
        if (!batch_size && !released)
            break;
        handled += batch_size;

        if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
            spin_unlock_irq(&conf->device_lock);
            md_check_recovery(mddev);
            spin_lock_irq(&conf->device_lock);
        }
    }
    pr_debug("%d stripes handled\n", handled);

    spin_unlock_irq(&conf->device_lock);

    async_tx_issue_pending_all();
    blk_finish_plug(&plug);

    pr_debug("--- raid5d inactive\n");
}

这里我们只介绍一些流程，并不对某些操作具体讲解，因为像同步或者处理条带这些操作很复杂，后面会详细的介绍，今天，我们只做一个流程规划。
其实raid5d只做了这几件事：检查同步、处理temp_inactive_list、激活bitmap处理、激活延迟处理、重试读和处理条带。下面一一介绍这些功能：

检查同步：由于下层的磁盘设备会发生故障或者产生损坏致使数据错误，而RAID5独有的parity机制则对数据的安全性有了很可靠的保障。检查同步就是按stripe_head为单元，一条条读取磁盘上的数据，然后计算新parity与原先的parity进行比较，如果相同则数据正确，否则数据产生错误，需要修复。
处理temp_inactive_list：跟进该函数，发现调用的是__release_stripe()，我的前一篇博文里面有提及这个函数，具体点击这里，在此就不赘述了。
激活bitmap处理：bitmap，前面没接触过的话，会觉得这个名词很陌生，其实它的存在是为了保证RAID5甚至整个MD模块运行的可靠性。打个比方说：RAID5在写1MB的数据，然而此时断电了，那么内存数据丢失，而且也不知道此时磁盘写了多大的数据，那么怎么恢复这个缺陷呢？于是bitmap出现了，它将每次写请求做一次类似log的形式保存在bitmap_list中，直到写请求已经全部完成后才将bitmap_list中的请求删去，比如说再遇到断电的情况，bitmap中保存着这次请求，下次可以直接恢复。
所以说bitmap的存在是为了系统的可靠性考虑。
激活延迟处理：该性能由raid5_activate_delayed() 实现，跟进函数：

static void raid5_activate_delayed(struct r5conf *conf)
{
    if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
        while (!list_empty(&conf->delayed_list)) {
            struct list_head *l = conf->delayed_list.next;//延迟链表
            struct stripe_head *sh;
            sh = list_entry(l, struct stripe_head, lru);
            list_del_init(l);//从list中删除
            clear_bit(STRIPE_DELAYED, &sh->state);//清楚延迟处理标志
            if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
                atomic_inc(&conf->preread_active_stripes);
            list_add_tail(&sh->lru, &conf->hold_list);//加入到hold_list
            raid5_wakeup_stripe_thread(sh);
        }
    }
}

还记得Linux 内核中RAID5源码详解之stripe_head的管理中提到的delayed_list和hold_list之间的转化吗？就是在这里实现的。

重试读：这部分是为了提高读的性能，后面会有介绍。
处理条带：这是我们最关心的功能，也是stripe_head处理的入口，跟进handle_active_stripes() ：

static int handle_active_stripes(struct r5conf *conf, int group,
                 struct r5worker *worker,
                 struct list_head *temp_inactive_list)
{
    struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
    int i, batch_size = 0, hash;
    bool release_inactive = false;

    while (batch_size < MAX_STRIPE_BATCH &&
            (sh = __get_priority_stripe(conf, group)) != NULL)/*默认的MAX_STRIPE_BATCH为8，即一次最多取8个stripe_head处理*/
        batch[batch_size++] = sh;

    if (batch_size == 0) {
        for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
            if (!list_empty(temp_inactive_list + i))
                break;
        if (i == NR_STRIPE_HASH_LOCKS)
            return batch_size;
        release_inactive = true;
    }
    spin_unlock_irq(&conf->device_lock);

    release_inactive_stripe_list(conf, temp_inactive_list,
                     NR_STRIPE_HASH_LOCKS);

    if (release_inactive) {
        spin_lock_irq(&conf->device_lock);
        return 0;
    }

    for (i = 0; i < batch_size; i++)
        handle_stripe(batch[i]);//处理条带的主战场

    cond_resched();

    spin_lock_irq(&conf->device_lock);
    for (i = 0; i < batch_size; i++) {
        hash = batch[i]->hash_lock_index;
        __release_stripe(conf, batch[i], &temp_inactive_list[hash]);//回收条带
    }
    return batch_size;
}

首先，RAID5是进行批量处理条带的，每次处理MAX_STRIPE_BATCH个条带，默认值为8，在取条带时也有规则，跟进__get_priority_stripe() ：

/* __get_priority_stripe - get the next stripe to process
 *
 * Full stripe writes are allowed to pass preread active stripes up until
 * the bypass_threshold is exceeded.  In general the bypass_count
 * increments when the handle_list is handled before the hold_list; however, it
 * will not be incremented when STRIPE_IO_STARTED is sampled set signifying a
 * stripe with in flight i/o.  The bypass_count will be reset when the
 * head of the hold_list has changed, i.e. the head was promoted to the
 * handle_list.
 */
static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
{
    struct stripe_head *sh = NULL, *tmp;
    struct list_head *handle_list = NULL;
    struct r5worker_group *wg = NULL;

    if (conf->worker_cnt_per_group == 0) {//确定handle_list
        handle_list = &conf->handle_list;
    } else if (group != ANY_GROUP) {
        handle_list = &conf->worker_groups[group].handle_list;
        wg = &conf->worker_groups[group];
    } else {
        int i;
        for (i = 0; i < conf->group_cnt; i++) {
            handle_list = &conf->worker_groups[i].handle_list;
            wg = &conf->worker_groups[i];
            if (!list_empty(handle_list))
                break;
        }
    }

    pr_debug("%s: handle: %s hold: %s full_writes: %d bypass_count: %d\n",
          __func__,
          list_empty(handle_list) ? "empty" : "busy",
          list_empty(&conf->hold_list) ? "empty" : "busy",
          atomic_read(&conf->pending_full_writes), conf->bypass_count);

    if (!list_empty(handle_list)) {//handle_list不为空，则从中取条带
        sh = list_entry(handle_list->next, typeof(*sh), lru);

        if (list_empty(&conf->hold_list))
            conf->bypass_count = 0;
        else if (!test_bit(STRIPE_IO_STARTED, &sh->state)) {
            if (conf->hold_list.next == conf->last_hold)
                conf->bypass_count++;
            else {
                conf->last_hold = conf->hold_list.next;
                conf->bypass_count -= conf->bypass_threshold;
                if (conf->bypass_count < 0)
                    conf->bypass_count = 0;
            }
        }
    } else if (!list_empty(&conf->hold_list) &&
           ((conf->bypass_threshold &&
             conf->bypass_count > conf->bypass_threshold) ||
            atomic_read(&conf->pending_full_writes) == 0)) {//否则从hold_list中取

        list_for_each_entry(tmp, &conf->hold_list,  lru) {
            if (conf->worker_cnt_per_group == 0 ||
                group == ANY_GROUP ||
                !cpu_online(tmp->cpu) ||
                cpu_to_group(tmp->cpu) == group) {
                sh = tmp;
                break;
            }
        }

        if (sh) {
            conf->bypass_count -= conf->bypass_threshold;
            if (conf->bypass_count < 0)
                conf->bypass_count = 0;
        }
        wg = NULL;
    }

    if (!sh)
        return NULL;

    if (wg) {
        wg->stripes_cnt--;
        sh->group = NULL;
    }
    list_del_init(&sh->lru);
    BUG_ON(atomic_inc_return(&sh->count) != 1);
    return sh;
}

根据代码我们可以看出，handle_list中条带的优先级高于hold_list中的条带优先级，而且函数的注释中已经明确表明了bypass_count的赋值情况。

回到handle_active_stripes() 中，取到条带后，调用handle_stripe() 进行处理，这个函数可厉害了，而且情况也特别复杂，我们这里不做讨论，后面会专门来讲解这个函数。处理完成后，调用__release_stripe() 对条带进行回收，我的前一篇博文里面有提及这个函数，具体点击这里，在此就不赘述了。
至此，raid5d所干的事已经昭告于天下了，也不是很复杂的哦，毕竟它只是个指挥官，真正的提枪上阵还得真正的士兵。后面会对某些操作进行具体的讲解，别急哦~

进程的注销

RAID5的守护进程raid5d的注销是通过调用md_unregister_thread() 函数来实现的，跟进md_unregister_thread() ：

void md_unregister_thread(struct md_thread **threadp)
{
    struct md_thread *thread = *threadp;
    if (!thread)
        return;
    pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
    /* Locking ensures that mddev_unlock does not wake_up a
     * non-existent thread
     */
    spin_lock(&pers_lock);
    *threadp = NULL;
    spin_unlock(&pers_lock);

    kthread_stop(thread->tsk);//停止这个进程
    kfree(thread);
}
EXPORT_SYMBOL(md_unregister_thread);

很简单，只是调用一下kthread_stop() 来停止下raid5d。

有关RAID5的守护进程raid5d的一些基本功能讲得差不多了，从注册到注销，正应了那句话，善始善终，good luck~

小表弟皮卡丘

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
Linux 内核中RAID5源码详解之守护进程raid5d

Linux 内核中RAID5源码详解之守护进程raid5d对于一个人，大脑支配着他的一举一动；对于一支部队，指挥中心控制着它的所有活动；同样，对于内核中的RAID5，也需要一个像大脑一样的东西来支配着他的正确运转，那就是RAID5的守护进程raid5d。今天，我们就好好来看看raid5d到底是怎么一回事~进程的注册前面的博文中贴出的源码经常会出现这样一条语句md_wakeup_thread(mdde
复制链接

扫一扫