SRCU的内核简单实现

最新推荐文章于 2024-06-02 17:56:58 发布

享乐主

最新推荐文章于 2024-06-02 17:56:58 发布

阅读量2.2k

点赞数 2

分类专栏：内核 RCU

本文链接：https://blog.csdn.net/huang987246510/article/details/103039355

版权

内核同时被 2 个专栏收录

15 篇文章 26 订阅

订阅专栏

RCU

3 篇文章 3 订阅

订阅专栏

文章目录

前言
初始化
宽限期统计
- 数据结构
- 宽限期统计原理
读者
写者

前言

srcu早期的内核实现比较简单，本文基于内核3.10.0分析srcu。内核中许多地方用srcu锁实现数据同步，比如debugfs，kvm，fsnotify，blk-mq等，本文以kvm为例分析srcu。

初始化

srcu是一个子系统，当srcu的同步出了问题，它的影响范围只限于子系统内的进程。当内核有模块想要使用srcu保护读多写少的临界资源时，在初始化时就需要初始化srcu子系统。一个srcu结构代表一个逻辑srcu子系统，定义如下：

struct srcu_struct_array {
    unsigned long c[2];		/* 临界区锁计数 */
    unsigned long seq[2];	/* 记录每个cpu经过的静默态个数，它的作用后面的宽限期检查中会介绍 */
};

struct srcu_struct {
    unsigned completed;		/* 宽限期个数记录 */
    struct srcu_struct_array __percpu *per_cpu_ref;	/* 临界区读者计数 */
    spinlock_t queue_lock; /* protect ->batch_queue, ->  */
    bool running;			/* 标记统计宽限期的工作队列是否正在工作 */
    /* callbacks just queued */
    struct rcu_batch batch_queue;
    /* callbacks try to do the first check_zero */
    struct rcu_batch batch_check0;
    /* callbacks done with the first check_zero and the flip */
    struct rcu_batch batch_check1;
    struct rcu_batch batch_done;
    struct delayed_work work;		/* 统计宽限期的工作队列*/
};

kvm使用srcu同步结构体的中的部分字段，因此kvm结构中有指向srcu系统的字段srcu。

struct kvm {
    spinlock_t mmu_lock;
    struct mutex slots_lock;
    struct mm_struct *mm; /* userspace tied to this vm */
    /* 由srcu保护 */
    struct kvm_memslots *memslots[KVM_ADDRESS_SPACE_NUM];
    struct srcu_struct srcu;		/* srcu子系统，用于保护kvm部分读多写少的字段 */
    struct srcu_struct irq_srcu;	/* irq srcu子系统 */
	......
	struct mmu_notifier mmu_notifier;
	/* 由srcu保护 */
    unsigned long mmu_notifier_seq;
    long mmu_notifier_count;
	......

kvm.srcu初始化流程如下：

kvm_dev_ioctl
	kvm_dev_ioctl_create_vm
		kvm_create_vm
			init_srcu_struct(&kvm->srcu)
				init_srcu_struct_fields
					
static int init_srcu_struct_fields(struct srcu_struct *sp)
{   
    sp->completed = 0;
    spin_lock_init(&sp->queue_lock);
    sp->running = false;
    rcu_batch_init(&sp->batch_queue);
    rcu_batch_init(&sp->batch_check0);
    rcu_batch_init(&sp->batch_check1);
    rcu_batch_init(&sp->batch_done);
    INIT_DELAYED_WORK(&sp->work, process_srcu);	/* 注册工作队列的函数*/
    sp->per_cpu_ref = alloc_percpu(struct srcu_struct_array);
    return sp->per_cpu_ref ? 0 : -ENOMEM;
}

宽限期统计

数据结构

宽限期统计由状态机负责，统计两个宽限期，状态机有4个状态：入队，下一个宽限期，当前宽限期，宽限期结束。srcu_struct中与状态机相关的字段有4个，如下：

struct srcu_struct {
	......
    /* callbacks just queued 
     * 如果当前有未统计结束的宽限期
     * 新统计的宽限期只能添加到batch_queue中
     */
    struct rcu_batch batch_queue;
    /* callbacks try to do the first check_zero 
     * 下一个宽限期
     */
    struct rcu_batch batch_check0;
    /* callbacks done with the first check_zero and the flip 
     * 当前宽限期
     */
    struct rcu_batch batch_check1;
    /* 当前宽限期结束后，将状态机置为done
     */
    struct rcu_batch batch_done;
	......
};

在这里插入图片描述

状态机的状态转移如图右上角所示，在开始等待宽限期时有两种情况：

当前没有统计的宽限期，这时不需要排队等待，直接进入check0状态，然后等待两个宽限期，状态从check0->check1->done，意味着宽限期结束。可以调用回调函数wakeme_after_rcu。唤醒写者更新临界区数据。
当前有正在统计的宽限期，这时需要排队，进入的是queued状态，等到前两个宽限期结束，状态从queued->check0->check1->done。然后回调wakeme_after_rcu。唤醒写者

宽限期统计原理

rcu的宽限期统计使用软中断实现，srcu使用工作队列实现宽限期统计，统计在process_srcu函数中实现

void process_srcu(struct work_struct *work)
{
    struct srcu_struct *sp;

    sp = container_of(work, struct srcu_struct, work.work);

    srcu_collect_new(sp);
    srcu_advance_batches(sp, 1);
    srcu_invoke_callbacks(sp);
    srcu_reschedule(sp);
}

process_srcu的核心函数为srcu_advance_batches，主要负责检查临界资源上是否还有读者在访问，通过统计srcu_struct.per_cpu_ref->c是否为0所有cpu都进入了静默态，流程如下：

/*
 * Core SRCU state machine.  Advance callbacks from ->batch_check0 to
 * ->batch_check1 and then to ->batch_done as readers drain.
 */
static void srcu_advance_batches(struct srcu_struct *sp, int trycount)
{
    int idx = 1 ^ (sp->completed & 1);

    if (rcu_batch_empty(&sp->batch_check0) &&
        rcu_batch_empty(&sp->batch_check1))
        return; /* no callbacks need to be advanced */
        
	/* try_check_zero返回true，表示当前宽限期结束，状态往前移动一个
	 * check1->done
	 * check0->check1
	 * queued->check0
	 */
    if (!try_check_zero(sp, idx, trycount))
        return; /* failed to advance, will try after SRCU_INTERVAL */
	/* check1->done */
    rcu_batch_move(&sp->batch_done, &sp->batch_check1);
	/* 如果状态机里面只有一个需要等待的宽限期，即check1不为空
	 * check0为空，直接返回，宽限期已经结束*/
    if (rcu_batch_empty(&sp->batch_check0))
        return; /* no callbacks need to be advanced */
    srcu_flip(sp);
    
	/* 如果状态机里面有两个需要等待的宽限期，即check1和check0都
	 * 需要等待，则一次性统计两个
	 * check0->check1 
	 * */
    rcu_batch_move(&sp->batch_check1, &sp->batch_check0);

    trycount = trycount < 2 ? 2 : trycount;
    if (!try_check_zero(sp, idx^1, trycount))
        return; /* failed to advance, will try after SRCU_INTERVAL */
        
    rcu_batch_move(&sp->batch_done, &sp->batch_check1);
}

函数首先根据completed的最低位取出宽限期的id号，这里只有0和1两个值，1是0之后的宽限期，这里只区分当前宽限期和后一个宽限期。try_check_zero函数统计所有访问临界区资源的cpu个数。为0时表示没有读者访问临界区，宽限期结束，函数流程如下：

try_check_zero
	srcu_readers_active_idx_check(sp, idx)
	
static bool srcu_readers_active_idx_check(struct srcu_struct *sp, int idx)
{
    unsigned long seq;
    seq = srcu_readers_seq_idx(sp, idx);
	smp_mb(); /* A */
	if (srcu_readers_active_idx(sp, idx) != 0)
        return false;
   	smp_mb(); /* D */
    return srcu_readers_seq_idx(sp, idx) == seq;
}

我们说，只要各CPU上访问临界区的读者锁计数总和为0，表示临界区没有读者访问，但这只是种理想情况，在多核情况下，由于进入临界区的读者进程可能被迁移：读者进入临界区后睡眠，然后切换出去，再次唤醒时运行在另外一个cpu上。基于这种实际情况，有两种场景计算出的所有cpu读者锁计数总和为0，第一种场景是常规的情况，当所有cpu上都没有临界区的读者的时候，所有cpu上锁计数之和为0，另一种情况是在统计锁计数过程中其它读者在穿越临界区过程中发生进程迁移的情况，也可能导致统计出来的锁计数之和为0。演示图如下：
如上，读者锁计数的统计函数srcu_readers_active_idx会遍历所有cpu，依次取出每个cpu的锁计数c[0]的值做加和。在t0时刻，宽限期统计函数从cpu0开始执行，在此之前cpu0上有读者进程A访问了临界区资源data1，因此cpu0上锁计数为1；之后是cpu1上的锁计数，cpu1上没有读者访问临界资源，因此cpu1上锁计数为0；同一时间段cpu0上有一个进程B也开始访问临界区资源data1，访问过程中进程B中被调度器调度从cpu0切换出去，再调度进程B时，进程B被迁移到了cpu2，并且在统计函数获取cpu2计数前，离开了临界区，并减少锁计数，因此cpu2上c[0]的值为-1，统计函数计算所有cpu锁计数之和，恰好是0。满足宽限期结束条件，宽限期结束。但实际上cpu0上还有一个进程A仍然访问着临界区，宽限期并未真的结束。
为了防止上述错误统计的发生，设计了如下结构体，c[]数组统计cpu上的临界区读者数，seq为序号，每当一个cpu上的读者进入临界区时seq加1，seq永远递增。通过在一段时间前后统计seq总和，当seq变大时，表示必然有cpu在这段时间内进入了临界区；如果seq不变，表示没有cpu进入临界区。上述错误统计的根因就是在统计读者数的时候，有新的读者进入了临界区并在统计结束之前又离开了临界区，导致统计不准。因此当统计锁计数之和为0时，再多加一个前提，只要统计的这段时间没有读者进入临界区。就可以确定宽限期一定结束，否则，宽限期不一定结束。所以在srcu_readers_active_idx_check函数的开始，先计算了所有cpu上的seq之和，当cpu锁计数为0时，再计算一次seq之和，如果前后两次seq相等，表示没有新的读者进入临界区，可以断定宽限期结束。

struct srcu_struct_array {
    unsigned long c[2];
    unsigned long seq[2];
};

读者

读者进入临界区前首先利用completed最低位区分当前宽限期和下一个宽限期，然后调用read_lock使对应宽限期计数增加，离开时调用read_unlock减少计数。在进入临界区时还要增加seq计数，原因见上面的宽限期统计一节。

1）进入临界区前增加锁计数
/*
 * Counts the new reader in the appropriate per-CPU element of the
 * srcu_struct.  Must be called from process context.
 * Returns an index that must be passed to the matching srcu_read_unlock().
 */
int __srcu_read_lock(struct srcu_struct *sp)
{
    int idx;
	/* 获取宽限期idx，用于区分当前宽限期和下一个宽限期 */
    idx = ACCESS_ONCE(sp->completed) & 0x1;
    preempt_disable();
    ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->c[idx]) += 1;
    smp_mb(); /* B */  /* Avoid leaking the critical section. */
    ACCESS_ONCE(this_cpu_ptr(sp->per_cpu_ref)->seq[idx]) += 1;
    preempt_enable();
    return idx;
}
2）离开临界区后减少锁计数
/*
 * Removes the count for the old reader from the appropriate per-CPU
 * element of the srcu_struct.  Note that this may well be a different
 * CPU than that which was incremented by the corresponding srcu_read_lock().
 * Must be called from process context.
 */
void __srcu_read_unlock(struct srcu_struct *sp, int idx)
{
    smp_mb(); /* C */  /* Avoid leaking the critical section. */
    this_cpu_dec(sp->per_cpu_ref->c[idx]);
}

写者

写者修改临界区数据后，调用sync函数等待宽限期结束，它的实际动作就是将自己入队，挂到宽限期统计的工作队列上

static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
{
    struct rcu_synchronize rcu;
    struct rcu_head *head = &rcu.head;
    bool done = false;
               
    rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
               !lock_is_held(&rcu_bh_lock_map) &&
               !lock_is_held(&rcu_lock_map) &&
               !lock_is_held(&rcu_sched_lock_map),
               "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");

    might_sleep();
    init_completion(&rcu.completion);

    head->next = NULL;
    /* 注册唤醒函数 */
    head->func = wakeme_after_rcu;
    spin_lock_irq(&sp->queue_lock);
    if (!sp->running) {
    	/* 如果工作队列没有运行，更改其状态 */
        /* steal the processing owner */
        sp->running = true;
       /* 将自己挂到宽限期统计工作队列上
        * 当工作队列统计结束时，唤醒自己 
        * */
        rcu_batch_queue(&sp->batch_check0, head);
        spin_unlock_irq(&sp->queue_lock);
		/* 统计宽限期 */
        srcu_advance_batches(sp, trycount);
        if (!rcu_batch_empty(&sp->batch_done)) {
        	/* 如果done不为空，表示成功等到1个宽限期 */
            BUG_ON(sp->batch_done.head != head);
            rcu_batch_dequeue(&sp->batch_done);
            done = true;
        }
        /* give the processing owner to work_struct */
        srcu_reschedule(sp);
    } else {
        rcu_batch_queue(&sp->batch_queue, head);
        spin_unlock_irq(&sp->queue_lock);
    }

    if (!done)
        wait_for_completion(&rcu.completion);
}