本文分析基于linux内核4.19.195
软件实现原理
软件上实现rcu的原理,无非就是识别宽限期(GP:Grace Period)的开始,检测静止状态(QS:Quiescent Status),然后判断宽限期的结束,处理宽限期结束的相关事宜并且启动下一个宽限期这样一个大型状态机;linux内核主要依赖的就是内核在tick时能否调度来识别一个CPU是否退出了临界区(也就是是否离开了宽限期),然后利用是否所用CPU都度过了一次静止状态来识别是否度过了一个宽限期;在宽限期结束之后,就可以进行相关的内存释放动作,以及启动下一次宽限期等动作。实现原理可以参考其他一些博客,讲的都很好。
硬件实现前提
一开始我也没注意到,内核实现的RCU是有硬件前提的,借用wikipedia的一个句子:RCU-based updaters typically take advantage of the fact that writes to single aligned pointers are atomic on modern CPUs, allowing atomic insertion, removal, and replacement of data in a linked structure without disrupting readers. 为什么要有这个前提呢?其实看看内核双链表的rcu版本的实现,就很容易明白,如果写段不是原子的,那么相当于没有加锁的读端就有可能看到一个基于旧版本和新版本之间状态的值,这样读端就有问题了。现代64位CPU,对于大小小于8byte的aligned的变量,基本都是可以做到write动作是atomic的,所以原则上来说 ,大小小于8byte的aligned的变量都可以用rcu来保护,但是我们很少(应该说就是没有,如果真的有请大佬指正)看到有非指针类型的变量用rcu来保护的,这是为什么呢?我觉得有两点:一点是非指针变量,size太小,也保存不了多少信息;另外就是,原则上数据是否有效,是需要一个引用计数来指示的,这也就是很多使用rcu保护的数据结构里,都有一个引用计数的原因,没有引用计数的话,我们无法知道这个数据是有效的还是过时的,而且,在获取到这个数据的时候,我们也需要通过增加引用计数的方式,来保证其内存不被释放。而一个引用计数基本就是4byte以上的体积了,因而rcu通过保护指针来获得数据就再正常不过了。
当然,这个硬件实现的前提,也适合内核里一些展示数据的代码,这类代码会在cpu 1上去读取cpu2上的percpu变量,然后输出。显然在cpu 1上去使用cpu2上的percpu变量是不合理的,但是如果是读操作,而且只是一个展示数据的代码,对精度要求不高,那么我们最多就是读到一个旧的数据,并不会带来什么严重的后果。
RCU实现浅析
下面以tree rcu--------rcu_sched为例来说明linux内核的rcu实现
基本数据结构
rcu_state描述rcu全局状态,内核位每种RCU定义了一个rcu_state实例
rcu_node描述一个处理器分组的rcu状态;是Tree RCU中的组织节点
rcu_data 描述一个处理器的rcu状态,每个处理器对应一个rcu_data实例;每个cpu对于自己正在经历的宽限期会有自己的视角,不一定是全局统一的,这是因为有些CPU可能经历的停止TICK、长时间的IDLE、甚至是offline/online等操作,这会带来与其他CPU之间的差异;此外,可能某些CPU的宽限期比较长,也会导致其宽限期号落后于其他cpu。
rcu_segcblist是rcu的call back链表相关的数据结构,这个数据结构将rcu的call back链表分成了四段,每一段的具体含义可以看代码注释
数据结构间的关系可以参考这里
struct rcu_state {
}
struct rcu_node {
}
struct rcu_data {
}
/* Complicated segmented callback lists. ;-) */
/*
* Index values for segments in rcu_segcblist structure.
*
* The segments are as follows:
*
* [head, *tails[RCU_DONE_TAIL]):
* Callbacks whose grace period has elapsed, and thus can be invoked.
* [*tails[RCU_DONE_TAIL], *tails[RCU_WAIT_TAIL]):
* Callbacks waiting for the current GP from the current CPU's viewpoint.
* [*tails[RCU_WAIT_TAIL], *tails[RCU_NEXT_READY_TAIL]):
* Callbacks that arrived before the next GP started, again from
* the current CPU's viewpoint. These can be handled by the next GP.
* [*tails[RCU_NEXT_READY_TAIL], *tails[RCU_NEXT_TAIL]):
* Callbacks that might have arrived after the next GP started.
* There is some uncertainty as to when a given GP starts and
* ends, but a CPU knows the exact times if it is the one starting
* or ending the GP. Other CPUs know that the previous GP ends
* before the next one starts.
*
* Note that RCU_WAIT_TAIL cannot be empty unless RCU_NEXT_READY_TAIL is also
* empty.
*
* The ->gp_seq[] array contains the grace-period number at which the
* corresponding segment of callbacks will be ready to invoke. A given
* element of this array is meaningful only when the corresponding segment
* is non-empty, and it is never valid for RCU_DONE_TAIL (whose callbacks
* are already ready to invoke) or for RCU_NEXT_TAIL (whose callbacks have
* not yet been assigned a grace-period number).
*/
#define RCU_DONE_TAIL 0 /* Also RCU_WAIT head. */
#define RCU_WAIT_TAIL 1 /* Also RCU_NEXT_READY head. */
#define RCU_NEXT_READY_TAIL 2 /* Also RCU_NEXT head. */
#define RCU_NEXT_TAIL 3
#define RCU_CBLIST_NSEGS 4
struct rcu_segcblist {
struct rcu_head *head;
struct rcu_head **tails[RCU_CBLIST_NSEGS];
unsigned long gp_seq[RCU_CBLIST_NSEGS];
long len;
long len_lazy;
};
后台线程
每种RCU都会有一个后台线程来完成宽限期的启停动作,这个宽限期线程代码就是函数rcu_gp_kthread(),它主要完成三个动作,也就是函数里实现的三个部分:1)等待并创建新的宽限期GP;2)等待强制静止状态,设置超时,提前唤醒说明所有处理器经过了静止状态;3)宽限期结束处理。
第一个部分,是在等待rsp->gp_flags设置RCU_GP_FLAG_INIT标志位,这个标志一旦置位,就说明需要开启一个新的宽限期;这个标志会在函数rcu_gp_cleanup()及rcu_start_this_gp()里被置位
第二个部分,会等待所有CPU都度过宽限期,如果所有CPU都在超时前度过宽限期,就会顺利的进入到第三个阶段;否则,到期时间一到,会强制那些没有度过宽限期的CPU参生静止状态
第三部分,用于完成宽限期结束后的清理动作
/*
* Body of kthread that handles grace periods.
*/
static int __noreturn rcu_gp_kthread(void *arg)
{
bool first_gp_fqs;
int gf;
unsigned long j;
int ret;
struct rcu_state *rsp = arg;
struct rcu_node *rnp = rcu_get_root(rsp);
rcu_bind_gp_kthread();
for (;;) {
/* Handle grace-period start. */
for (;;) { //这个for循环用来启动宽限期
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("reqwait"));
rsp->gp_state = RCU_GP_WAIT_GPS;
swait_event_idle_exclusive(rsp->gp_wq, READ_ONCE(rsp->gp_flags) &
RCU_GP_FLAG_INIT); //等待启动新的宽限期
rsp->gp_state = RCU_GP_DONE_GPS;
/* Locking provides needed memory barrier. */
if (rcu_gp_init(rsp)) //启动新的宽限期并初始化
break;
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
WARN_ON(signal_pending(current));
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("reqwaitsig"));
}
/* Handle quiescent-state forcing. */ //处理强制静止状态
first_gp_fqs = true;
j = jiffies_till_first_fqs;
ret = 0;
for (;;) {
if (!ret) {
rsp->jiffies_force_qs = jiffies + j;
WRITE_ONCE(rsp->jiffies_kick_kthreads,
jiffies + 3 * j);
}
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqswait"));
rsp->gp_state = RCU_GP_WAIT_FQS;//表示等待强制静止状态
ret = swait_event_idle_timeout_exclusive(rsp->gp_wq, //等待超时或者gp完成
rcu_gp_fqs_check_wake(rsp, &gf), j);
rsp->gp_state = RCU_GP_DOING_FQS;
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
!rcu_preempt_blocked_readers_cgp(rnp))
break;
/* If time for quiescent-state forcing, do it. */
if (ULONG_CMP_GE(jiffies, rsp->jiffies_force_qs) ||
(gf & RCU_GP_FLAG_FQS)) {
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqsstart"));
rcu_gp_fqs(rsp, first_gp_fqs);
first_gp_fqs = false;
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqsend"));
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
ret = 0; /* Force full wait till next FQS. */
j = jiffies_till_next_fqs;
} else {
/* Deal with stray signal. */
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
WARN_ON(signal_pending(current));
trace_rcu_grace_period(rsp->name,
READ_ONCE(rsp->gp_seq),
TPS("fqswaitsig"));
ret = 1; /* Keep old FQS timing. */
j = jiffies;
if (time_after(jiffies, rsp->jiffies_force_qs))
j = 1;
else
j = rsp->jiffies_force_qs - j;
}
}
/* Handle grace-period end. */ //处理宽限期结束
rsp->gp_state = RCU_GP_CLEANUP;
rcu_gp_cleanup(rsp);
rsp->gp_state = RCU_GP_CLEANED;
}
}
第一部分实现非常简单,就不分析了;
第二部分的实现不复杂,先是睡眠等待函数rcu_gp_fqs_check_wake()判断为true才往下运行,rcu_gp_fqs_check_wake()判断为true有两种情况,一种是需要强制产生一个静止状态,另外一种是当前的宽限期已经完成了;如果不需要强制静止状态,那么这个宽限期就可以结束了;否则,就需要进行强制禁止状态,这个过程会在第二部分的大循环里面不断循环,并且以jiffies_till_first_fqse为周期,调用rcu_gp_fqs()强制没有度过静止状态的处理器经历一个静止状态。如果是本宽限期第一次处理强制宽限期,那么rcu_gp_fqs()会调用force_qs_rnp(rsp, dyntick_save_progress_counter);完成处理,否则调用force_qs_rnp(rsp, rcu_implicit_dynticks_qs);来看看这部分的实现
/*
* Do one round of quiescent-state forcing.
*/
static void rcu_gp_fqs(struct rcu_state *rsp, bool first_time)
{
struct rcu_node *rnp = rcu_get_root(rsp);
WRITE_ONCE(rsp->gp_activity, jiffies);
rsp->n_force_qs++;
if (first_time) {
/* Collect dyntick-idle snapshots. */
force_qs_rnp(rsp, dyntick_save_progress_counter);
} else {
/* Handle dyntick-idle and offline CPUs. */
force_qs_rnp(rsp, rcu_implicit_dynticks_qs);
}
/* Clear flag to prevent immediate re-entry. */
if (READ_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
raw_spin_lock_irq_rcu_node(rnp);
WRITE_ONCE(rsp->gp_flags,
READ_ONCE(rsp->gp_flags) & ~RCU_GP_FLAG_FQS);
raw_spin_unlock_irq_rcu_node(rnp);
}
}
/*
* Scan the leaf rcu_node structures, processing dyntick state for any that
* have not yet encountered a quiescent state, using the function specified.
* Also initiate boosting for any threads blocked on the root rcu_node.
*
* The caller must have suppressed start of new grace periods.
*/
static void force_qs_rnp(struct rcu_state *rsp, int (*f)(struct rcu_data *rsp))
{
int cpu;
unsigned long flags;
unsigned long mask;
struct rcu_node *rnp;
rcu_for_each_leaf_node(rsp, rnp) {
cond_resched_tasks_rcu_qs();
mask = 0;
raw_spin_lock_irqsave_rcu_node(rnp, flags);
if (rnp->qsmask == 0) {
if (rcu_state_p == &rcu_sched_state ||
rsp != rcu_state_p ||
rcu_preempt_blocked_readers_cgp(rnp)) {
/*
* No point in scanning bits because they
* are all zero. But we might need to
* priority-boost blocked readers.
*/
rcu_initiate_boost(rnp, flags);
/* rcu_initiate_boost() releases rnp->lock */
continue;
}
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
continue;
}
for_each_leaf_node_possible_cpu(rnp, cpu) {
unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
if ((rnp->qsmask & bit) != 0) {
if (f(per_cpu_ptr(rsp->rda, cpu)))
mask |= bit;
}
}
if (mask != 0) {
/* Idle/offline CPUs, report (releases rnp->lock). */
rcu_report_qs_rnp(mask, rsp, rnp, rnp->gp_seq, flags);
} else {
/* Nothing to do here, so just drop the lock. */
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
}
}
}
/*
* Return true if the specified CPU has passed through a quiescent
* state by virtue of being in or having passed through an dynticks
* idle state since the last call to dyntick_save_progress_counter()
* for this same CPU, or by virtue of having been offline.
*/
static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
{
unsigned long jtsq;
bool *rnhqp;
bool *ruqp;
struct rcu_node *rnp = rdp->mynode;
/*
* If the CPU passed through or entered a dynticks idle phase with
* no active irq/NMI handlers, then we can safely pretend that the CPU
* already acknowledged the request to pass through a quiescent
* state. Either way, that CPU cannot possibly be in an RCU
* read-side critical section that started before the beginning
* of the current RCU grace period.
*/
if (rcu_dynticks_in_eqs_since(rdp->dynticks, rdp->dynticks_snap)) {
trace_rcu_fqs(rdp->rsp->name, rdp->gp_seq, rdp->cpu, TPS("dti"));
rdp->dynticks_fqs++;
rcu_gpnum_ovf(rnp, rdp);
return 1;
}
/*
* Has this CPU encountered a cond_resched() since the beginning
* of the grace period? For this to be the case, the CPU has to
* have noticed the current grace period. This might not be the
* case for nohz_full CPUs looping in the kernel.
*/
jtsq = jiffies_till_sched_qs;
ruqp = per_cpu_ptr(&rcu_dynticks.rcu_urgent_qs, rdp->cpu);
if (time_after(jiffies, rdp->rsp->gp_start + jtsq) &&
READ_ONCE(rdp->rcu_qs_ctr_snap) != per_cpu(rcu_dynticks.rcu_qs_ctr, rdp->cpu) &&
rcu_seq_current(&rdp->gp_seq) == rnp->gp_seq && !rdp->gpwrap) {
trace_rcu_fqs(rdp->rsp->name, rdp->gp_seq, rdp->cpu, TPS("rqc"));
rcu_gpnum_ovf(rnp, rdp);
return 1;
} else if (time_after(jiffies, rdp->rsp->gp_start + jtsq)) {
/* Load rcu_qs_ctr before store to rcu_urgent_qs. */
smp_store_release(ruqp, true);
}
/* If waiting too long on an offline CPU, complain. */
if (!(rdp->grpmask & rcu_rnp_online_cpus(rnp)) &&
time_after(jiffies, rdp->rsp->gp_start + HZ)) {
bool onl;
struct rcu_node *rnp1;
WARN_ON(1); /* Offline CPUs are supposed to report QS! */
pr_info("%s: grp: %d-%d level: %d ->gp_seq %ld ->completedqs %ld\n",
__func__, rnp->grplo, rnp->grphi, rnp->level,
(long)rnp->gp_seq, (long)rnp->completedqs);
for (rnp1 = rnp; rnp1; rnp1 = rnp1->parent)
pr_info("%s: %d:%d ->qsmask %#lx ->qsmaskinit %#lx ->qsmaskinitnext %#lx ->rcu_gp_init_mask %#lx\n",
__func__, rnp1->grplo, rnp1->grphi, rnp1->qsmask, rnp1->qsmaskinit, rnp1->qsmaskinitnext, rnp1->rcu_gp_init_mask);
onl = !!(rdp->grpmask & rcu_rnp_online_cpus(rnp));
pr_info("%s %d: %c online: %ld(%d) offline: %ld(%d)\n",
__func__, rdp->cpu, ".o"[onl],
(long)rdp->rcu_onl_gp_seq, rdp->rcu_onl_gp_flags,
(long)rdp->rcu_ofl_gp_seq, rdp->rcu_ofl_gp_flags);
return 1; /* Break things loose after complaining. */
}
/*
* A CPU running for an extended time within the kernel can
* delay RCU grace periods. When the CPU is in NO_HZ_FULL mode,
* even context-switching back and forth between a pair of
* in-kernel CPU-bound tasks cannot advance grace periods.
* So if the grace period is old enough, make the CPU pay attention.
* Note that the unsynchronized assignments to the per-CPU
* rcu_need_heavy_qs variable are safe. Yes, setting of
* bits can be lost, but they will be set again on the next
* force-quiescent-state pass. So lost bit sets do not result
* in incorrect behavior, merely in a grace period lasting
* a few jiffies longer than it might otherwise. Because
* there are at most four threads involved, and because the
* updates are only once every few jiffies, the probability of
* lossage (and thus of slight grace-period extension) is
* quite low.
*/
rnhqp = &per_cpu(rcu_dynticks.rcu_need_heavy_qs, rdp->cpu);
if (!READ_ONCE(*rnhqp) &&
(time_after(jiffies, rdp->rsp->gp_start + jtsq) ||
time_after(jiffies, rdp->rsp->jiffies_resched))) {
WRITE_ONCE(*rnhqp, true);
/* Store rcu_need_heavy_qs before rcu_urgent_qs. */
smp_store_release(ruqp, true);
rdp->rsp->jiffies_resched += jtsq; /* Re-enable beating. */
}
/*
* If more than halfway to RCU CPU stall-warning time, do a
* resched_cpu() to try to loosen things up a bit. Also check to
* see if the CPU is getting hammered with interrupts, but only
* once per grace period, just to keep the IPIs down to a dull roar.
*/
if (jiffies - rdp->rsp->gp_start > rcu_jiffies_till_stall_check() / 2) {
resched_cpu(rdp->cpu);
if (IS_ENABLED(CONFIG_IRQ_WORK) &&
!rdp->rcu_iw_pending && rdp->rcu_iw_gp_seq != rnp->gp_seq &&
(rnp->ffmask & rdp->grpmask)) {
init_irq_work(&rdp->rcu_iw, rcu_iw_handler);
rdp->rcu_iw_pending = true;
rdp->rcu_iw_gp_seq = rnp->gp_seq;
irq_work_queue_on(&rdp->rcu_iw, rdp->cpu);
}
}
return 0;
}
假设不是本宽限期第一次处理强制宽限期force_qs_rnp函数会遍历每一个节点,如果节点代表的所有cpu都过了宽限期,跳过即可;对于没有过宽限期的叶节点,会调用rcu_implicit_dynticks_qs()函数进行处理;若处理后有些cpu度过了宽限期,则修改相关rcu_node节点的值。rcu_implicit_dynticks_qs()函数会进行一系列的判断确定相关cpu在这个宽限期开始后是否有经历过一个静止状态,若没有,会调用resched_cpu()要求对应cpu重新调度。
第三部分主要是在一个宽限期结束后,设置相关状态,然后调用函数rcu_gp_cleanup()完成当前宽限期的处理
/*
* Clean up after the old grace period.
*/
static void rcu_gp_cleanup(struct rcu_state *rsp)
{
unsigned long gp_duration;
bool needgp = false;
unsigned long new_gp_seq;
struct rcu_data *rdp;
struct rcu_node *rnp = rcu_get_root(rsp);
struct swait_queue_head *sq;
WRITE_ONCE(rsp->gp_activity, jiffies);
raw_spin_lock_irq_rcu_node(rnp);
gp_duration = jiffies - rsp->gp_start;
if (gp_duration > rsp->gp_max)
rsp->gp_max = gp_duration;//记录gp_max
/*
* We know the grace period is complete, but to everyone else
* it appears to still be ongoing. But it is also the case
* that to everyone else it looks like there is nothing that
* they can do to advance the grace period. It is therefore
* safe for us to drop the lock in order to mark the grace
* period as completed in all of the rcu_node structures.
*/
raw_spin_unlock_irq_rcu_node(rnp);
/*
* Propagate new ->gp_seq value to rcu_node structures so that
* other CPUs don't have to wait until the start of the next grace
* period to process their callbacks. This also avoids some nasty
* RCU grace-period initialization races by forcing the end of
* the current grace period to be completely recorded in all of
* the rcu_node structures before the beginning of the next grace
* period is recorded in any of the rcu_node structures.
*/
new_gp_seq = rsp->gp_seq;
rcu_seq_end(&new_gp_seq);//获取新的gp_seq,及清除老gp_seq的flag域并加1
rcu_for_each_node_breadth_first(rsp, rnp) {//从root node向下逐个初始化node
raw_spin_lock_irq_rcu_node(rnp);
if (WARN_ON_ONCE(rcu_preempt_blocked_readers_cgp(rnp)))
dump_blkd_tasks(rsp, rnp, 10);
WARN_ON_ONCE(rnp->qsmask);
WRITE_ONCE(rnp->gp_seq, new_gp_seq);//更新rcu_node的seq
rdp = this_cpu_ptr(rsp->rda);
if (rnp == rdp->mynode)//只有叶子节点才会为true
needgp = __note_gp_changes(rsp, rnp, rdp) || needgp;
/* smp_mb() provided by prior unlock-lock pair. */
needgp = rcu_future_gp_cleanup(rsp, rnp) || needgp;
sq = rcu_nocb_gp_get(rnp);
raw_spin_unlock_irq_rcu_node(rnp);
rcu_nocb_gp_cleanup(sq);
cond_resched_tasks_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);//更新gp_activity
rcu_gp_slow(rsp, gp_cleanup_delay);
}
rnp = rcu_get_root(rsp);
raw_spin_lock_irq_rcu_node(rnp); /* GP before rsp->gp_seq update. */
/* Declare grace period done. */
rcu_seq_end(&rsp->gp_seq);//更新rcu_state的seq
trace_rcu_grace_period(rsp->name, rsp->gp_seq, TPS("end"));
rsp->gp_state = RCU_GP_IDLE; //将rcu状态设置为RCU_GP_IDLE
/* Check for GP requests since above loop. */
rdp = this_cpu_ptr(rsp->rda);
if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
trace_rcu_this_gp(rnp, rdp, rnp->gp_seq_needed,
TPS("CleanupMore"));
needgp = true;
}
/* Advance CBs to reduce false positives below. */
if (!rcu_accelerate_cbs(rsp, rnp, rdp) && needgp) { //
WRITE_ONCE(rsp->gp_flags, RCU_GP_FLAG_INIT);//标识需要启动一个新的gp
rsp->gp_req_activity = jiffies;
trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gp_seq),
TPS("newreq"));
} else {
WRITE_ONCE(rsp->gp_flags, rsp->gp_flags & RCU_GP_FLAG_INIT);
}
raw_spin_unlock_irq_rcu_node(rnp);
}
/*
* If there is room, assign a ->gp_seq number to any callbacks on this
* CPU that have not already been assigned. Also accelerate any callbacks
* that were previously assigned a ->gp_seq number that has since proven
* to be too conservative, which can happen if callbacks get assigned a
* ->gp_seq number while RCU is idle, but with reference to a non-root
* rcu_node structure. This function is idempotent, so it does not hurt
* to call it repeatedly. Returns an flag saying that we should awaken
* the RCU grace-period kthread.
*
* The caller must hold rnp->lock with interrupts disabled.
*/
static bool rcu_accelerate_cbs(struct rcu_state *rsp, struct rcu_node *rnp,
struct rcu_data *rdp)
{
unsigned long gp_seq_req;
bool ret = false;
raw_lockdep_assert_held_rcu_node(rnp);
/* If no pending (not yet ready to invoke) callbacks, nothing to do. */
if (!rcu_segcblist_pend_cbs(&rdp->cblist))
return false;
/*
* Callbacks are often registered with incomplete grace-period
* information. Something about the fact that getting exact
* information requires acquiring a global lock... RCU therefore
* makes a conservative estimate of the grace period number at which
* a given callback will become ready to invoke. The following
* code checks this estimate and improves it when possible, thus
* accelerating callback invocation to an earlier grace-period
* number.
*/
gp_seq_req = rcu_seq_snap(&rsp->gp_seq);
if (rcu_segcblist_accelerate(&rdp->cblist, gp_seq_req))
ret = rcu_start_this_gp(rnp, rdp, gp_seq_req);
/* Trace depending on how much we were able to accelerate. */
if (rcu_segcblist_restempty(&rdp->cblist, RCU_WAIT_TAIL))
trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("AccWaitCB"));
else
trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("AccReadyCB"));
return ret;
}
该函数结束当前宽限期的执行过程如下:
1.从根节点开始按层次遍历RCU树的每个节点,把节点的已结束宽限期编号更新为当前宽限期编号
2.假设宽限期线程在处理器n上运行,为处理器n的rcu_data实例处理宽限期结束–>把当前宽限期及以前注册的回调函数移动到RCU_DONE_tail链表中;更新当前宽限期号
其中函数rcu_accelerate_cbs()负责加速回调函数,把最后一个子链表RCU_next_tail的回调函数移到前面的子链表中;这个函数主要是对rcu_segcblist中的call back进行移动处理,并确定是否需要新启动一个宽限期
tick里的代码
需要注意的是,每个cpu都会有自己的tick中断,tick中断里进行的rcu处理,主要是处理本cpu的相关工作,而全局的工作还是交给RCU后台线程去处理
tick中,会走到函数rcu_check_callbacks()进行相关的rcu检查
/*
* Check to see if this CPU is in a non-context-switch quiescent state
* (user mode or idle loop for rcu, non-softirq execution for rcu_bh).
* Also schedule RCU core processing.
*
* This function must be called from hardirq context. It is normally
* invoked from the scheduling-clock interrupt.
*/
void rcu_check_callbacks(int user)
{
trace_rcu_utilization(TPS("Start scheduler-tick"));
increment_cpu_stall_ticks();
if (user || rcu_is_cpu_rrupt_from_idle()) { //
/*
* Get here if this CPU took its interrupt from user
* mode or from the idle loop, and if this is not a
* nested interrupt. In this case, the CPU is in
* a quiescent state, so note it.
*
* No memory barrier is required here because both
* rcu_sched_qs() and rcu_bh_qs() reference only CPU-local
* variables that other CPUs neither access nor modify,
* at least not while the corresponding CPU is online.
*/
rcu_sched_qs();
rcu_bh_qs();
rcu_note_voluntary_context_switch(current);
} else if (!in_softirq()) {
/*
* Get here if this CPU did not take its interrupt from
* softirq, in other words, if it is not interrupting
* a rcu_bh read-side critical section. This is an _bh
* critical section, so note it.
*/
rcu_bh_qs();
}
rcu_preempt_check_callbacks();
/* The load-acquire pairs with the store-release setting to true. */
if (smp_load_acquire(this_cpu_ptr(&rcu_dynticks.rcu_urgent_qs))) {
/* Idle and userspace execution already are quiescent states. */
if (!rcu_is_cpu_rrupt_from_idle() && !user) {
set_tsk_need_resched(current);
set_preempt_need_resched();
}
__this_cpu_write(rcu_dynticks.rcu_urgent_qs, false);
}
if (rcu_pending())
invoke_rcu_core();
trace_rcu_utilization(TPS("End scheduler-tick"));
}
代码很简单,如果触发tick的时候是在用户态,或者是在idle状态,则我们可以认为该cpu度过了一次静止状态,调用rcu_sched_qs()函数记录静止状态。这里只是记录,汇报本cpu的静止状态是在软中断中完成的
rcu_init()->open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
rcu_process_callbacks()
->__rcu_process_callbacks() //这个函数会检查是否需要启动一个新的宽限期,以及是否需要执行callback函数;如果需要启动新的宽限期,那么还需要唤醒后台线程进行处理
->rcu_check_quiescent_state()/*检查是否需要报告静止状态*/
->rcu_report_qs_rdp()//报告静止状态
->rcu_report_qs_rnp()
->rcu_report_qs_rsp(rsp, flags);//如果根节点所有成员都报告了静止状态,调用函数通知宽限期线程结束当前宽限期
rcu_report_qs_rnp()会从叶子节点到树根一级一级的向上汇报静止状态;
接着看rcu_check_callbacks()函数,这个函数最后会通过rcu_pending()检查本处理器是否有相关工作要处理,需要的话,就调用invoke_rcu_core()触发RCU_SOFTIRQ软中断,在软中断中进行处理,而不是在tick中进行处理。那么,rcu_pending()检查了什么呢?
* Check to see if there is any immediate RCU-related work to be done
* by the current CPU, returning 1 if so. This function is part of the
* RCU implementation; it is -not- an exported member of the RCU API.
*/
static int rcu_pending(void)
{
struct rcu_state *rsp;
for_each_rcu_flavor(rsp)
if (__rcu_pending(rsp, this_cpu_ptr(rsp->rda)))
return 1;
return 0;
}
/*
* Check to see if there is any immediate RCU-related work to be done
* by the current CPU, for the specified type of RCU, returning 1 if so.
* The checks are in order of increasing expense: checks that can be
* carried out against CPU-local state are performed first. However,
* we must check for CPU stalls first, else we might not get a chance.
*/
static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
{
struct rcu_node *rnp = rdp->mynode;
/* Check for CPU stalls, if enabled. */
check_cpu_stall(rsp, rdp);
/* Is this CPU a NO_HZ_FULL CPU that should ignore RCU? */
if (rcu_nohz_full_cpu(rsp))
return 0;
/* Is the RCU core waiting for a quiescent state from this CPU? */
if (rdp->core_needs_qs && !rdp->cpu_no_qs.b.norm) //core_needs_qs为1表示rcu需要处理本处理器报告静止状态;rdp->cpu_no_qs.b.norm为false表示本处理器经历了正常宽限期的静止状态
return 1; //返回1表示需要报告静止状态
/* Does this CPU have callbacks ready to invoke? */
if (rcu_segcblist_ready_cbs(&rdp->cblist))
return 1;
/* Has RCU gone idle with this CPU needing another grace period? */
//需要新的GP
if (!rcu_gp_in_progress(rsp) &&
rcu_segcblist_is_enabled(&rdp->cblist) &&
!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
return 1;
/* Have RCU grace period completed or started? */
if (rcu_seq_current(&rnp->gp_seq) != rdp->gp_seq ||
unlikely(READ_ONCE(rdp->gpwrap))) /* outside lock */
return 1;
/* Does this CPU need a deferred NOCB wakeup? */
if (rcu_nocb_need_deferred_wakeup(rdp))
return 1;
/* nothing to do */
return 0;
}
rdp->cpu_no_qs.b.norm变量,如果刚度过了静止状态,就会在rcu_sched_qs被写成false,判断需要上报静止状态就会成功,返回1;如果有callback函数需要调用,也会返回1表示有工作需要完成;如果检查callback链表发现需要度过新的宽限期,也会返回1.除此之外还有其他的一些情况也会返回1,我也没搞太清楚,不过如果nothing to do,就会返回0。
在idle里的实现
当一个CPU进入idle状态时,最后会进入到rcu_idle_enter()函数处理rcu的相关事宜。
/**
* rcu_idle_enter - inform RCU that current CPU is entering idle
*
* Enter idle mode, in other words, -leave- the mode in which RCU
* read-side critical sections can occur. (Though RCU read-side
* critical sections can occur in irq handlers in idle, a possibility
* handled by irq_enter() and irq_exit().)
*
* If you add or remove a call to rcu_idle_enter(), be sure to test with
* CONFIG_RCU_EQS_DEBUG=y.
*/
void rcu_idle_enter(void)
{
lockdep_assert_irqs_disabled();
rcu_eqs_enter(false);
}
rcu_idle_enter()仅仅是调用了rcu_eqs_enter()完成了处理,这个函数的作用是标识本CPU进入了一个扩展的静止状态,也就是说,在这个状态下,本CPU一直都可以被认为度过了静止状态,相当于可以不参与全局所有CPU是否都经历过静止状态的判断,不过内核还是会做一个简单的判断的,只要某个CPU进入了一个扩展的静止状态,其他CPU就会协助其进行静止状态的上报,这样就可以让进入了扩展的静止状态的CPU一直处于idle状态或者是进入guest状态(可能还有其他的),无需打搅到该CPU。
/*
* Enter an RCU extended quiescent state, which can be either the
* idle loop or adaptive-tickless usermode execution.
*
* We crowbar the ->dynticks_nmi_nesting field to zero to allow for
* the possibility of usermode upcalls having messed up our count
* of interrupt nesting level during the prior busy period.
*/
static void rcu_eqs_enter(bool user)
{
struct rcu_state *rsp;
struct rcu_data *rdp;
struct rcu_dynticks *rdtp;
rdtp = this_cpu_ptr(&rcu_dynticks);
WRITE_ONCE(rdtp->dynticks_nmi_nesting, 0);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
rdtp->dynticks_nesting == 0);
if (rdtp->dynticks_nesting != 1) {
rdtp->dynticks_nesting--;
return;
}
lockdep_assert_irqs_disabled();
trace_rcu_dyntick(TPS("Start"), rdtp->dynticks_nesting, 0, rdtp->dynticks);
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
for_each_rcu_flavor(rsp) {
rdp = this_cpu_ptr(rsp->rda);
do_nocb_deferred_wakeup(rdp);
}
rcu_prepare_for_idle();//为进入eqs前做准备。1.判断nohz是否被sysfs修改过,并记录进rdp 2.若当前cpu上仍有callback则触发软中断来处理
WRITE_ONCE(rdtp->dynticks_nesting, 0); /* Avoid irq-access tearing. */
rcu_dynticks_eqs_enter(); //重点:rdp->dynticks加0x10,gp线程的fqs处理中通过判断bit1的奇偶来判断eqs,用bit1的原因是,bit0被另做它用了
rcu_dynticks_task_enter();
}
重点在于函数rcu_dynticks_eqs_enter(),将本cpu的rcu_dynticks变量的值加了0x10。
/*
* Record entry into an extended quiescent state. This is only to be
* called when not already in an extended quiescent state.
*/
static void rcu_dynticks_eqs_enter(void)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
int seq;
/*
* CPUs seeing atomic_add_return() must see prior RCU read-side
* critical sections, and we also must force ordering with the
* next idle sojourn.
*/
seq = atomic_add_return(RCU_DYNTICK_CTRL_CTR, &rdtp->dynticks);
/* Better be in an extended quiescent state! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
(seq & RCU_DYNTICK_CTRL_CTR));
/* Better not have special action (TLB flush) pending! */
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
(seq & RCU_DYNTICK_CTRL_MASK));
}
那么rcu_dynticks变量这个值是在哪里被用到的呢?答案是在gp线程中。
rcu_gp_kthread()第二部分
->rcu_gp_fqs()
->dyntick_save_progress_counter() //第一次强制宽限期处理
->dyntick_save_progress_counter() //返回1
->force_qs_rnp()//根据dyntick_save_progress_counter()返回1,协助该CPU完成静止状态的上报
进程切换里的实现
TODO
读端及写端的实现
读写端的实现,核心函数就那几个,下面就以RCU版本的双链表操作函数为例来说明rcu在读写端的实现
/*
* return the ->next pointer of a list_head in an rcu safe
* way, we must not access it directly
*/
#define list_next_rcu(list) (*((struct list_head __rcu **)(&(list)->next)))
/*
* Insert a new entry between two known consecutive entries.
*
* This is only for internal list manipulation where we know
* the prev/next entries already!
*/
static inline void __list_add_rcu(struct list_head *new,
struct list_head *prev, struct list_head *next)
{
if (!__list_add_valid(new, prev, next))
return;
new->next = next;
new->prev = prev;
rcu_assign_pointer(list_next_rcu(prev), new);
next->prev = new;
}
/**
* list_add_rcu - add a new entry to rcu-protected list
* @new: new entry to be added
* @head: list head to add it after
*
* Insert a new entry after the specified head.
* This is good for implementing stacks.
*
* The caller must take whatever precautions are necessary
* (such as holding appropriate locks) to avoid racing
* with another list-mutation primitive, such as list_add_rcu()
* or list_del_rcu(), running on this same list.
* However, it is perfectly legal to run concurrently with
* the _rcu list-traversal primitives, such as
* list_for_each_entry_rcu().
*/
static inline void list_add_rcu(struct list_head *new, struct list_head *head)
{
__list_add_rcu(new, head, head->next);
}
/**
* list_entry_rcu - get the struct for this entry
* @ptr: the &struct list_head pointer.
* @type: the type of the struct this is embedded in.
* @member: the name of the list_head within the struct.
*
* This primitive may safely run concurrently with the _rcu list-mutation
* primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
*/
#define list_entry_rcu(ptr, type, member) \
container_of(READ_ONCE(ptr), type, member)
/**
* list_for_each_entry_rcu - iterate over rcu list of given type
* @pos: the type * to use as a loop cursor.
* @head: the head for your list.
* @member: the name of the list_head within the struct.
*
* This list-traversal primitive may safely run concurrently with
* the _rcu list-mutation primitives such as list_add_rcu()
* as long as the traversal is guarded by rcu_read_lock().
*/
#define list_for_each_entry_rcu(pos, head, member) \
for (pos = list_entry_rcu((head)->next, typeof(*pos), member); \
&pos->member != (head); \
pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
让我们仔细看__list_add_rcu函数的实现,和非RCU版本是不一样的。__list_add_rcu函数先将new节点初始化,然后使用rcu_assign_pointer来更新前一个节点的next,再接着更新后一个节点的prev。rcu_assign_pointer/rcu_dereference宏来确保执行顺序。这样做的原因是,能够保证当读端看到new节点(无论是正向遍历还是逆向遍历)时,其prev和next指针都是正确的,这样能够保证读端的正确性。
当写者更新好之后,需要调用synchronize_rcu或者call_rcu以回收旧的数据。这两个函数的实现是类似的。
/**
* synchronize_rcu - wait until a grace period has elapsed.
*
* Control will return to the caller some time after a full grace
* period has elapsed, in other words after all currently executing RCU
* read-side critical sections have completed. Note, however, that
* upon return from synchronize_rcu(), the caller might well be executing
* concurrently with new RCU read-side critical sections that began while
* synchronize_rcu() was waiting. RCU read-side critical sections are
* delimited by rcu_read_lock() and rcu_read_unlock(), and may be nested.
*
* See the description of synchronize_sched() for more detailed
* information on memory-ordering guarantees. However, please note
* that -only- the memory-ordering guarantees apply. For example,
* synchronize_rcu() is -not- guaranteed to wait on things like code
* protected by preempt_disable(), instead, synchronize_rcu() is -only-
* guaranteed to wait on RCU read-side critical sections, that is, sections
* of code protected by rcu_read_lock().
*/
void synchronize_rcu(void)
{
RCU_LOCKDEP_WARN(lock_is_held(&rcu_bh_lock_map) ||
lock_is_held(&rcu_lock_map) ||
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_rcu() in RCU read-side critical section");
if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
return;
if (rcu_gp_is_expedited())//加速rcu gp
synchronize_rcu_expedited();
else
wait_rcu_gp(call_rcu);
}
EXPORT_SYMBOL_GPL(synchronize_rcu);
一般情况下会走else分支
#define _wait_rcu_gp(checktiny, ...) \
do { \
call_rcu_func_t __crcu_array[] = { __VA_ARGS__ }; \
struct rcu_synchronize __rs_array[ARRAY_SIZE(__crcu_array)]; \
__wait_rcu_gp(checktiny, ARRAY_SIZE(__crcu_array), \
__crcu_array, __rs_array); \
} while (0)
void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
struct rcu_synchronize *rs_array)
{
int i;
int j;
/* Initialize and register callbacks for each flavor specified. */
for (i = 0; i < n; i++) {
if (checktiny &&
(crcu_array[i] == call_rcu ||
crcu_array[i] == call_rcu_bh)) {
might_sleep();
continue;
}
init_rcu_head_on_stack(&rs_array[i].head);
init_completion(&rs_array[i].completion);
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
if (j == i)
(crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
}
/* Wait for all callbacks to be invoked. */
for (i = 0; i < n; i++) {
if (checktiny &&
(crcu_array[i] == call_rcu ||
crcu_array[i] == call_rcu_bh))
continue;
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
if (j == i)
wait_for_completion(&rs_array[i].completion);
destroy_rcu_head_on_stack(&rs_array[i].head);
}
}
EXPORT_SYMBOL_GPL(__wait_rcu_gp);
/**
* wakeme_after_rcu() - Callback function to awaken a task after grace period
* @head: Pointer to rcu_head member within rcu_synchronize structure
*
* Awaken the corresponding task now that a grace period has elapsed.
*/
void wakeme_after_rcu(struct rcu_head *head)
{
struct rcu_synchronize *rcu;
rcu = container_of(head, struct rcu_synchronize, head);
complete(&rcu->completion);
}
EXPORT_SYMBOL_GPL(wakeme_after_rcu);
代码上可以看出,synchronize_rcu就是借用call_rcu来实现的,其回调函数为wakeme_after_rcu。当当前宽限期度过之后,wakeme_after_rcu就会将调用synchronize_rcu函数的进程唤醒,从而该进程可以安全的将相关资源进行释放。那么call_rcu是怎么实现的呢?
/**
* call_rcu() - Queue an RCU callback for invocation after a grace period.
* @head: structure to be used for queueing the RCU updates.
* @func: actual callback function to be invoked after the grace period
*
* The callback function will be invoked some time after a full grace
* period elapses, in other words after all pre-existing RCU read-side
* critical sections have completed. However, the callback function
* might well execute concurrently with RCU read-side critical sections
* that started after call_rcu() was invoked. RCU read-side critical
* sections are delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*
* Note that all CPUs must agree that the grace period extended beyond
* all pre-existing RCU read-side critical section. On systems with more
* than one CPU, this means that when "func()" is invoked, each CPU is
* guaranteed to have executed a full memory barrier since the end of its
* last RCU read-side critical section whose beginning preceded the call
* to call_rcu(). It also means that each CPU executing an RCU read-side
* critical section that continues beyond the start of "func()" must have
* executed a memory barrier after the call_rcu() but before the beginning
* of that RCU read-side critical section. Note that these guarantees
* include CPUs that are offline, idle, or executing in user mode, as
* well as CPUs that are executing in the kernel.
*
* Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
* resulting RCU callback function "func()", then both CPU A and CPU B are
* guaranteed to execute a full memory barrier during the time interval
* between the call to call_rcu() and the invocation of "func()" -- even
* if CPU A and CPU B are the same CPU (but again only if the system has
* more than one CPU).
*/
void call_rcu(struct rcu_head *head, rcu_callback_t func)
{
__call_rcu(head, func, rcu_state_p, -1, 0);
}
EXPORT_SYMBOL_GPL(call_rcu);
/*
* Helper function for call_rcu() and friends. The cpu argument will
* normally be -1, indicating "currently running CPU". It may specify
* a CPU only if that CPU is a no-CBs CPU. Currently, only _rcu_barrier()
* is expected to specify a CPU.
*/
static void
__call_rcu(struct rcu_head *head, rcu_callback_t func,
struct rcu_state *rsp, int cpu, bool lazy)
{
unsigned long flags;
struct rcu_data *rdp;
/* Misaligned rcu_head! */
WARN_ON_ONCE((unsigned long)head & (sizeof(void *) - 1));
if (debug_rcu_head_queue(head)) {
/*
* Probable double call_rcu(), so leak the callback.
* Use rcu:rcu_callback trace event to find the previous
* time callback was passed to __call_rcu().
*/
WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pF()!!!\n",
head, head->func);
WRITE_ONCE(head->func, rcu_leak_callback);
return;
}
head->func = func;
head->next = NULL;
local_irq_save(flags);
rdp = this_cpu_ptr(rsp->rda);
/* Add the callback to our list. */
if (unlikely(!rcu_segcblist_is_enabled(&rdp->cblist)) || cpu != -1) {
int offline;
if (cpu != -1)
rdp = per_cpu_ptr(rsp->rda, cpu);
if (likely(rdp->mynode)) {
/* Post-boot, so this should be for a no-CBs CPU. */
offline = !__call_rcu_nocb(rdp, head, lazy, flags);
WARN_ON_ONCE(offline);
/* Offline CPU, _call_rcu() illegal, leak callback. */
local_irq_restore(flags);
return;
}
/*
* Very early boot, before rcu_init(). Initialize if needed
* and then drop through to queue the callback.
*/
BUG_ON(cpu != -1);
WARN_ON_ONCE(!rcu_is_watching());
if (rcu_segcblist_empty(&rdp->cblist))
rcu_segcblist_init(&rdp->cblist);
}
rcu_segcblist_enqueue(&rdp->cblist, head, lazy);//加到链表最尾
if (!lazy)
rcu_idle_count_callbacks_posted();
if (__is_kfree_rcu_offset((unsigned long)func))
trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func,
rcu_segcblist_n_lazy_cbs(&rdp->cblist),
rcu_segcblist_n_cbs(&rdp->cblist));
else
trace_rcu_callback(rsp->name, head,
rcu_segcblist_n_lazy_cbs(&rdp->cblist),
rcu_segcblist_n_cbs(&rdp->cblist));
/* Go handle any RCU core processing required. */
__call_rcu_core(rsp, rdp, head, flags);
local_irq_restore(flags);
}
/*
* Handle any core-RCU processing required by a call_rcu() invocation.
*/
static void __call_rcu_core(struct rcu_state *rsp, struct rcu_data *rdp,
struct rcu_head *head, unsigned long flags)
{
/*
* If called from an extended quiescent state, invoke the RCU
* core in order to force a re-evaluation of RCU's idleness.
*/
if (!rcu_is_watching())
invoke_rcu_core();
/* If interrupts were disabled or CPU offline, don't invoke RCU core. */
if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id()))
return;
/*
* Force the grace period if too many callbacks or too long waiting.
* Enforce hysteresis, and don't invoke force_quiescent_state()
* if some other CPU has recently done so. Also, don't bother
* invoking force_quiescent_state() if the newly enqueued callback
* is the only one waiting for a grace period to complete.
*/
if (unlikely(rcu_segcblist_n_cbs(&rdp->cblist) >
rdp->qlen_last_fqs_check + qhimark)) {//这里表明挂载到队列的回调函数过多需要进行一次强制性的grace period
/* Are we ignoring a completed grace period? */
note_gp_changes(rsp, rdp);
/* Start a new grace period if one not already started. */
if (!rcu_gp_in_progress(rsp)) {
rcu_accelerate_cbs_unlocked(rsp, rdp->mynode, rdp);// 没有处在宽限期中,唤醒GP线程,启动新的宽限期
} else {
/* Give the grace period a kick. */
rdp->blimit = LONG_MAX;
if (rsp->n_force_qs == rdp->n_force_qs_snap &&
rcu_segcblist_first_pend_cb(&rdp->cblist) != head)
force_quiescent_state(rsp);//处在宽限期中,通知GP线程强制静止状态
rdp->n_force_qs_snap = rsp->n_force_qs;
rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(&rdp->cblist);
}
}
}
call_rcu 第一个功能是注册回调函数,而回调的函数的维护是在rcu_data结构中的struct rcu_segcblist cblist字段中,第二个功能是判断是否需要开启新的宽限期GP(通过通知GP线程完成),而根据上面的分析,这些回调函数会在软中断中被调用。
总结
rcu是一个对读者非常友好的机制,读者几乎没有加锁的开销;当然,缺点就是读者需要去确认数据的有效性,以及写者的开销会增加。
rcu为了效率,设计了一系列复杂精细的数据结构,然后通过在tick以及一些流程中不断检测、上报静止状态,推动宽限期的渡过。而call_rcu以及synchronize_rcu这类需要等待宽限期的函数调用,会推动系统启动新的宽限期检测。代码虽然很复杂,但是理清了整体思路后,很多代码也变得清晰起来。