Linux自旋锁(2)-队列自旋锁

wulaladamowang

已于 2024-01-01 22:06:19 修改

阅读量744

点赞数 17

文章标签： linux 运维 arm开发 android

于 2024-01-01 22:03:19 首次发布

本文链接：https://blog.csdn.net/wulaladamowang/article/details/135329036

版权

本文详细介绍了Linux内核中的队列自旋锁，它是MCS锁的改进版，通过减少内存分配、优化结构设计（如pending位）和自旋逻辑，提高锁的竞争效率。文章分析了锁的获得、释放过程，以及与raw_spin_lock的关系。

摘要由CSDN通过智能技术生成

队列自旋锁

前言

在第一节【Linux自旋锁(1)】大致介绍了自旋锁的底层原理：通过对内存控制器或者共享总线的控制，以原子的形式实现对内存变量进行test and set。并介绍了在此基础上为改变一些不公平现象以及解决缓存抖动而出现的MCS锁，本章节在此基础上介绍当前内核版本中的队列自旋锁。

队列自旋锁

与MCS锁

队列自旋锁在MCS锁的基础上演进而来，MCS锁在CPU持锁时，需要CPU创建副本，而每个CPU上能够持锁的个数是有限的：线程本身、软中断、硬中断、NMI，每种类型可以被后者抢占从而出现锁嵌套，因此可以将这一部分进行预先分配，减少部分内存分配的损耗；队列自旋锁的结构体struct qnode结构体中即包含一个struct mcs_spinlock;

struct qnode {
	struct mcs_spinlock mcs;
#ifdef CONFIG_PARAVIRT_SPINLOCKS
	long reserved[2];
#endif
};

/*
 * Per-CPU queue node structures; we can never have more than 4 nested
 * contexts: task, softirq, hardirq, nmi.
 *
 * Exactly fits one 64-byte cacheline on a 64-bit architecture.
 *
 * PV doubles the storage and uses the second cacheline for PV state.
 */
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

MCS锁在持锁时，每个CPU需要现在本地创建副本，而通过副本中的自旋变量进行自旋；队列自旋锁考虑可以先直接在全局变量上自旋，有多个CPU抢占而导致排队时再使用副本，一定程度上可以减少流程量；

与raw_spin_lock锁

在介绍代码之前，先看下自旋锁常用的raw_spin_lock与自旋锁的关系；

int a = 1;
raw_spinlock_t sf_lock;
raw_spin_lock(&sf_lock);
a = a + 1;
raw_spin_unlock(&sf_lock);

上述为lock的常见使用，通过下面的函数定义，raw_spinlock_t实际上是qspinlock增加了一些debug信息的封装；

typedef struct raw_spinlock {
	arch_spinlock_t raw_lock; // arch_spinlock_t 为 qspinlock的宏定义
#ifdef CONFIG_DEBUG_SPINLOCK
	unsigned int magic, owner_cpu;
	void *owner;
#endif
#ifdef CONFIG_DEBUG_LOCK_ALLOC
	struct lockdep_map dep_map;
#endif
} raw_spinlock_t

而raw_spinlock在arm-generic中最终会调用到queued_spin_lock，即队列自旋锁的主角函数；在arch_spin_lock一步，处理的成员变量从raw_spinlock_t变为了它的成员变量struct qspinlock;
在这里插入图片描述

代码逻辑

锁结构

在开始之前，先看下与队列自旋锁相关的结构体qsinlock 和 qnode；

qnode实际上包含mcs_spinlock；一个cpu上的自旋锁的包括函数、软中断、硬中断以及NMI，所以定义每CPU变量qnodes，预先分配4个锁副本给每个CPU，可以减少分配内存的开销；

struct qnode {
	struct mcs_spinlock mcs;
#ifdef CONFIG_PARAVIRT_SPINLOCKS
	long reserved[2];
#endif
};

#define MAX_NODES 4
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

与mcs相比，qspinlock的全局变量锁进行了压缩；用union联合结构体中的locked表明当前是否有CPU持锁，而用tail来实现排队操作，利用pending位进行优化；

typedef struct qspinlock{
	union{
		atomic_t val;

		/*
		 * By using the whole 2nd least significant byte for the
		 * pending bit, we can allow better optimization of the lock
		 * acquisition for the pending bit holder.
		 */
#ifdef __LITTLE_ENDIAN
		struct{
			u8	locked;
			u8	pending;
		};
		struct{
			u16	locked_pending;
			u16	tail;
		};
#else
		struct{
			u16	tail;
			u16	locked_pending;
		};
		struct{
			u8	reserved[2];
			u8	pending;
			u8	locked;
		};
#endif
	};
}arch_spinlock_t/*

上图中的qspinlock结构体为一个union结构，可以根据结构体定义作出函数示意图，虚线框内的结构体共享32位，通过修改不同的变量，可以方便的修改不同位；在这里插入图片描述
下面代码记录宏定义，利用掩码操作当前锁的状态；

 * Bitfields in the atomic value:
 *
 * When NR_CPUS < 16K
 *  0- 7: locked byte
 *     8: pending
 *  9-15: not used
 * 16-17: tail index
 * 18-31: tail cpu (+1)
 *
 * When NR_CPUS >= 16K
 *  0- 7: locked byte
 *     8: pending
 *  9-10: tail index
 * 11-31: tail cpu (+1)
 */
#define	_Q_SET_MASK(type)	(((1U << _Q_ ## type ## _BITS) - 1)\
				      << _Q_ ## type ## _OFFSET)
#define _Q_LOCKED_OFFSET	0
#define _Q_LOCKED_BITS		8
#define _Q_LOCKED_MASK		_Q_SET_MASK(LOCKED)

#define _Q_PENDING_OFFSET	(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
#if CONFIG_NR_CPUS < (1U << 14)
#define _Q_PENDING_BITS		8
#else
#define _Q_PENDING_BITS		1
#endif
#define _Q_PENDING_MASK		_Q_SET_MASK(PENDING)

#define _Q_TAIL_IDX_OFFSET	(_Q_PENDING_OFFSET + _Q_PENDING_BITS)
#define _Q_TAIL_IDX_BITS	2
#define _Q_TAIL_IDX_MASK	_Q_SET_MASK(TAIL_IDX)

#define _Q_TAIL_CPU_OFFSET	(_Q_TAIL_IDX_OFFSET + _Q_TAIL_IDX_BITS)
#define _Q_TAIL_CPU_BITS	(32 - _Q_TAIL_CPU_OFFSET)
#define _Q_TAIL_CPU_MASK	_Q_SET_MASK(TAIL_CPU)

#define _Q_TAIL_OFFSET		_Q_TAIL_IDX_OFFSET
#define _Q_TAIL_MASK		(_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)

#define _Q_LOCKED_VAL		(1U << _Q_LOCKED_OFFSET)
#define _Q_PENDING_VAL		(1U << _Q_PENDING_OFFSET)

通过上面的注释，可以得到对应的结构组成，其示意图如下：在这里插入图片描述

锁的获得与释放

当多个cpu产生锁竞争时，对应的全局变量qspinlock以及每cpu变量结构变化如下：

初始情况，无CPU持锁；
cpu3来获得锁，将locked直接值为1，表明有cpu持锁；
cpu2来获得锁，此时全局变量locked为1，表明有CPU持锁，其看到pending位和tail位均为0，则说明当前没有其他cpu等锁，也没有CPU在排队等锁，其将在pending位置为1，并在locked位上自旋；
cpu0此时来竞争锁，此时locked和pending均为1，表明当前有人持锁，并且有人已经等锁；于是排队机制触发；cpu0首先确定一个自己当前还未使用的qnode，假设为1，然后将生成一个tail(tail cpu为cpu_idx + 1，与0做区分)，替换全局变量中的tail，表明当前CPU0在队尾，与这个全局变量关联的是副本1；同时副本0中的count具有计数功能，表明已经使用的副本数，当副本超过4个时(大概率已经异常)，该cpu再次持锁时就不会排队，而是直接与全局变量竞争；cpu0的副本1中locked被置为0，因为其为队列中的第一个cpu，所以其自旋等待pending和locked位同时置为0；
当cpu3释放锁之后，会将全局变量中的locked置为0，此时自旋在全局变量中的cpu2就会获得自旋锁，然后pending位清除位0；注意，此时的tail并不为0，说明有CPU在排队，所以后续再来的cpu也不能自旋在全局变量的locked位上，也不可设置pending位；所以说pending位设置的条件是locked 为1，pending和tail均为0，即队列上没有等锁的cpu；
cpu1来获得锁，因为此时的tail不为0，所以其需要排队，首先找到自己未使用的锁副本生成tail，然后将全局变量中的tail换做自己的副本，并获得old_tail，根据old_tail找到队列尾，然后排队在其后；在本例中对队尾是cpu0，副本是1；- cpu2释放锁之后，会将locked置为0，此时pending和locked同时为0，cpu0会获得锁，并将count计数-1，其会检查自己当前全局变量中的tail，是否还是自己，此时不是自己说明队列后还有CPU在等锁，于是找到cpu1，将其副本中locked置为0，cpu1变为队首，cpu1结束在副本上的自旋，然后自旋等待locked和pending为0；
cpu0释放锁之后，cpu1获得锁，其检查tail为自己之后，将tail置为0，后续再来的持锁线程就可以使用pending位进行自旋了；
cpu1释放之后，全局锁恢复成无持锁状态；

细节梳理

代码注释中的(0, 0, *)等即上面描述的(tail, pending, locked)的标识位，*代表任意状态；

/**
 * queued_spin_lock - acquire a queued spinlock
 * @lock: Pointer to queued spinlock structure
 */
static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
	int val = 0;

	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
		return;

	queued_spin_lock_slowpath(lock, val);
}

atomic_try_cmpxchg_acquire是原子操作，作用是将第一个参数的值与第二个参数值做比较，如果两个值相等，则将第三个参数赋值给第一个参数，并返回真；如果第一个参数与第二个参数不想等，将第一个参数的旧值赋值给第二个参数，将第三个参数赋值给第一个参数，具体底层实现可以参考第一节中的linux自旋锁的实现；
在queued_spin_lock中，临时变量val的值为0，&lock->val如果也是为0，则表明当前没有持锁，则当前的CPU可以直接持锁，并用_Q_LOCKED_VAL赋值，表明当前已经有CPU持锁；如果当前有持锁，此时val已经是lock的此时值，传递到queued_spin_lock_slowpath中；

	/*
	 * Wait for in-progress pending->locked hand-overs with a bounded
	 * number of spins so that we guarantee forward progress.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	if (val == _Q_PENDING_VAL) {
		int cnt = _Q_PENDING_LOOPS;
		val = atomic_cond_read_relaxed(&lock->val,
					       (VAL != _Q_PENDING_VAL) || !cnt--);
	}

优化逻辑：再次检测当前pending位，可能刚才检查之后pending已经变为locked，并且没有排队的cpu，再尝试拿一下；

	 * If we observe any contention; queue.
	 */
	if (val & ~_Q_LOCKED_MASK)  // 判断pending和tail位
		goto queue;

	/*
	 * trylock || pending
	 *
	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
	 */
	val = queued_fetch_set_pending_acquire(lock);

	/*
	 * If we observe contention, there is a concurrent locker.
	 *
	 * Undo and queue; our setting of PENDING might have made the
	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
	 * on @next to become !NULL.
	 */
	if (unlikely(val & ~_Q_LOCKED_MASK)) {

		/* Undo PENDING if we set it. */
		if (!(val & _Q_PENDING_MASK))
			clear_pending(lock);

		goto queue;
	}

首先判断是否有pending或者是tail被设置，有的话，就直接去排队；否则，尝试抢pending位，最下面的unlikely判断是因为同一时刻可能有多个竞争，防止当前cpu设置pending位置的时候，其他的cpu已经设置了pending或者是tail；fetch_set先获取旧值，然后设置新值，如果是已经有了tail，就要将自己设置的pending清除掉(clear_pending的条件是自己设置pending之前，locked已经有排队的任务，并且pending没有置位，则自己就是设置pending还没有排队的cpu！！！要清除掉设置的pending位，去排队)；

	/*
	 * We're pending, wait for the owner to go away.
	 *
	 * 0,1,1 -> *,1,0
	 *
	 * this wait loop must be a load-acquire such that we match the
	 * store-release that clears the locked bit and create lock
	 * sequentiality; this is because not all
	 * clear_pending_set_locked() implementations imply full
	 * barriers.
	 */
	if (val & _Q_LOCKED_MASK)
		smp_cond_load_acquire(&lock->locked, !VAL);// 自旋等待全局locked为0

	/*
	 * take ownership and clear the pending bit.
	 *
	 * 0,1,0 -> 0,0,1
	 */
	clear_pending_set_locked(lock);
	lockevent_inc(lock_pending);
	return;

当前cpu已经获得pending位，自旋等待locked为0，当获得锁时，将pending位清除并设置locked位；
下面描述如何排队；

queue:
	lockevent_inc(lock_slowpath);
pv_queue:
	node = this_cpu_ptr(&qnodes[0].mcs);
	idx = node->count++;
	tail = encode_tail(smp_processor_id(), idx);

	trace_contention_begin(lock, LCB_F_SPIN);

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) { // 判断当前cpu是否用完了所有的副本
		lockevent_inc(lock_no_node);
		while (!queued_spin_trylock(lock)) // 直接在全局locked上抢
			cpu_relax();
		goto release;
	}

	node = grab_mcs_node(node, idx);

首先找到本地cpu的一个没有使用的副本，利用encode_tail编码为一个tail;如果当前的cpu上持锁过多已经用完了4个副本，就直接不再排队，直接在全局变量上抢锁；

	/*
	 * Publish the updated tail.
	 * We have already touched the queueing cacheline; don't bother with
	 * pending stuff.
	 *
	 * p,*,* -> n,*,*
	 */
	old = xchg_tail(lock, tail);// 置换全局变量中的tail为当前cpu的副本
	next = NULL;

	/*
	 * if there was a previous node; link it and wait until reaching the
	 * head of the waitqueue.
	 */
	if (old & _Q_TAIL_MASK) { // 队列中还有等锁cpu
		prev = decode_tail(old);

		/* Link @node into the waitqueue. */
		WRITE_ONCE(prev->next, node);

		pv_wait_node(node, prev);
		arch_mcs_spin_lock_contended(&node->locked);// 自选在副本locked

		/*
		 * While waiting for the MCS lock, the next pointer may have
		 * been set by another lock waiter. We optimistically load
		 * the next pointer & prefetch the cacheline for writing
		 * to reduce latency in the upcoming MCS unlock operation.
		 */
		next = READ_ONCE(node->next);
		if (next)
			prefetchw(next);
	}
	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));// 队首cpu自选等待全局变量中的locked和pending清0

中间对node的初始化省略掉；将全局变量中的tail赋值为当前cpu的副本的编码tail，并根据old将自己排在队尾；decode_tail可以解析上一个队尾，然后将prev->next置为当前cpu的副本；arch_mcs_spin_lock_contended将在副本的局部变量上自旋；等待队列前一个cpu将副本的locked置位，prefetchw是副本自旋结束之后，如果有next排队cpu，用来更新下缓存，优化逻辑；当上面代码最后一行，当前cpu已经到队首，等待全局变量中的locked和pending为0；

locked:
	/*
	 * claim the lock:
	 *
	 * n,0,0 -> 0,0,1 : lock, uncontended
	 * *,*,0 -> *,*,1 : lock, contended
	 *
	 * If the queue head is the only one in the queue (lock value == tail)
	 * and nobody is pending, clear the tail code and grab the lock.
	 * Otherwise, we only need to grab the lock.
	 */

	/*
	 * In the PV case we might already have _Q_LOCKED_VAL set, because
	 * of lock stealing; therefore we must also allow:
	 *
	 * n,0,1 -> 0,0,1
	 *
	 * Note: at this point: (val & _Q_PENDING_MASK) == 0, because of the
	 *       above wait condition, therefore any concurrent setting of
	 *       PENDING will make the uncontended transition fail.
	 */
	if ((val & _Q_TAIL_MASK) == tail) { // 当前cpu是队尾
		if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
			goto release; /* No contention */
	}

	/*
	 * Either somebody is queued behind us or _Q_PENDING_VAL got set
	 * which will then detect the remaining tail and queue behind us
	 * ensuring we'll see a @next.
	 */
	set_locked(lock);

	/*
	 * contended path; wait for next if not observed yet, release.
	 */
	if (!next) // tail的替换非原子，已经有排队，但是还没有设置当前cpu的next
		next = smp_cond_load_relaxed(&node->next, (VAL));

	arch_mcs_spin_unlock_contended(&next->locked);
	pv_kick_node(lock, next);

release:
	trace_contention_end(lock, 0);

	/*
	 * release the node
	 */
	__this_cpu_dec(qnodes[0].mcs.count);

当前的cpu已经获得锁了，首先通过全局变量中的tail，查看是否有排队cpu，没有的话直接获得锁，修改mcs.count，即副本使用计数-1；否则将next的副本中的locked进行unlocked，使得下一个cpu可以在全局变量上竞争了；因为挂在next上和全局锁中tail的置位并不是原子操作，可能已经有要排队的cpu了，还没有挂上next，稍等下，等待其排在队列上；
通过上述操作，一个完整的持锁过程就已经结束了，释放锁即利用locked_mask的掩码将locked位清0，逻辑比较简单;

static __always_inline void queued_spin_unlock(struct qspinlock *lock)
{
	/* 将自旋锁的locked域设置成0 */
	smp_store_release(&lock->locked, 0);
}

代码总结

获得pending位的cpu自旋在locked变量上；
队列开头的cpu自旋等待locked和pending位同时为0；
队列中的cpu自旋等待副本中的locked解锁，排到队首之后自旋到locked和pending位上；
综合上述三种情况，一个锁可能有三个cpu卡在不同的代码位置；但是在锁竞争比较激烈的场景，一般队列上都有锁，此时的锁一般卡在本地的自旋锁和locked和pending位上