3.1-并发控制：互斥

绿洲213

于 2023-02-08 17:59:38 发布

阅读量191

点赞数

分类专栏： jyy操作系统2022 文章标签： java linux 算法

本文链接：https://blog.csdn.net/weixin_46227276/article/details/128940231

版权

jyy操作系统2022 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

本文探讨了在多处理器系统中实现线程互斥的方法，包括自旋锁和互斥锁。自旋锁利用原子操作如x86的LOCK指令，但在高并发下可能导致性能下降和资源浪费。互斥锁通过系统调用实现，允许线程在无法获取锁时挂起，提高资源利用率。Futex作为用户空间和内核空间的结合，提供了快速路径和慢速路径，平衡了性能和等待效率。文章强调了性能优化的重要性，特别是在并发编程中寻找合适的平衡点。

摘要由CSDN通过智能技术生成

复习

状态机、状态机、状态机

本次课回答的问题

Q: 如何在多处理器上实现线程互斥？

本次课主要内容

自旋锁的实现
互斥锁的实现

一、共享内存上的互斥

在共享内存上实现互斥

失败的尝试

mutex-bad.py

(部分) 成功的尝试

peterson-barrier.c

实现互斥的根本困难：不能同时读/写共享内存

load (环顾四周) 的时候不能写，只能 “看一眼就把眼睛闭上”
- 看到的东西马上就过时了
store (改变物理世界状态) 的时候不能读，只能 “闭着眼睛动手”
- 也不知道把什么改成了什么
这是~~简单、粗暴 (稳定)、有效~~的《操作系统》课

二、自旋锁 (Spin Lock)

x86 原子操作：`LOCK` 指令前缀

#include "thread.h"

#define N 100000000

long sum = 0;

void Tsum() {
  for (int i = 0; i < N; i++) {
    asm volatile("lock addq $1, %0": "+m"(sum));
  }
}

int main() {
  create(Tsum);
  create(Tsum);
  join();
  printf("sum = %ld\n", sum);
}

编译优化

gcc -O2 -lpthread sum-atomic.c && ./a.out

sum = 200000000

实现互斥：自旋锁

int table = YES;

void lock() {
retry:
  int got = xchg(&table, NOPE);
  if (got == NOPE)
    goto retry;
  assert(got == YES);
}

void unlock() {
  xchg(&table, YES)
}

int locked = 0;
void lock() { while (xchg(&locked, 1)) ; }
void unlock() { xchg(&locked, 0); }

实现互斥：自旋锁 (cont’d)

并发编程：千万小心

做详尽的测试 (在此省略，你们做 Labs 就知道了)
尽可能地证明 (model-checker.py 和 spinlock.py)

原子指令的模型

保证之前的 store 都写入内存
保证 load/store 不与原子指令乱序

class Spinlock:
    locked = ''

    @thread
    def t1(self):
        while True:
            while True:
                self.locked, seen = '🔒', self.locked
                if seen != '🔒': break
            cs = True
            del cs
            self.locked = ''

    @thread
    def t2(self):
        while True:
            while True:
                self.locked, seen = '🔒', self.locked
                if seen != '🔒': break
            cs = True
            del cs
            self.locked = ''

    @marker
    def mark_t1(self, state):
        if localvar(state, 't1', 'cs'): return 'blue'

    @marker
    def mark_t2(self, state):
        if localvar(state, 't2', 'cs'): return 'green'

    @marker
    def mark_both(self, state):
        if localvar(state, 't1', 'cs') and localvar(state, 't2', 'cs'):
            return 'red'

python model-checker.py spinlock.py | python visualize.py -t > a.html

Lock 指令的现代实现

在 L1 cache 层保持一致性 (ring/mesh bus)

相当于每个 cache line 有分别的锁
store(x) 进入 L1 缓存即保证对其他处理器可见
- 但要小心 store buffer 和乱序执行

L1 cache line 根据状态进行协调

M (Modified), 脏值
E (Exclusive), 独占访问
S (Shared), 只读共享
I (Invalid), 不拥有 cache line

RISC-V: 另一种原子操作的设计

考虑常见的原子操作：

atomic test-and-set
- reg = load(x); if (reg == XX) { store(x, YY); }
lock xchg
- reg = load(x); store(x, XX);
lock add
- t = load(x); t++; store(x, t);

它们的本质都是：

load
exec (处理器本地寄存器的运算)
store

Load-Reserved/Store-Conditional (LR/SC)

LR: 在内存上标记 reserved (盯上你了)，中断、其他处理器写入都会导致标记消除

lr.w rd, (rs1)
  rd = M[rs1]
  reserve M[rs1]

SC: 如果 “盯上” 未被解除，则写入

sc.w rd, rs2, (rs1)
  if still reserved:
    M[rs1] = rs2
    rd = 0
  else:
    rd = nonzero

Compare-and-Swap 的 LR/SC 实现

int cas(int *addr, int cmp_val, int new_val) {
  int old_val = *addr;
  if (old_val == cmp_val) {
    *addr = new_val; return 0;
  } else { return 1; }
}
cas:
  lr.w  t0, (a0)       # Load original value.
  bne   t0, a1, fail   # Doesn’t match, so fail.
  sc.w  t0, a2, (a0)   # Try to update.
  bnez  t0, cas        # Retry if store-conditional failed.
  li a0, 0             # Set return to success.
  jr ra                # Return.
fail:
  li a0, 1             # Set return to failure.
  jr ra                # Return

三、互斥锁 (Mutex Lock)

自旋锁的缺陷

性能问题 (0)

自旋 (共享变量) 会触发处理器间的缓存同步，延迟增加

性能问题 (1)

除了进入临界区的线程，其他处理器上的线程都在空转
争抢锁的处理器越多，利用率越低

性能问题 (2)

获得自旋锁的线程

可能被操作系统切换出去
- 操作系统不 “感知” 线程在做什么
- (但为什么不能呢？)
实现 100% 的资源浪费

Scalability: 性能的新维度

同一份计算任务，时间 (CPU cycles) 和空间 (mapped memory) 会随处理器数量的增长而变化。

sum-scalability.c
thread-sync.h
- 严谨的统计很难
  - CPU 动态功耗
  - 系统中的其他进程
  - ……
- Benchmarking crimes

#include "thread.h"
#include "thread-sync.h"

#define N 10000000
spinlock_t lock = SPIN_INIT();

long n, sum = 0;

void Tsum() {
  for (int i = 0; i < n; i++) {
    spin_lock(&lock);
    sum++;
    spin_unlock(&lock);
  }
}

int main(int argc, char *argv[]) {
  assert(argc == 2);
  int nthread = atoi(argv[1]);
  n = N / nthread;
  for (int i = 0; i < nthread; i++) {
    create(Tsum);
  }
  join();
  assert(sum == n * nthread);
}

编译优化

自旋锁的缺陷是在同样的工作量下，线程数越多，耗时越多，效率越低。

gcc -O2 -lpthread sum-scalability.c && time ./a.out 1
# ./a.out 1  0.09s user 0.00s system 99% cpu 0.095 total

time ./a.out 10
# ./a.out 10  10.22s user 0.03s system 584% cpu 1.756 total

自旋锁的使用场景

临界区几乎不 “拥堵”
持有自旋锁时禁止执行流切换

使用场景：操作系统内核的并发数据结构 (短临界区)

操作系统可以关闭中断和抢占
- 保证锁的持有者在很短的时间内可以释放锁
(如果是虚拟机呢…😂)
- PAUSE 指令会触发 VM Exit
但依旧很难做好
- An analysis of Linux scalability to many cores (OSDI’10)

实现线程 + 长临界区的互斥

作业那么多，与其干等 Online Judge 发布，不如把自己 (CPU) 让给其他作业 (线程) 执行？

“让” 不是 C 语言代码可以做到的 (C 代码只能计算)

把锁的实现放到操作系统里就好啦！
- ```
syscall(SYSCALL_lock, &lk);
```
  - 试图获得 lk，但如果失败，就切换到其他线程
- ```
syscall(SYSCALL_unlock, &lk);
```
  - 释放 lk，如果有等待锁的线程就唤醒

实现线程 + 长临界区的互斥 (cont’d)

操作系统 = 更衣室管理员

先到的人 (线程)
- 成功获得手环，进入游泳馆
- *lk = 🔒，系统调用直接返回
后到的人 (线程)
- 不能进入游泳馆，排队等待
- 线程放入等待队列，执行线程切换 (yield)
洗完澡出来的人 (线程)
- 交还手环给管理员；管理员把手环再交给排队的人
- 如果等待队列不空，从等待队列中取出一个线程允许执行
- 如果等待队列为空，*lk = ✅
管理员 (OS) 使用自旋锁确保自己处理手环的过程是原子的

四、Futex = Spin + Mutex

关于互斥的一些分析

自旋锁 (线程直接共享 locked)

更快的 fast path
- xchg 成功 → 立即进入临界区，开销很小
更慢的 slow path
- xchg 失败 → 浪费 CPU 自旋等待

睡眠锁 (通过系统调用访问 locked)

更快的 slow path
- 上锁失败线程不再占用 CPU
更慢的 fast path
- 即便上锁成功也需要进出内核 (syscall)

Futex: Fast Userspace muTexes

小孩子才做选择。我当然是全都要啦！

Fast path: 一条原子指令，上锁成功立即返回
Slow path: 上锁失败，执行系统调用睡眠
- 性能优化的最常见技巧
  - 看 average (frequent) case 而不是 worst case

POSIX 线程库中的互斥锁 (pthread_mutex)

sum-scalability.c，换成 mutex
- 观察系统调用 (strace)
- gdb 调试
  - set scheduler-locking on, info threads, thread X

#include "thread.h"
#include "thread-sync.h"

#define N 10000000
mutex_t lock = MUTEX_INIT();

long n, sum = 0;

void Tsum() {
  for (int i = 0; i < n; i++) {
    mutex_lock(&lock);
    sum++;
    mutex_unlock(&lock);
  }
}

int main(int argc, char *argv[]) {
  assert(argc == 2);
  int nthread = atoi(argv[1]);
  n = N / nthread;
  for (int i = 0; i < nthread; i++) {
    create(Tsum);
  }
  join();
  assert(sum == n * nthread);
}

编译优化

gcc -O2 -lpthread sum-scalability.c && time ./a.out 1
# ./a.out 1  0.09s user 0.00s system 99% cpu 0.098 total

time ./a.out 10                                      
# ./a.out 10  0.19s user 2.42s system 569% cpu 0.459 total

time ./a.out 32 
# ./a.out 32  0.54s user 2.09s system 577% cpu 0.457 total

strace -f ./a.out 64
# [pid 55433] futex(0x4040a0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
# [pid 55432] futex(0x4040a0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
# [pid 55433] <... futex resumed> )       = -1 EAGAIN (资源暂时不可用)
# [pid 55433] futex(0x4040a0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
# [pid 55432] <... futex resumed> )       = 0
# [pid 55433] <... futex resumed> )       = 0

Futex: Fast Userspace muTexes (cont’d)

先在用户空间自旋

如果获得锁，直接进入
未能获得锁，系统调用
解锁以后也需要系统调用
- futex.py
- 更好的设计可以在 fast-path 不进行系统调用

RTFM (劝退)

futex (7), futex (2)
A futex overview and update (LWN)
Futexes are tricky (论 model checker 的重要性)
(我们不讲并发算法)

class Futex:
    locked, waits = '', ''

    def tryacquire(self):
        if not self.locked:
            # Test-and-set (cmpxchg)
            # Same effect, but more efficient than xchg
            self.locked = '🔒'
            return ''
        else:
            return '🔒'

    def release(self):
        if self.waits:
            self.waits = self.waits[1:]
        else:
            self.locked = ''

    @thread
    def t1(self):
        while True:
            if self.tryacquire() == '🔒':     # User
                self.waits = self.waits + '1' # Kernel
                while '1' in self.waits:      # Kernel
                    pass
            cs = True                         # User
            del cs                            # User
            self.release()                    # Kernel

    @thread
    def t2(self):
        while True:
            if self.tryacquire() == '🔒':
                self.waits = self.waits + '2'
                while '2' in self.waits:
                    pass
            cs = True
            del cs
            self.release()

    @thread
    def t3(self):
        while True:
            if self.tryacquire() == '🔒':
                self.waits = self.waits + '3'
                while '3' in self.waits:
                    pass
            cs = True
            del cs
            self.release()

    @marker
    def mark_t1(self, state):
        if localvar(state, 't1', 'cs'): return 'blue'

    @marker
    def mark_t2(self, state):
        if localvar(state, 't2', 'cs'): return 'green'

    @marker
    def mark_t3(self, state):
        if localvar(state, 't3', 'cs'): return 'yellow'

    @marker
    def mark_both(self, state):
        count = 0
        for t in ['t1', 't2', 't3']:
            if localvar(state, t, 'cs'):
                count += 1
        if count > 1:
            return 'red'

编译优化

python model-checker.py futex.py | python visualize.py -t > a.html

总结

本次课回答的问题

Q: 如何在多处理器系统上实现互斥？

Take-away message

软件不够，硬件来凑 (自旋锁)
用户不够，内核来凑 (互斥锁)
- 找到你依赖的假设，并大胆地打破它
Fast/slow paths: 性能优化的重要途径

绿洲213

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

3.1-并发控制：互斥

复习

本次课回答的问题

本次课主要内容

一、共享内存上的互斥

在共享内存上实现互斥

二、自旋锁 (Spin Lock)

x86 原子操作：LOCK 指令前缀

编译优化

实现互斥：自旋锁

实现互斥：自旋锁 (cont’d)

Lock 指令的现代实现

RISC-V: 另一种原子操作的设计

Load-Reserved/Store-Conditional (LR/SC)

Compare-and-Swap 的 LR/SC 实现

三、互斥锁 (Mutex Lock)

自旋锁的缺陷

Scalability: 性能的新维度

编译优化

自旋锁的使用场景

实现线程 + 长临界区的互斥

实现线程 + 长临界区的互斥 (cont’d)

四、Futex = Spin + Mutex

关于互斥的一些分析

Futex: Fast Userspace muTexes

编译优化

Futex: Fast Userspace muTexes (cont’d)

编译优化

总结

x86 原子操作：`LOCK` 指令前缀