关于内核抢占

Johnson_JloveJ

已于 2024-04-03 16:26:05 修改

阅读量1.2k

点赞数 32

文章标签： linux 运维服务器

于 2024-04-02 15:18:34 首次发布

本文链接：https://blog.csdn.net/weixin_40889165/article/details/137257995

版权

我们从5w1h的角度分析下linux的内核抢占

what（内核抢占是什么？）

先看下用户态抢占：即一个task运行在用户态时，当syscall/tick/irq等发生&完成处理时、scheduler就会选择更合适的其它task去运行；
而当一个task运行在内核态时，CONFIG_PREEMPT允许在tick/irq等发生&完成处理时、scheduler去选择更合适的其它task去运行；

why（为什么需要内核抢占？）

早起linux内核（2.6以前？）不支持CONFIG_PREEMPT，因为linux认为task在内核态的运行时间应该很短，如果其发生抢占、scheduler来回调度的时间得不偿失；就像Linux到现在也默认不支持irq抢占（irq嵌套、high-priority的irq可打断low-priority的irq），softirq也不允许抢占；
到随着linux的发展，很多task开始在内核态做各种各样的处理和运算，动则100ms起；如果这100ms内high-priority的task都得不到运行，那么嵌入式平台的实时性则无法满足；
贴下CONFIG_PREEMPT的描述：

config PREEMPT
        bool "Preemptible Kernel (Low-Latency Desktop)"
        depends on !ARCH_NO_PREEMPT
        select PREEMPTION
        select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
        help
          This option reduces the latency of the kernel by making
          all kernel code (that is not executing in a critical section)
          preemptible.  This allows reaction to interactive events by
          permitting a low priority process to be preempted involuntarily
          even if it is in kernel mode executing a system call and would
          otherwise not be about to reach a natural preemption point.
          This allows applications to run more 'smoothly' even when the
          system is under load, at the cost of slightly lower throughput
          and a slight runtime overhead to kernel code.

          Select this if you are building a kernel for a desktop or
          embedded system with latency requirements in the milliseconds
          range.

when（什么时候发生内核抢占？）

先看图
内核抢占，顾名思义：Task-A运行在内核态时，由于发生了irq（包括tick-timer、各种外设中断、ppi中断）、在irq处理完成后，scheduler检测到由更合适的Task-B，那么Task-B就抢占了Task-A；

用户态被抢占的时机主要包括

syscall：open/read/write/ioctl, 即下面的SVC；
pagefault：data_abort，DABT；
irq；
arm64把syscall/pagefault等归为sync（同步异常），因为发生sync时都是指令引起的、系统知道当时运行在哪、是synchronize，即系统要同步下该指令引起的结果；把irq归为un-sync，因为irq是随机的。
贴一下arm64处理sync的代码：

asmlinkage void noinstr el0_sync_handler(struct pt_regs *regs)
{
        unsigned long esr = read_sysreg(esr_el1);

        switch (ESR_ELx_EC(esr)) {
        case ESR_ELx_EC_SVC64:
                el0_svc(regs);
                break;
        case ESR_ELx_EC_DABT_LOW:
                el0_da(regs, esr);
                break;
        case ESR_ELx_EC_IABT_LOW:
                el0_ia(regs, esr);
                break;
        ......
}

arm64用户态抢占时机的主要代码

el0_sync/el0_irq(kernl5.10: arch/arm64/kernel/entry.S)
  -->ret_to_user
    -->check(tsk->ti_flag & (_TIF_NEED_RESCHED|_TIF_SIGPENDING|***))
      -->work_pending
        -->do_notify_resume
          -->schedule()

内核态被抢占的时机主要包括irq
- 虽然arm64也有el1_sync_handler，但进入这里基本上都是系统发生了错误；
- 内核态就不会发生syscall了（smc、hvc等和el3/el2相关的这里不讲述）；
- 内核态不应该发生data-abort，内核地址和物理地址都有映射关系；如果发生了，一般是访问了非法地址；
- 所以内核态被抢占的时机主要是irq引起的；
- arm64内核抢占时机的主要代码
- ```
el1_irq(arch/arm64/kernel/entry.S)
  -->el1_interrupt_handler(handle_arch_irq)
check(task->thread_info->preempt.count)是否为0
    -->arm64_preempt_schedule_irq
      -->preempt_schedule_irq
         -->__schedule
```

how（preempt_disable怎么实现的？）

内核很多关键地方需要暂时关闭抢占，如果允许抢占、则需要设计更多的锁去保护关键数据，比如kmap_atomic(）。好奇宝宝，先看看preempt_disable()如何实现的？

#define preempt_disable() \
do { \
        preempt_count_inc(); \
        barrier(); \
} while (0)

static inline void *kmap_atomic(struct page *page)
{
        preempt_disable();
        pagefault_disable();
        return page_address(page);
}

没错，preempt_disable()只是对 current_thread_info()->preempt.count 进行了+1；
然后只需要在内核抢占时机退出（irq处理完成）时检查当前task是否能抢占即可；

el1_irq(arch/arm64/kernel/entry.S)
  -->el1_interrupt_handler(handle_arch_irq)
check(task->thread_info->preempt.count)是否为0
    -->arm64_preempt_schedule_irq
      -->preempt_schedule_irq
         -->__schedule

如果task被irq抢占之前调用了preempt_disable()，那irq退出时也不会发生调度；
preempt_disable()可以嵌套使用，也需要配对使用preempt_enable；不配对的话，该task就无法在内核态被其它task抢占了。
在看看相关结构体，thread_info->preempt_count中低8位代表关抢占的次数。

arch/arm64/include/asm/thread_info.h
struct thread_info {
        unsigned long           flags;          /* low level flags */
        mm_segment_t            addr_limit;     /* address limit */
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
        u64                     ttbr0;          /* saved TTBR0_EL1 */
#endif
        union {
                u64             preempt_count;  /* 0 => preemptible, <0 => bug */
                struct {
                        u32     count;
                        u32     need_resched;
                } preempt;
        };
};

/*
 * We put the hardirq and softirq counter into the preemption
 * counter. The bitmask has the following meaning:
 *
 * - bits 0-7 are the preemption count (max preemption depth: 256)
 * - bits 8-15 are the softirq count (max # of softirqs: 256)
 *
 * The hardirq count could in theory be the same as the number of
 * interrupts in the system, but we run all interrupt handlers with
 * interrupts disabled, so we cannot have nesting interrupts. Though
 * there are a few palaeontologic drivers which reenable interrupts in
 * the handler, so we need more than one bit here.
 *
 *         PREEMPT_MASK:        0x000000ff
 *         SOFTIRQ_MASK:        0x0000ff00
 *         HARDIRQ_MASK:        0x000f0000
 *             NMI_MASK:        0x00f00000
 * PREEMPT_NEED_RESCHED:        0x80000000

where（preempt_disable该用在哪里？）

preempt_disable该用在原子上下文；这句话有点扯淡，那还是看看何时改用该用preempt_disable()，可是该用spinlock()？
聪明的你肯定知道，spinlock是用来保护关键数据、不受其它task/irq等抢占；在一个多cpu多task的系统中，同一时间只允许一个 task或者irq来访问这个关键数据；
如果关键数据是percpu变量，其实这个数据可以被多个cpu并行访问的；
- 即，如果cpu0上的taskA 和cpu1上的taskB，可以同时读写这个percpu变量；
- 但如果cpu0上的taskA，被scheduler切换到cpu0上的taskC，taskA和taskC如何做这个percpu变量的同步？
- 如果用spinlock去保护percpu变量，那cpu1上的taskB则也无法并行访问，这不是我们想要的；
- 如果用mutex去保护该percpu变量，和spinlock一样、区别是可以被schedule出去；
- 此时用preempt_disable()接口，即taskA运行中内核态访问percpu变量前、关闭抢占，访问结束后 preempt_enable，即可禁止cpu0上的taskC去和taskA竞争该percpu变量；
- 即percpu变量，用preempt_disable()/preempt_enable()去保护更合理；
- 看一下get_cpu_var的实现，直接先preempt_disbale()：
```
#define get_cpu_var(var)						\
(*({									\
	preempt_disable();						\
	this_cpu_ptr(&var);						\
}))
```