前言
工作期间调试过多次内核实时性问题,通信系统需要精确ms级延迟,之前在产品中做过实时补丁、xenomai等方式,现在把这些记录下来做个备份和交流。主要思路是梳理不可抢占内核2.4、抢占内核2.6的区别,实时补丁preempt_rt到底修改了哪些内容、常规的实时性调试思路等。这是第一篇文章,分析2.4内核的调度相关内容。
调度策略概念
内核支持三种调度策略SCHED_OTHER、SCHED_FIFO、SCHED_RR,具体介绍如下:
- SCHED_OTHER:基于时间片的普通优先级策略;
- SCHED_FIFO:先入先出实时调度策略,针对实时性要求较高的线程,如果线程使用该调度策略,没有时间片的概念,线程会一直占用CPU,除非更高优先级的线程抢占;
- SCHED_RR:时间片的实时调度策略,与SCHED_FIFO不同的是在当采用SHCED_RR策略的进程的时间片用完,系统将重新分配时间片,并置于就绪队列尾。放在队列尾保证了所有具有相同优先级的RR任务的调度公平;
优先级定义
-
实时优先级
-
普通优先级
进程描述符
进程描述符是进程在内核中的实体,分析代码时也会时常面对它:
struct task_struct {
volatile long need_resched;
long counter;
long nice;
unsigned long policy;
struct list_head run_list;
unsigned long sleep_time;
struct task_struct *next_task, *prev_task;
unsigned long rt_priority;
...
}
policy:上面介绍的调度策略,SCHED_OTHER、SCHED_FIFO、SCHED_RR;
nice:普通线程静态优先级,NICE_TO_TICKS会把nice转化成时间片存储到counter中;
rt_priority:实时线程静态优先级;
如何组织进程描述符
- 所有就绪的进程通过run_list双向链表链接在一起,头结点为runqueue_head;
- 无论进程是普通线程还是实时线程,只要任务处于可运行状态,都会挂入这个链表;
- 系统运行过程中,这个链表会不停的插入、删除,CPU只需关注这个链表即可;
- 每次调度时都是遍历这个链表中选择weight最大的线程执行;
多核调度
- 当CPUx的调度点到来时,CPUx根据调度策略从runqueue选择线程执行;
- 多个CPU共享一个runqueue,访问时需要互斥访问,这样就导致效率低下,后面我们会讲到这样调度的缺点
- 硬件定时器到来时,会更新当前正在执行线程的时间片
- 线程状态变化、调度策略改变、线程的创建、销毁等都会对runqueue进行更新;
看到这我们还有一个关键的问题:调度点到底有哪些?接着往下看
调度代码分析
大概分析下调度函数实现的思路:
asmlinkage void schedule(void)
{
struct schedule_data * sched_data;
struct task_struct *prev, *next, *p;
struct list_head *tmp;
int this_cpu, c;
spin_lock_prefetch(&runqueue_lock);
if (!current->active_mm) BUG();
need_resched_back:
prev = current;
this_cpu = prev->processor;
if (unlikely(in_interrupt())) {
printk("Scheduling in interrupt\n");
BUG();
}
release_kernel_lock(prev, this_cpu);
/*
* 'sched_data' is protected by the fact that we can run
* only one process per CPU.
*/
sched_data = & aligned_data[this_cpu].schedule_data;
//自旋锁,互斥访问
spin_lock_irq(&runqueue_lock);//runqueue lock
//如果调度策略为SCHED_RR,时间片用完时插入队列尾部
/* move an exhausted RR process to be last.. */
if (unlikely(prev->policy == SCHED_RR))
if (!prev->counter) {
prev->counter = NICE_TO_TICKS(prev->nice);//重新计算counter
move_last_runqueue(prev);//放入队列尾部
}
switch (prev->state) {
case TASK_INTERRUPTIBLE:
if (signal_pending(prev)) {
prev->state = TASK_RUNNING;
break;
}
default:
del_from_runqueue(prev);
case TASK_RUNNING:;
}
prev->need_resched = 0;
/*
* this is the scheduler proper:
*/
repeat_schedule:
/*
* Default process to select..
*/
//查找需要调度的线程,如果有多个线程则根据上面的调度策略选择线程
next = idle_task(this_cpu);//初始为空闲task
c = -1000;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p, this_cpu)) {
//选择weight最大的线程赋值给next
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
/* Do we need to re-calculate counters? */
//所有线程时间片耗尽,重新计算时间片
if (unlikely(!c)) {
struct task_struct *p;
//重新计算时间片
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
//重新选择线程调度
goto repeat_schedule;
}
/*
* from this point on nothing can prevent us from
* switching to the next task, save this fact in
* sched_data.
*/
sched_data->curr = next;
task_set_cpu(next, this_cpu);
spin_unlock_irq(&runqueue_lock);
if (unlikely(prev == next)) {
/* We won't go through the normal tail, so do this by hand */
prev->policy &= ~SCHED_YIELD;
goto same_process;
}
#ifdef CONFIG_SMP
/*
* maintain the per-process 'last schedule' value.
* (this has to be recalculated even if we reschedule to
* the same process) Currently this is only used on SMP,
* and it's approximate, so we do not have to maintain
* it while holding the runqueue spinlock.
*/
sched_data->last_schedule = get_cycles();
/*
* We drop the scheduler lock early (it's a global spinlock),
* thus we have to lock the previous process from getting
* rescheduled during switch_to().
*/
#endif /* CONFIG_SMP */
//切换线程
kstat.context_swtch++;
/*
* there are 3 processes which are affected by a context switch:
*
* prev == .... ==> (last => next)
*
* It's the 'much more previous' 'prev' that is on next's stack,
* but prev is set to (the just run) 'last' process by switch_to().
* This might sound slightly confusing but makes tons of sense.
*/
prepare_to_switch();
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
if (!mm) {
if (next->active_mm) BUG();
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
enter_lazy_tlb(oldmm, next, this_cpu);
} else {
if (next->active_mm != mm) BUG();
switch_mm(oldmm, mm, next, this_cpu);
}
if (!prev->mm) {
prev->active_mm = NULL;
mmdrop(oldmm);
}
}
/*
* This just switches the register state and the
* stack.
*/
switch_to(prev, next, prev);
__schedule_tail(prev);
same_process:
reacquire_kernel_lock(current);
if (current->need_resched)
goto need_resched_back;
return;
}
调度点
调度主要分为两种:第一种是主动让出CPU,第二种是被动让出CPU;
-
主动让出,比如程序主动调用sleep,该函数在陷入内核之后会主动释放CPU,然后主动执行调度函数schedule;
-
被动调度,在从内核态返回用户态时,会产生一次调度(只有在这个时候才会有调度,主要分为下面几种情况)
(1):中断处理完成返回用户空间时;
(2):异常处理完成返回用户空间时;
(3):系统调用从内核返回用户空间时; -
从上面可以看出一个关键的问题,内核态下不具备抢占的特性。线程A(普通优先级)执行系统调用陷入内核,在内核执行过程中,发生了中断,在中断中唤醒了线程B(实时线程),但是这个时候并不会立即调度线程B而是要等待线程A的系统调用完成之后,在内核态返回用户空间时才有调度到线程B,这就是2.4内核最大的一个弊端(内核态不能抢占)。
时间片计算
#if HZ < 200
#define TICK_SCALE(x) ((x) >> 2)
#elif HZ < 400
#define TICK_SCALE(x) ((x) >> 1)
#elif HZ < 800
#define TICK_SCALE(x) (x)
#elif HZ < 1600
#define TICK_SCALE(x) ((x) << 1)
#else
#define TICK_SCALE(x) ((x) << 2)
#endif
#define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1)//nice [-20,19]
对于普通线程而言,线程优先级为100-139,对应nice为[-20-+19],数字越小,优先级越高。不同HZ的示例如下:
//HZ=100
//nice=-20 x=40(0b0010 1000) tick=10 (10+1)10=110ms
//nice=19 x=1 (0b0000 0001) tick=0 110 =10ms
//HZ=250
//nice=-20 x=40(0b0010 1000) tick=20 (20+1)*4=84ms
//nice=19 x=1 (0b0000 0001) tick=1 (1+1)*4 =8ms
//HZ=1000
//nice=-20 x=40(0b0010 1000) tick=80 (80+1)*1=81ms
//nice=19 x=1 (0b0000 0001) tick=2 (2+1)*1 =3ms
从上面可以看出,线程分配的时间片与HZ息息相关。
对于实时线程而言,分为两种情况:
SCHE_FIFO:没有时间片概念,抢到CPU之后就一直运行,除非自主动让出CPU,或者被高优先级的线程抢占;
SCHE_RR:这里有一个很奇怪的设计,如下面的代码,RR线程的优先级居然跟nice相关…
/* move an exhausted RR process to be last.. */
if (unlikely(prev->policy == SCHED_RR))
if (!prev->counter) {
prev->counter = NICE_TO_TICKS(prev->nice);//重新计算counter
move_last_runqueue(prev);//放入队列尾部
}
}
一般来说nice的默认值为0,即时间片为:
//HZ=1000
//nice=0 x=20(0b0001 0100) tick=40 (40+1)*1=41ms
有个问题,如果我手动修改使用命令修改nice的值,是不是调度的轮询的时间片就被改了,后面验证下。
调度器的问题
- 内核态不可抢占,无法保证实时性;
- SMP下共享一个runqueue,访问效率低下;
- 每次选择线程时都要遍历runqueue,时间复杂度O(n);
遗留的问题
- RR调度的时间片到底与nice是怎样的关系,用户空间修改nice值会不会影响内核的时间片分配;
- 多核访问怎么保证访问runqueue,该问题与preempt_rt的补丁改造spin-lock相关;
- 普通进程运行过程中的weight是怎么动态变化的;
等等,后面接着一步步分析