linux调度器（一）-2.4内核分析

最新推荐文章于 2022-01-04 07:00:00 发布

weixin_42318651

最新推荐文章于 2022-01-04 07:00:00 发布

阅读量503

点赞数 1

分类专栏： linux调度器文章标签： linux 调度器 linux2.4

本文链接：https://blog.csdn.net/weixin_42318651/article/details/93415800

版权

linux调度器专栏收录该内容

4 篇文章

订阅专栏

前言

工作期间调试过多次内核实时性问题，通信系统需要精确ms级延迟，之前在产品中做过实时补丁、xenomai等方式，现在把这些记录下来做个备份和交流。主要思路是梳理不可抢占内核2.4、抢占内核2.6的区别，实时补丁preempt_rt到底修改了哪些内容、常规的实时性调试思路等。这是第一篇文章，分析2.4内核的调度相关内容。

调度策略概念

内核支持三种调度策略SCHED_OTHER、SCHED_FIFO、SCHED_RR，具体介绍如下：

SCHED_OTHER：基于时间片的普通优先级策略；
SCHED_FIFO：先入先出实时调度策略，针对实时性要求较高的线程，如果线程使用该调度策略，没有时间片的概念，线程会一直占用CPU，除非更高优先级的线程抢占；
SCHED_RR：时间片的实时调度策略，与SCHED_FIFO不同的是在当采用SHCED_RR策略的进程的时间片用完，系统将重新分配时间片，并置于就绪队列尾。放在队列尾保证了所有具有相同优先级的RR任务的调度公平；

优先级定义

实时优先级
普通优先级

进程描述符

进程描述符是进程在内核中的实体，分析代码时也会时常面对它：

struct task_struct {
	volatile long need_resched;
	long counter;
	long nice;
	unsigned long policy;

	struct list_head run_list;
	unsigned long sleep_time;

	struct task_struct *next_task, *prev_task;

	unsigned long rt_priority;
	...
	}

policy：上面介绍的调度策略，SCHED_OTHER、SCHED_FIFO、SCHED_RR；
nice：普通线程静态优先级，NICE_TO_TICKS会把nice转化成时间片存储到counter中；
rt_priority：实时线程静态优先级；

如何组织进程描述符

所有就绪的进程通过run_list双向链表链接在一起，头结点为runqueue_head；
无论进程是普通线程还是实时线程，只要任务处于可运行状态，都会挂入这个链表；
系统运行过程中，这个链表会不停的插入、删除，CPU只需关注这个链表即可；
每次调度时都是遍历这个链表中选择weight最大的线程执行；

多核调度

在这里插入图片描述

当CPUx的调度点到来时，CPUx根据调度策略从runqueue选择线程执行；
多个CPU共享一个runqueue，访问时需要互斥访问，这样就导致效率低下，后面我们会讲到这样调度的缺点
硬件定时器到来时，会更新当前正在执行线程的时间片
线程状态变化、调度策略改变、线程的创建、销毁等都会对runqueue进行更新；

看到这我们还有一个关键的问题：调度点到底有哪些？接着往下看

调度代码分析

大概分析下调度函数实现的思路：

asmlinkage void schedule(void)
{
	struct schedule_data * sched_data;
	struct task_struct *prev, *next, *p;
	struct list_head *tmp;
	int this_cpu, c;

	spin_lock_prefetch(&runqueue_lock);

	if (!current->active_mm) BUG();
need_resched_back:
	prev = current;
	this_cpu = prev->processor;

	if (unlikely(in_interrupt())) {
		printk("Scheduling in interrupt\n");
		BUG();
	}

	release_kernel_lock(prev, this_cpu);

	/*
	 * 'sched_data' is protected by the fact that we can run
	 * only one process per CPU.
	 */
	sched_data = & aligned_data[this_cpu].schedule_data;

//自旋锁，互斥访问
	spin_lock_irq(&runqueue_lock);//runqueue lock

//如果调度策略为SCHED_RR，时间片用完时插入队列尾部
	/* move an exhausted RR process to be last.. */
	if (unlikely(prev->policy == SCHED_RR))
		if (!prev->counter) {
			prev->counter = NICE_TO_TICKS(prev->nice);//重新计算counter
			move_last_runqueue(prev);//放入队列尾部
		}

	switch (prev->state) {
		case TASK_INTERRUPTIBLE:
			if (signal_pending(prev)) {
				prev->state = TASK_RUNNING;
				break;
			}
		default:
			del_from_runqueue(prev);
		case TASK_RUNNING:;
	}
	prev->need_resched = 0;

	/*
	 * this is the scheduler proper:
	 */

repeat_schedule:
	/*
	 * Default process to select..
	 */
	 //查找需要调度的线程，如果有多个线程则根据上面的调度策略选择线程
	next = idle_task(this_cpu);//初始为空闲task
	c = -1000;
	list_for_each(tmp, &runqueue_head) {
		p = list_entry(tmp, struct task_struct, run_list);
		if (can_schedule(p, this_cpu)) {
		//选择weight最大的线程赋值给next
			int weight = goodness(p, this_cpu, prev->active_mm);
			if (weight > c)
				c = weight, next = p;
		}
	}

	/* Do we need to re-calculate counters? */
	//所有线程时间片耗尽，重新计算时间片
	if (unlikely(!c)) {
		struct task_struct *p;
		//重新计算时间片
		spin_unlock_irq(&runqueue_lock);
		read_lock(&tasklist_lock);
		for_each_task(p)
			p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
		read_unlock(&tasklist_lock);
		spin_lock_irq(&runqueue_lock);
		//重新选择线程调度
		goto repeat_schedule;
	}

	/*
	 * from this point on nothing can prevent us from
	 * switching to the next task, save this fact in
	 * sched_data.
	 */
	sched_data->curr = next;
	task_set_cpu(next, this_cpu);
	spin_unlock_irq(&runqueue_lock);

	if (unlikely(prev == next)) {
		/* We won't go through the normal tail, so do this by hand */
		prev->policy &= ~SCHED_YIELD;
		goto same_process;
	}

#ifdef CONFIG_SMP
 	/*
 	 * maintain the per-process 'last schedule' value.
 	 * (this has to be recalculated even if we reschedule to
 	 * the same process) Currently this is only used on SMP,
	 * and it's approximate, so we do not have to maintain
	 * it while holding the runqueue spinlock.
 	 */
 	sched_data->last_schedule = get_cycles();

	/*
	 * We drop the scheduler lock early (it's a global spinlock),
	 * thus we have to lock the previous process from getting
	 * rescheduled during switch_to().
	 */

#endif /* CONFIG_SMP */
//切换线程
	kstat.context_swtch++;
	/*
	 * there are 3 processes which are affected by a context switch:
	 *
	 * prev == .... ==> (last => next)
	 *
	 * It's the 'much more previous' 'prev' that is on next's stack,
	 * but prev is set to (the just run) 'last' process by switch_to().
	 * This might sound slightly confusing but makes tons of sense.
	 */
	prepare_to_switch();
	{
		struct mm_struct *mm = next->mm;
		struct mm_struct *oldmm = prev->active_mm;
		if (!mm) {
			if (next->active_mm) BUG();
			next->active_mm = oldmm;
			atomic_inc(&oldmm->mm_count);
			enter_lazy_tlb(oldmm, next, this_cpu);
		} else {
			if (next->active_mm != mm) BUG();
			switch_mm(oldmm, mm, next, this_cpu);
		}

		if (!prev->mm) {
			prev->active_mm = NULL;
			mmdrop(oldmm);
		}
	}

	/*
	 * This just switches the register state and the
	 * stack.
	 */
	switch_to(prev, next, prev);
	__schedule_tail(prev);

same_process:
	reacquire_kernel_lock(current);
	if (current->need_resched)
		goto need_resched_back;
	return;
}

调度点

调度主要分为两种：第一种是主动让出CPU，第二种是被动让出CPU；

主动让出，比如程序主动调用sleep，该函数在陷入内核之后会主动释放CPU，然后主动执行调度函数schedule；
被动调度，在从内核态返回用户态时，会产生一次调度（只有在这个时候才会有调度，主要分为下面几种情况）
（1）：中断处理完成返回用户空间时；
（2）：异常处理完成返回用户空间时；
（3）：系统调用从内核返回用户空间时；
从上面可以看出一个关键的问题，内核态下不具备抢占的特性。线程A（普通优先级）执行系统调用陷入内核，在内核执行过程中，发生了中断，在中断中唤醒了线程B（实时线程），但是这个时候并不会立即调度线程B而是要等待线程A的系统调用完成之后，在内核态返回用户空间时才有调度到线程B，这就是2.4内核最大的一个弊端（内核态不能抢占）。

时间片计算

#if HZ < 200
#define TICK_SCALE(x)	((x) >> 2)
#elif HZ < 400
#define TICK_SCALE(x)	((x) >> 1)
#elif HZ < 800
#define TICK_SCALE(x)	(x)
#elif HZ < 1600
#define TICK_SCALE(x)	((x) << 1)
#else
#define TICK_SCALE(x)	((x) << 2)
#endif

#define NICE_TO_TICKS(nice)	(TICK_SCALE(20-(nice))+1)//nice [-20，19]

对于普通线程而言，线程优先级为100-139，对应nice为[-20-+19]，数字越小，优先级越高。不同HZ的示例如下：
//HZ=100
//nice=-20 x=40(0b0010 1000) tick=10 (10+1)10=110ms
//nice=19 x=1 (0b0000 0001) tick=0 110 =10ms

//HZ=250
//nice=-20 x=40(0b0010 1000) tick=20 (20+1)*4=84ms
//nice=19 x=1 (0b0000 0001) tick=1 (1+1)*4 =8ms

//HZ=1000
//nice=-20 x=40(0b0010 1000) tick=80 (80+1)*1=81ms
//nice=19 x=1 (0b0000 0001) tick=2 (2+1)*1 =3ms

从上面可以看出，线程分配的时间片与HZ息息相关。

对于实时线程而言，分为两种情况：
SCHE_FIFO：没有时间片概念，抢到CPU之后就一直运行，除非自主动让出CPU，或者被高优先级的线程抢占；
SCHE_RR：这里有一个很奇怪的设计，如下面的代码，RR线程的优先级居然跟nice相关…

	/* move an exhausted RR process to be last.. */
	if (unlikely(prev->policy == SCHED_RR))
		if (!prev->counter) {
			prev->counter = NICE_TO_TICKS(prev->nice);//重新计算counter
			move_last_runqueue(prev);//放入队列尾部
		}
	}

一般来说nice的默认值为0，即时间片为：
//HZ=1000
//nice=0 x=20(0b0001 0100) tick=40 (40+1)*1=41ms

有个问题，如果我手动修改使用命令修改nice的值，是不是调度的轮询的时间片就被改了，后面验证下。