linux调度器(二)-2.6内核分析

2.4内核调度器存在的问题

接上篇文章 linux调度器(一)-2.4内核分析 介绍,2.4内核存在以下问题:

  • SMP架构,所有CPU共享一个可运行队列,访问效率低下;
  • 每次调度时从可运行队列选择一个线程执行的时候都需要遍历一次链表, 时间复杂度O(n);
  • 所有任务的时间片用完时,要遍历一遍链表重新赋值时间片,时间复杂度O(n);
  • 内核态不可抢占;

2.6的内核调度器就是针对这些问题一一解决的,下面我们来一起分析一下2.6 的kernel。

2.6内核调度架构

在这里插入图片描述
从上图可以看出,每个CPU都有一个自己的可运行队列,不必再一起竞争一个运行队列,提供了访问的效率。由负载均衡模块决定将任务放置到哪个CPU队列中,后面将专门研究下负载均衡的代码,下面来看看每个CPU的运行队列是怎么组织的:

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	spinlock_t lock;

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned long nr_running;
	unsigned long raw_weighted_load;
#ifdef CONFIG_SMP
	unsigned long cpu_load[3];
#endif
	unsigned long long nr_switches;

	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long nr_uninterruptible;

	unsigned long expired_timestamp;
	unsigned long long timestamp_last_tick;
	struct task_struct *curr, *idle;
	struct mm_struct *prev_mm;
	struct prio_array *active, *expired, arrays[2];//线程的组织结构
	int best_expired_prio;
	atomic_t nr_iowait;

#ifdef CONFIG_SMP
	struct sched_domain *sd;

	/* For active balancing */
	int active_balance;
	int push_cpu;

	struct task_struct *migration_thread;
	struct list_head migration_queue;
#endif

#ifdef CONFIG_SCHEDSTATS
	/* latency stats */
	struct sched_info rq_sched_info;

	/* sys_sched_yield() stats */
	unsigned long yld_exp_empty;
	unsigned long yld_act_empty;
	unsigned long yld_both_empty;
	unsigned long yld_cnt;

	/* schedule() stats */
	unsigned long sched_switch;
	unsigned long sched_cnt;
	unsigned long sched_goidle;

	/* try_to_wake_up() stats */
	unsigned long ttwu_cnt;
	unsigned long ttwu_local;
#endif
	struct lock_class_key rq_lock_key;
};

组织结构如下:
在这里插入图片描述
active:当前运行队列数组,时间片还有剩余
expired:下次运行队列数组,时间片已经用完

针对2.4内核的问题:所有任务的时间片用完时,要遍历一遍链表重新赋值时间片,时间复杂度O(n)。
解决办法是:active中时间片用完的任务加入到expired中, 当active所有任务时间片都用完时,active与expired交换,实现快速切换。时间复杂度O(1);
代码如下:

	array = rq->active;
	if (unlikely(!array->nr_active)) {
		/*
		 * Switch the active and expired arrays.
		 */
		schedstat_inc(rq, sched_switch);
		rq->active = rq->expired;
		rq->expired = array;
		array = rq->active;
		rq->expired_timestamp = 0;
		rq->best_expired_prio = MAX_PRIO;
	}

针对2.4内核问题:每次调度时从可运行队列选择一个线程执行的时候都需要遍历一次链表, 时间复杂度O(n);
解决办法是:只需查找bitmap数据确定queue中最高优先级那一项。时间复杂度O(1);

内核态可抢占

经过上面的文章大家知道2.4的调度点只在内核态返回用户态和自己主动让出CPU两个CPU会调度,2.6内核支持内核态抢占。
主要的区别:
2.4在中断或者异常返回时,只有从内核态返回用户态才会产生调度;
2.6在中断或者异常返回时,只有从内核态返回用户态、从内核态到内核态都会产生调度,即内核态可抢占。

2.4内核代码如下:

#define RESTORE_ALL	\
	popl %ebx;	\
	popl %ecx;	\
	popl %edx;	\
	popl %esi;	\
	popl %edi;	\
	popl %ebp;	\
	popl %eax;	\
1:	popl %ds;	\
2:	popl %es;	\
	addl $4,%esp;	\
3:	iret;		\
.section .fixup,"ax";	\
4:	movl $0,(%esp);	\
	jmp 1b;		\
5:	movl $0,(%esp);	\
	jmp 2b;		\
6:	pushl %ss;	\
	popl %ds;	\
	pushl %ss;	\
	popl %es;	\
	pushl $11;	\
	call do_exit;	\
.previous;		\
.section __ex_table,"a";\
	.align 4;	\
	.long 1b,4b;	\
	.long 2b,5b;	\
	.long 3b,6b;	\
.previous

ENTRY(ret_from_sys_call)
	cli				# need_resched and signals atomic test
	cmpl $0,need_resched(%ebx)
	jne reschedule
	cmpl $0,sigpending(%ebx)
	jne signal_return
restore_all:
	RESTORE_ALL

	ALIGN
ENTRY(ret_from_intr)
	GET_CURRENT(%ebx)
ret_from_exception:
	movl EFLAGS(%esp),%eax		# mix EFLAGS and CS
	movb CS(%esp),%al
	testl $(VM_MASK | 3),%eax	# return to VM86 mode or non-supervisor?
	jne ret_from_sys_call//内核态返回用户态,ret_from_sys_call会判断是否调度
	jmp restore_all//无调度点

2.6内核代码如下:

	# userspace resumption stub bypassing syscall exit tracing
	ALIGN
	RING0_PTREGS_FRAME
ret_from_exception:
	preempt_stop
ret_from_intr:
	GET_THREAD_INFO(%ebp)
check_userspace:
	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
	movb CS(%esp), %al
	testl $(VM_MASK | 3), %eax
	jz resume_kernel//返回内核态
ENTRY(resume_userspace)
 	cli				# make sure we don't miss an interrupt
					# setting need_resched or sigpending
					# between sampling and the iret
	movl TI_flags(%ebp), %ecx
	andl $_TIF_WORK_MASK, %ecx	# is there any work to be done on
					# int/exception return?
	jne work_pending//有调度点
	jmp restore_all

#ifdef CONFIG_PREEMPT
ENTRY(resume_kernel)//有调度点
	cli
	cmpl $0,TI_preempt_count(%ebp)	# non-zero preempt_count ?
	jnz restore_nocheck
need_resched:
	movl TI_flags(%ebp), %ecx	# need_resched set ?
	testb $_TIF_NEED_RESCHED, %cl
	jz restore_all
	testl $IF_MASK,EFLAGS(%esp)     # interrupts off (exception path) ?
	jz restore_all
	call preempt_schedule_irq
	jmp need_resched
#endif

负载均衡

相对复杂,后面单开一张讲解,主要的过程就是当前CPU的rq的active为空时,就要从别的rq中均衡一部分过来。

2.6内核调度存在的问题

  • 优先级反转
  • 锁的粒度太大
  • 中断线程化
    这些问题都在preempt-rt中得到解决,下一篇文章我们再分析preempt-rt。
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值