linux实时线程调度算法,Linux调度原理介绍和应用（前篇）

最新推荐文章于 2024-08-26 19:39:48 发布

happy最紧要

最新推荐文章于 2024-08-26 19:39:48 发布

阅读量1k

点赞数 1

文章标签： linux实时线程调度算法

摘要

本文(有码慎入)主要介绍Linux任务调度相关的发展历史和基本原理。多年以来，内核界的***们一直着力于寻找既能满足高负载后台任务资源充分利用，又能满足桌面系统良好交互性的调度方法，尽管截至到目前为止仍然没有一个完美的解决方案。本文希望通过介绍调度算法的发展历程，因为任务调度本身不是一个局限于操作系统的话题，包括数据库，程序语言实现等，都会与调度相关。本文在介绍过程中，会引用Linux的代码实现作为说明，同时阐述其中的一些趣闻轶事。

调度实体

进程任务通常包含一个或者多个线程任务，不同于用户态线程，系统线程没有用户态的地址空间，这些系统线程单独生存于内核空间。系统任务通常做的事情，就是维护系统的正常运行，只有系统自己可以起停内核线程，不同于pthread用户接口，在中我们可以找到对应的内核线程管理接口：/* Simple interface for creating and stopping kernel threads withoumess. */

…

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), void *data, int node, const char namefmt[], ...);

#define kthread_create(threadfn, data, namefmt, arg...) \

kthread_create_on_node(threadfn, data, -1, namefmt, ##arg)

实际上，同于其他unix-like调度的单位是内核线程，Linux的***在系统内部通常也会混用“线程”(process)和“进程”(thread)两个概念。

在Linux当中，每一个任务都是由一个c语言的struct描述，中，我们可以找到相关的信息。实际上，由于历史原因，这个结构非常庞大和复杂(或者也可以看一下MySQL的THD定义…)，几乎包含了所有任务相关的要素，包括进程id、父任务、子任务、任务寄存器、任务包含资源描述符等等，Linux使用一个循环双向链表作为任务管理的基础结构。struct task_struct {

volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */

void *stack;

…

unsigned int flags; /* per process flags */

…

struct mm_struct *mm;

…

pid_t pid;

pid_t tgid;

…

struct task_struct __rcu *real_parent; /* real parent process */

struct list_head children; /* list of my children */

struct list_head sibling; /* linkage in my parent's children list */

...

}

系统线程栈空间大小有限，所以我们不可以在栈上来回拷贝这个结构，需要一层额外的封装，所以实际上使用的结构体是：struct thread_info {

struct task_struct *task; /* main task structure */ 这里存一个指针吧

struct exec_domain *exec_domain; /* execution domain */

__u32 flags; /* low level flags */

__u32 status; /* thread synchronous flags */

__u32 cpu; /* current CPU */

int saved_preempt_count;

mm_segment_t addr_limit;

struct restart_block restart_block;

void __user *sysenter_return;

unsigned int uaccess_err:1; /* uaccess failed */

};

遍历一个包含成千上万任务的链表代价巨大，Linux使用另外一个宏叫做current来指向当前执行的任务，注意这个结构与架构和实现无关(对于有的平台或许是直接指向任务的指针，或许是需要根据特定寄存器每次进行计算的方法)。对于X86而言，通过下面的方法(3.19版本实现)获取当前内核栈thread_info信息(注意thread_info在内核栈的栈底)，static inline struct thread_info *current_thread_info(void)

{

struct thread_info *ti; ti = (void *)(this_cpu_read_stable(kernel_stack) +

KERNEL_STACK_OFFSET - THREAD_SIZE);

return ti；

｝

找到thread_info之后，即可找到对应的任务

current = current_thread_info()->task;

4.1之后的版本，对于内核栈进行了调整，4.9之后版本，不再通过thread_info获取task指针。

既然内存空间不在内核栈，对于任务task结构所使用的内存，Linux使用SLAB内存分配器进行管理，对于不同的任务内容(task_structs, inodes, mm_structs, fs_caches)，都进行空间预分配，避免碎化。这些不同的目标作为不同的slab，所以名叫slab allocator。一个slab可以处于空闲、部分使用、全部使用三种状态，通常分配会从部分使用的slab中获取对象，如果没有slab可用，会重新分配一个。这一部分一直以来都是Linux内存管理的核心，***们看起来不太愿意去动这一部分代码。但随着新硬件的发展，系统的复杂，老版本slab的优势逐渐丧失，开始成为瓶颈，这里不进行展开，有兴趣可以阅读更多Linux内存管理相关的资料。

在介绍如何调度之前，还是有先温习一下Linux对于链表的经典实现方式(一直被模仿)。

Linux实现中，通常把链表结点“嵌入”到需要关联的对象当中，struct list_head {

struct list_head *next, *prev;

};

struct task {

int num; /* number of the task to do */

bool is_awesome; /* is it something awesome? */

struct list_head mylist; /* the list of all tasks */ 注意这里

};

初始化相关的宏#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \

struct list_head name = LIST_HEAD_INIT(name)

static inline void INIT_LIST_HEAD(struct list_head *mylist)

{

list->next = mylist;

list->prev = mylist;

}

例子struct task *some_task;

/* … allocate some_task… */

some_task->num = 1;

some_task->is_awesome = false;

INIT_LIST_HEAD(&some_task->mylist);

在中，有所有常用链表操作相关的实现O(1)复杂度。static inline void list_add(struct list_head *new,

struct list_head *head);

static inline void list_add_tail(struct list_head *new,

struct list_head *head);

static inline void list_del(struct list_head *entry);

遍历操作struct list_head *h;

struct task *t;

list_for_each(h, &task_list) {

t = list_entry(h, struct task_list, mylist);

if (t->num == 1)

// do something

｝

更简单的方式

list_for_each_entry(pos, head, member) pos是通过链表成员反向找到的被嵌套对象，head是需要被遍历的链表

调度算法-概念

Linux是一个经典的多任务系统，多任务意味着系统可以并发执行多个用户任务，但不一定是真正意义上的“并行”。调度器的目的在于，评估出下一个时间周期，哪一个任务应该被实际“执行”。Linux实现的是“可抢占”式任务调度，时钟中断是实现抢占的基础，一般来说，每秒钟会有1000次时钟中断，时钟中断发生的时候，调度器决定当前哪一个任务可以可以在哪一个cpu上继续执行。任务获得执行的时间我们称之为“时间片”。

任务通常分为两种类型，交互式和非交互式。调度器的目的，需要保证非交互式任务的资源使用，又不能使得系统交互体验太差。简单来说，Linux的做法是提供给非交互式任务更长的“不可中断”的执行时间片，但更低的被调度频率。对于交互式任务，更多的被调度频率，但更短的执行时间片。

回顾一下上一部分提到的任务调度实体：struct task_struct {

int prio, static_prio, normal_prio;

unsigned int rt_priority;

const struct sched_class *sched_class;

struct sched_entity se;

struct sched_rt_entity rt;

…

unsigned int policy;

cpumask_t cpus_allowed;

…

};

重要字段：prio -> 优先级，重要程度

policy -> 调度策略(通常为SCHED_NORMAL普通任务)

cpus_allowed -> 允许被运行到哪些逻辑核心，可通过亲和性进行配置

优先级意味着更多被调度执行的可能型，Linux任务默认nice值为0，可以通过接口进行修改(比如nice命令)，范围取值为

-20_ _ _ _ _ 0 _ _ _ _ _+19

用户自身只能增加自己的nice值。这里主要讨论普通任务调度，硬实时调度(指定时间内必须完成指定任务)Linux并不支持，可以参考(RTLinux以及其他一些实时任务操作系统)。对于软实时任务调度，Linux通过real-time异常来进行支持，这里也不做过多的描述。

调度算法-实现

Linux中与调度实现相关的部分目录下core.c 包括调度器的核心部分

fair.c CFS(completely fair scheduler)算法的实现

rt.c 实时调度算法的实现

idle_task.c idle任务的调度实现

中，这些统一的调度策略包含相同的接口，通过函数指针的方式，可以获得其实现：struct sched_class {

const struct sched_class *next;

void (*enqueue_task) (struct rq *rq,

struct task_struct *p,

int flags);

void (*dequeue_task) (struct rq *rq,

struct task_struct *p,

int flags);

void (*yield_task) (struct rq *rq);

…

void (*check_preempt_curr) (struct rq *rq,

struct task_struct *p,

int flags);

struct task_struct * (*pick_next_task) (struct rq *rq,

struct task_struct *prev);

…

void (*set_curr_task) (struct rq *rq);

void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);

void (*task_fork) (struct task_struct *p);

void (*task_dead) (struct task_struct *p);

…

};

任务状态变化核心部分- enqueue_task 任务进入可调度状态的时候被调用时候，把任务放入可执行队列，同时增加

可执行任务个数变量

- dequeue_task 上面操作的反向操作

- yield_task 任务自愿希望放弃cpu

- pick_next_task 获取下一个可以执行的任务

- task_tick 任务被调度

系统根据任务的调度策略(policy)决定哪一个调度类型被使用

调度器的核心组件在于“可执行任务队列runqueue”，顾名思义，runqueue里面包含当前可以执行的所有任务。注意，调度器保证，每一个逻辑核心包含自己的runqueue，每一个任务至多可以被放到一个runqueue里面。当然，多核处理器的情境下，任务也可能由于负载原因从一个runqueue放到另外一个runqueue。最终，调度器的任务是实现从runqueue中选出一个任务，使其在对应的逻辑核心上执行。

在中runqueue的大致实现/*

* This is the main, per-CPU runqueue data structure.

struct rq {

…

unsigned int nr_running;

…

#define CPU_LOAD_IDX_MAX 5

unsigned long cpu_load[CPU_LOAD_IDX_MAX];

…

/* capture load from *all* tasks on this cpu: */

struct load_weight load;

…

struct cfs_rq cfs;

struct rt_rq rt;

…

struct task_struct *curr, *idle, …;

…

u64 clock;

…

/* cpu of this runqueue: */

int cpu;

…

}nr_running 当前可执行任务个数

load cpu的负载相关的一些信息

curr 当前执行任务的指针

cfs 比较重要，CFS调度相关的信息

那么，对于一个runqueue，谁来唤醒调度任务的执行，实现任务切换的呢？

实际上，调度的核心方法schedule()是上文所述的“时钟中断”唤醒调用执行，其中断目标函数为scheduler_tick()，这个方法实际做了以下事情：/*

* This function gets called by the timer code, with HZ frequency.

* We call it with interrupts disabled.

void scheduler_tick(void)

{

int cpu = smp_processor_id();

struct rq *rq = cpu_rq(cpu);

struct task_struct *curr = rq->curr;

…

update_rq_clock(rq);

curr->sched_class->task_tick(rq, curr, 0);

update_cpu_load_active(rq);

…

}

调用更新run queue clock相关信息，然后根据调度策略(policy)执行对应的调度方法，例如CFS策略最终会调用task_tick_fair()，这个方法在中实现。当任务需要等待其他条件(例如IO)情况下，任务会首先将自己置为TASK_INTERRUPTIBLE 或者 TASK_UNINTERRUPTIBLE 不同的状态，后者表明当前任务无法被中断，当任务进入sleep状态之前，schedule函数会再一次被调用选出下一次需要执行的任务。当任务条件满足，被唤醒加入到runqueue的时候，如果该任务优先级较高，schedule函数会再一次被调用。

终于我们走到了最核心的scheduler函数，调度算法的具体实现，在下一篇的介绍当中，我们会继续结合代码，介绍Linux调度算法的前世今生，发展由来以及其中的恩怨纠葛。

总结

本文主要介绍了与任务调度的一些基本概念，并结合Linux的实现进行了相关代码介绍。调度本身是一个庞大而复杂的话题，在后续的文章中，我们会继续就Linux的相关实现进行介绍，同时结合数据库场景的实际应用作为例子，分享一些测试数据和结论。