#内核初始化流程
读者在这章可以了解到整个内核初始化的完整周期,从内核解压之后的第一步到内核自身运行的第一个进程。
注意 这里不是所有内核初始化步骤的介绍。这里只有通用的内核内容,不会涉及到中断控制、 ACPI 、以及其它部分。此处没有详述的部分,会在其它章节中描述。
- 内核解压之后的首要步骤 - 描述内核中的首要步骤。
- 早期的中断和异常控制 - 描述了早期的中断初始化和早期的缺页处理函数。
- 在到达内核入口之前最后的准备 - 描述了在调用 start_kernel 之前最后的准备工作。
- 内核入口 - start_kernel - 描述了内核通用代码中初始化的第一步。
- 体系架构初始化 - 描述了特定架构的初始化。
- 进一步初始化指定体系架构 - 描述了再一次的指定架构初始化流程。
- 最后对指定体系架构初始化 - 描述了指定架构初始化流程的结尾。
- 调度器初始化 - 描述了调度初始化之前的准备工作,以及调度初始化。
- RCU 初始化 - 描述了 RCU 的初始化。
- 初始化结束 - Linux内核初始化的最后部分。
内核初始化 第一部分
踏入内核代码的第一步(TODO: Need proofreading)
上一章是引导过程的最后一部分。从现在开始,我们将深入探究 Linux 内核的初始化过程。在解压缩完 Linux 内核镜像、并把它妥善地放入内存后,内核就开始工作了。我们在第一章中介绍了 Linux 内核引导程序,它的任务就是为执行内核代码做准备。而在本章中,我们将探究内核代码,看一看内核的初始化过程——即在启动 PID 为 1
的 init
进程前,内核所做的大量工作。
本章的内容很多,介绍了在内核启动前的所有准备工作。arch/x86/kernel/head_64.S 文件中定义了内核入口点,我们会从这里开始,逐步地深入下去。在 start_kernel
函数(定义在 init/main.c) 执行之前,我们会看到很多的初期的初始化过程,例如初期页表初始化、切换到一个新的内核空间描述符等等。
在上一章的最后一节中,我们跟踪到了 arch/x86/boot/compressed/head_64.S 文件中的 jmp 指令:
jmp *%rax
此时 rax
寄存器中保存的就是 Linux 内核入口点,通过调用 decompress_kernel
(arch/x86/boot/compressed/misc.c) 函数后获得。由此可见,内核引导程序的最后一行代码是一句指向内核入口点的跳转指令。既然已经知道了内核入口点定义在哪,我们就可以继续探究 Linux 内核在引导结束后做了些什么。
内核执行的第一步
OK,在调用了 decompress_kernel
函数后,rax
寄存器中保存了解压缩后的内核镜像的地址,并且跳转了过去。解压缩后的内核镜像的入口点定义在 arch/x86/kernel/head_64.S,这个文件的开头几行如下:
__HEAD
.code64
.globl startup_64
startup_64:
...
...
...
我们可以看到 startup_64
过程定义在了 __HEAD
区段下。 __HEAD
只是一个宏,它将展开为可执行的 .head.text
区段:
#define __HEAD .section ".head.text","ax"
我们可以在 arch/x86/kernel/vmlinux.lds.S 链接器脚本文件中看到这个区段的定义:
.text : AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
...
...
...
} :text = 0x9090
除了对 .text
区段的定义,我们还能从这个脚本文件中得知内核的默认物理地址与虚拟地址。_text
是一个地址计数器,对于 x86_64 来说,它定义为:
. = __START_KERNEL;
__START_KERNEL
宏的定义在 arch/x86/include/asm/page_types.h 头文件中,它由内核映射的虚拟基址与基物理起始点相加得到:
#define _START_KERNEL (__START_KERNEL_map + __PHYSICAL_START)
#define __PHYSICAL_START ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)
换句话说:
- Linux 内核的物理基址 -
0x1000000
; - Linux 内核的虚拟基址 -
0xffffffff81000000
.
现在我们知道了 startup_64
过程的默认物理地址与虚拟地址,但是真正的地址必须要通过下面的代码计算得到:
leaq _text(%rip), %rbp
subq $_text - __START_KERNEL_map, %rbp
没错,虽然定义为 0x1000000
,但是仍然有可能变化,例如启用 kASLR 的时候。所以我们当前的目标是计算 0x1000000
与实际加载地址的差。这里我们首先将RIP相对地址(rip-relative
)放入 rbp
寄存器,并且从中减去 $_text - __START_KERNEL_map
。我们已经知道, _text
在编译后的默认虚拟地址为 0xffffffff81000000
, 物理地址为 0x1000000
。__START_KERNEL_map
宏将展开为 0xffffffff80000000
,因此对于对于第二行汇编代码,我们将得到如下的表达式:
rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)
在计算过后,rbp
的值将为 0
,代表了实际加载地址与编译后的默认地址之间的差值。在我们这个例子中,0
代表了 Linux 内核被加载到了默认地址,并且没有启用 kASLR 。
在得到了 startup_64
的地址后,我们需要检查这个地址是否已经正确对齐。下面的代码将进行这项工作:
testl $~PMD_PAGE_MASK, %ebp
jnz bad_address
在这里我们将 rbp
寄存器的低32位与 PMD_PAGE_MASK
进行比较。PMD_PAGE_MASK
代表中层页目录(Page middle directory
)屏蔽位(相关信息请阅读 paging 一节),它的定义如下:
#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))
#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT 21
可以很容易得出 PMD_PAGE_SIZE
为 2MB
。在这里我们使用标准公式来检查对齐问题,如果 text
的地址没有对齐到 2MB
,则跳转到 bad_address
。
在此之后,我们通过检查高 18
位来防止这个地址过大:
leaq _text(%rip), %rax
shrq $MAX_PHYSMEM_BITS, %rax
jnz bad_address
这个地址必须不超过 46
个比特,即小于2的46次方:
#define MAX_PHYSMEM_BITS 46
OK,至此我们完成了一些初步的检查,可以继续进行后续的工作了。
修正页表基地址
在开始设置 Identity 分页之前,我们需要首先修正下面的地址:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
如果 startup_64
的值不为默认的 0x1000000
的话, 则包括 early_level4_pgt
、level3_kernel_pgt
在内的很多地址都会不正确。rbp
寄存器中包含的是相对地址,因此我们把它与 early_level4_pgt
、level3_kernel_pgt
以及 level2_fixmap_pgt
中特定的项相加。首先我们来看一下它们的定义:
NEXT_PAGE(early_level4_pgt)
.fill 511,8,0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level3_kernel_pgt)
.fill L3_START_KERNEL,8,0
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
NEXT_PAGE(level2_kernel_pgt)
PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
KERNEL_IMAGE_SIZE/PMD_SIZE)
NEXT_PAGE(level2_fixmap_pgt)
.fill 506,8,0
.quad level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
.fill 5,8,0
NEXT_PAGE(level1_fixmap_pgt)
.fill 512,8,0
看起来很难理解,实则不然。首先我们来看一下 early_level4_pgt
。它的前 (4096 - 8) 个字节全为 0
,即它的前 511
个项均不使用,之后的一项是 level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
。我们知道 __START_KERNEL_map
是内核的虚拟基地址,因此减去 __START_KERNEL_map
后就得到了 level3_kernel_pgt
的物理地址。现在我们来看一下 _PAGE_TABLE
,它是页表项的访问权限:
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
更多信息请阅读 分页 部分.
level3_kernel_pgt
中保存的两项用来映射内核空间,在它的前 510
(即 L3_START_KERNEL
)项均为 0
。这里的 L3_START_KERNEL
保存的是在上层页目录(Page Upper Directory)中包含__START_KERNEL_map
地址的那一条索引,它等于 510
。后面一项 level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
中的 level2_kernel_pgt
比较容易理解,它是一条页表项,包含了指向中层页目录的指针,它用来映射内核空间,并且具有如下的访问权限:
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
_PAGE_DIRTY)
level2_fixmap_pgt
是一系列虚拟地址,它们可以在内核空间中指向任意的物理地址。它们由level2_fixmap_pgt
作为入口点、10
MB 大小的空间用来为 vsyscalls 做映射。level2_kernel_pgt
则调用了PDMS
宏,在 __START_KERNEL_map
地址处为内核的 .text
创建了 512
MB 大小的空间(这 512
MB空间的后面是模块内存空间)。
现在,在看过了这些符号的定义之后,让我们回到本节开始时介绍的那几行代码。rbp
寄存器包含了实际地址与 startup_64
地址之差,其中 startup_64
的地址是在内核链接时获得的。因此我们只需要把它与各个页表项的基地址相加,就能够得到正确的地址了。在这里这些操作如下:
addq %rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_kernel_pgt + (510*8)(%rip)
addq %rbp, level3_kernel_pgt + (511*8)(%rip)
addq %rbp, level2_fixmap_pgt + (506*8)(%rip)
换句话说,early_level4_pgt
的最后一项就是 level3_kernel_pgt
,level3_kernel_pgt
的最后两项分别是 level2_kernel_pgt
和 level2_fixmap_pgt
, level2_fixmap_pgt
的第507项就是 level1_fixmap_pgt
页目录。
在这之后我们就得到了:
early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0] -> 512 MB kernel mapping
level2_fixmap_pgt[507] -> level1_fixmap_pgt
需要注意的是,我们并不修正 early_level4_pgt
以及其他页目录的基地址,我们会在构造、填充这些页目录结构的时候修正。我们修正了页表基地址后,就可以开始构造这些页目录了。
Identity Map Paging
现在我们可以进入到对初期页表进行 Identity 映射的初始化过程了。在 Identity 映射分页中,虚拟地址会被映射到地址相同的物理地址上,即 1 : 1
。下面我们来看一下细节。首先我们找到 _text
与 _early_level4_pgt
的 RIP 相对地址,并把他们放入 rdi
与 rbx
寄存器中。
leaq _text(%rip), %rdi
leaq early_level4_pgt(%rip), %rbx
在此之后我们使用 rax
保存 _text
的地址。同时,在全局页目录表中有一条记录中存放的是 _text
的地址。为了得到这条索引,我们把 _text
的地址右移 PGDIR_SHIFT
位。
movq %rdi, %rax
shrq $PGDIR_SHIFT, %rax
leaq (4096 + _KERNPG_TABLE)(%rbx), %rdx
movq %rdx, 0(%rbx,%rax,8)
movq %rdx, 8(%rbx,%rax,8)
其中 PGDIR_SHIFT
为 39
。PGDIR_SHIFT
表示的是在虚拟地址下的全局页目录位的屏蔽值(mask)。下面的宏定义了所有类型的页目录的屏蔽值:
#define PGDIR_SHIFT 39
#define PUD_SHIFT 30
#define PMD_SHIFT 21
此后我们就将 level3_kernel_pgt
的地址放进 rdx
中,并将它的访问权限设置为 _KERNPG_TABLE
(见上),然后将 level3_kernel_pgt
填入 early_level4_pgt
的两项中。
然后我们给 rdx
寄存器加上 4096
(即 early_level4_pgt
的大小),并把 rdi
寄存器的值(即 _text
的物理地址)赋值给 rax
寄存器。之后我们把上层页目录中的两个项写入 level3_kernel_pgt
:
addq $4096, %rdx
movq %rdi, %rax
shrq $PUD_SHIFT, %rax
andl $(PTRS_PER_PUD-1), %eax
movq %rdx, 4096(%rbx,%rax,8)
incl %eax
andl $(PTRS_PER_PUD-1), %eax
movq %rdx, 4096(%rbx,%rax,8)
下一步我们把中层页目录表项的地址写入 level2_kernel_pgt
,然后修正内核的 text 和 data 的虚拟地址:
leaq level2_kernel_pgt(%rip), %rdi
leaq 4096(%rdi), %r8
1: testq $1, 0(%rdi)
jz 2f
addq %rbp, 0(%rdi)
2: addq $8, %rdi
cmp %r8, %rdi
jne 1b
这里首先把 level2_kernel_pgt
的地址赋值给 rdi
,并把页表项的地址赋值给 r8
寄存器。下一步我们来检查 level2_kernel_pgt
中的存在位,如果其为0,就把 rdi
加上8以便指向下一个页。然后我们将其与 r8
(即页表项的地址)作比较,不相等的话就跳转回前面的标签 1
,反之则继续运行。
接下来我们使用 rbp
(即 _text
的物理地址)来修正 phys_base
物理地址。将 early_level4_pgt
的物理地址与 rbp
相加,然后跳转至标签 1
:
addq %rbp, phys_base(%rip)
movq $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f
其中 phys_base
与 level2_kernel_pgt
第一项相同,为 512
MB的内核映射。
跳转至内核入口点之前的最后准备
此后我们就跳转至标签1
来开启 PAE
和 PGE
(Paging Global Extension),并且将phys_base
的物理地址(见上)放入 rax
就寄存器,同时将其放入 cr3
寄存器:
1:
movl $(X86_CR4_PAE | X86_CR4_PGE), %ecx
movq %rcx, %cr4
addq phys_base(%rip), %rax
movq %rax, %cr3
接下来我们检查CPU是否支持 NX 位:
movl $0x80000001, %eax
cpuid
movl %edx,%edi
首先将 0x80000001
放入 eax
中,然后执行 cpuid
指令来得到处理器信息。这条指令的结果会存放在 edx
中,我们把他再放到 edi
里。
现在我们把 MSR_EFER
(即 0xc0000080
)放入 ecx
,然后执行 rdmsr
指令来读取CPU中的Model Specific Register (MSR)。
movl $MSR_EFER, %ecx
rdmsr
返回结果将存放于 edx:eax
。下面展示了 EFER
各个位的含义:
63 32
--------------------------------------------------------------------------------
| |
| Reserved MBZ |
| |
--------------------------------------------------------------------------------
31 16 15 14 13 12 11 10 9 8 7 1 0
--------------------------------------------------------------------------------
| | T | | | | | | | | | |
| Reserved MBZ | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
| | E | | | | | | | | | |
--------------------------------------------------------------------------------
在这里我们不会介绍每一个位的含义,没有涉及到的位和其他的 MSR 将会在专门的部分介绍。在我们将 EFER
读入 edx:eax
之后,通过 btsl
来将 _EFER_SCE
(即第0位)置1,设置 SCE
位将会启用 SYSCALL
以及 SYSRET
指令。下一步我们检查 edi
(即 cpuid
的结果(见上)) 中的第20位。如果第 20
位(即 NX
位)置位,我们就只把 EFER_SCE
写入MSR。
btsl $_EFER_SCE, %eax
btl $20,%edi
jnc 1f
btsl $_EFER_NX, %eax
btsq $_PAGE_BIT_NX,early_pmd_flags(%rip)
1: wrmsr
如果支持 NX 那么我们就把 _EFER_NX
也写入MSR。在设置了 NX 后,还要对 cr0
(control register) 中的一些位进行设置:
X86_CR0_PE
- 系统处于保护模式;X86_CR0_MP
- 与CR0的TS标志位一同控制 WAIT/FWAIT 指令的功能;X86_CR0_ET
- 386允许指定外部数学协处理器为80287或80387;X86_CR0_NE
- 如果置位,则启用内置的x87浮点错误报告,否则启用PC风格的x87错误检测;X86_CR0_WP
- 如果置位,则CPU在特权等级为0时无法写入只读内存页;X86_CR0_AM
- 当AM位置位、EFLGS中的AC位置位、特权等级为3时,进行对齐检查;X86_CR0_PG
- 启用分页.
#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
X86_CR0_PG)
movl $CR0_STATE, %eax
movq %rax, %cr0
为了从汇编执行C语言代码,我们需要建立一个栈。首先将栈指针 指向一个内存中合适的区域,然后重置FLAGS寄存器
movq stack_start(%rip), %rsp
pushq $0
popfq
在这里最有意思的地方在于 stack_start
。它也定义在当前的源文件中:
GLOBAL(stack_start)
.quad init_thread_union+THREAD_SIZE-8
对于 GLOABL
我们应该很熟悉了。它在 arch/x86/include/asm/linkage.h 头文件中定义如下:
#define GLOBAL(name) \
.globl name; \
name:
THREAD_SIZE
定义在 arch/x86/include/asm/page_64_types.h,它依赖于 KASAN_STACK_ORDER
的值:
#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
首先来考虑当禁用了 kasan 并且 PAGE_SIZE
大小为4096时的情况。此时 THREAD_SIZE
将为 16
KB,代表了一个线程的栈的大小。为什么是线程
?我们知道每一个进程可能会有父进程和子进程。事实上,父进程和子进程使用不同的栈空间,每一个新进程都会拥有一个新的内核栈。在Linux内核中,这个栈由 thread_info
结构中的一个union表示:
union thread_union {
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
例如,init_thread_union
定义如下:
union thread_union init_thread_union __init_task_data =
{ INIT_THREAD_INFO(init_task) };
其中 INIT_THREAD_INFO
接受 task_struct
结构类型的参数,并进行一些初始化操作:
#define INIT_THREAD_INFO(tsk) \
{ \
.task = &tsk, \
.flags = 0, \
.cpu = 0, \
.addr_limit = KERNEL_DS, \
}
task_struct
结构在内核中代表了对进程的描述。因此,thread_union
包含了关于一个进程的低级信息,并且其位于进程栈底:
+-----------------------+
| |
| |
| |
| Kernel stack |
| |
| |
| |
|-----------------------|
| |
| struct thread_info |
| |
+-----------------------+
需要注意的是我们在栈顶保留了 8
个字节的空间,用来保护对下一个内存页的非法访问。
在初期启动栈设置好之后,使用 lgdt
指令来更新全局描述符表:
lgdt early_gdt_descr(%rip)
其中 early_gdt_descr
定义如下:
early_gdt_descr:
.word GDT_ENTRIES*8-1
early_gdt_descr_base:
.quad INIT_PER_CPU_VAR(gdt_page)
需要重新加载 全局描述附表
的原因是,虽然目前内核工作在用户空间的低地址中,但很快内核将会在它自己的内存地址空间中运行。下面让我们来看一下 early_gdt_descr
的定义。全局描述符表包含了32项,用于内核代码、数据、线程局部存储段等:
#define GDT_ENTRIES 32
现在来看一下 early_gdt_descr_base
. 首先,gdt_page
的定义在arch/x86/include/asm/desc.h中:
struct gdt_page {
struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));
它只包含了一项 desc_struct
的数组gdt
。desc_struct
定义如下:
struct desc_struct {
union {
struct {
unsigned int a;
unsigned int b;
};
struct {
u16 limit0;
u16 base0;
unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
};
};
} __attribute__((packed));
它跟 GDT
描述符的定义很像。同时需要注意的是,gdt_page
结构是 PAGE_SIZE
(4096
) 对齐的,即 gdt
将会占用一页内存。
下面我们来看一下 INIT_PER_CPU_VAR
,它定义在 arch/x86/include/asm/percpu.h,只是将给定的参数与 init_per_cpu__
连接起来:
#define INIT_PER_CPU_VAR(var) init_per_cpu__##var
所以在宏展开之后,我们会得到 init_per_cpu__gdt_page
。而在 linker script 中可以发现:
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);
INIT_PER_CPU
扩展后也将得到 init_per_cpu__gdt_page
并将它的值设置为相对于 __per_cpu_load
的偏移量。这样,我们就得到了新GDT的正确的基地址。
per-CPU变量是2.6内核中的特性。顾名思义,当我们创建一个 per-CPU
变量时,每个CPU都会拥有一份它自己的拷贝,在这里我们创建的是 gdt_page
per-CPU变量。这种类型的变量有很多有点,比如由于每个CPU都只访问自己的变量而不需要锁等。因此在多处理器的情况下,每一个处理器核心都将拥有一份自己的 GDT
表,其中的每一项都代表了一块内存,这块内存可以由在这个核心上运行的线程访问。这里 Theory/per-cpu 有关于 per-CPU
变量的更详细的介绍。
在加载好了新的全局描述附表之后,跟之前一样我们重新加载一下各个段:
xorl %eax,%eax
movl %eax,%ds
movl %eax,%ss
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
在所有这些步骤都结束后,我们需要设置一下 gs
寄存器,令它指向一个特殊的栈 irqstack
,用于处理中断:
movl $MSR_GS_BASE,%ecx
movl initial_gs(%rip),%eax
movl initial_gs+4(%rip),%edx
wrmsr
其中, MSR_GS_BASE
为:
#define MSR_GS_BASE 0xc0000101
我们需要把 MSR_GS_BASE
放入 ecx
寄存器,同时利用 wrmsr
指令向 eax
和 edx
处的地址加载数据(即指向 initial_gs
)。cs
, fs
, ds
和 ss
段寄存器在64位模式下不用来寻址,但 fs
和 gs
可以使用。 fs
和 gs
有一个隐含的部分(与实模式下的 cs
段寄存器类似),这个隐含部分存储了一个描述符,其指向 Model Specific Registers。因此上面的 0xc0000101
是一个 gs.base
MSR 地址。当发生系统调用 或者 中断时,入口点处并没有内核栈,因此 MSR_GS_BASE
将会用来存放中断栈。
接下来我们把实模式中的 bootparam 结构的地址放入 rdi
(要记得 rsi
从一开始就保存了这个结构体的指针),然后跳转到C语言代码:
movq initial_code(%rip),%rax
pushq $0
pushq $__KERNEL_CS
pushq %rax
lretq
这里我们把 initial_code
放入 rax
中,并且向栈里分别压入一个无用的地址、__KERNEL_CS
和 initial_code
的地址。随后的 lreq
指令表示从栈上弹出返回地址并跳转。initial_code
同样定义在这个文件里:
.balign 8
GLOBAL(initial_code)
.quad x86_64_start_kernel
...
...
...
可以看到 initial_code
包含了 x86_64_start_kernel
的地址,其定义在 arch/x86/kerne/head64.c:
asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
...
...
...
}
这个函数接受一个参数 real_mode_data
(刚才我们把实模式下数据的地址保存到了 rdi
寄存器中)。
这个函数是内核中第一个执行的C语言代码!
走进 start_kernel
在我们真正到达“内核入口点”-init/main.c中的start_kernel函数之前,我们还需要最后的准备工作:
首先在 x86_64_start_kernel
函数中可以看到一些检查工作:
BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);
这些检查包括:模块的虚拟地址不能低于内核 text 段基地址 __START_KERNEL_map
,包含模块的内核 text 段的空间大小不能小于内核镜像大小等等。BUILD_BUG_ON
宏定义如下:
#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))
我们来理解一下这些巧妙的设计是怎么工作的。首先以第一个条件 MODULES_VADDR < __START_KERNEL_map
为例:!!conditions
等价于 condition != 0
,这代表如果 MODULES_VADDR < __START_KERNEL_map
为真,则 !!(condition)
为1,否则为0。执行2*!!(condition)
之后数值变为 2
或 0
。因此,这个宏执行完后可能产生两种不同的行为:
- 编译错误。因为我们尝试取获取一个字符数组索引为负数的变量的大小。
- 没有编译错误。
就是这么简单,通过C语言中某些常量导致编译错误的技巧实现了这一设计。
接下来 start_kernel 调用了 cr4_init_shadow
函数,其中存储了每个CPU中 cr4
的Shadow Copy。上下文切换可能会修改 cr4
中的位,因此需要保存每个CPU中 cr4
的内容。在这之后将会调用 reset_early_page_tables
函数,它重置了所有的全局页目录项,同时向 cr3
中重新写入了的全局页目录表的地址:
for (i = 0; i < PTRS_PER_PGD-1; i++)
early_level4_pgt[i].pgd = 0;
next_early_pgt = 0;
write_cr3(__pa_nodebug(early_level4_pgt));
很快我们就会设置新的页表。在这里我们遍历了所有的全局页目录项(其中 PTRS_PER_PGD
为 512
),将其设置为0。之后将 next_early_pgt
设置为0(会在下一篇文章中介绍细节),同时把 early_level4_pgt
的物理地址写入 cr3
。__pa_nodebug
是一个宏,将被扩展为:
((unsigned long)(x) - __START_KERNEL_map + phys_base)
此后我们清空了从 __bss_stop
到 __bss_start
的 _bss
段,下一步将是建立初期 IDT(中断描述符表)
的处理代码,内容很多,我们将会留到下一个部分再来探究。
总结
第一部分关于Linux内核的初始化过程到这里就结束了。
如果你有任何问题或建议,请在twitter上联系我 0xAX,或者通过邮件与我沟通,还可以新开issue。
下一部分我们会看到初期中断处理程序的初始化过程、内核空间的内存映射等。
相关链接
内核初始化 第二部分
初期中断和异常处理
在上一个 部分 我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了,还有了用于初期启动的基本的 分页 机制。我们的目标是在内核的主体代码执行前做好准备工作。
我们已经在 本章 的 第一部分 做了一些工作,在这一部分中我们会继续分析关于中断和异常处理部分的代码。
我们在上一部分谈到了下面这个循环:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
这段代码位于 arch/x86/kernel/head64.c。在分析这段代码之前,我们先来了解一些关于中断和中断处理程序的知识。
理论
中断是一种由软件或硬件产生的、向CPU发出的事件。例如,如果用户按下了键盘上的一个按键时,就会产生中断。此时CPU将会暂停当前的任务,并且将控制流转到特殊的程序中—— 中断处理程序(Interrupt Handler)。一个中断处理程序会对中断进行处理,然后将控制权交还给之前暂停的任务中。中断分为三类:
- 软件中断 - 当一个软件可以向CPU发出信号,表明它需要系统内核的相关功能时产生。这些中断通常用于系统调用;
- 硬件中断 - 当一个硬件有任何事件发生时产生,例如键盘的按键被按下;
- 异常 - 当CPU检测到错误时产生,例如发生了除零错误或者访问了一个不存在的内存页。
每一个中断和异常都可以由一个数来表示,这个数叫做 向量号
,它可以取从 0
到 255
中的任何一个数。通常在实践中前 32
个向量号用来表示异常,32
到 255
用来表示用户定义的中断。可以看到在上面的代码中,NUM_EXCEPTION_VECTORS
就定义为:
#define NUM_EXCEPTION_VECTORS 32
CPU会从 APIC 或者 CPU 引脚接收中断,并使用中断向量号作为 中断描述符表
的索引。下面的表中列出了 0-31
号异常:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description |Type |Error Code|Source |
----------------------------------------------------------------------------------------------
|0 | #DE |Divide Error |Fault|NO |DIV and IDIV |
|---------------------------------------------------------------------------------------------
|1 | #DB |Reserved |F/T |NO | |
|---------------------------------------------------------------------------------------------
|2 | --- |NMI |INT |NO |external NMI |
|---------------------------------------------------------------------------------------------
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
|---------------------------------------------------------------------------------------------
|4 | #OF |Overflow |Trap |NO |INTO instruction |
|---------------------------------------------------------------------------------------------
|5 | #BR |Bound Range Exceeded|Fault|NO |BOUND instruction |
|---------------------------------------------------------------------------------------------
|6 | #UD |Invalid Opcode |Fault|NO |UD2 instruction |
|---------------------------------------------------------------------------------------------
|7 | #NM |Device Not Available|Fault|NO |Floating point or [F]WAIT |
|---------------------------------------------------------------------------------------------
|8 | #DF |Double Fault |Abort|YES |Ant instrctions which can generate NMI|
|---------------------------------------------------------------------------------------------
|9 | --- |Reserved |Fault|NO | |
|---------------------------------------------------------------------------------------------
|10 | #TS |Invalid TSS |Fault|YES |Task switch or TSS access |
|---------------------------------------------------------------------------------------------
|11 | #NP |Segment Not Present |Fault|NO |Accessing segment register |
|---------------------------------------------------------------------------------------------
|12 | #SS |Stack-Segment Fault |Fault|YES |Stack operations |
|---------------------------------------------------------------------------------------------
|13 | #GP |General Protection |Fault|YES |Memory reference |
|---------------------------------------------------------------------------------------------
|14 | #PF |Page fault |Fault|YES |Memory reference |
|---------------------------------------------------------------------------------------------
|15 | --- |Reserved | |NO | |
|---------------------------------------------------------------------------------------------
|16 | #MF |x87 FPU fp error |Fault|NO |Floating point or [F]Wait |
|---------------------------------------------------------------------------------------------
|17 | #AC |Alignment Check |Fault|YES |Data reference |
|---------------------------------------------------------------------------------------------
|18 | #MC |Machine Check |Abort|NO | |
|---------------------------------------------------------------------------------------------
|19 | #XM |SIMD fp exception |Fault|NO |SSE[2,3] instructions |
|---------------------------------------------------------------------------------------------
|20 | #VE |Virtualization exc. |Fault|NO |EPT violations |
|---------------------------------------------------------------------------------------------
|21-31 | --- |Reserved |INT |NO |External interrupts |
----------------------------------------------------------------------------------------------
为了能够对中断进行处理,CPU使用了一种特殊的结构 - 中断描述符表(IDT)。IDT 是一个由描述符组成的数组,其中每个描述符都为8个字节,与全局描述附表一致;不过不同的是,我们把IDT中的每一项叫做 门(gate)
。为了获得某一项描述符的起始地址,CPU 会把向量号乘以8,在64位模式中则会乘以16。在前面我们已经见过,CPU使用一个特殊的 GDTR
寄存器来存放全局描述符表的地址,中断描述符表也有一个类似的寄存器 IDTR
,同时还有用于将基地址加载入这个寄存器的指令 lidt
。
64位模式下 IDT 的每一项的结构如下:
127 96
--------------------------------------------------------------------------------
| |
| Reserved |
| |
--------------------------------------------------------------------------------
95 64
--------------------------------------------------------------------------------
| |
| Offset 63..32 |
| |
--------------------------------------------------------------------------------
63 48 47 46 44 42 39 34 32
--------------------------------------------------------------------------------
| | | D | | | | | | |
| Offset 31..16 | P | P | 0 |Type |0 0 0 | 0 | 0 | IST |
| | | L | | | | | | |
--------------------------------------------------------------------------------
31 15 16 0
--------------------------------------------------------------------------------
| | |
| Segment Selector | Offset 15..0 |
| | |
--------------------------------------------------------------------------------
其中:
Offset
- 代表了到中断处理程序入口点的偏移;DPL
- 描述符特权级别;P
- Segment Present 标志;Segment selector
- 在GDT或LDT中的代码段选择子;IST
- 用来为中断处理提供一个新的栈。
最后的 Type
域描述了这一项的类型,中断处理程序共分为三种:
- 任务描述符
- 中断描述符
- 陷阱描述符
中断和陷阱描述符包含了一个指向中断处理程序的远 (far) 指针,二者唯一的不同在于CPU处理 IF
标志的方式。如果是由中断门进入中断处理程序的,CPU 会清除 IF
标志位,这样当当前中断处理程序执行时,CPU 不会对其他的中断进行处理;只有当当前的中断处理程序返回时,CPU 才在 iret
指令执行时重新设置 IF
标志位。
中断门的其他位为保留位,必须为0。下面我们来看一下 CPU 是如何处理中断的:
- CPU 会在栈上保存标志寄存器、
cs
段寄存器和程序计数器IP; - 如果中断是由错误码引起的(比如
#PF
), CPU会在栈上保存错误码; - 在中断处理程序执行完毕后,由
iret
指令返回。
OK,接下来我们继续分析代码。
设置并加载 IDT
我们分析到了如下代码:
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
set_intr_gate(i, early_idt_handler_array[i]);
这里循环内部调用了 set_intr_gate
,它接受两个参数:
- 中断号,即
向量号
; - 中断处理程序的地址。
同时,这个函数还会将中断门插入至 IDT
表中,代码中的 &idt_descr
数组即为 IDT
。 首先让我们来看一下 early_idt_handler_array
数组,它定义在 arch/x86/include/asm/segment.h 头文件中,包含了前32个异常处理程序的地址:
#define EARLY_IDT_HANDLER_SIZE 9
#define NUM_EXCEPTION_VECTORS 32
extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];
early_idt_handler_array
是一个大小为 288
字节的数组,每一项为 9
个字节,其中2个字节的备用指令用于向栈中压入默认错误码(如果异常本身没有提供错误码的话),2个字节的指令用于向栈中压入向量号,剩余5个字节用于跳转到异常处理程序。
在上面的代码中,我们只通过一个循环向 IDT
中填入了前32项内容,这是因为在整个初期设置阶段,中断是禁用的。early_idt_handler_array
数组中的每一项指向的都是同一个通用中断处理程序,定义在 arch/x86/kernel/head_64.S 。我们先暂时跳过这个数组的内容,看一下 set_intr_gate
的定义。
set_intr_gate
宏定义在 arch/x86/include/asm/desc.h:
#define set_intr_gate(n, addr) \
do { \
BUG_ON((unsigned)n > 0xFF); \
_set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
__KERNEL_CS); \
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
0, 0, __KERNEL_CS); \
} while (0)
首先 BUG_ON
宏确保了传入的中断向量号不会大于255,因为我们最多只有 256
个中断。然后它调用了 _set_gate
函数,它会将中断门写入 IDT
:
static inline void _set_gate(int gate, unsigned type, void *addr,
unsigned dpl, unsigned ist, unsigned seg)
{
gate_desc s;
pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
write_idt_entry(idt_table, gate, &s);
write_trace_idt_entry(gate, &s);
}
在 _set_gate
函数的开始,它调用了 pack_gate
函数。这个函数会使用给定的参数填充 gate_desc
结构:
static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
unsigned dpl, unsigned ist, unsigned seg)
{
gate->offset_low = PTR_LOW(func);
gate->segment = __KERNEL_CS;
gate->ist = ist;
gate->p = 1;
gate->dpl = dpl;
gate->zero0 = 0;
gate->zero1 = 0;
gate->type = type;
gate->offset_middle = PTR_MIDDLE(func);
gate->offset_high = PTR_HIGH(func);
}
在这个函数里,我们把从主循环中得到的中断处理程序入口点地址拆成三个部分,填入门描述符中。下面的三个宏就用来做这个拆分工作:
#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF)
#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF)
#define PTR_HIGH(x) ((unsigned long long)(x) >> 32)
调用 PTR_LOW
可以得到 x 的低 2
个字节,调用 PTR_MIDDLE
可以得到 x 的中间 2
个字节,调用 PTR_HIGH
则能够得到 x 的高 4
个字节。接下来我们来位中断处理程序设置段选择子,即内核代码段 __KERNEL_CS
。然后将 Interrupt Stack Table
和 描述符特权等级
(最高特权等级)设置为0,以及在最后设置 GAT_INTERRUPT
类型。
现在我们已经设置好了IDT中的一项,那么通过调用 native_write_idt_entry
函数来把复制到 IDT
:
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
memcpy(&idt[entry], gate, sizeof(*gate));
}
主循环结束后,idt_table
就已经设置完毕了,其为一个 gate_desc
数组。然后我们就可以通过下面的代码加载 中断描述符表
:
load_idt((const struct desc_ptr *)&idt_descr);
其中,idt_descr
为:
struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };
load_idt
函数只是执行了一下 lidt
指令:
asm volatile("lidt %0"::"m" (*dtr));
你可能已经注意到了,在代码中还有对 _trace_*
函数的调用。这些函数会用跟 _set_gate
同样的方法对 IDT
门进行设置,但仅有一处不同:这些函数并不设置 idt_table
,而是 trace_idt_table
,用于设置追踪点(tracepoint,我们将会在其他章节介绍这一部分)。
好了,至此我们已经了解到,通过设置并加载 中断描述符表
,能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。
初期中断处理程序
在上面的代码中,我们用 early_idt_handler_array
的地址来填充了 IDT
,这个 early_idt_handler_array
定义在 arch/x86/kernel/head_64.S:
.globl early_idt_handler_array
early_idt_handlers:
i = 0
.rept NUM_EXCEPTION_VECTORS
.if (EXCEPTION_ERRCODE_MASK >> i) & 1
pushq $0
.endif
pushq $i
jmp early_idt_handler_common
i = i + 1
.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc
.endr
这段代码自动生成为前 32
个异常生成了中断处理程序。首先,为了统一栈的布局,如果一个异常没有返回错误码,那么我们就手动在栈中压入一个 0
。然后再在栈中压入中断向量号,最后跳转至通用的中断处理程序 early_idt_handler_common
。我们可以通过 objdump
命令的输出一探究竟:
$ objdump -D vmlinux
...
...
...
ffffffff81fe5000 <early_idt_handler_array>:
ffffffff81fe5000: 6a 00 pushq $0x0
ffffffff81fe5002: 6a 00 pushq $0x0
ffffffff81fe5004: e9 17 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5009: 6a 00 pushq $0x0
ffffffff81fe500b: 6a 01 pushq $0x1
ffffffff81fe500d: e9 0e 01 00 00 jmpq ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5012: 6a 00 pushq $0x0
ffffffff81fe5014: 6a 02 pushq $0x2
...
...
...
由于在中断发生时,CPU 会在栈上压入标志寄存器、CS
段寄存器和 RIP
寄存器的内容。因此在 early_idt_handler
执行前,栈的布局如下:
|--------------------|
| %rflags |
| %cs |
| %rip |
| rsp --> error code |
|--------------------|
下面我们来看一下 early_idt_handler_common
的实现。它也定义在 arch/x86/kernel/head_64.S 文件中。首先它会检查当前中断是否为 不可屏蔽中断(NMI),如果是则简单地忽略它们:
cmpl $2,(%rsp)
je .Lis_nmi
其中 is_nmi
为:
is_nmi:
addq $16,%rsp
INTERRUPT_RETURN
这段程序首先从栈顶弹出错误码和中断向量号,然后通过调用 INTERRUPT_RETURN
,即 iretq
指令直接返回。
如果当前中断不是 NMI
,则首先检查 early_recursion_flag
以避免在 early_idt_handler_common
程序中递归地产生中断。如果一切都没问题,就先在栈上保存通用寄存器,为了防止中断返回时寄存器的内容错乱:
pushq %rax
pushq %rcx
pushq %rdx
pushq %rsi
pushq %rdi
pushq %r8
pushq %r9
pushq %r10
pushq %r11
然后我们检查栈上的段选择子:
cmpl $__KERNEL_CS,96(%rsp)
jne 11f
段选择子必须为内核代码段,如果不是则跳转到标签 11
,输出 PANIC
信息并打印栈的内容。然后我们来检查向量号,如果是 #PF
即 缺页中断(Page Fault),那么就把 cr2
寄存器中的值赋值给 rdi
,然后调用 early_make_pgtable
(详见后文):
cmpl $14,72(%rsp)
jnz 10f
GET_CR2_INTO(%rdi)
call early_make_pgtable
andl %eax,%eax
jz 20f
如果向量号不是 #PF
,那么就恢复通用寄存器:
popq %r11
popq %r10
popq %r9
popq %r8
popq %rdi
popq %rsi
popq %rdx
popq %rcx
popq %rax
并调用 iret
从中断处理程序返回。
第一个中断处理程序到这里就结束了。由于它只是一个初期中段处理程序,因此只处理缺页中断。下面让我们首先来看一下缺页中断处理程序,其他中断的处理程序我们之后再进行分析。
缺页中断处理程序
在上一节中我们第一次见到了初期中断处理程序,它检查了缺页中断的中断号,并调用了 early_make_pgtable
来建立新的页表。在这里我们需要提供 #PF
中断处理程序,以便为之后将内核加载至 4G
地址以上,并且能访问位于4G以上的 boot_params
结构体。
early_make_pgtable
的实现在 arch/x86/kernel/head64.c,它接受一个参数:从 cr2
寄存器得到的地址,这个地址引发了内存中断。下面让我们来看一下:
int __init early_make_pgtable(unsigned long address)
{
unsigned long physaddr = address - __PAGE_OFFSET;
unsigned long i;
pgdval_t pgd, *pgd_p;
pudval_t pud, *pud_p;
pmdval_t pmd, *pmd_p;
...
...
...
}
首先它定义了一些 *val_t
类型的变量。这些类型均为:
typedef unsigned long pgdval_t;
此外,我们还会遇见 *_t
(不带val)的类型,比如 pgd_t
……这些类型都定义在 arch/x86/include/asm/pgtable_types.h,形式如下:
typedef struct { pgdval_t pgd; } pgd_t;
例如,
extern pgd_t early_level4_pgt[PTRS_PER_PGD];
在这里 early_level4_pgt
代表了初期顶层页表目录,它是一个 pdg_t
类型的数组,其中的 pgd
指向了下一级页表。
在确认不是非法地址后,我们取得页表中包含引起 #PF
中断的地址的那一项,将其赋值给 pgd
变量:
pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;
接下来我们检查一下 pgd
,如果它包含了正确的全局页表项的话,我们就把这一项的物理地址处理后赋值给 pud_p
:
pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);
其中 PTE_PFN_MASK
是一个宏:
#define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)
展开后将为:
(~(PAGE_SIZE-1)) & ((1 << 46) - 1)
或者写为:
0b1111111111111111111111111111111111111111111111
它是一个46bit大小的页帧屏蔽值。
如果 pgd
没有包含有效的地址,我们就检查 next_early_pgt
与 EARLY_DYNAMIC_PAGE_TABLES
(即 64
)的大小。EARLY_DYNAMIC_PAGE_TABLES
它是一个固定大小的缓冲区,用来在需要的时候建立新的页表。如果 next_early_pgt
比 EARLY_DYNAMIC_PAGE_TABLES
大,我们就用一个上层页目录指针指向当前的动态页表,并将它的物理地址与 _KERPG_TABLE
访问权限一起写入全局页目录表:
if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
reset_early_page_tables();
goto again;
}
pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
for (i = 0; i < PTRS_PER_PUD; i++)
pud_p[i] = 0;
*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;
然后我们来修正上层页目录的地址:
pud_p += pud_index(address);
pud = *pud_p;
下面我们对中层页目录重复上面同样的操作。最后我们利用 In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses:
pmd = (physaddr & PMD_MASK) + early_pmd_flags;
pmd_p[pmd_index(address)] = pmd;
到此缺页中断处理程序就完成了它所有的工作,此时 early_level4_pgt
就包含了指向合法地址的项。
小结
本书的第二部分到此结束了。
如果你有任何问题或建议,请在twitter上联系我 0xAX,或者通过邮件与我沟通,还可以新开issue。
接下来我们将会看到进入内核入口点 start_kernel
函数之前剩下所有的准备工作。
相关链接
内核初始化 第三部分
进入内核入口点之前最后的准备工作
这是 Linux 内核初始化过程的第三部分。在上一个部分 中我们接触到了初期中断和异常处理,而在这个部分中我们要继续看一看 Linux 内核的初始化过程。在之后的章节我们将会关注“内核入口点”—— init/main.c 文件中的start_kernel
函数。没错,从技术上说这并不是内核的入口点,只是不依赖于特定架构的通用内核代码的开始。不过,在我们调用 start_kernel
之前,有些准备必须要做。下面我们就来看一看。
boot_params again
在上一个部分中我们讲到了设置中断描述符表,并将其加载进 IDTR
寄存器。下一步是调用 copy_bootdata
函数:
copy_bootdata(__va(real_mode_data));
这个函数接受一个参数—— read_mode_data
的虚拟地址。boot_params
结构体是在 arch/x86/include/uapi/asm/bootparam.h 作为第一个参数传递到 arch/x86/kernel/head_64.S 中的 x86_64_start_kernel
函数的:
/* rsi is pointer to real mode structure with interesting info.
pass it to C */
movq %rsi, %rdi
下面我们来看一看 __va
宏。 这个宏定义在 init/main.c:
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
其中 PAGE_OFFSET
就是 __PAGE_OFFSET
(即 0xffff880000000000
),也是所有对物理地址进行直接映射后的虚拟基地址。因此我们就得到了 boot_params
结构体的虚拟地址,并把他传入 copy_bootdata
函数中。在这个函数里我们把 real_mod_data
(定义在 arch/x86/kernel/setup.h) 拷贝进 boot_params
:
extern struct boot_params boot_params;
copy_boot_data
的实现如下:
static void __init copy_bootdata(char *real_mode_data)
{
char * command_line;
unsigned long cmd_line_ptr;
memcpy(&boot_params, real_mode_data, sizeof boot_params);
sanitize_boot_params(&boot_params);
cmd_line_ptr = get_cmd_line_ptr();
if (cmd_line_ptr) {
command_line = __va(cmd_line_ptr);
memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
}
}
首先,这个函数的声明中有一个 __init
前缀,这表示这个函数只在初始化阶段使用,并且它所使用的内存将会被释放。
在这个函数中首先声明了两个用于解析内核命令行的变量,然后使用memcpy
函数将 real_mode_data
拷贝进 boot_params
。如果系统引导工具(bootloader)没能正确初始化 boot_params
中的某些成员的话,那么在接下来调用的 sanitize_boot_params
函数中将会对这些成员进行清零,比如 ext_ramdisk_image
等。此后我们通过调用 get_cmd_line_ptr
函数来得到命令行的地址:
unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;
return cmd_line_ptr;
get_cmd_line_ptr
函数将会从 boot_params
中获得命令行的64位地址并返回。最后,我们检查一下是否正确获得了 cmd_line_ptr
,并把它的虚拟地址拷贝到一个字节数组 boot_command_line
中:
extern char __initdata boot_command_line[];
这一步完成之后,我们就得到了内核命令行和 boot_params
结构体。之后,内核通过调用 load_ucode_bsp
函数来加载处理器微代码(microcode),不过我们目前先暂时忽略这一步。
微代码加载之后,内核会对 console_loglevel
进行检查,同时通过 early_printk
函数来打印出字符串 Kernel Alive
。不过这个输出不会真的被显示出来,因为这个时候 early_printk
还没有被初始化。这是目前内核中的一个小bug,作者已经提交了补丁 commit,补丁很快就能应用在主分支中了。所以你可以先跳过这段代码。
初始化内存页
至此,我们已经拷贝了 boot_params
结构体,接下来将对初期页表进行一些设置以便在初始化内核的过程中使用。我们之前已经对初始化了初期页表,以便支持换页,这在之前的部分中已经讨论过。现在则通过调用 reset_early_page_tables
函数将初期页表中大部分项清零(在之前的部分也有介绍),只保留内核高地址的映射。然后我们调用:
clear_page(init_level4_pgt);
init_level4_pgt
同样定义在 arch/x86/kernel/head_64.S:
NEXT_PAGE(init_level4_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
.org init_level4_pgt + L4_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
.org init_level4_pgt + L4_START_KERNEL*8, 0
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
这段代码为内核的代码段、数据段和 bss 段映射了前 2.5G 个字节。clear_page
函数定义在 arch/x86/lib/clear_page_64.S:
ENTRY(clear_page)
CFI_STARTPROC
xorl %eax,%eax
movl $4096/64,%ecx
.p2align 4
.Lloop:
decl %ecx
#define PUT(x) movq %rax,x*8(%rdi)
movq %rax,(%rdi)
PUT(1)
PUT(2)
PUT(3)
PUT(4)
PUT(5)
PUT(6)
PUT(7)
leaq 64(%rdi),%rdi
jnz .Lloop
nop
ret
CFI_ENDPROC
.Lclear_page_end:
ENDPROC(clear_page)
顾名思义,这个函数会将页表清零。这个函数的开始和结束部分有两个宏 CFI_STARTPROC
和 CFI_ENDPROC
,他们会展开成 GNU 汇编指令,用于调试:
#define CFI_STARTPROC .cfi_startproc
#define CFI_ENDPROC .cfi_endproc
在 CFI_STARTPROC
之后我们将 eax
寄存器清零,并将 ecx
赋值为 64(用作计数器)。接下来从 .Lloop
标签开始循环,首先就是将 ecx
减一。然后将 rax
中的值(目前为0)写入 rdi
指向的地址,rdi
中保存的是 init_level4_pgt
的基地址。接下来重复7次这个步骤,但是每次都相对 rdi
多偏移8个字节。之后 init_level4_pgt
的前64个字节就都被填充为0了。接下来我们将 rdi
中的值加上64,重复这个步骤,直到 ecx
减至0。最后就完成了将 init_level4_pgt
填零。
在将 init_level4_pgt
填0之后,再把它的最后一项设置为内核高地址的映射:
init_level4_pgt[511] = early_level4_pgt[511];
在前面我们已经使用 reset_early_page_table
函数清除 early_level4_pgt
中的大部分项,而只保留内核高地址的映射。
x86_64_start_kernel
函数的最后一步是调用:
x86_64_start_reservations(real_mode_data);
并传入 real_mode_data
参数。 x86_64_start_reservations
函数与 x86_64_start_kernel
函数定义在同一个文件中:
void __init x86_64_start_reservations(char *real_mode_data)
{
if (!boot_params.hdr.version)
copy_bootdata(__va(real_mode_data));
reserve_ebda_region();
start_kernel();
}
这就是进入内核入口点之前的最后一个函数了。下面我们就来介绍一下这个函数。
内核入口点前的最后一步
在 x86_64_start_reservations
函数中首先检查了 boot_params.hdr.version
:
if (!boot_params.hdr.version)
copy_bootdata(__va(real_mode_data));
如果它为0,则再次调用 copy_bootdata
,并传入 real_mode_data
的虚拟地址。
接下来则调用了 reserve_ebda_region
函数,它定义在 arch/x86/kernel/head.c。这个函数为 EBDA
(即Extended BIOS Data Area,扩展BIOS数据区域)预留空间。扩展BIOS预留区域位于常规内存顶部(译注:常规内存(Conventiional Memory)是指前640K字节内存),包含了端口、磁盘参数等数据。
接下来我们来看一下 reserve_ebda_region
函数。它首先会检查是否启用了半虚拟化:
if (paravirt_enabled())
return;
如果开启了半虚拟化,那么就退出 reserve_ebda_region
函数,因为此时没有扩展BIOS数据区域。下面我们首先得到低地址内存的末尾地址:
lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;
首先我们得到了BIOS地地址内存的虚拟地址,以KB为单位,然后将其左移10位(即乘以1024)转换为以字节为单位。然后我们需要获得扩展BIOS数据区域的地址:
ebda_addr = get_bios_ebda();
其中, get_bios_ebda
函数定义在 arch/x86/include/asm/bios_ebda.h:
static inline unsigned int get_bios_ebda(void)
{
unsigned int address = *(unsigned short *)phys_to_virt(0x40E);
address <<= 4;
return address;
}
下面我们来尝试理解一下这段代码。这段代码中,首先我们将物理地址 0x40E
转换为虚拟地址,0x0040:0x000e
就是包含有扩展BIOS数据区域基地址的代码段。这里我们使用了 phys_to_virt
函数进行地址转换,而不是之前使用的 __va
宏。不过,事实上他们两个基本上是一样的:
static inline void *phys_to_virt(phys_addr_t address)
{
return __va(address);
}
而不同之处在于,phys_to_virt
函数的参数类型 phys_addr_t
的定义依赖于 CONFIG_PHYS_ADDR_T_64BIT
:
#ifdef CONFIG_PHYS_ADDR_T_64BIT
typedef u64 phys_addr_t;
#else
typedef u32 phys_addr_t;
#endif
具体的类型是由 CONFIG_PHYS_ADDR_T_64BIT
设置选项控制的。此后我们得到了包含扩展BIOS数据区域虚拟基地址的段,把它左移4位后返回。这样,ebda_addr
变量就包含了扩展BIOS数据区域的基地址。
下一步我们来检查扩展BIOS数据区域与低地址内存的地址,看一看它们是否小于 INSANE_CUTOFF
宏:
if (ebda_addr < INSANE_CUTOFF)
ebda_addr = LOWMEM_CAP;
if (lowmem < INSANE_CUTOFF)
lowmem = LOWMEM_CAP;
INSANE_CUTOFF
为:
#define INSANE_CUTOFF 0x20000U
即 128 KB. 上一步我们得到了低地址内存中的低地址部分以及扩展BIOS数据区域,然后调用 memblock_reserve
函数来在低内存地址与1MB之间为扩展BIOS数据预留内存区域。
lowmem = min(lowmem, ebda_addr);
lowmem = min(lowmem, LOWMEM_CAP);
memblock_reserve(lowmem, 0x100000 - lowmem);
memblock_reserve
函数定义在 mm/block.c,它接受两个参数:
- 基物理地址
- 区域大小
然后在给定的基地址处预留指定大小的内存。memblock_reserve
是在这本书中我们接触到的第一个Linux内核内存管理框架中的函数。我们很快会详细地介绍内存管理,不过现在还是先来看一看这个函数的实现。
Linux内核管理框架初探
在上一段中我们遇到了对 memblock_reserve
函数的调用。现在我们来尝试理解一下这个函数是如何工作的。 memblock_reserve
函数只是调用了:
memblock_reserve_region(base, size, MAX_NUMNODES, 0);
memblock_reserve_region
接受四个参数:
- 内存区域的物理基地址
- 内存区域的大小
- 最大 NUMA 节点数
- 标志参数 flags
在 memblock_reserve_region
函数一开始,就是一个 memblock_type
结构体类型的变量:
struct memblock_type *_rgn = &memblock.reserved;
memblock_type
类型代表了一块内存,定义如下:
struct memblock_type {
unsigned long cnt;
unsigned long max;
phys_addr_t total_size;
struct memblock_region *regions;
};
因为我们要为扩展BIOS数据区域预留内存块,所以当前内存区域的类型就是预留。memblock
结构体的定义为:
struct memblock {
bool bottom_up;
phys_addr_t current_limit;
struct memblock_type memory;
struct memblock_type reserved;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
struct memblock_type physmem;
#endif
};
它描述了一块通用的数据块。我们用 memblock.reserved
的值来初始化 _rgn
。memblock
全局变量定义如下:
struct memblock memblock __initdata_memblock = {
.memory.regions = memblock_memory_init_regions,
.memory.cnt = 1,
.memory.max = INIT_MEMBLOCK_REGIONS,
.reserved.regions = memblock_reserved_init_regions,
.reserved.cnt = 1,
.reserved.max = INIT_MEMBLOCK_REGIONS,
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
.physmem.regions = memblock_physmem_init_regions,
.physmem.cnt = 1,
.physmem.max = INIT_PHYSMEM_REGIONS,
#endif
.bottom_up = false,
.current_limit = MEMBLOCK_ALLOC_ANYWHERE,
};
我们现在不会继续深究这个变量,但在内存管理部分的中我们会详细地对它进行介绍。需要注意的是,这个变量的声明中使用了 __initdata_memblock
:
#define __initdata_memblock __meminitdata
而 __meminit_data
为:
#define __meminitdata __section(.meminit.data)
自此我们得出这样的结论:所有的内存块都将定义在 .meminit.data
区段中。在我们定义了 _rgn
之后,使用了 memblock_dbg
宏来输出相关的信息。你可以在从内核命令行传入参数 memblock=debug
来开启这些输出。
在输出了这些调试信息后,是对下面这个函数的调用:
memblock_add_range(_rgn, base, size, nid, flags);
它向 .meminit.data
区段添加了一个新的内存块区域。由于 _rgn
的值是 &memblock.reserved
,下面的代码就直接将扩展BIOS数据区域的基地址、大小和标志填入 _rgn
中:
if (type->regions[0].size == 0) {
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
}
在填充好了区域后,接着是对 memblock_set_region_node
函数的调用。它接受两个参数:
- 填充好的内存区域的地址
- NUMA节点ID
其中我们的区域由 memblock_region
结构体来表示:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
NUMA节点ID依赖于 MAX_NUMNODES
宏,定义在 include/linux/numa.h
#define MAX_NUMNODES (1 << NODES_SHIFT)
其中 NODES_SHIFT
依赖于 CONFIG_NODES_SHIFT
配置参数,定义如下:
#ifdef CONFIG_NODES_SHIFT
#define NODES_SHIFT CONFIG_NODES_SHIFT
#else
#define NODES_SHIFT 0
#endif
memblick_set_region_node
函数只是填充了 memblock_region
中的 nid
成员:
static inline void memblock_set_region_node(struct memblock_region *r, int nid)
{
r->nid = nid;
}
在这之后我们就在 .meminit.data
区段拥有了为扩展BIOS数据区域预留的第一个 memblock
。reserve_ebda_region
已经完成了它该做的任务,我们回到 arch/x86/kernel/head64.c 继续。
至此我们已经结束了进入内核之前所有的准备工作。x86_64_start_reservations
的最后一步是调用 init/main.c 中的:
start_kernel()
这一部分到此结束。
小结
本书的第三部分到这里就结束了。在下一部分中,我们将会见到内核入口点处的初始化工作 —— 位于 start_kernel
函数中。这些工作是在启动第一个进程 init
之前首先要完成的工作。
如果你有任何问题或建议,请在twitter上联系我 0xAX,或者通过邮件与我沟通,还可以新开issue。
相关链接
内核初始化. Part 4.
Kernel entry point
还记得上一章的内容吗 - 跳转到内核入口之前的最后准备?你应该还记得我们已经完成一系列初始化操作,并停在了调用位于init/main.c
中的start_kernel
函数之前.start_kernel
函数是与体系架构无关的通用处理入口函数,尽管我们在此初始化过程中要无数次的返回arch/ 文件夹。如果你仔细看看start_kernel
函数的内容,你将发现此函数涉及内容非常广泛。在此过程中约包含了86个调用函数,是的,你发现它真的是非常庞大但是此部分并不是全部的初始化过程,在当前阶段我们只看这些就可以了。此章节以及后续所有在内核初始化过程章节的内容将涉及并详述它。
start_kernel
函数的主要目的是完成内核初始化并启动祖先进程(1号进程)。在祖先进程启动之前start_kernel
函数做了很多事情,如锁验证器,根据处理器标识ID初始化处理器,开启cgroups子系统,设置每CPU区域环境,初始化VFS Cache机制,初始化内存管理,rcu,vmalloc,scheduler(调度器),IRQs(中断向量表),ACPI(中断可编程控制器)以及其它很多子系统。只有经过这些步骤我们才看到本章最后一部分祖先进程启动的过程;同志们,如此复杂的内核子系统,有没有勾起你的学习欲望,有这么多的内核代码等着我们去征服,让我们开始吧。
注意:在此大章节的所有内容 Linux Kernel initialization process
,并不涉及内核调试相关,关于内核调试部分会有一个单独的章节来进行描述
关于 __attribute__
正如我上述所写,start_kernel
函数是定义在init/main.c.从已知代码中我们能看到此函数使用了__init
特性,你也许从其它地方了解过关于GCC __attribute__
相关的内容。在内核初始化阶段这个机制在所有的函数中都是有必要的。
#define __init __section(.init.text) __cold notrace
在初始化过程完成后,内核将通过调用free_initmem
释放这些sections(段)。注意__init
属性是通过__cold
和notrace
两个属性来定义的。第一个属性cold
的目的是标记此函数很少使用所以编译器必须优化此函数的大小,第二个属性notrace
定义如下:
#define notrace __attribute__((no_instrument_function))
含有no_instrument_function
意思就是告诉编译器函数调用不产生环境变量(堆栈空间)。
在start_kernel
函数的定义中,你也可以看到__visible
属性的扩展:
#define __visible __attribute__((externally_visible))
含有externally_visible
意思就是告诉编译器有一些过程在使用该函数或者变量,为了放至标记这个函数/变量是unusable
。你可以在此include/linux/init.h处查到这些属性表达式的含义。
start_kernel 初始化
在start_kernel的初始之初你可以看到这两个变量:
char *command_line;
char *after_dashes;
第一个变量表示内核命令行的全局指针,第二个变量将包含parse_args
函数通过输入字符串中的参数'name=value',寻找特定的关键字和调用正确的处理程序。我们不想在这个时候参与这两个变量的相关细节,但是会在接下来的章节看到。我们接着往下走,下一步我们看到了此函数:
lockdep_init();
lockdep_init
初始化 lock validator. 其实现是相当简单的,它只是初始化了两个哈希表 list_head并设置lockdep_initialized
全局变量为1
。 关于自旋锁 spinlock以及互斥锁mutex 如何获取请参考链接.
下一个函数是set_task_stack_end_magic
,参数为init_task
和设置STACK_END_MAGIC
(0x57AC6E9D
)。init_task
代表初始化进程(任务)数据结构:
struct task_struct init_task = INIT_TASK(init_task);
task_struct
存储了进程的所有相关信息。因为它很庞大,我在这本书并不会去介绍,详细信息你可以查看调度相关数据结构定义头文件 include/linux/sched.h。在此刻task_sreuct
包含了超过100
个字段!虽然你不会看到task_struct
是在这本书中的解释,但是我们会经常使用它,因为它是介绍在Linux内核进程
的基本知识。我将描述这个结构中字段的一些含义,因为我们在后面的实践中见到它们。
你也可以查看init_task
的相关定义以及宏指令INIT_TASK
的初始化流程。这个宏指令来自于include/linux/init_task.h在此刻只是设置和初始化了第一个进程来(0号进程)的值。例如这么设置:
- 初始化进程状态为 zero 或者
runnable
. 一个可运行进程即为等待CPU去运行; - 初始化仅存的标志位 -
PF_KTHREAD
意思为 - 内核线程; - 一个可运行的任务列表;
- 进程地址空间;
- 初始化进程堆栈
&init_thread_info
-init_thread_union.thread_info
和initthread_union
使用共用体 -thread_union
包含了thread_info
进程信息以及进程栈:。
union thread_union {
struct thread_info thread_info;
unsigned long stack[THREAD_SIZE/sizeof(long)];
};
每个进程都有其自己的堆栈,x86_64
架构的CPU一般支持的页表是16KB or 4个页框大小。我们注意stack变量被定义为数据并且类型是unsigned long
。thread_union
结构的下一个字段为thread_union
定义如下:
struct thread_info {
struct task_struct *task;
struct exec_domain *exec_domain;
__u32 flags;
__u32 status;
__u32 cpu;
int saved_preempt_count;
mm_segment_t addr_limit;
struct restart_block restart_block;
void __user *sysenter_return;
unsigned int sig_on_uaccess_error:1;
unsigned int uaccess_err:1;
};
此结构占用52个字节。thread_info
结构包含了特定体系架构相关的线程信息,我们都知道在X86_64
架构上内核栈是逆生成而thread_union.thread_info
结构则是正生长。所以进程进程栈是16KB并且thread_info
是在栈底。还需我们处理16 kilobytes - 62 bytes = 16332 bytes
.注意 thread_union
代表一个联合体union而不是结构体,用一张图来描述栈内存空间。 如下图所示:
+-----------------------+
| |
| |
| stack |
| |
|_______________________|
| | |
| | |
| | |
|__________↓____________| +--------------------+
| | | |
| thread_info |<----------->| task_struct |
| | | |
+-----------------------+ +--------------------+
http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct
所以INIT_TASK
宏指令就是task_struct's
'结构。正如我上述所写,我并不会去描述这些字段的含义和值,在INIT_TASK
赋值处理的时候我们很快能看到这些。
现在让我们回到set_task_stack_end_magic
函数,这个函数被定义在kernel/fork.c功能为设置canary init
进程堆栈以检测堆栈溢出。
void set_task_stack_end_magic(struct task_struct *tsk)
{
unsigned long *stackend;
stackend = end_of_stack(tsk);
*stackend = STACK_END_MAGIC; /* for overflow detection */
}
上述函数比较简单,set_task_stack_end_magic
函数的作用是先通过end_of_stack
函数获取堆栈并赋给 task_struct
。 关于检测配置需要打开内核配置宏CONFIG_STACK_GROWSUP
。因为我们学习的是x86架构的初始化,堆栈是逆生成,所以堆栈底部为:
(unsigned long *)(task_thread_info(p) + 1);
task_thread_info
的定义如下,返回一个当前的堆栈;
#define task_thread_info(task) ((struct thread_info *)(task)->stack)
进程的栈底,我们写STACK_END_MAGIC
这个值。如果设置canary
,我们可以像这样子去检测堆栈:
if (*end_of_stack(task) != STACK_END_MAGIC) {
//
// handle stack overflow here
//
}
set_task_stack_end_magic
初始化完毕后的下一个函数是 smp_setup_processor_id
.此函数在x86_64
架构上是空函数:
void __init __weak smp_setup_processor_id(void)
{
}
在此架构上没有实现此函数,但在别的体系架构的实现可以参考s390 and arm64.
我们接着往下走,下一个函数是debug_objects_early_init
。此函数的执行几乎和lockdep_init
是一样的,但是填充的哈希对象是调试相关。上述我已经表明,关于内核调试部分会在后续专门有一个章节来完成。
debug_object_early_init
函数之后我们看到调用了boot_init_stack_canary
函数。task_struct->canary
的值利用了GCC特性,但是此特性需要先使能内核CONFIG_CC_STACKPROTECTOR
宏后才可以使用。 boot_init_stack_canary
什么也没有做, 否则基于随机数和随机池产生 TSC:
get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);
我们要获取随机数, 我们可以给stack_canary
字段 task_struct
赋值:
current->stack_canary = canary;
然后将此值写入IRQ堆栈的顶部:
this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write
关于IRQ的章节我们这里也不会详细刨析, 关于这部分介绍看这里IRQs.如果canary被设置, 关闭本地中断注册bootstrap CPU以及CPU maps. 我们关闭本地中断 (interrupts for current CPU) 使用 local_irq_disable
函数,展开后原型为 arch_local_irq_disable
函数include/linux/percpu-defs.h:
static inline notrace void arch_local_irq_enable(void)
{
native_irq_enable();
}
如果native_irq_enable
通过cli
指令判断架构,这里是X86_64
, Where native_irq_enable
is cli
instruction for x86_64
.中断的关闭(屏蔽)我们可以通过注册当前CPU ID到CPU bitmap来实现。
激活第一个CPU
当前已经走到start_kernel
函数中的boot_cpu_init
函数,此函数主要为了通过掩码初始化每一个CPU。首先我们需要获取当前处理器的ID通过下面函数:
int cpu = smp_processor_id();
现在是0. 如果CONFIG_DEBUG_PREEMPT
宏配置了那么 smp_processor_id
的值就来自于 raw_smp_processor_id
函数,原型如下:
#define raw_smp_processor_id() (this_cpu_read(cpu_number))
this_cpu_read
函数与其它很多函数一样如(this_cpu_write
, this_cpu_add
等等...) 被定义在include/linux/percpu-defs.h 此部分函数主要为对 this_cpu
进行操作. 这些操作提供不同的对每cpuper-cpu 变量相关访问方式. 譬如让我们来看看这个函数 this_cpu_read
:
__pcpu_size_call_return(this_cpu_read_, pcp)
还记得上面我们所写,每cpu变量cpu_number
的值是this_cpu_read
通过raw_smp_processor_id
来得到,现在让我们看看 __pcpu_size_call_return
的执行:
#define __pcpu_size_call_return(stem, variable) \
({ \
typeof(variable) pscr_ret__; \
__verify_pcpu_ptr(&(variable)); \
switch(sizeof(variable)) { \
case 1: pscr_ret__ = stem##1(variable); break; \
case 2: pscr_ret__ = stem##2(variable); break; \
case 4: pscr_ret__ = stem##4(variable); break; \
case 8: pscr_ret__ = stem##8(variable); break; \
default: \
__bad_size_call_parameter(); break; \
} \
pscr_ret__; \
})
是的,此函数虽然看起起奇怪但是它的实现是简单的,我们看到pscr_ret__
变量的定义是int
类型,为什么是int类型呢?好吧,变量
是common_cpu
它声明了每cpu(per-cpu)变量:
DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
在下一个步骤中我们调用了__verify_pcpu_ptr
通过使用一个有效的每cpu变量指针来取地址得到cpu_number
。之后我们通过pscr_ret__
函数设置变量的大小,common_cpu
变量是int
,所以它的大小是4字节。意思就是我们通过this_cpu_read_4(common_cpu)
获取cpu变量其大小被pscr_ret__
决定。在__pcpu_size_call_return
的结束 我们调用了__pcpu_size_call_return:
#define this_cpu_read_4(pcp) percpu_from_op("mov", pcp)
需要调用percpu_from_op
并且通过mov
指令来传递每cpu变量,percpu_from_op
的内联扩展如下:
asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu))
让我们尝试理解此函数是如果工作的,gs
段寄存器包含每个CPU区域的初始值,这里我们通过mov
指令copy common_cpu
到内存中去,此函数还有另外的形式:
this_cpu_read(common_cpu)
等价于:
movl %gs:$common_cpu, $pfo_ret__
由于我们没有设置每个CPU的区域,我们只有一个 - 为当前CPU的值zero
通过此函数 smp_processor_id
返回.
返回的ID表示我们处于哪一个CPU上, boot_cpu_init
函数设置了CPU的在线, 激活, 当前的设置为:
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
上述我们所有使用的这些CPU的配置我们称之为- CPU掩码cpumask
. cpu_possible
则是设置支持CPU热插拔时候的CPU ID. cpu_present
表示当前热插拔的CPU. cpu_online
表示当前所有在线的CPU以及通过 cpu_present
来决定被调度出去的CPU. CPU热插拔的操作需要打开内核配置宏CONFIG_HOTPLUG_CPU
并且将 possible == present
以及active == online
选项禁用。这些功能都非常相似,每个函数都需要检查第二个参数,如果设置为true
,需要通过调用cpumask_set_cpu
or cpumask_clear_cpu
来改变状态。
譬如我们可以通过true或者第二个参数来这么调用:
cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));
让我们继续尝试理解to_cpumask
宏指令,此宏指令转化为一个位图通过struct cpumask *
,CPU掩码提供了位图集代表了当前系统中所有的CPU's,每CPU都占用1bit,CPU掩码相关定义通过cpu_mask
结构定义:
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
在来看下面一组函数定义了位图宏指令。
#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)]
正如我们看到的定义一样, DECLARE_BITMAP
宏指令的原型是一个unsigned long
的数组,现在让我们查看如何执行to_cpumask
:
#define to_cpumask(bitmap) \
((struct cpumask *)(1 ? (bitmap) \
: (void *)sizeof(__check_is_bitmap(bitmap))))
我不知道你是怎么想的, 但是我是这么想的,我看到此函数其实就是一个条件判断语句当条件为真的时候,但是为什么执行__check_is_bitmap
?让我们看看__check_is_bitmap
的定义:
static inline int __check_is_bitmap(const unsigned long *bitmap)
{
return 1;
}
原来此函数始终返回1,事实上我们需要这样的函数才达到我们的目的: 它在编译时给定一个bitmap
,换句话将就是检查bitmap
的类型是否是unsigned long *
,因此我们仅仅通过to_cpumask
宏指令将类型为unsigned long
的数组转化为struct cpumask *
。现在我们可以调用cpumask_set_cpu
函数,这个函数仅仅是一个 set_bit
给CPU掩码的功能函数。所有的这些set_cpu_*
函数的原理都是一样的。
如果你还不确定set_cpu_*
这些函数的操作并且不能理解 cpumask
的概念,不要担心。你可以通过读取这些章节cpumask or documentation.来继续了解和学习这些函数的原理。
现在我们已经激活第一个CPU,我们继续接着start_kernel函数往下走,下面的函数是page_address_init
,但是此函数不执行任何操作,因为只有当所有内存不能直接映射的时候才会执行。
Linux 内核的第一条打印信息
下面调用了pr_notice函数。
#define pr_notice(fmt, ...) \
printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
pr_notice其实是printk的扩展,这里我们使用它打印了Linux 的banner。
pr_notice("%s", linux_banner);
打印的是内核的版本号以及编译环境信息:
Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP
依赖于体系结构的初始化部分
下个步骤我们就要进入到指定的体系架构的初始函数,Linux 内核初始化体系架构相关调用setup_arch
函数,这又是一个类型于start_kernel
的庞大函数,这里我们仅仅简单描述,在下一个章节我们将继续深入。指定体系架构的内容,我们需要再一次阅读arch/
目录,setup_arch
函数定义在arch/x86/kernel/setup.c 文件中,此函数就一个参数-内核命令行。
此函数解析内核的段_text
和_data
来自于_text
符号和_bss_stop
(你应该还记得此文件arch/x86/kernel/head_64.S)。我们使用memblock
来解析内存块。
memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);
你可以阅读关于memblock
的相关内容在Linux kernel memory management Part 1.,你应该还记得memblock_reserve
函数的两个参数:
- base physical address of a memory block;
- size of a memory block.
我们可以通过__pa_symbol
宏指令来获取符号表_text
段中的物理地址
#define __pa_symbol(x) \
__phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
上述宏指令调用 __phys_reloc_hide
宏指令来填充参数,__phys_reloc_hide
宏指令在x86_64
上返回的参数是给定的。宏指令 __phys_addr_symbol
的执行是简单的,只是减去从_text
符号表中读到的内核的符号映射地址并且加上物理地址的基地址。
#define __phys_addr_symbol(x) \
((unsigned long)(x) - __START_KERNEL_map + phys_base)
memblock_reserve
函数对内存页进行分配。
保留可用内存初始化initrd
之后我们保留替换内核的text和data段用来初始化initrd,我们暂时不去了解initrd的详细信息,你仅仅只需要知道根文件系统就是通过这方式来进行初始化这就是early_reserve_initrd
函数的工作,此函数获取RAM DISK的基地址以及大小以及大小加偏移。
u64 ramdisk_image = get_ramdisk_image();
u64 ramdisk_size = get_ramdisk_size();
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
如果你阅读过这些章节Linux Kernel Booting Process,你就知道所有的这些啊参数都来自于boot_params
,时刻谨记boot_params
在boot期间已经被赋值,内核启动头包含了一下几个字段用来描述RAM DISK:
Field name: ramdisk_image
Type: write (obligatory)
Offset/size: 0x218/4
Protocol: 2.00+
The 32-bit linear address of the initial ramdisk or ramfs. Leave at
zero if there is no initial ramdisk/ramfs.
我们可以得到关于 boot_params
的一些信息. 具体查看get_ramdisk_image
:
static u64 __init get_ramdisk_image(void)
{
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
return ramdisk_image;
}
关于32位的ramdisk的地址,我们可以阅读此部分内容来获取Documentation/x86/zero-page.txt:
0C0/004 ALL ext_ramdisk_image ramdisk_image high 32bits
32位变化后,我们获取64位的ramdisk原理一样,为此我们可以检查bootloader 提供的ramdisk信息:
if (!boot_params.hdr.type_of_loader ||
!ramdisk_image || !ramdisk_size)
return;
并保留内存块将ramdisk传输到最终的内存地址,然后进行初始化:
memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);
结束语
以上就是第四部分关于内核初始化的部分内容,我们从start_kernel
函数开始一直到指定体系架构初始化setup_arch
的过程中停止,那么在下一个章节我们将继续研究体系架构相关的初始化内容。
如果你有任何的问题或者建议,你可以留言,也可以直接发消息给我twitter。
很抱歉,英语并不是我的母的,非常抱歉给您阅读带来不便,如果你发现文中描述有任何问题,请提交一个 PR 到 linux-insides.
链接
- GCC function attributes
- this_cpu operations
- cpumask
- lock validator
- cgroups
- stack buffer overflow
- IRQs
- initrd
- Previous part
内核初始化 第五部分
与系统架构有关的初始化后续分析
在之前的章节中, 我们讲到了与系统架构有关的 setup_arch 函数部分,本文会继续从这里开始。 我们为 initrd 预留了内存之后,下一步是执行 olpc_ofw_detect
函数检测系统是否支持 One Laptop Per Child support。 我们不会考虑与平台有关的东西,且会忽略与平台有关的函数。所以我们继续往下看。 下一步是执行 early_trap_init
函数。这个函数会初始化调试功能 (#DB
-当 TF
标志位和rflags被设置时会被使用)和 int3
(#BP
)中断门。 如果你不了解中断,你可以从 初期中断和异常处理 中学习有关中断的内容。 在 x86
架构中,INT
,INT0
和 INT3
是支持任务显式调用中断处理函数的特殊指令。INT3
指令调用断点(#BP
)处理函数。 你如果记得,我们在这部分 看到过中断和异常概念:
----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description |Type |Error Code|Source |
----------------------------------------------------------------------------------------------
|3 | #BP |Breakpoint |Trap |NO |INT 3 |
----------------------------------------------------------------------------------------------
调试中断 #DB
是激活调试器的重要方法。early_trap_init
函数的定义在 arch/x86/kernel/traps.c 中。这个函数用来设置 #DB
和 #BP
处理函数,并且实现重新加载 IDT。
void __init early_trap_init(void)
{
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
load_idt(&idt_descr);
}
我们之前中断相关章节中看到过 set_intr_gate
的实现。这里的 set_intr_gate_ist
和 set_system_intr_gate_ist
也是类似的实现。 这两个函数都需要三个参数:
- 中断号
- 中断/异常处理函数的基地址
- 第三个参数是
Interrupt Stack Table
。IST
是 TSS 的部分内容,是x86_64
引入的新机制。 在内核态处于活跃状态的线程拥有16kb
的内核栈空间。但是在用户空间的线程的内核栈是空的。 除了线程栈,还有一些与每个CPU
有关的特殊栈。你可以查阅 linux 内核文档 - Kernel stacks 部分了解这些栈信息。x86_64
提供了像在非屏蔽中断等类似事件中切换新的特殊栈的特性支持。这个特性的名字是Interrupt Stack Table
。 每个CPU最多可以有 7 个IST
条目,每个条目有自己特定的栈。在我们的案例中使用的是DEBUG_STACK
。
set_intr_gate_ist
和 set_system_intr_gate_ist
与 set_intr_gate
的工作原理几乎一样,只有一个区别。 这些函数检查中断号并在内部调用 _set_gate
:
BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);
其中, set_intr_gate
把 dpl 和 ist
置为 0 来调用 _set_gate
。 但是 set_intr_gate_ist
和 set_system_intr_gate_ist
把 ist
设置为 DEBUG_STACK
,并且 set_system_intr_gate_ist
把 dpl
设置为优先级最低的 0x3
。 当中断发生时,硬件加载这个描述符,然后硬件根据 IST
的值自动设置新的栈指针。 之后激活对应的中断处理函数。所有的特殊内核栈会在 cpu_init
函数中设置好(我们会在后文中提到)。
当 #DB
和 #BP
门向 idt_descr
有写操作,我们会调用 load_idt
函数来执行 ldtr
指令来重新加载 IDT
表。 现在我们来了解下中断处理函数并尝试理解它的工作原理。当然,我们不可能在这本书中讲解所有的中断处理函数。 深入学习 linux 的内核源码是很有意思的事情,我们会在这里讲解 debug
处理函数的实现。请自行学习其他的中断处理函数实现。
#DB
处理函数
像上文中提到的,我们在 set_intr_gate_ist
中通过 &debug
的地址传送 #DB
处理函数。lxr.free-electorns.com 是很好的用来搜索 linux 源代码中标识符的资源。 遗憾的是,你在其中找不到 debug
处理函数。你只能在 arch/x86/include/asm/traps.h 中找到 debug
的定义:
asmlinkage void debug(void);
从 asmlinkage
属性我们可以知道 debug
是由 assembly 语言实现的函数。是的,又是汇编语言 :)。 和其他处理函数一样,#DB
处理函数的实现可以在 arch/x86/kernel/entry_64.S 文件中找到。 都是由 idtentry
汇编宏定义的:
idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
idtentry
是一个定义中断/异常指令入口点的宏。它需要五个参数:
- 中断条目点的名字
- 中断处理函数的名字
- 是否有中断错误码
- paranoid - 如果这个参数置为 1,则切换到特殊栈
- shift_ist - 支持中断期间切换栈
现在我们来看下 idtentry
宏的实现。这个宏的定义也在相同的汇编文件中,并且定义了有 ENTRY
宏属性的 debug
函数。 首先,idtentry
宏检查所有的参数是否正确,是否需要切换到特殊栈。接下来检查中断返回的错误码。例如本案例中的 #DB
不会返回错误码。 如果有错误码返回,它会调用 INTR_FRAME
或者 XCPT_FRAM
宏。其实 XCPT_FRAME
和 INTR_FRAME
宏什么也不会做,只是对中断初始状态编译的时候有用。 它们使用 CFI
指令用来调试。你可以查阅更多有关 CFI
指令的信息 CFI。 就像 arch/x86/kernel/entry_64.S 中解释:CFI
宏是用来产生更好的回溯的 dwarf2
的解开信息。 它们不会改变任何代码。因此我们可以忽略它们。
.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
/* Sanity check */
.if \shift_ist != -1 && \paranoid == 0
.error "using shift_ist requires paranoid=1"
.endif
.if \has_error_code
XCPT_FRAME
.else
INTR_FRAME
.endif
...
...
...
当中断发生后经过初期的中断/异常处理,我们可以知道栈内的格式是这样的:
+-----------------------+
| |
+40 | SS |
+32 | RSP |
+24 | RFLAGS |
+16 | CS |
+8 | RIP |
0 | Error Code | <---- rsp
| |
+-----------------------+
idtentry
实现中的另外两个宏分别是
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
第一个 ASM_CLAC
宏依赖于 CONFIG_X86_SMAP
这个配置项和考虑安全因素,你可以从这里了解更多内容。 第二个 PARAVIRT_EXCEPTION_FRAME
宏是用来处理 Xen
类型异常(这章只讲解内核初始化,不会考虑虚拟化的内容)。 下一段代码会检查中断是否有错误码。如果没有则会把 $-1
(在 x86_64
架构下值为 0xffffffffffffffff
)压入栈:
.ifeq \has_error_code
pushq_cfi $-1
.endif
为了保证对于所有中断的栈的一致性,我们会把它处理为 dummy
错误码。下一步我们从栈指针中减去 $ORIG_RAX-R15
:
subq $ORIG_RAX-R15, %rsp
其中,ORIG_RAX
,R15
和其他宏都定义在 arch/x86/include/asm/calling.h 中。ORIG_RAX-R15
是 120 字节。 我们在中断处理过程中需要把所有的寄存器信息存储在栈中,所有通用寄存器会占用这个 120 字节。 为通用寄存器设置完栈之后,下一步是检查从用户空间产生的中断:
testl $3, CS(%rsp)
jnz 1f
我们查看段寄存器 CS
的前两个比特位。你应该记得 CS
寄存器包含段选择器,它的前两个比特是 RPL
。所有的权限等级是0-3范围内的整数。 数字越小代表权限越高。因此当中断来自内核空间,我们会调用 save_paranoid
,如果不来自内核空间,我们会跳转到标签 1
处处理。 在 save_paranoid
函数中,我们会把所有的通用寄存器存储到栈中,如果需要的话会用户态 gs
切换到内核态 gs
:
movl $1,%ebx
movl $MSR_GS_BASE,%ecx
rdmsr
testl %edx,%edx
js 1f
SWAPGS
xorl %ebx,%ebx
1: ret
下一步我们把 pt_regs
指针存在 rdi
中,如果存在错误码就把它存储到 rsi
中,然后调用中断处理函数,例如就像 arch/x86/kernel/trap.c中的 do_debug
。 do_debug
像其他处理函数一样需要两个参数:
- pt_regs - 是一个存储在进程内存区域的一组CPU寄存器
- error code - 中断错误码
中断处理函数完成工作后会调用 paranoid_exit
还原栈区。如果中断来自用户空间则切换回用户态并调用 iret
。我们会在不同的章节继续深入分析中断。 这是用在 #DB
中断中的 idtentry
宏的基本介绍。所有的中断都和这个实现类似,都定义在 idtentry
中。early_trap_init
执行完后,下一个函数是 early_cpu_init
。 这个函数定义在 arch/x86/kernel/cpu/common.c 中,负责收集 CPU
和其供应商的信息。
早期ioremap初始化
下一步是初始化早期的 ioremap
。通常有两种实现与设备通信的方式:
- I/O端口
- 设备内存
我们在 linux 内核启动过程中见过第一种方法(通过 outb/inb
指令实现)。 第二种方法是把 I/O
的物理地址映射到虚拟地址。当 CPU
读取一段物理地址时,它可以读取到映射了 I/O
设备的物理 RAM
区域。 ioremap
就是用来把设备内存映射到内核地址空间的。
像我上面提到的下一个函数时 early_ioremap_init
,它可以在正常的像 ioremap
这样的映射函数可用之前,把 I/O
内存映射到内核地址空间以方便读取。 我们需要在初期的初始化代码中初始化临时的 ioremap
来映射 I/O
设备到内存区域。初期的 ioremap
实现在 arch/x86/mm/ioremap.c 中可以找到。 在 early_ioremap_init
的一开始我们可以看到 pmd_t
类型的 pmd
指针定义(代表页中间目录条目 typedef struct {pmdval_t pmd; } pmd_t;
其中 pmdval_t
是无符号长整型)。 然后检查 fixmap
是正确对齐的:
pmd_t *pmd;
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));
fixmap
- 是一段从 FIXADDR_START
到 FIXADDR_TOP
的固定虚拟地址映射区域。它在子系统需要知道虚拟地址的编译过程中会被使用。 之后 early_ioremap_init
函数会调用 mm/early_ioremap.c 中的 early_ioremap_setup
函数。 early_ioremap_setup
会填充512个临时的启动时固定映射表来完成无符号长整型矩阵 slot_virt
的初始化:
for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);
之后我们就获得了 FIX_BTMAP_BEGIN
的页中间目录条目,并把它赋值给了 pmd
变量,把启动时间页表 bm_pte
写满 0。然后调用 pmd_populate_kernel
函数设置给定的页中间目录的页表条目:
pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
pmd_populate_kernel(&init_mm, pmd, bm_pte);
这就是所有过程。如果你仍然觉得困惑,不要担心。在 内核内存管理,第二部分 章节会有单独一部分讲解 ioremap
和 fixmaps
。
获取根设备的主次设备号
ioremap
初始化完成后,紧接着是执行下面的代码:
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
这段代码用来获取根设备的主次设备号。后面 initrd
会通过 do_mount_root
函数挂载到这个根设备上。其中主设备号用来识别和这个设备有关的驱动。 次设备号用来表示使用该驱动的各设备。注意 old_decode_dev
函数是从 boot_params_structure
中获取了一个参数。我们可以从 x86 linux 内核启动协议中查到:
Field name: root_dev
Type: modify (optional)
Offset/size: 0x1fc/2
Protocol: ALL
The default root device device number. The use of this field is
deprecated, use the "root=" option on the command line instead
现在我们来看看 old_decode_dev
如何实现的。实际上它只是根据主次设备号调用了 MKDEV
来生成一个 dev_t
类型的设备。它的实现很简单:
static inline dev_t old_decode_dev(u16 val)
{
return MKDEV((val >> 8) & 255, val & 255);
}
其中 dev_t
是用来表示主/次设备号对的一个内核数据类型。但是这个奇怪的 old
前缀代表了什么呢?出于历史原因,有两种管理主次设备号的方法。 第一种方法主次设备号占用 2 字节。你可以在以前的代码中发现:主设备号占用 8 bit,次设备号占用 8 bit。但是这会引入一个问题:最多只能支持 256 个主设备号和 256 个次设备号。 因此后来引入了 32 bit 来表示主次设备号,其中 12 位用来表示主设备号,20 位用来表示次设备号。你可以在 new_decode_dev
的实现中找到:
static inline dev_t new_decode_dev(u32 dev)
{
unsigned major = (dev & 0xfff00) >> 8;
unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
return MKDEV(major, minor);
}
如果 dev
的值是 0xffffffff
,经过计算我们可以得到用来表示主设备号的 12 位值 0xfff
,表示次设备号的20位值 0xfffff
。因此经过 old_decode_dev
我们最终可以得到在 ROOT_DEV
中根设备的主次设备号。
Memory Map设置
下一步是调用 setup_memory_map
函数设置内存映射。但是在这之前我们需要设置与显示屏有关的参数(目前有行、列,视频页等,你可以在 显示模式初始化和进入保护模式 中了解), 与拓展显示识别数据,视频模式,引导启动器类型等参数:
screen_info = boot_params.screen_info;
edid_info = boot_params.edid_info;
saved_video_mode = boot_params.hdr.vid_mode;
bootloader_type = boot_params.hdr.type_of_loader;
if ((bootloader_type >> 4) == 0xe) {
bootloader_type &= 0xf;
bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
}
bootloader_version = bootloader_type & 0xf;
bootloader_version |= boot_params.hdr.ext_loader_ver << 4;
我们可以从启动时候存储在 boot_params
结构中获取这些参数信息。之后我们需要设置 I/O
内存。众所周知,内核主要做的工作就是资源管理。其中一个资源就是内存。 我们也知道目前有通过 I/O
口和设备内存两种方法实现设备通信。所有有关注册资源的信息可以通过 /proc/ioports
和 /proc/iomem
获得:
- /proc/ioports - 提供用于设备输入输出通信的一租注册端口区域
- /proc/iomem - 提供每个物理设备的系统内存映射地址 我们先来看下
/proc/iomem
:
cat /proc/iomem
00000000-00000fff : reserved
00001000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cffff : Video ROM
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000fffff : reserved
000e0000-000e3fff : PCI Bus 0000:00
000e4000-000e7fff : PCI Bus 0000:00
000f0000-000fffff : System ROM
可以看到,根据不同属性划分为以十六进制符号表示的一段地址范围。linux 内核提供了用来管理所有资源的一种通用 API。全局资源(比如 PICs 或者 I/O 端口)可以划分为与硬件总线插槽有关的子集。 resource
的主要结构是:
struct resource {
resource_size_t start;
resource_size_t end;
const char *name;
unsigned long flags;
struct resource *parent, *sibling, *child;
};
例如下图中的树形系统资源子集示例。这个结构提供了资源占用的从 start
到 end
的地址范围(resource_size_t
是 phys_addr_t
类型,在 x86_64
架构上是 u64
)。 资源名(你可以在 /proc/iomem
输出中看到),资源标记(所有的资源标记定义在 include/linux/ioport.h 文件中)。最后三个是资源结构体指针,如下图所示:
+-------------+ +-------------+
| | | |
| parent |------| sibling |
| | | |
+-------------+ +-------------+
|
|
+-------------+
| |
| child |
| |
+-------------+
每个资源子集有自己的根范围资源。iomem
的资源 iomem_resource
的定义是:
struct resource iomem_resource = {
.name = "PCI mem",
.start = 0,
.end = -1,
.flags = IORESOURCE_MEM,
};
EXPORT_SYMBOL(iomem_resource);
TODO EXPORT_SYMBOL
iomem_resource
利用 PCI mem
名字和 IORESOURCE_MEM (0x00000200)
标记定义了 io
内存的根地址范围。就像上文提到的,我们目前的目的是设置 iomem
的结束地址,我们需要这样做:
iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
我们对1左移 boot_cpu_data.x86_phys_bits
。boot_cpu_data
是我们在执行 early_cpu_init
的时候初始化的 cpuinfo_x86
结构。从字面理解,x86_phys_bits
代表系统可达到的最大内存地址时需要的比特数。 另外,iomem_resource
是通过 EXPORT_SYMBOL
宏传递的。这个宏可以把指定的符号(例如 iomem_resource
)做动态链接。换句话说,它可以支持动态加载模块的时候访问对应符号。 设置完根 iomem
的资源地址范围的结束地址后,下一步就是设置内存映射。它通过调用 setup_memory_map
函数实现:
void __init setup_memory_map(void)
{
char *who;
who = x86_init.resources.memory_setup();
memcpy(&e820_saved, &e820, sizeof(struct e820map));
printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");
e820_print_map(who);
}
首先,我们来看下 x86_init.resources.memory_setup
。x86_init
是一种 x86_init_ops
类型的结构体,用来表示项资源初始化,pci
初始化平台特定的一些设置函数。 x86_init
的初始化实现在 arch/x86/kernel/x86_init.c 文件中。我不会全部解释这个初始化过程,因为我们只关心一个地方:
struct x86_init_ops x86_init __initdata = {
.resources = {
.probe_roms = probe_roms,
.reserve_resources = reserve_standard_io_resources,
.memory_setup = default_machine_specific_memory_setup,
},
...
...
...
}
我们可以看到,这里的 memory_setup
赋值为 default_machine_specific_memory_setup
,它是我们在对 内核启动 过程中的所有 e820 条目经过整理和把内存分区填入 e820map
结构体中获得的。 所有收集的内存分区会用 printk
打印出来。你可以通过运行 dmesg
命令找到类似于下面的信息:
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable
...
...
...
复制 BIOS
增强磁盘设备信息
下面两部是通过 parse_setup_data
函数解析 setup_data
,并且把 BIOS
的 EDD
信息复制到安全的地方。 setup_data
是内核启动头中包含的字段,我们可以在 x86
的启动协议中了解:
Field name: setup_data
Type: write (special)
Offset/size: 0x250/8
Protocol: 2.09+
The 64-bit physical pointer to NULL terminated single linked list of
struct setup_data. This is used to define a more extensible boot
parameters passing mechanism.
它用来存储不同类型的设置信息,例如设备树 blob
,EFI
设置数据等等。第二步是从 boot_params
结构中复制我们在 arch/x86/boot/edd.c 中 BIOS
的 EDD
信息到 edd
结构中。
static inline void __init copy_edd(void)
{
memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
sizeof(edd.mbr_signature));
memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
edd.edd_info_nr = boot_params.eddbuf_entries;
}
内存描述符初始化
下一步是在初始化阶段完成内存描述符的初始化。我们知道每个进程都有自己的运行内存地址空间。通过调用 memory descriptor
可以看到这些特殊数据结构。 在 linux 内核源码中内存描述符是用 mm_struct
结构体表示的。mm_struct
包含许多不同的与进程地址空间有关的字段,像内核代码/数据段的起始和结束地址, brk
的起始和结束,内存区域的数量,内存区域列表等。这些结构定义在 include/linux/mm_types.h 中。task_struct
结构的 mm
和 active_mm
字段包含了每个进程自己的内存描述符。 我们的第一个 init
进程也有自己的内存描述符。在之前的章节我们看到过通过 INIT_TASK
宏实现 task_struct
的部分初始化信息:
#define INIT_TASK(tsk) \
{
...
...
...
.mm = NULL, \
.active_mm = &init_mm, \
...
}
mm
指向进程地址空间,active_mm
指向像内核线程这样子不存在地址空间的有效地址空间(你可以在这个文档 中了解更多内容)。 接下来我们在初始化阶段完成内存描述符中内核代码段,数据段和 brk
段的初始化:
init_mm.start_code = (unsigned long) _text;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
init_mm.brk = _brk_end;
init_mm
是初始化阶段的内存描述符定义:
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
INIT_MM_CONTEXT(init_mm)
};
其中 mm_rb
是虚拟内存区域的红黑树结构,pgd
是全局页目录的指针,mm_user
是使用该内存空间的进程数目,mm_count
是主引用计数,mmap_sem
是内存区域信号量。 在初始化阶段完成内存描述符的设置后,下一步是通过 mpx_mm_init
完成 Intel
内存保护扩展的初始化。下一步是代码/数据/bss
资源的初始化:
code_resource.start = __pa_symbol(_text);
code_resource.end = __pa_symbol(_etext)-1;
data_resource.start = __pa_symbol(_etext);
data_resource.end = __pa_symbol(_edata)-1;
bss_resource.start = __pa_symbol(__bss_start);
bss_resource.end = __pa_symbol(__bss_stop)-1;
通过上面我们已经知道了一小部分关于 resource
结构体的样子。在这里,我们把物理地址段赋值给代码/数据/bss
段。你可以在 /proc/iomem
中看到:
00100000-be825fff : System RAM
01000000-015bb392 : Kernel code
015bb393-01930c3f : Kernel data
01a11000-01ac3fff : Kernel bss
在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) 中有所有这些结构体的定义:
static struct resource code_resource = {
.name = "Kernel code",
.start = 0,
.end = 0,
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
};
本章节涉及的最后一部分就是 NX
配置。NX-bit
或者 no-execute
位是页目录条目的第 63 比特位。它的作用是控制被映射的物理页面是否具有执行代码的能力。 这个比特位只会在通过把 EFER.NXE
置为1使能 no-execute
页保护机制的时候被使用/设置。在 x86_configure_nx
函数中会检查 CPU
是否支持 NX-bit
,以及是否被禁用。 经过检查后,我们会根据结果给 _supported_pte_mask
赋值:
void x86_configure_nx(void)
{
if (cpu_has_nx && !disable_nx)
__supported_pte_mask |= _PAGE_NX;
else
__supported_pte_mask &= ~_PAGE_NX;
}
结论
以上是 linux 内核初始化过程的第五部分。在这一章我们讲解了有关架构初始化的 setup_arch
函数。内容很多,但是我们还没有学习完。其中,setup_arch
是一个很复杂的函数,甚至我不确定我们能在以后的章节中讲完它的所有内容。在这一章节中有一些很有趣的概念像 Fix-mapped
地址,ioremap
等等。 如果没听明白也不用担心,在 内核内存管理,第二部分 还会有更详细的解释。在下一章节我们会继续讲解有关结构初始化的东西, 以及初期内核参数的解析,pci
设备的早期转存,直接媒体接口扫描等等。
如果你有任何问题或者建议,你可以留言,也可以直接发送消息给我twitter。
很抱歉,英语并不是我的母语,非常抱歉给您阅读带来不便,如果你发现文中描述有任何问题,清提交一个 PR 到 linux-insides。
链接
- mm vs active_mm
- e820
- Supervisor mode access prevention
- Kernel stacks
- TSS
- IDT
- Memory mapped I/O
- CFI directives
- PDF. dwarf4 specification
- Call stack
- 内核初始化. Part 4.
Kernel initialization. Part 6.
Architecture-specific initialization, again...
In the previous part we saw architecture-specific (x86_64
in our case) initialization stuff from the arch/x86/kernel/setup.c and finished on x86_configure_nx
function which sets the _PAGE_NX
flag depends on support of NX bit. As I wrote before setup_arch
function and start_kernel
are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after x86_configure_nx
is parse_early_param
. This function is defined in the init/main.c and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the Documentation/kernel-parameters.txt). You may remember how we setup earlyprintk
in the earliest part. On the early stage we looked for kernel parameters and their value with the cmdline_find_option
function and __cmdline_find_option
, __cmdline_find_option_bool
helpers from the arch/x86/boot/cmdline.c. There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:
early_param("gbpages", parse_direct_gbpages_on);
early_param
macro takes two parameters:
- command line parameter name;
- function which will be called if given parameter is passed.
and defined as:
#define early_param(str, fn) \
__setup_param(str, fn, fn, 1)
in the include/linux/init.h. As you can see early_param
macro just makes call of the __setup_param
macro:
#define __setup_param(str, unique_id, fn, early) \
static const char __setup_str_##unique_id[] __initconst \
__aligned(1) = str; \
static struct obs_kernel_param __setup_##unique_id \
__used __section(.init.setup) \
__attribute__((aligned((sizeof(long))))) \
= { __setup_str_##unique_id, fn, early }
This macro defines __setup_str_*_id
variable (where *
depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the __setup_*
variable which type is obs_kernel_param
and its initialization. obs_kernel_param
structure defined as:
struct obs_kernel_param {
const char *str;
int (*setup_func)(char *);
int early;
};
and contains three fields:
- name of the kernel parameter;
- function which setups something depend on parameter;
- field determines is parameter early (1) or not (0).
Note that __set_param
macro defines with __section(.init.setup)
attribute. It means that all __setup_str_*
will be placed in the .init.setup
section, moreover, as we can see in the include/asm-generic/vmlinux.lds.h, they will be placed between __setup_start
and __setup_end
:
#define INIT_SETUP(initsetup_align) \
. = ALIGN(initsetup_align); \
VMLINUX_SYMBOL(__setup_start) = .; \
*(.init.setup) \
VMLINUX_SYMBOL(__setup_end) = .;
Now we know how parameters are defined, let's back to the parse_early_param
implementation:
void __init parse_early_param(void)
{
static int done __initdata;
static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;
if (done)
return;
/* All fall through to do_early_param. */
strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
parse_early_options(tmp_cmdline);
done = 1;
}
The parse_early_param
function defines two static variables. First done
check that parse_early_param
already called and the second is temporary storage for kernel command line. After this we copy boot_command_line
to the temporary command line which we just defined and call the parse_early_options
function from the same source code main.c
file. parse_early_options
calls the parse_args
function from the kernel/params.c where parse_args
parses given command line and calls do_early_param
function. This function goes from the __setup_start
to __setup_end
, and calls the function from the obs_kernel_param
if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the parse_early_param
is x86_report_nx
. As I wrote in the beginning of this part, we already set NX-bit
with the x86_configure_nx
. The next x86_report_nx
function from the arch/x86/mm/setup_nx.c just prints information about the NX
. Note that we call x86_report_nx
not right after the x86_configure_nx
, but after the call of the parse_early_param
. The answer is simple: we call it after the parse_early_param
because the kernel support noexec
parameter:
noexec [X86]
On X86-32 available only on PAE configured kernels.
noexec=on: enable non-executable mappings (default)
noexec=off: disable non-executable mappings
We can see it in the booting time:
After this we can see call of the:
memblock_x86_reserve_range_setup_data();
function. This function is defined in the same arch/x86/kernel/setup.c source code file and remaps memory for the setup_data
and reserved memory block for the setup_data
(more about setup_data
you can read in the previous part and about ioremap
and memblock
you can read in the Linux kernel memory management).
In the next step we can see following conditional statement:
if (acpi_mps_check()) {
#ifdef CONFIG_X86_LOCAL_APIC
disable_apic = 1;
#endif
setup_clear_cpu_cap(X86_FEATURE_APIC);
}
The first acpi_mps_check
function from the arch/x86/kernel/acpi/boot.c depends on CONFIG_X86_LOCAL_APIC
and CONFIG_x86_MPPARSE
configuration options:
int __init acpi_mps_check(void)
{
#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE)
/* mptable code is not built-in*/
if (acpi_disabled || acpi_noirq) {
printk(KERN_WARNING "MPS support code is not built-in.\n"
"Using acpi=off or acpi=noirq or pci=noacpi "
"may have problem\n");
return 1;
}
#endif
return 0;
}
It checks the built-in MPS
or MultiProcessor Specification table. If CONFIG_X86_LOCAL_APIC
is set and CONFIG_x86_MPPAARSE
is not set, acpi_mps_check
prints warning message if the one of the command line options: acpi=off
, acpi=noirq
or pci=noacpi
passed to the kernel. If acpi_mps_check
returns 1
it means that we disable local APIC and clear X86_FEATURE_APIC
bit in the of the current CPU with the setup_clear_cpu_cap
macro. (more about CPU mask you can read in the CPU masks).
Early PCI dump
In the next step we make a dump of the PCI devices with the following code:
#ifdef CONFIG_PCI
if (pci_early_dump_regs)
early_dump_pci_devices();
#endif
pci_early_dump_regs
variable defined in the arch/x86/pci/common.c and its value depends on the kernel command line parameter: pci=earlydump
. We can find definition of this parameter in the drivers/pci/pci.c:
early_param("pci", pci_setup);
pci_setup
function gets the string after the pci=
and analyzes it. This function calls pcibios_setup
which defined as __weak
in the drivers/pci/pci.c and every architecture defines the same function which overrides __weak
analog. For example x86_64
architecture-dependent version is in the arch/x86/pci/common.c:
char *__init pcibios_setup(char *str) {
...
...
...
} else if (!strcmp(str, "earlydump")) {
pci_early_dump_regs = 1;
return NULL;
}
...
...
...
}
So, if CONFIG_PCI
option is set and we passed pci=earlydump
option to the kernel command line, next function which will be called - early_dump_pci_devices
from the arch/x86/pci/early.c. This function checks noearly
pci parameter with:
if (!early_pci_allowed())
return;
and returns if it was passed. Each PCI domain can host up to 256
buses and each bus hosts up to 32 devices. So, we goes in a loop:
for (bus = 0; bus < 256; bus++) {
for (slot = 0; slot < 32; slot++) {
for (func = 0; func < 8; func++) {
...
...
...
}
}
}
and read the pci
config with the read_pci_config
function.
That's all. We will not go deep in the pci
details, but will see more details in the special Drivers/PCI
part.
Finish with memory parsing
After the early_dump_pci_devices
, there are a couple of function related with available memory and e820 which we collected in the First steps in the kernel setup part:
/* update the e820_saved too */
e820_reserve_setup_data();
finish_e820_parsing();
...
...
...
e820_add_kernel_range();
trim_bios_range(void);
max_pfn = e820_end_of_ram_pfn();
early_reserve_e820_mpc_new();
Let's look on it. As you can see the first function is e820_reserve_setup_data
. This function does almost the same as memblock_x86_reserve_range_setup_data
which we saw above, but it also calls e820_update_range
which adds new regions to the e820map
with the given type which is E820_RESERVED_KERN
in our case. The next function is finish_e820_parsing
which sanitizes e820map
with the sanitize_e820_map
function. Besides this two functions we can see a couple of functions related to the e820. You can see it in the listing above. e820_add_kernel_range
function takes the physical address of the kernel start and end:
u64 start = __pa_symbol(_text);
u64 size = __pa_symbol(_end) - start;
checks that .text
.data
and .bss
marked as E820RAM
in the e820map
and prints the warning message if not. The next function trm_bios_range
update first 4096 bytes in e820Map
as E820_RESERVED
and sanitizes it again with the call of the sanitize_e820_map
. After this we get the last page frame number with the call of the e820_end_of_ram_pfn
function. Every memory page has an unique number - Page frame number
and e820_end_of_ram_pfn
function returns the maximum with the call of the e820_end_pfn
:
unsigned long __init e820_end_of_ram_pfn(void)
{
return e820_end_pfn(MAX_ARCH_PFN);
}
where e820_end_pfn
takes maximum page frame number on the certain architecture (MAX_ARCH_PFN
is 0x400000000
for x86_64
). In the e820_end_pfn
we go through the all e820
slots and check that e820
entry has E820_RAM
or E820_PRAM
type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current e820
entry and makes some checks for these addresses:
for (i = 0; i < e820.nr_map; i++) {
struct e820entry *ei = &e820.map[i];
unsigned long start_pfn;
unsigned long end_pfn;
if (ei->type != E820_RAM && ei->type != E820_PRAM)
continue;
start_pfn = ei->addr >> PAGE_SHIFT;
end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT;
if (start_pfn >= limit_pfn)
continue;
if (end_pfn > limit_pfn) {
last_pfn = limit_pfn;
break;
}
if (end_pfn > last_pfn)
last_pfn = end_pfn;
}
if (last_pfn > max_arch_pfn)
last_pfn = max_arch_pfn;
printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n",
last_pfn, max_arch_pfn);
return last_pfn;
After this we check that last_pfn
which we got in the loop is not greater that maximum page frame number for the certain architecture (x86_64
in our case), print information about last page frame number and return it. We can see the last_pfn
in the dmesg
output:
...
[ 0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000
...
After this, as we have calculated the biggest page frame number, we calculate max_low_pfn
which is the biggest page frame number in the low memory
or bellow first 4
gigabytes. If installed more than 4 gigabytes of RAM, max_low_pfn
will be result of the e820_end_of_low_ram_pfn
function which does the same e820_end_of_ram_pfn
but with 4 gigabytes limit, in other way max_low_pfn
will be the same as max_pfn
:
if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
max_low_pfn = e820_end_of_low_ram_pfn();
else
max_low_pfn = max_pfn;
high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;
Next we calculate high_memory
(defines the upper bound on direct map memory) with __va
macro which returns a virtual address by the given physical memory.
DMI scanning
The next step after manipulations with different memory regions and e820
slots is collecting information about computer. We will get all information with the Desktop Management Interface and following functions:
dmi_scan_machine();
dmi_memdev_walk();
First is dmi_scan_machine
defined in the drivers/firmware/dmi_scan.c. This function goes through the System Management BIOS structures and extracts information. There are two ways specified to gain access to the SMBIOS
table: get the pointer to the SMBIOS
table from the EFI's configuration table and scanning the physical memory between 0xF0000
and 0x10000
addresses. Let's look on the second approach. dmi_scan_machine
function remaps memory between 0xf0000
and 0x10000
with the dmi_early_remap
which just expands to the early_ioremap
:
void __init dmi_scan_machine(void)
{
char __iomem *p, *q;
char buf[32];
...
...
...
p = dmi_early_remap(0xF0000, 0x10000);
if (p == NULL)
goto error;
and iterates over all DMI
header address and find search _SM_
string:
memset(buf, 0, 16);
for (q = p; q < p + 0x10000; q += 16) {
memcpy_fromio(buf + 16, q, 16);
if (!dmi_smbios3_present(buf) || !dmi_present(buf)) {
dmi_available = 1;
dmi_early_unmap(p, 0x10000);
goto out;
}
memcpy(buf, buf + 16, 16);
}
_SM_
string must be between 000F0000h
and 0x000FFFFF
. Here we copy 16 bytes to the buf
with memcpy_fromio
which is the same memcpy
and execute dmi_smbios3_present
and dmi_present
on the buffer. These functions check that first 4 bytes is _SM_
string, get SMBIOS
version and gets _DMI_
attributes as DMI
structure table length, table address and etc... After one of these functions finish, you will see the result of it in the dmesg
output:
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014
In the end of the dmi_scan_machine
, we unmap the previously remapped memory:
dmi_early_unmap(p, 0x10000);
The second function is - dmi_memdev_walk
. As you can understand it goes over memory devices. Let's look on it:
void __init dmi_memdev_walk(void)
{
if (!dmi_available)
return;
if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) {
dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr);
if (dmi_memdev)
dmi_walk_early(save_mem_devices);
}
}
It checks that DMI
available (we got it in the previous function - dmi_scan_machine
) and collects information about memory devices with dmi_walk_early
and dmi_alloc
which defined as:
#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif
RESERVE_BRK
defined in the arch/x86/include/asm/setup.h and reserves space with given size in the brk
section.
init_hypervisor_platform();
x86_init.resources.probe_roms();
insert_resource(&iomem_resource, &code_resource);
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);
early_gart_iommu_check();
SMP config
The next step is parsing of the SMP configuration. We do it with the call of the find_smp_config
function which just calls function:
static inline void find_smp_config(void)
{
x86_init.mpparse.find_smp_config();
}
inside. x86_init.mpparse.find_smp_config
is the default_find_smp_config
function from the arch/x86/kernel/mpparse.c. In the default_find_smp_config
function we are scanning a couple of memory regions for SMP
config and return if they are found:
if (smp_scan_config(0x0, 0x400) ||
smp_scan_config(639 * 0x400, 0x400) ||
smp_scan_config(0xF0000, 0x10000))
return;
First of all smp_scan_config
function defines a couple of variables:
unsigned int *bp = phys_to_virt(base);
struct mpf_intel *mpf;
First is virtual address of the memory region where we will scan SMP
config, second is the pointer to the mpf_intel
structure. Let's try to understand what is it mpf_intel
. All information stores in the multiprocessor configuration data structure. mpf_intel
presents this structure and looks:
struct mpf_intel {
char signature[4];
unsigned int physptr;
unsigned char length;
unsigned char specification;
unsigned char checksum;
unsigned char feature1;
unsigned char feature2;
unsigned char feature3;
unsigned char feature4;
unsigned char feature5;
};
As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and mpf_intel
stores the physical address (look at second parameter) of the multiprocessor configuration table. So, smp_scan_config
going in a loop through the given memory range and tries to find MP floating pointer structure
there. It checks that current byte points to the SMP
signature, checks checksum, checks if mpf->specification
is 1 or 4(it must be 1
or 4
by specification) in the loop:
while (length > 0) {
if ((*bp == SMP_MAGIC_IDENT) &&
(mpf->length == 1) &&
!mpf_checksum((unsigned char *)bp, 16) &&
((mpf->specification == 1)
|| (mpf->specification == 4))) {
mem = virt_to_phys(mpf);
memblock_reserve(mem, sizeof(*mpf));
if (mpf->physptr)
smp_reserve_memory(mpf);
}
}
reserves given memory block if search is successful with memblock_reserve
and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - MultiProcessor Specification. You can read More details in the special part about SMP
.
Additional early memory initialization routines
In the next step of the setup_arch
we can see the call of the early_alloc_pgt_buf
function which allocates the page table buffer for early stage. The page table buffer will be placed in the brk
area. Let's look on its implementation:
void __init early_alloc_pgt_buf(void)
{
unsigned long tables = INIT_PGT_BUF_SIZE;
phys_addr_t base;
base = __pa(extend_brk(tables, PAGE_SIZE));
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}
First of all it get the size of the page table buffer, it will be INIT_PGT_BUF_SIZE
which is (6 * PAGE_SIZE)
in the current linux kernel 4.0. As we got the size of the page table buffer, we call extend_brk
function with two parameters: size and align. As you can understand from its name, this function extends the brk
area. As we can see in the linux kernel linker script brk
is in memory right after the BSS:
. = ALIGN(PAGE_SIZE);
.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
__brk_base = .;
. += 64 * 1024; /* 64k alignment slop space */
*(.brk_reservation) /* areas brk users have reserved */
__brk_limit = .;
}
Or we can find it with readelf
util:
After that we got physical address of the new brk
with the __pa
macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the reserve_brk
function:
static void __init reserve_brk(void)
{
if (_brk_end > _brk_start)
memblock_reserve(__pa_symbol(_brk_start),
_brk_end - _brk_start);
_brk_start = 0;
}
Note that in the end of the reserve_brk
, we set brk_start
to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the brk
, we need to unmap out-of-range memory areas in the kernel mapping with the cleanup_highmap
function. Remember that kernel mapping is __START_KERNEL_map
and _end - _text
or level2_kernel_pgt
maps the kernel _text
, data
and bss
. In the start of the clean_high_map
we define these parameters:
unsigned long vaddr = __START_KERNEL_map;
unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between _text
and end
:
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
if (vaddr < (unsigned long) _text || vaddr > end)
set_pmd(pmd, __pmd(0));
}
After this we set the limit for the memblock
allocation with the memblock_set_current_limit
function (read more about memblock
you can in the Linux kernel memory management Part 2), it will be ISA_END_ADDRESS
or 0x100000
and fill the memblock
information according to e820
with the call of the memblock_x86_fill
function. You can see the result of this function in the kernel initialization time:
MEMBLOCK configuration:
memory size = 0x1fff7ec00 reserved size = 0x1e30000
memory.cnt = 0x3
memory[0x0] [0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0
memory[0x1] [0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0
memory[0x2] [0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0
reserved.cnt = 0x3
reserved[0x0] [0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0
reserved[0x1] [0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0
reserved[0x2] [0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0
The rest functions after the memblock_x86_fill
are: early_reserve_e820_mpc_new
allocates additional slots in the e820map
for MultiProcessor Specification table, reserve_real_mode
- reserves low memory from 0x0
to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), trim_platform_memory_ranges
- trims certain memory regions started from 0x20050000
, 0x20110000
, etc. these regions must be excluded because Sandy Bridge has problems with these regions, trim_low_memory_range
reserves the first 4 kilobyte page in memblock
, init_mem_mapping
function reconstructs direct memory mapping and setups the direct mapping of the physical memory at PAGE_OFFSET
, early_trap_pf_init
setups #PF
handler (we will look on it in the chapter about interrupts) and setup_real_mode
function setups trampoline to the real mode code.
That's all. You can note that this part will not cover all functions which are in the setup_arch
(like early_gart_iommu_check
, mtrr initialization, etc.). As I already wrote many times, setup_arch
is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme.
Conclusion
It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the setup_arch
function again and it was long part, but we are not finished with it. Yes, setup_arch
is big, hope that next part will be the last part about this function.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Links
- MultiProcessor Specification
- NX bit
- Documentation/kernel-parameters.txt
- APIC
- CPU masks
- Linux kernel memory management
- PCI
- e820
- System Management BIOS
- System Management BIOS
- EFI
- SMP
- MultiProcessor Specification
- BSS
- SMBIOS specification
- Previous part
Kernel initialization. Part 7.
The End of the architecture-specific initialization, almost...
This is the seventh part of the Linux Kernel initialization process which covers insides of the setup_arch
function from the arch/x86/kernel/setup.c. As you can know from the previous parts, the setup_arch
function does some architecture-specific (in our case it is x86_64) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the Desktop Management Interface, early dump of the PCI device and many many more. If you have read the previous part, you can remember that we've finished it at the setup_real_mode
function. In the next step, as we set limit of the memblock to the all mapped pages, we can see the call of the setup_log_buf
function from the kernel/printk/printk.c.
The setup_log_buf
function setups kernel cyclic buffer and its length depends on the CONFIG_LOG_BUF_SHIFT
configuration option. As we can read from the documentation of the CONFIG_LOG_BUF_SHIFT
it can be between 12
and 21
. In the insides, buffer defined as array of chars:
#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
static char *log_buf = __log_buf;
Now let's look on the implementation of the setup_log_buf
function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the log_buf_add_cpu
function which increase size of the buffer for every CPU:
if (log_buf != __log_buf)
return;
if (!early && !new_log_buf_len)
log_buf_add_cpu();
We will not research log_buf_add_cpu
function, because as you can see in the setup_arch
, we call setup_log_buf
as:
setup_log_buf(1);
where 1
means that it is early setup. In the next step we check new_log_buf_len
variable which is updated length of the kernel log buffer and allocate new space for the buffer with the memblock_virt_alloc
function for it, or just return.
As kernel log buffer is ready, the next function is reserve_initrd
. You can remember that we already called the early_reserve_initrd
function in the fourth part of the Kernel initialization. Now, as we reconstructed direct memory mapping in the init_mem_mapping
function, we need to move initrd into directly mapped memory. The reserve_initrd
function starts from the definition of the base address and end address of the initrd
and check that initrd
is provided by a bootloader. All the same as what we saw in the early_reserve_initrd
. But instead of the reserving place in the memblock
area with the call of the memblock_reserve
function, we get the mapped size of the direct memory area and check that the size of the initrd
is not greater than this area with:
mapped_size = memblock_mem_size(max_pfn_mapped);
if (ramdisk_size >= (mapped_size>>1))
panic("initrd too large to handle, "
"disabling initrd (%lld needed, %lld available)\n",
ramdisk_size, mapped_size>>1);
You can see here that we call memblock_mem_size
function and pass the max_pfn_mapped
to it, where max_pfn_mapped
contains the highest direct mapped page frame number. If you do not remember what is page frame number
, explanation is simple: First 12
bits of the virtual address represent offset in the physical page or page frame. If we right-shift out 12
bits of the virtual address, we'll discard offset part and will get Page Frame Number
. In the memblock_mem_size
we go through the all memblock mem
(not reserved) regions and calculates size of the mapped pages and return it to the mapped_size
variable (see code above). As we got amount of the direct mapped memory, we check that size of the initrd
is not greater than mapped pages. If it is greater we just call panic
which halts the system and prints famous Kernel panic message. In the next step we print information about the initrd
size. We can see the result of this in the dmesg
output:
[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff]
and relocate initrd
to the direct mapping area with the relocate_initrd
function. In the start of the relocate_initrd
function we try to find a free area with the memblock_find_in_range
function:
relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE);
if (!relocated_ramdisk)
panic("Cannot find place for new RAMDISK of size %lld\n",
ramdisk_size);
The memblock_find_in_range
function tries to find a free area in a given range, in our case from 0
to the maximum mapped physical address and size must equal to the aligned size of the initrd
. If we didn't find a area with the given size, we call panic
again. If all is good, we start to relocated RAM disk to the down of the directly mapped memory in the next step.
In the end of the reserve_initrd
function, we free memblock memory which occupied by the ramdisk with the call of the:
memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
After we relocated initrd
ramdisk image, the next function is vsmp_init
from the arch/x86/kernel/vsmp_64.c. This function initializes support of the ScaleMP vSMP
. As I already wrote in the previous parts, this chapter will not cover non-related x86_64
initialization parts (for example as the current or ACPI
, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.
The next function is io_delay_init
from the arch/x86/kernel/io_delay.c. This function allows to override default default I/O delay 0x80
port. We already saw I/O delay in the Last preparation before transition into protected mode, now let's look on the io_delay_init
implementation:
void __init io_delay_init(void)
{
if (!io_delay_override)
dmi_check_system(io_delay_0xed_port_dmi_table);
}
This function check io_delay_override
variable and overrides I/O delay port if io_delay_override
is set. We can set io_delay_override
variably by passing io_delay
option to the kernel command line. As we can read from the Documentation/kernel-parameters.txt, io_delay
option is:
io_delay= [X86] I/O delay method
0x80
Standard port 0x80 based delay
0xed
Alternate port 0xed based delay (needed on some systems)
udelay
Simple two microseconds delay
none
No delay
We can see io_delay
command line parameter setup with the early_param
macro in the arch/x86/kernel/io_delay.c
early_param("io_delay", io_delay_param);
More about early_param
you can read in the previous part. So the io_delay_param
function which setups io_delay_override
variable will be called in the do_early_param function. io_delay_param
function gets the argument of the io_delay
kernel command line parameter and sets io_delay_type
depends on it:
static int __init io_delay_param(char *s)
{
if (!s)
return -EINVAL;
if (!strcmp(s, "0x80"))
io_delay_type = CONFIG_IO_DELAY_TYPE_0X80;
else if (!strcmp(s, "0xed"))
io_delay_type = CONFIG_IO_DELAY_TYPE_0XED;
else if (!strcmp(s, "udelay"))
io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY;
else if (!strcmp(s, "none"))
io_delay_type = CONFIG_IO_DELAY_TYPE_NONE;
else
return -EINVAL;
io_delay_override = 1;
return 0;
}
The next functions are acpi_boot_table_init
, early_acpi_boot_init
and initmem_init
after the io_delay_init
, but as I wrote above we will not cover ACPI related stuff in this Linux Kernel initialization process
chapter.
Allocate area for DMA
In the next step we need to allocate area for the Direct memory access with the dma_contiguous_reserve
function which is defined in the drivers/base/dma-contiguous.c. DMA
is a special mode when devices communicate with memory without CPU. Note that we pass one parameter - max_pfn_mapped << PAGE_SHIFT
, to the dma_contiguous_reserve
function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables:
phys_addr_t selected_size = 0;
phys_addr_t selected_base = 0;
phys_addr_t selected_limit = limit;
bool fixed = false;
where first represents size in bytes of the reserved area, second is base address of the reserved area, third is end address of the reserved area and the last fixed
parameter shows where to place reserved area. If fixed
is 1
we just reserve area with the memblock_reserve
, if it is 0
we allocate space with the kmemleak_alloc
. In the next step we check size_cmdline
variable and if it is not equal to -1
we fill all variables which you can see above with the values from the cma
kernel command line parameter:
if (size_cmdline != -1) {
...
...
...
}
You can find in this source code file definition of the early parameter:
early_param("cma", early_cma);
where cma
is:
cma=nn[MG]@[start[MG][-end[MG]]]
[ARM,X86,KNL]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
memory allocations. A value of 0 disables CMA
altogether. For more information, see
include/linux/dma-contiguous.h
If we will not pass cma
option to the kernel command line, size_cmdline
will be equal to -1
. In this way we need to calculate size of the reserved area which depends on the following kernel configuration options:
CONFIG_CMA_SIZE_SEL_MBYTES
- size in megabytes, default globalCMA
area, which is equal toCMA_SIZE_MBYTES * SZ_1M
orCONFIG_CMA_SIZE_MBYTES * 1M
;CONFIG_CMA_SIZE_SEL_PERCENTAGE
- percentage of total memory;CONFIG_CMA_SIZE_SEL_MIN
- use lower value;CONFIG_CMA_SIZE_SEL_MAX
- use higher value.
As we calculated the size of the reserved area, we reserve area with the call of the dma_contiguous_reserve_area
function which first of all calls:
ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);
function. The cma_declare_contiguous
reserves contiguous area from the given base address with given size. After we reserved area for the DMA
, next function is the memblock_find_dma_reserve
. As you can understand from its name, this function counts the reserved pages in the DMA
area. This part will not cover all details of the CMA
and DMA
, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas.
Initialization of the sparse memory
The next step is the call of the function - x86_init.paging.pagetable_init
. If you try to find this function in the linux kernel source code, in the end of your search, you will see the following macro:
#define native_pagetable_init paging_init
which expands as you can see to the call of the paging_init
function from the arch/x86/mm/init_64.c. The paging_init
function initializes sparse memory and zone sizes. First of all what's zones and what is it Sparsemem
. The Sparsemem
is a special foundation in the linux kernel memory manager which used to split memory area into different memory banks in the NUMA systems. Let's look on the implementation of the paginig_init
function:
void __init paging_init(void)
{
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
node_clear_state(0, N_MEMORY);
if (N_MEMORY != N_NORMAL_MEMORY)
node_clear_state(0, N_NORMAL_MEMORY);
zone_sizes_init();
}
As you can see there is call of the sparse_memory_present_with_active_regions
function which records a memory area for every NUMA
node to the array of the mem_section
structure which contains a pointer to the structure of the array of struct page
. The next sparse_init
function allocates non-linear mem_section
and mem_map
. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every NUMA
node is divided into a number of pieces which are called - zones
. So, zone_sizes_init
function from the arch/x86/mm/init.c initializes size of zones.
Again, this part and next parts do not cover this theme in full details. There will be special part about NUMA
.
vsyscall mapping
The next step after SparseMem
initialization is setting of the trampoline_cr4_features
which must contain content of the cr4
Control register. First of all we need to check that current CPU has support of the cr4
register and if it has, we save its content to the trampoline_cr4_features
which is storage for cr4
in the real mode:
if (boot_cpu_data.cpuid_level >= 0) {
mmu_cr4_features = __read_cr4();
if (trampoline_cr4_features)
*trampoline_cr4_features = mmu_cr4_features;
}
The next function which you can see is map_vsyscal
from the arch/x86/kernel/vsyscall_64.c. This function maps memory space for vsyscalls and depends on CONFIG_X86_VSYSCALL_EMULATION
kernel configuration option. Actually vsyscall
is a special segment which provides fast access to the certain system calls like getcpu
, etc. Let's look on implementation of this function:
void __init map_vsyscall(void)
{
extern char __vsyscall_page;
unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
if (vsyscall_mode != NONE)
__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
vsyscall_mode == NATIVE
? PAGE_KERNEL_VSYSCALL
: PAGE_KERNEL_VVAR);
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR);
}
In the beginning of the map_vsyscall
we can see definition of two variables. The first is extern variable __vsyscall_page
. As a extern variable, it defined somewhere in other source code file. Actually we can see definition of the __vsyscall_page
in the arch/x86/kernel/vsyscall_emu_64.S. The __vsyscall_page
symbol points to the aligned calls of the vsyscalls
as gettimeofday
, etc.:
.globl __vsyscall_page
.balign PAGE_SIZE, 0xcc
.type __vsyscall_page, @object
__vsyscall_page:
mov $__NR_gettimeofday, %rax
syscall
ret
.balign 1024, 0xcc
mov $__NR_time, %rax
syscall
ret
...
...
...
The second variable is physaddr_vsyscall
which just stores physical address of the __vsyscall_page
symbol. In the next step we check the vsyscall_mode
variable, and if it is not equal to NONE
, it is EMULATE
by default:
static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
And after this check we can see the call of the __set_fixmap
function which calls native_set_fixmap
with the same parameters:
void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
{
__native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));
}
void __native_set_fixmap(enum fixed_addresses idx, pte_t pte)
{
unsigned long address = __fix_to_virt(idx);
if (idx >= __end_of_fixed_addresses) {
BUG();
return;
}
set_pte_vaddr(address, pte);
fixmaps_set++;
}
Here we can see that native_set_fixmap
makes value of Page Table Entry
from the given physical address (physical address of the __vsyscall_page
symbol in our case) and calls internal function - __native_set_fixmap
. Internal function gets the virtual address of the given fixed_addresses
index (VSYSCALL_PAGE
in our case) and checks that given index is not greater than end of the fix-mapped addresses. After this we set page table entry with the call of the set_pte_vaddr
function and increase count of the fix-mapped addresses. And in the end of the map_vsyscall
we check that virtual address of the VSYSCALL_PAGE
(which is first index in the fixed_addresses
) is not greater than VSYSCALL_ADDR
which is -10UL << 20
or ffffffffff600000
with the BUILD_BUG_ON
macro:
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
(unsigned long)VSYSCALL_ADDR);
Now vsyscall
area is in the fix-mapped
area. That's all about map_vsyscall
, if you do not know anything about fix-mapped addresses, you can read Fix-Mapped Addresses and ioremap. We will see more about vsyscalls
in the vsyscalls and vdso
part.
Getting the SMP configuration
You may remember how we made a search of the SMP configuration in the previous part. Now we need to get the SMP
configuration if we found it. For this we check smp_found_config
variable which we set in the smp_scan_config
function (read about it the previous part) and call the get_smp_config
function:
if (smp_found_config)
get_smp_config();
The get_smp_config
expands to the x86_init.mpparse.default_get_smp_config
function which is defined in the arch/x86/kernel/mpparse.c. This function defines a pointer to the multiprocessor floating pointer structure - mpf_intel
(you can read about it in the previous part) and does some checks:
struct mpf_intel *mpf = mpf_found;
if (!mpf)
return;
if (acpi_lapic && early)
return;
Here we can see that multiprocessor configuration was found in the smp_scan_config
function or just return from the function if not. The next check is acpi_lapic
and early
. And as we did this checks, we start to read the SMP
configuration. As we finished reading it, the next step is - prefill_possible_map
function which makes preliminary filling of the possible CPU's cpumask
(more about it you can read in the Introduction to the cpumasks).
The rest of the setup_arch
Here we are getting to the end of the setup_arch
function. The rest of function of course is important, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the NUMA
, SMP
, ACPI
and APICs
, etc. First of all, the next call of the init_apic_mappings
function. As we can understand this function sets the address of the local APIC. The next is x86_io_apic_ops.init
and this function initializes I/O APIC. Please note that we will see all details related with APIC
in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like DMA
, TIMER
, FPU
, etc., with the call of the x86_init.resources.reserve_resources
function. Following is mcheck_init
function initializes Machine check Exception
and the last is register_refined_jiffies
which registers jiffy (There will be separate chapter about timers in the kernel).
So that's all. Finally we have finished with the big setup_arch
function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized.
That's all, and now we can back to the start_kernel
from the setup_arch
.
Back to the main.c
As I wrote above, we have finished with the setup_arch
function and now we can back to the start_kernel
function from the init/main.c. As you may remember or saw yourself, start_kernel
function as big as the setup_arch
. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the setup_arch
we can see the call of the mm_init_cpumask
function. This function sets the cpumask pointer to the memory descriptor cpumask
. We can look on its implementation:
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
mm->cpu_vm_mask_var = &mm->cpumask_allocation;
#endif
cpumask_clear(mm->cpu_vm_mask_var);
}
As you can see in the init/main.c, we pass memory descriptor of the init process to the mm_init_cpumask
and depends on CONFIG_CPUMASK_OFFSTACK
configuration option we clear TLB switch cpumask
.
In the next step we can see the call of the following function:
setup_command_line(command_line);
This function takes pointer to the kernel command line allocates a couple of buffers to store command line. We need a couple of buffers, because one buffer used for future reference and accessing to command line and one for parameter parsing. We will allocate space for the following buffers:
saved_command_line
- will contain boot command line;initcall_command_line
- will contain boot command line. will be used in thedo_initcall_level
;static_command_line
- will contain command line for parameters parsing.
We will allocate space with the memblock_virt_alloc
function. This function calls memblock_virt_alloc_try_nid
which allocates boot memory block with memblock_reserve
if slab is not available or uses kzalloc_node
(more about it will be in the linux memory management chapter). The memblock_virt_alloc
uses BOOTMEM_LOW_LIMIT
(physical address of the (PAGE_OFFSET + 0x1000000)
value) and BOOTMEM_ALLOC_ACCESSIBLE
(equal to the current value of the memblock.current_limit
) as minimum address of the memory region and maximum address of the memory region.
Let's look on the implementation of the setup_command_line
:
static void __init setup_command_line(char *command_line)
{
saved_command_line =
memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
initcall_command_line =
memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0);
strcpy(saved_command_line, boot_command_line);
strcpy(static_command_line, command_line);
}
Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store boot_command_line
in the saved_command_line
and command_line
(kernel command line from the setup_arch
) to the static_command_line
.
The next function after the setup_command_line
is the setup_nr_cpu_ids
. This function setting nr_cpu_ids
(number of CPUs) according to the last bit in the cpu_possible_mask
(more about it you can read in the chapter describes cpumasks concept). Let's look on its implementation:
void __init setup_nr_cpu_ids(void)
{
nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}
Here nr_cpu_ids
represents number of CPUs, NR_CPUS
represents the maximum number of CPUs which we can set in configuration time:
Actually we need to call this function, because NR_CPUS
can be greater than actual amount of the CPUs in the your computer. Here we can see that we call find_last_bit
function and pass two parameters to it:
cpu_possible_mask
bits;- maximum number of CPUS.
In the setup_arch
we can find the call of the prefill_possible_map
function which calculates and writes to the cpu_possible_mask
actual number of the CPUs. We call the find_last_bit
function which takes the address and maximum size to search and returns bit number of the first set bit. We passed cpu_possible_mask
bits and maximum number of the CPUs. First of all the find_last_bit
function splits given unsigned long
address to the words:
words = size / BITS_PER_LONG;
where BITS_PER_LONG
is 64
on the x86_64
. As we got amount of words in the given size of the search data, we need to check is given size does not contain partial words with the following check:
if (size & (BITS_PER_LONG-1)) {
tmp = (addr[words] & (~0UL >> (BITS_PER_LONG
- (size & (BITS_PER_LONG-1)))));
if (tmp)
goto found;
}
if it contains partial word, we mask the last word and check it. If the last word is not zero, it means that current word contains at least one set bit. We go to the found
label:
found:
return words * BITS_PER_LONG + __fls(tmp);
Here you can see __fls
function which returns last set bit in a given word with help of the bsr
instruction:
static inline unsigned long __fls(unsigned long word)
{
asm("bsr %1,%0"
: "=r" (word)
: "rm" (word));
return word;
}
The bsr
instruction which scans the given operand for first bit set. If the last word is not partial we going through the all words in the given address and trying to find first set bit:
while (words) {
tmp = addr[--words];
if (tmp) {
found:
return words * BITS_PER_LONG + __fls(tmp);
}
}
Here we put the last word to the tmp
variable and check that tmp
contains at least one set bit. If a set bit found, we return the number of this bit. If no one words do not contains set bit we just return given size:
return size;
After this nr_cpu_ids
will contain the correct amount of the available CPUs.
That's all.
Conclusion
It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finished with the setup_arch
function and returned to the start_kernel
function. In the next part we will continue to learn generic kernel code from the start_kernel
and will continue our way to the first init
process.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Links
- Desktop Management Interface
- x86_64
- initrd
- Kernel panic
- Documentation/kernel-parameters.txt
- ACPI
- Direct memory access
- NUMA
- Control register
- vsyscalls
- SMP
- jiffy
- Previous part
Kernel initialization. Part 8.
Scheduler initialization
This is the eighth part of the Linux kernel initialization process and we stopped on the setup_nr_cpu_ids
function in the previous part. The main point of the current part is scheduler initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the init/main.c is the setup_per_cpu_areas
function. This function setups areas for the percpu
variables, more about it you can read in the special part about the Per-CPU variables. After percpu
areas is up and running, the next step is the smp_prepare_boot_cpu
function. This function does some preparations for the SMP:
static inline void smp_prepare_boot_cpu(void)
{
smp_ops.smp_prepare_boot_cpu();
}
where the smp_prepare_boot_cpu
expands to the call of the native_smp_prepare_boot_cpu
function (more about smp_ops
will be in the special parts about SMP
):
void __init native_smp_prepare_boot_cpu(void)
{
int me = smp_processor_id();
switch_to_new_gdt(me);
cpumask_set_cpu(me, cpu_callout_mask);
per_cpu(cpu_state, me) = CPU_ONLINE;
}
The native_smp_prepare_boot_cpu
function gets the id of the current CPU (which is Bootstrap processor and its id
is zero) with the smp_processor_id
function. I will not explain how the smp_processor_id
works, because we already saw it in the Kernel entry point part. As we got processor id
number we reload Global Descriptor Table for the given CPU with the switch_to_new_gdt
function:
void switch_to_new_gdt(int cpu)
{
struct desc_ptr gdt_descr;
gdt_descr.address = (long)get_cpu_gdt_table(cpu);
gdt_descr.size = GDT_SIZE - 1;
load_gdt(&gdt_descr);
load_percpu_segment(cpu);
}
The gdt_descr
variable represents pointer to the GDT
descriptor here (we already saw desc_ptr
in the Early interrupt and exception handling). We get the address and the size of the GDT
descriptor where GDT_SIZE
is 256
or:
#define GDT_SIZE (GDT_ENTRIES * 8)
and the address of the descriptor we will get with the get_cpu_gdt_table
:
static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
{
return per_cpu(gdt_page, cpu).gdt;
}
The get_cpu_gdt_table
uses per_cpu
macro for getting gdt_page
percpu variable for the given CPU number (bootstrap processor with id
- 0 in our case). You may ask the following question: so, if we can access gdt_page
percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first part of this chapter, you can remember that we saw definition of the gdt_page
in the arch/x86/kernel/head_64.S:
early_gdt_descr:
.word GDT_ENTRIES*8-1
early_gdt_descr_base:
.quad INIT_PER_CPU_VAR(gdt_page)
and if we will look on the linker file we can see that it locates after the __per_cpu_load
symbol:
#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);
and filled gdt_page
in the arch/x86/kernel/cpu/common.c:
DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#ifdef CONFIG_X86_64
[GDT_ENTRY_KERNEL32_CS] = GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
[GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER32_CS] = GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
[GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
...
...
...
more about percpu
variables you can read in the Per-CPU variables part. As we got address and size of the GDT
descriptor we reload GDT
with the load_gdt
which just execute lgdt
instruct and load percpu_segment
with the following function:
void load_percpu_segment(int cpu) {
loadsegment(gs, 0);
wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
load_stack_canary_segment();
}
The base address of the percpu
area must contain gs
register (or fs
register for x86
), so we are using loadsegment
macro and pass gs
. In the next step we writes the base address if the IRQ stack and setup stack canary (this is only for x86_32
). After we load new GDT
, we fill cpu_callout_mask
bitmap with the current cpu and set cpu state as online with the setting cpu_state
percpu variable for the current processor - CPU_ONLINE
:
cpumask_set_cpu(me, cpu_callout_mask);
per_cpu(cpu_state, me) = CPU_ONLINE;
So, what is cpu_callout_mask
bitmap... As we initialized bootstrap processor (processor which is booted the first on x86
) the other processors in a multiprocessor system are known as secondary processors
. Linux kernel uses following two bitmasks:
cpu_callout_mask
cpu_callin_mask
After bootstrap processor initialized, it updates the cpu_callout_mask
to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the cpu_callout_mask
on the boostrap processor bit. Only after the bootstrap processor filled the cpu_callout_mask
with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the cpu_callin_mask
. Once the bootstrap processor finds the bit in the cpu_callin_mask
for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about SMP
.
That's all. We did all SMP
boot preparation.
Build zonelists
In the next step we can see the call of the build_all_zonelists
function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - nodes
. If you has no hardware support for NUMA
, you will see only one node:
$ cat /sys/devices/system/node/node0/numastat
numa_hit 72452442
numa_miss 0
numa_foreign 0
interleave_hit 12925
local_node 72452442
other_node 0
Every node
is presented by the struct pglist_data
in the linux kernel. Each node is divided into a number of special blocks which are called - zones
. Every zone is presented by the zone struct
in the linux kernel and has one of the type:
ZONE_DMA
- 0-16M;ZONE_DMA32
- used for 32 bit devices that can only do DMA areas below 4G;ZONE_NORMAL
- all RAM from the 4GB on thex86_64
;ZONE_HIGHMEM
- absent on thex86_64
;ZONE_MOVABLE
- zone which contains movable pages.
which are presented by the zone_type
enum. We can get information about zones with the:
$ cat /proc/zoneinfo
Node 0, zone DMA
pages free 3975
min 3
low 3
...
...
Node 0, zone DMA32
pages free 694163
min 875
low 1093
...
...
Node 0, zone Normal
pages free 2529995
min 3146
low 3932
...
...
As I wrote above all nodes are described with the pglist_data
or pg_data_t
structure in memory. This structure is defined in the include/linux/mmzone.h. The build_all_zonelists
function from the mm/page_alloc.c constructs an ordered zonelist
(of different zones DMA
, DMA32
, NORMAL
, HIGH_MEMORY
, MOVABLE
) which specifies the zones/nodes to visit when a selected zone
or node
cannot satisfy the allocation request. That's all. More about NUMA
and multiprocessor systems will be in the special part.
The rest of the stuff before scheduler initialization
Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the page_alloc_init
function from the mm/page_alloc.c. This function looks pretty easy:
void __init page_alloc_init(void)
{
hotcpu_notifier(page_alloc_cpu_notify, 0);
}
and initializes handler for the CPU
hotplug. Of course the hotcpu_notifier
depends on the CONFIG_HOTPLUG_CPU
configuration option and if this option is set, it just calls cpu_notifier
macro which expands to the call of the register_cpu_notifier
which adds hotplug cpu handler (page_alloc_cpu_notify
in our case).
After this we can see the kernel command line in the initialization output:
And a couple of functions such as parse_early_param
and parse_args
which handles linux kernel command line. You may remember that we already saw the call of the parse_early_param
function in the sixth part of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (x86_64
in our case), but not all architecture calls this function. And we need to call the second function parse_args
to parse and handle non-early command line arguments.
In the next step we can see the call of the jump_label_init
from the kernel/jump_label.c. and initializes jump label.
After this we can see the call of the setup_log_buf
function which setups the printk log buffer. We already saw this function in the seventh part of the linux kernel initialization process chapter.
PID hash initialization
The next is pidhash_init
function. As you know each process has assigned a unique number which called - process identification number
or PID
. Each process generated with fork or clone is automatically assigned a new unique PID
value by the kernel. The management of PIDs
centered around the two special data structures: struct pid
and struct upid
. First structure represents information about a PID
in the kernel. The second structure represents the information that is visible in a specific namespace. All PID
instances stored in the special hash table:
static struct hlist_head *pid_hash;
This hash table is used to find the pid instance that belongs to a numeric PID
value. So, pidhash_init
initializes this hash table. In the start of the pidhash_init
function we can see the call of the alloc_large_system_hash
:
pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
HASH_EARLY | HASH_SMALL,
&pidhash_shift, NULL,
0, 4096);
The number of elements of the pid_hash
depends on the RAM
configuration, but it can be between 2^4
and 2^12
. The pidhash_init
computes the size and allocates the required storage (which is hlist
in our case - the same as doubly linked list, but contains one pointer instead on the struct hlist_head]. The alloc_large_system_hash
function allocates a large system hash table with memblock_virt_alloc_nopanic
if we pass HASH_EARLY
flag (as it in our case) or with __vmalloc
if we did no pass this flag.
The result we can see in the dmesg
output:
$ dmesg | grep hash
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
...
...
...
That's all. The rest of the stuff before scheduler initialization is the following functions: vfs_caches_init_early
does early initialization of the virtual file system (more about it will be in the chapter which will describe virtual file system), sort_main_extable
sorts the kernel's built-in exception table entries which are between __start___ex_table
and __stop___ex_table
, and trap_init
initializes trap handlers (more about last two function we will know in the separate chapter about interrupts).
The last step before the scheduler initialization is initialization of the memory manager with the mm_init
function from the init/main.c. As we can see, the mm_init
function initializes different parts of the linux kernel memory manager:
page_ext_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
pgtable_init();
vmalloc_init();
The first is page_ext_init_flatmem
which depends on the CONFIG_SPARSEMEM
kernel configuration option and initializes extended data per page handling. The mem_init
releases all bootmem
, the kmem_cache_init
initializes kernel cache, the percpu_init_late
- replaces percpu
chunks with those allocated by slub, the pgtable_init
- initializes the page->ptl
kernel cache, the vmalloc_init
- initializes vmalloc
. Please, NOTE that we will not dive into details about all of these functions and concepts, but we will see all of they it in the Linux kernel memory manager chapter.
That's all. Now we can look on the scheduler
.
Scheduler initialization
And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the sched_init
function from the kernel/sched/core.c and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the sched_init
function we can see the following code:
#ifdef CONFIG_FAIR_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
First of all we can see two configuration options here:
CONFIG_FAIR_GROUP_SCHED
CONFIG_RT_GROUP_SCHED
Both of this options provide two different planning models. As we can read from the documentation, the current scheduler - CFS
or Completely Fair Scheduler
use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive 1/n
processor time, where n
is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called scheduling policy
. The Completely Fair Scheduler supports following normal
or non-real-time
scheduling policies: SCHED_NORMAL
, SCHED_BATCH
and SCHED_IDLE
. The SCHED_NORMAL
is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the nice value, the SCHED_BATCH
used for the 100% non-interactive tasks and the SCHED_IDLE
runs tasks only when the processor has no task to run besides this task. The real-time
policies are also supported for the time-critical applications: SCHED_FIFO
and SCHED_RR
. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called scheduler classes
. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them.
Now let's back to the our code and look on the two configuration options CONFIG_FAIR_GROUP_SCHED
and CONFIG_RT_GROUP_SCHED
. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the CFS group scheduling). We can see that we assign the alloc_size
variables which represent size based on amount of the processors to allocate for the sched_entity
and cfs_rq
to the 2 * nr_cpu_ids * sizeof(void **)
expression with kzalloc
:
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
#ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.se = (struct sched_entity **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
#endif
The sched_entity
is a structure which is defined in the include/linux/sched.h and used by the scheduler to keep track of process accounting. The cfs_rq
presents run queue. So, you can see that we allocated space with size alloc_size
for the run queue and scheduler entity of the root_task_group
. The root_task_group
is an instance of the task_group
structure from the kernel/sched/sched.h which contains task group related information:
struct task_group {
...
...
struct sched_entity **se;
struct cfs_rq **cfs_rq;
...
...
}
The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (cpu_possible_mask
bitmap) and allocate zeroed memory from a particular memory node with the kzalloc_node
function for the load_balance_mask
percpu
variable:
DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
Here cpumask_var_t
is the cpumask_t
with one difference: cpumask_var_t
is allocated only nr_cpu_ids
bits when the cpumask_t
always has NR_CPUS
bits (more about cpumask
you can read in the CPU masks part). As you can see:
#ifdef CONFIG_CPUMASK_OFFSTACK
for_each_possible_cpu(i) {
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif
this code depends on the CONFIG_CPUMASK_OFFSTACK
configuration option. This configuration options says to use dynamic allocation for cpumask
, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions:
init_rt_bandwidth(&def_rt_bandwidth,
global_rt_period(), global_rt_runtime());
init_dl_bandwidth(&def_dl_bandwidth,
global_rt_period(), global_rt_runtime());
we initialize bandwidth management for the SCHED_DEADLINE
real-time tasks. These functions initializes rt_bandwidth
and dl_bandwidth
structures which store information about maximum deadline
bandwidth of the system. For example, let's look on the implementation of the init_rt_bandwidth
function:
void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
{
rt_b->rt_period = ns_to_ktime(period);
rt_b->rt_runtime = runtime;
raw_spin_lock_init(&rt_b->rt_runtime_lock);
hrtimer_init(&rt_b->rt_period_timer,
CLOCK_MONOTONIC, HRTIMER_MODE_REL);
rt_b->rt_period_timer.function = sched_rt_period_timer;
}
It takes three parameters:
- address of the
rt_bandwidth
structure which contains information about the allocated and consumed quota within a period; period
- period over which real-time task bandwidth enforcement is measured inus
;runtime
- part of the period that we allow tasks to run inus
.
As period
and runtime
we pass result of the global_rt_period
and global_rt_runtime
functions. Which are 1s
second and and 0.95s
by default. The rt_bandwidth
structure is defined in the kernel/sched/sched.h and looks:
struct rt_bandwidth {
raw_spinlock_t rt_runtime_lock;
ktime_t rt_period;
u64 rt_runtime;
struct hrtimer rt_period_timer;
};
As you can see, it contains runtime
and period
and also two following fields:
rt_runtime_lock
- spinlock for thert_time
protection;rt_period_timer
- high-resolution kernel timer for unthrottled of real-time tasks.
So, in the init_rt_bandwidth
we initialize rt_bandwidth
period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of SMP, we make initialization of the root domain:
#ifdef CONFIG_SMP
init_defrootdomain();
#endif
The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - cpuset
. If a cpuset
contains non-overlapping with other cpuset
CPUs, it is exclusive cpuset
. Each exclusive cpuset defines an isolated domain or root domain
of CPUs partitioned from other cpusets or CPUs. A root domain
is presented by the struct root_domain
from the kernel/sched/sched.h in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler.
After root domain
initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above:
#ifdef CONFIG_RT_GROUP_SCHED
init_rt_bandwidth(&root_task_group.rt_bandwidth,
global_rt_period(), global_rt_runtime());
#endif
In the next step, depends on the CONFIG_CGROUP_SCHED
kernel configuration option we initialize the siblings
and children
lists of the root task group. As we can read from the documentation, the CONFIG_CGROUP_SCHED
is:
This option allows you to create arbitrary task groups using the "cgroup" pseudo
filesystem and control the cpu bandwidth allocated to each such task group.
As we finished with the lists initialization, we can see the call of the autogroup_init
function:
#ifdef CONFIG_CGROUP_SCHED
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
autogroup_init(&init_task);
#endif
which initializes automatic process group scheduling.
After this we are going through the all possible
cpu (you can remember that possible
CPUs store in the cpu_possible_mask
bitmap that can ever be available in the system) and initialize a runqueue
for each possible cpu:
for_each_possible_cpu(i) {
struct rq *rq;
...
...
...
Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is runqueue
. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The runqueue
is presented by the rq
structure in the linux kernel which is defined in the kernel/sched/sched.h.
rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
Here we get the runqueue for the every CPU with the cpu_rq
macro which returns runqueues
percpu variable and start to initialize it with runqueue lock, number of running tasks, calc_load
relative fields (calc_load_active
and calc_load_update
) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize cpu_load
array with zeros and set the last load update tick to the jiffies
variable which determines the number of time ticks (cycles), since the system boot:
for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
rq->cpu_load[j] = 0;
rq->last_load_update_tick = jiffies;
where cpu_load
keeps history of runqueue loads in the past, for now CPU_LOAD_IDX_MAX
is 5. In the next step we fill runqueue
fields which are related to the SMP, but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give runqueue
and set the iowait
(more about it in the separate part about scheduler) number:
init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);
Now we come out from the for_each_possible_cpu
loop and the next we need to set load weight for the init
task with the set_load_weight
function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the init
process and set scheduler class for the current process:
atomic_inc(&init_mm.mm_count);
current->sched_class = &fair_sched_class;
And make current process (it will be the first init
process) idle
and update the value of the calc_load_update
with the 5 seconds interval:
init_idle(current, smp_processor_id());
calc_load_update = jiffies + LOAD_FREQ;
So, the init
process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set scheduler_running
variable:
scheduler_running = 1;
That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler.
Conclusion
It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the RCU and many other initialization stuff in the next part.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Links
- CPU masks
- high-resolution kernel timer
- spinlock
- Run queue
- Linux kernem memory manager
- slub
- virtual file system
- Linux kernel hotplug documentation
- IRQ
- Global Descriptor Table
- Per-CPU variables
- SMP
- RCU
- CFS Scheduler documentation
- Real-Time group scheduling
- Previous part
Kernel initialization. Part 9.
RCU initialization
This is ninth part of the Linux Kernel initialization process and in the previous part we stopped at the scheduler initialization. In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the RCU. We can see that the next step in the init/main.c after the sched_init
is the call of the preempt_disable
. There are two macros:
preempt_disable
preempt_enable
for preemption disabling and enabling. First of all let's try to understand what is preempt
in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one init
process for the early boot time and we don't need to stop it before we call cpu_idle
function. The preempt_disable
macro is defined in the include/linux/preempt.h and depends on the CONFIG_PREEMPT_COUNT
kernel configuration option. This macro is implemented as:
#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)
and if CONFIG_PREEMPT_COUNT
is not set just:
#define preempt_disable() barrier()
Let's look on it. First of all we can see one difference between these macro implementations. The preempt_disable
with CONFIG_PREEMPT_COUNT
set contains the call of the preempt_count_inc
. There is special percpu
variable which stores the number of held locks and preempt_disable
calls:
DECLARE_PER_CPU(int, __preempt_count);
In the first implementation of the preempt_disable
we increment this __preempt_count
. There is API for returning value of the __preempt_count
, it is the preempt_count
function. As we called preempt_disable
, first of all we increment preemption counter with the preempt_count_inc
macro which expands to the:
#define preempt_count_inc() preempt_count_add(1)
#define preempt_count_add(val) __preempt_count_add(val)
where preempt_count_add
calls the raw_cpu_add_4
macro which adds 1
to the given percpu
variable (__preempt_count
) in our case (more about precpu
variables you can read in the part about Per-CPU variables). Ok, we increased __preempt_count
and the next step we can see the call of the barrier
macro in the both macros. The barrier
macro inserts an optimization barrier. In the processors with x86_64
architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:
preempt_disable();
foo();
preempt_enable();
Compiler can rearrange it as:
preempt_disable();
preempt_enable();
foo();
In this case non-preemptible function foo
can be preempted. As we put barrier
macro in the preempt_disable
and preempt_enable
macros, it prevents the compiler from swapping preempt_count_inc
with other statements. More about barriers you can read here and here.
In the next step we can see following statement:
if (WARN(!irqs_disabled(),
"Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
which check IRQs state, and disabling (with cli
instruction for x86_64
) if they are enabled.
That's all. Preemption is disabled and we can go ahead.
Initialization of the integer ID management
In the next step we can see the call of the idr_init_cache
function which defined in the lib/idr.c. The idr
library is used in a various places in the linux kernel to manage assigning integer IDs
to objects and looking up objects by id.
Let's look on the implementation of the idr_init_cache
function:
void __init idr_init_cache(void)
{
idr_layer_cache = kmem_cache_create("idr_layer_cache",
sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
}
Here we can see the call of the kmem_cache_create
. We already called the kmem_cache_init
in the init/main.c. This function create generalized caches again using the kmem_cache_alloc
(more about caches we will see in the Linux kernel memory management chapter). In our case, as we are using kmem_cache_t
which will be used by the slab allocator and kmem_cache_create
creates it. As you can see we pass five parameters to the kmem_cache_create
:
- name of the cache;
- size of the object to store in cache;
- offset of the first object in the page;
- flags;
- constructor for the objects.
and it will create kmem_cache
for the integer IDs. Integer IDs
is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the i2c drivers subsystem. For example drivers/i2c/i2c-core.c which represents the core of the i2c
subsystem defines ID
for the i2c
adapter with the DEFINE_IDR
macro:
static DEFINE_IDR(i2c_adapter_idr);
and then uses it for the declaration of the i2c
adapter:
static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
{
int id;
...
...
...
id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
...
...
...
}
and id2_adapter_idr
presents dynamically calculated bus number.
More about integer ID management you can read here.
RCU initialization
The next step is RCU initialization with the rcu_init
function and it's implementation depends on two kernel configuration options:
CONFIG_TINY_RCU
CONFIG_TREE_RCU
In the first case rcu_init
will be in the kernel/rcu/tiny.c and in the second case it will be defined in the kernel/rcu/tree.c. We will see the implementation of the tree rcu
, but first of all about the RCU
in general.
RCU
or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including lock-free data structures, percpu data structures and other. One of these mechanisms is - the read-copy update
. The RCU
technique is designed for rarely-modified data structures. The idea of the RCU
is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.
Of course this description of the RCU
is very simplified. To understand some details about RCU
, first of all we need to learn some terminology. Data readers in the RCU
executed in the critical section. Every time when data reader get to the critical section, it calls the rcu_read_lock
, and rcu_read_unlock
on exit from the critical section. If the thread is not in the critical section, it will be in state which called - quiescent state
. The moment when every thread is in the quiescent state
called - grace period
. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is removal
- atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the grace period
finished, the second step of the element removal will be started, it just removes the element from the physical memory.
There a couple of implementations of the RCU
. Old RCU
called classic, the new implementation called tree
RCU. As you may already understand, the CONFIG_TREE_RCU
kernel configuration option enables tree RCU
. Another is the tiny
RCU which depends on CONFIG_TINY_RCU
and CONFIG_SMP=n
. We will see more details about the RCU
in general in the separate chapter about synchronization primitives, but now let's look on the rcu_init
implementation from the kernel/rcu/tree.c:
void __init rcu_init(void)
{
int cpu;
rcu_bootup_announce();
rcu_init_geometry();
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
__rcu_init_preempt();
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
/*
* We don't need protection against CPU-hotplug here because
* this is called early in boot, before either interrupts
* or the scheduler are operational.
*/
cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
rcu_early_boot_tests();
}
In the beginning of the rcu_init
function we define cpu
variable and call rcu_bootup_announce
. The rcu_bootup_announce
function is pretty simple:
static void __init rcu_bootup_announce(void)
{
pr_info("Hierarchical RCU implementation.\n");
rcu_bootup_announce_oddness();
}
It just prints information about the RCU
with the pr_info
function and rcu_bootup_announce_oddness
which uses pr_info
too, for printing different information about the current RCU
configuration which depends on different kernel configuration options like CONFIG_RCU_TRACE
, CONFIG_PROVE_RCU
, CONFIG_RCU_FANOUT_EXACT
, etc. In the next step, we can see the call of the rcu_init_geometry
function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually RCU
provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? RCU
API provides the rcu_state
structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:
struct rcu_node node[NUM_RCU_NODES];
array of structures. As we can read in the comment of above definition:
The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]). The number of levels is
determined by the number of CPUs and by CONFIG_RCU_FANOUT.
Small systems will have a "hierarchy" consisting of a single rcu_node.
The rcu_node
structure is defined in the kernel/rcu/tree.h and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every rcu_node
contains a lock for a couple of CPUs. These rcu_node
structures are embedded into a linear array in the rcu_state
structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the NUM_RCU_NODES
which depends on number of available CPUs:
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
where levels values depend on the CONFIG_RCU_FANOUT_LEAF
configuration option. For example for the simplest case, one rcu_node
will cover two CPU on machine with the eight CPUs:
+-----------------------------------------------------------------+
| rcu_state |
| +----------------------+ |
| | root | |
| | rcu_node | |
| +----------------------+ |
| | | |
| +----v-----+ +--v-------+ |
| | | | | |
| | rcu_node | | rcu_node | |
| | | | | |
| +------------------+ +----------------+ |
| | | | | |
| | | | | |
| +----v-----+ +-------v--+ +-v--------+ +-v--------+ |
| | | | | | | | | |
| | rcu_node | | rcu_node | | rcu_node | | rcu_node | |
| | | | | | | | | |
| +----------+ +----------+ +----------+ +----------+ |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
+---------|-----------------|-------------|---------------|-------+
| | | |
+---------v-----------------v-------------v---------------v--------+
| | | | |
| CPU1 | CPU3 | CPU5 | CPU7 |
| | | | |
| CPU2 | CPU4 | CPU6 | CPU8 |
| | | | |
+------------------------------------------------------------------+
So, in the rcu_init_geometry
function we just need to calculate the total number of rcu_node
structures. We start to do it with the calculation of the jiffies
till to the first and next fqs
which is force-quiescent-state
(read above about it):
d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
if (jiffies_till_first_fqs == ULONG_MAX)
jiffies_till_first_fqs = d;
if (jiffies_till_next_fqs == ULONG_MAX)
jiffies_till_next_fqs = d;
where:
#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
#define RCU_JIFFIES_FQS_DIV 256
As we calculated these jiffies, we check that previous defined jiffies_till_first_fqs
and jiffies_till_next_fqs
variables are equal to the ULONG_MAX (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the ULONG_MAX
:
static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;
In the next step of the rcu_init_geometry
, we check that rcu_fanout_leaf
didn't change (it has the same value as CONFIG_RCU_FANOUT_LEAF
in compile-time) and equal to the value of the CONFIG_RCU_FANOUT_LEAF
configuration option, we just return:
if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
nr_cpu_ids == NR_CPUS)
return;
After this we need to compute the number of nodes that an rcu_node
tree can handle with the given number of levels:
rcu_capacity[0] = 1;
rcu_capacity[1] = rcu_fanout_leaf;
for (i = 2; i <= MAX_RCU_LVLS; i++)
rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
And in the last step we calculate the number of rcu_nodes at each level of the tree in the loop.
As we calculated geometry of the rcu_node
tree, we need to go back to the rcu_init
function and next step we need to initialize two rcu_state
structures with the rcu_init_one
function:
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
The rcu_init_one
function takes two arguments:
- Global
RCU
state; - Per-CPU data for
RCU
.
Both variables defined in the kernel/rcu/tree.h with its percpu
data:
extern struct rcu_state rcu_bh_state;
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
About this states you can read here. As I wrote above we need to initialize rcu_state
structures and rcu_init_one
function will help us with it. After the rcu_state
initialization, we can see the call of the __rcu_init_preempt
which depends on the CONFIG_PREEMPT_RCU
kernel configuration option. It does the same as previous functions - initialization of the rcu_preempt_state
structure with the rcu_init_one
function which has rcu_state
type. After this, in the rcu_init
, we can see the call of the:
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
function. This function registers a handler of the pending interrupt
. Pending interrupt or softirq
supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:
struct softirq_action
{
void (*action)(struct softirq_action *);
};
which is defined in the include/linux/interrupt.h and contains only one field - handler of an interrupt. You can check about softirqs
in the your system with the:
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
HI: 2 0 0 1 0 2 0 0
TIMER: 137779 108110 139573 107647 107408 114972 99653 98665
NET_TX: 1127 0 4 0 1 1 0 0
NET_RX: 334 221 132939 3076 451 361 292 303
BLOCK: 5253 5596 8 779 2016 37442 28 2855
BLOCK_IOPOLL: 0 0 0 0 0 0 0 0
TASKLET: 66 0 2916 113 0 24 26708 0
SCHED: 102350 75950 91705 75356 75323 82627 69279 69914
HRTIMER: 510 302 368 260 219 255 248 246
RCU: 81290 68062 82979 69015 68390 69385 63304 63473
The open_softirq
function takes two parameters:
- index of the interrupt;
- interrupt handler.
and adds interrupt handler to the array of the pending interrupts:
void open_softirq(int nr, void (*action)(struct softirq_action *))
{
softirq_vec[nr].action = action;
}
In our case the interrupt handler is - rcu_process_callbacks
which is defined in the kernel/rcu/tree.c and does the RCU
core processing for the current CPU. After we registered softirq
interrupt for the RCU
, we can see the following code:
cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
Here we can see registration of the cpu
notifier which needs in systems which supports CPU hotplug and we will not dive into details about this theme. The last function in the rcu_init
is the rcu_early_boot_tests
:
void rcu_early_boot_tests(void)
{
pr_info("Running RCU self tests\n");
if (rcu_self_test)
early_boot_test_call_rcu();
if (rcu_self_test_bh)
early_boot_test_call_rcu_bh();
if (rcu_self_test_sched)
early_boot_test_call_rcu_sched();
}
which runs self tests for the RCU
.
That's all. We saw initialization process of the RCU
subsystem. As I wrote above, more about the RCU
will be in the separate chapter about synchronization primitives.
Rest of the initialization process
Ok, we already passed the main theme of this part which is RCU
initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:
- They are not very important for the generic kernel initialization process and depend on the different kernel configuration;
- They have the character of debugging and not important for now;
- We will see many of this stuff in the separate parts/chapters.
After we initialized RCU
, the next step which you can see in the init/main.c is the - trace_init
function. As you can understand from its name, this function initialize tracing subsystem. You can read more about linux kernel trace system - here.
After the trace_init
, we can see the call of the radix_tree_init
. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the Radix tree. This function is defined in the lib/radix-tree.c and you can read more about it in the part about Radix tree.
In the next step we can see the functions which are related to the interrupts handling
subsystem, they are:
early_irq_init
init_IRQ
softirq_init
We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like init_timers
, hrtimers_init
, time_init
, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.
The next couple of functions are related with the perf events - perf_event-init
(there will be separate chapter about perf), initialization of the profiling
with the profile_init
. After this we enable irq
with the call of the:
local_irq_enable();
which expands to the sti
instruction and making post initialization of the SLAB with the call of the kmem_cache_init_late
function (As I wrote above we will know about the SLAB
in the Linux memory management chapter).
After the post initialization of the SLAB
, next point is initialization of the console with the console_init
function from the drivers/tty/tty_io.c.
After the console initialization, we can see the lockdep_info
function which prints information about the Lock dependency validator. After this, we can see the initialization of the dynamic allocation of the debug objects
with the debug_objects_mem_init
, kernel memory leak detector initialization with the kmemleak_init
, percpu
pageset setup with the setup_per_cpu_pageset
, setup of the NUMA policy with the numa_policy_init
, setting time for the scheduler with the sched_clock_init
, pidmap
initialization with the call of the pidmap_init
function for the initial PID
namespace, cache creation with the anon_vma_init
for the private virtual memory areas and early initialization of the ACPI with the acpi_early_init
.
This is the end of the ninth part of the linux kernel initialization process and here we saw initialization of the RCU. In the last paragraph of this part (Rest of the initialization process
) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.
Conclusion
It is the end of the ninth part about the linux kernel initialization process. In this part, we looked on the initialization process of the RCU
subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the start_kernel
function and will go to the rest_init
function from the same init/main.c source code file and will see the start of the first process.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Links
- lock-free data structures
- kmemleak
- ACPI
- IRQs
- RCU
- RCU documentation
- integer ID management
- Documentation/memory-barriers.txt
- Runtime locking correctness validator
- Per-CPU variables
- Linux kernel memory management
- slab
- i2c
- Previous part
Kernel initialization. Part 10.
End of the linux kernel initialization process
This is tenth part of the chapter about linux kernel initialization process and in the previous part we saw the initialization of the RCU and stopped on the call of the acpi_early_init
function. This part will be the last part of the Kernel initialization process chapter, so let's finish it.
After the call of the acpi_early_init
function from the init/main.c, we can see the following code:
#ifdef CONFIG_X86_ESPFIX64
init_espfix_bsp();
#endif
Here we can see the call of the init_espfix_bsp
function which depends on the CONFIG_X86_ESPFIX64
kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the arch/x86/kernel/espfix_64.c and prevents leaking of 31:16
bits of the esp
register during returning to 16-bit stack. First of all we install espfix
page upper directory into the kernel page directory in the init_espfix_bs
:
pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);
Where ESPFIX_BASE_ADDR
is:
#define PGDIR_SHIFT 39
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)
Also we can find it in the Documentation/x86/x86_64/mm:
... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...
After we've filled page global directory with the espfix
pud, the next step is call of the init_espfix_random
and init_espfix_ap
functions. The first function returns random locations for the espfix
page and the second enables the espfix
for the current CPU. After the init_espfix_bsp
finished the work, we can see the call of the thread_info_cache_init
function which defined in the kernel/fork.c and allocates cache for the thread_info
if THREAD_SIZE
is less than PAGE_SIZE
:
# if THREAD_SIZE >= PAGE_SIZE
...
...
...
void thread_info_cache_init(void)
{
thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
THREAD_SIZE, 0, NULL);
BUG_ON(thread_info_cache == NULL);
}
...
...
...
#endif
As we already know the PAGE_SIZE
is (_AC(1,UL) << PAGE_SHIFT)
or 4096
bytes and THREAD_SIZE
is (PAGE_SIZE << THREAD_SIZE_ORDER)
or 16384
bytes for the x86_64
. The next function after the thread_info_cache_init
is the cred_init
from the kernel/cred.c. This function just allocates cache for the credentials (like uid
, gid
, etc.):
void __init cred_init(void)
{
cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
}
more about credentials you can read in the Documentation/security/credentials.txt. Next step is the fork_init
function from the kernel/fork.c. The fork_init
function allocates cache for the task_struct
. Let's look on the implementation of the fork_init
. First of all we can see definitions of the ARCH_MIN_TASKALIGN
macro and creation of a slab where task_structs will be allocated:
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
#ifndef ARCH_MIN_TASKALIGN
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
#endif
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
#endif
As we can see this code depends on the CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR
kernel configuration option. This configuration option shows the presence of the alloc_task_struct
for the given architecture. As x86_64
has no alloc_task_struct
function, this code will not work and even will not be compiled on the x86_64
.
Allocating cache for init task
After this we can see the call of the arch_task_cache_init
function in the fork_init
:
void arch_task_cache_init(void)
{
task_xstate_cachep =
kmem_cache_create("task_xstate", xstate_size,
__alignof__(union thread_xstate),
SLAB_PANIC | SLAB_NOTRACK, NULL);
setup_xstate_comp();
}
The arch_task_cache_init
does initialization of the architecture-specific caches. In our case it is x86_64
, so as we can see, the arch_task_cache_init
allocates cache for the task_xstate
which represents FPU state and sets up offsets and sizes of all extended states in xsave area with the call of the setup_xstate_comp
function. After the arch_task_cache_init
we calculate default maximum number of threads with the:
set_max_threads(MAX_THREADS);
where default maximum number of threads is:
#define FUTEX_TID_MASK 0x3fffffff
#define MAX_THREADS FUTEX_TID_MASK
In the end of the fork_init
function we initialize signal handler:
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];
As we know the init_task
is an instance of the task_struct
structure, so it contains signal
field which represents signal handler. It has following type struct signal_struct
. On the first two lines we can see setting of the current and maximum limit of the resource limits
. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here rlim
is resource control limit and presented by the:
struct rlimit {
__kernel_ulong_t rlim_cur;
__kernel_ulong_t rlim_max;
};
structure from the include/uapi/linux/resource.h. In our case the resource is the RLIMIT_NPROC
which is the maximum number of processes that user can own and RLIMIT_SIGPENDING
- the maximum number of pending signals. We can see it in the:
cat /proc/self/limits
Limit Soft Limit Hard Limit Units
...
...
...
Max processes 63815 63815 processes
Max pending signals 63815 63815 signals
...
...
...
Initialization of the caches
The next function after the fork_init
is the proc_caches_init
from the kernel/fork.c. This function allocates caches for the memory descriptors (or mm_struct
structure). At the beginning of the proc_caches_init
we can see allocation of the different SLAB caches with the call of the kmem_cache_create
:
sighand_cachep
- manage information about installed signal handlers;signal_cachep
- manage information about process signal descriptor;files_cachep
- manage information about opened files;fs_cachep
- manage filesystem information.
After this we allocate SLAB
cache for the mm_struct
structures:
mm_cachep = kmem_cache_create("mm_struct",
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
After this we allocate SLAB
cache for the important vm_area_struct
which used by the kernel to manage virtual memory space:
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
Note, that we use KMEM_CACHE
macro here instead of the kmem_cache_create
. This macro is defined in the include/linux/slab.h and just expands to the kmem_cache_create
call:
#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)
The KMEM_CACHE
has one difference from kmem_cache_create
. Take a look on __alignof__
operator. The KMEM_CACHE
macro aligns SLAB
to the size of the given structure, but kmem_cache_create
uses given value to align space. After this we can see the call of the mmap_init
and nsproxy_cache_init
functions. The first function initializes virtual memory area SLAB
and the second function initializes SLAB
for namespaces.
The next function after the proc_caches_init
is buffer_init
. This function is defined in the fs/buffer.c source code file and allocate cache for the buffer_head
. The buffer_head
is a special structure which defined in the include/linux/buffer_head.h and used for managing buffers. In the start of the buffer_init
function we allocate cache for the struct buffer_head
structures with the call of the kmem_cache_create
function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:
nrpages = (nr_free_buffer_pages() * 10) / 100;
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
which will be equal to the 10%
of the ZONE_NORMAL
(all RAM from the 4GB on the x86_64
). The next function after the buffer_init
is - vfs_caches_init
. This function allocates SLAB
caches and hashtable for different VFS caches. We already saw the vfs_caches_init_early
function in the eighth part of the linux kernel initialization process which initialized caches for dcache
(or directory-cache) and inode cache. The vfs_caches_init
function makes post-early initialization of the dcache
and inode
caches, private data cache, hash tables for the mount points, etc. More details about VFS will be described in the separate part. After this we can see signals_init
function. This function is defined in the kernel/signal.c and allocates a cache for the sigqueue
structures which represents queue of the real time signals. The next function is page_writeback_init
. This function initializes the ratio for the dirty pages. Every low-level page entry contains the dirty
bit which indicates whether a page has been written to after been loaded into memory.
Creation of the root for the procfs
After all of this preparations we need to create the root for the proc filesystem. We will do it with the call of the proc_root_init
function from the fs/proc/root.c. At the start of the proc_root_init
function we allocate the cache for the inodes and register a new filesystem in the system with the:
err = register_filesystem(&proc_fs_type);
if (err)
return;
As I wrote above we will not dive into details about VFS and different filesystems in this chapter, but will see it in the chapter about the VFS
. After we've registered a new filesystem in our system, we call the proc_self_init
function from the fs/proc/self.c and this function allocates inode
number for the self
(/proc/self
directory refers to the process accessing the /proc
filesystem). The next step after the proc_self_init
is proc_setup_thread_self
which setups the /proc/thread-self
directory which contains information about current thread. After this we create /proc/self/mounts
symlink which will contains mount points with the call of the
proc_symlink("mounts", NULL, "self/mounts");
and a couple of directories depends on the different configuration options:
#ifdef CONFIG_SYSVIPC
proc_mkdir("sysvipc", NULL);
#endif
proc_mkdir("fs", NULL);
proc_mkdir("driver", NULL);
proc_mkdir("fs/nfsd", NULL);
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
proc_mkdir("openprom", NULL);
#endif
proc_mkdir("bus", NULL);
...
...
...
if (!proc_mkdir("tty", NULL))
return;
proc_mkdir("tty/ldisc", NULL);
...
...
...
In the end of the proc_root_init
we call the proc_sys_init
function which creates /proc/sys
directory and initializes the Sysctl.
It is the end of start_kernel
function. I did not describe all functions which are called in the start_kernel
. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are taskstats_init_early
which exports per-task statistic to the user-space, delayacct_init
- initializes per-task delay accounting, key_init
and security_init
initialize different security stuff, check_bugs
- fix some architecture-dependent bugs, ftrace_init
function executes initialization of the ftrace, cgroup_init
makes initialization of the rest of the cgroup subsystem,etc. Many of these parts and subsystems will be described in the other chapters.
That's all. Finally we have passed through the long-long start_kernel
function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the start_kernel
we can see the last call of the - rest_init
function. Let's go ahead.
First steps after the start_kernel
The rest_init
function is defined in the same source code file as start_kernel
function, and this file is init/main.c. In the beginning of the rest_init
we can see call of the two following functions:
rcu_scheduler_starting();
smpboot_thread_init();
The first rcu_scheduler_starting
makes RCU scheduler active and the second smpboot_thread_init
registers the smpboot_thread_notifier
CPU notifier (more about it you can read in the CPU hotplug documentation. After this we can see the following calls:
kernel_thread(kernel_init, NULL, CLONE_FS);
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
Here the kernel_thread
function (defined in the kernel/fork.c) creates new kernel thread.As we can see the kernel_thread
function takes three arguments:
- Function which will be executed in a new thread;
- Parameter for the
kernel_init
function; - Flags.
We will not dive into details about kernel_thread
implementation (we will see it in the chapter which describe scheduler, just need to say that kernel_thread
invokes clone). Now we only need to know that we create new kernel thread with kernel_thread
function, parent and child of the thread will use shared information about filesystem and it will start to execute kernel_init
function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two kernel_thread
calls we create two new kernel threads with the PID = 1
for init
process and PID = 2
for kthreadd
. We already know what is init
process. Let's look on the kthreadd
. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the ps
util:
$ ps -ef | grep kthread
root 2 0 0 Jan11 ? 00:00:00 [kthreadd]
Let's postpone kernel_init
and kthreadd
for now and go ahead in the rest_init
. In the next step after we have created two new kernel threads we can see the following code:
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
The first rcu_read_lock
function marks the beginning of an RCU read-side critical section and the rcu_read_unlock
marks the end of an RCU read-side critical section. We call these functions because we need to protect the find_task_by_pid_ns
. The find_task_by_pid_ns
returns pointer to the task_struct
by the given pid. So, here we are getting the pointer to the task_struct
for PID = 2
(we got it after kthreadd
creation with the kernel_thread
). In the next step we call complete
function
complete(&kthreadd_done);
and pass address of the kthreadd_done
. The kthreadd_done
defined as
static __initdata DECLARE_COMPLETION(kthreadd_done);
where DECLARE_COMPLETION
macro defined as:
#define DECLARE_COMPLETION(work) \
struct completion work = COMPLETION_INITIALIZER(work)
and expands to the definition of the completion
structure. This structure is defined in the include/linux/completion.h and presents completions
concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the complete
structure and we did it with the DECLARE_COMPLETION
. The second is call of the wait_for_completion
. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call complete
function. Note that we call wait_for_completion
with the kthreadd_done
in the beginning of the kernel_init_freeable
:
wait_for_completion(&kthreadd_done);
And the last step is to call complete
function as we saw it above. After this the kernel_init_freeable
function will not be executed while kthreadd
thread will not be set. After the kthreadd
was set, we can see three following functions in the rest_init
:
init_idle_bootup_task(current);
schedule_preempt_disabled();
cpu_startup_entry(CPUHP_ONLINE);
The first init_idle_bootup_task
function from the kernel/sched/core.c sets the Scheduling class for the current process (idle
class in our case):
void init_idle_bootup_task(struct task_struct *idle)
{
idle->sched_class = &idle_sched_class;
}
where idle
class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function schedule_preempt_disabled
disables preempt in idle
tasks. And the third function cpu_startup_entry
is defined in the kernel/sched/idle.c and calls cpu_idle_loop
from the kernel/sched/idle.c. The cpu_idle_loop
function works as process with PID = 0
and works in the background. Main purpose of the cpu_idle_loop
is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with idle
scheduling class (we just set the current
task to the idle
with the call of the init_idle_bootup_task
function), so the idle
thread does not do useful work but just checks if there is an active task to switch to:
static void cpu_idle_loop(void)
{
...
...
...
while (1) {
while (!need_resched()) {
...
...
...
}
...
}
More about it will be in the chapter about scheduler. So for this moment the start_kernel
calls the rest_init
function which spawns an init
(kernel_init
function) process and become idle
process itself. Now is time to look on the kernel_init
. Execution of the kernel_init
function starts from the call of the kernel_init_freeable
function. The kernel_init_freeable
function first of all waits for the completion of the kthreadd
setup. I already wrote about it above:
wait_for_completion(&kthreadd_done);
After this we set gfp_allowed_mask
to __GFP_BITS_MASK
which means that system is already running, set allowed cpus/mems to all CPUs and NUMA nodes with the set_mems_allowed
function, allow init
process to run on any CPU with the set_cpus_allowed_ptr
, set pid for the cad
or Ctrl-Alt-Delete
, do preparation for booting of the other CPUs with the call of the smp_prepare_cpus
, call early initcalls with the do_pre_smp_initcalls
, initialize SMP
with the smp_init
and initialize lockup_detector with the call of the lockup_detector_init
and initialize scheduler with the sched_init_smp
.
After this we can see the call of the following functions - do_basic_setup
. Before we will call the do_basic_setup
function, our kernel already initialized for this moment. As comment says:
Now we can finally start doing some real work..
The do_basic_setup
will reinitialize cpuset to the active CPUs, initialize the khelper
- which is a kernel thread which used for making calls out to userspace from within the kernel, initialize tmpfs, initialize drivers
subsystem, enable the user-mode helper workqueue
and make post-early call of the initcalls
. We can see opening of the dev/console
and dup twice file descriptors from 0
to 2
after the do_basic_setup
:
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
pr_err("Warning: unable to open an initial console.\n");
(void) sys_dup(0);
(void) sys_dup(0);
We are using two system calls here sys_open
and sys_dup
. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that rdinit=
option was passed to the kernel command line or set default path of the ramdisk:
if (!ramdisk_execute_command)
ramdisk_execute_command = "/init";
Check user's permissions for the ramdisk
and call the prepare_namespace
function from the init/do_mounts.c which checks and mounts the initrd:
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
}
This is the end of the kernel_init_freeable
function and we need return to the kernel_init
. The next step after the kernel_init_freeable
finished its execution is the async_synchronize_full
. This function waits until all asynchronous function calls have been done and after it we will call the free_initmem
which will release all memory occupied by the initialization stuff which located between __init_begin
and __init_end
. After this we protect .rodata
with the mark_rodata_ro
and update state of the system from the SYSTEM_BOOTING
to the
system_state = SYSTEM_RUNNING;
And tries to run the init
process:
if (ramdisk_execute_command) {
ret = run_init_process(ramdisk_execute_command);
if (!ret)
return 0;
pr_err("Failed to execute %s (error %d)\n",
ramdisk_execute_command, ret);
}
First of all it checks the ramdisk_execute_command
which we set in the kernel_init_freeable
function and it will be equal to the value of the rdinit=
kernel command line parameters or /init
by default. The run_init_process
function fills the first element of the argv_init
array:
static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };
which represents arguments of the init
program and call do_execve
function:
argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),
(const char __user *const __user *)argv_init,
(const char __user *const __user *)envp_init);
The do_execve
function is defined in the include/linux/sched.h and runs program with the given file name and arguments. If we did not pass rdinit=
option to the kernel command line, kernel starts to check the execute_command
which is equal to value of the init=
kernel command line parameter:
if (execute_command) {
ret = run_init_process(execute_command);
if (!ret)
return 0;
panic("Requested init %s failed (error %d).",
execute_command, ret);
}
If we did not pass init=
kernel command line parameter either, kernel tries to run one of the following executable files:
if (!try_to_run_init_process("/sbin/init") ||
!try_to_run_init_process("/etc/init") ||
!try_to_run_init_process("/bin/init") ||
!try_to_run_init_process("/bin/sh"))
return 0;
Otherwise we finish with panic:
panic("No working init found. Try passing init= option to kernel. "
"See Linux Documentation/init.txt for guidance.");
That's all! Linux kernel initialization process is finished!
Conclusion
It is the end of the tenth part about the linux kernel initialization process. It is not only the tenth
part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first part of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - start_kernel
and finished with the launch of the first init
process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.
If you have any questions or suggestions write me a comment or ping me at twitter.
Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.
Links
- SLAB
- xsave
- FPU
- Documentation/security/credentials.txt
- Documentation/x86/x86_64/mm
- RCU
- VFS
- inode
- proc
- man proc
- Sysctl
- ftrace
- cgroup
- CPU hotplug documentation
- completions - wait for completion handling
- NUMA
- cpus/mems
- initcalls
- Tmpfs
- initrd
- panic
- Previous part