内核初始化流程

大吉机器人

已于 2022-10-08 11:51:03 修改

阅读量430

点赞数

分类专栏： Linux内核驱动文章标签： linux

于 2022-10-07 11:21:29 首次发布

原文链接：https://www.cntofu.com/book/104/Initialization/README.md

版权

Linux内核驱动专栏收录该内容

65 篇文章 22 订阅

订阅专栏

#内核初始化流程

读者在这章可以了解到整个内核初始化的完整周期，从内核解压之后的第一步到内核自身运行的第一个进程。

注意这里不是所有内核初始化步骤的介绍。这里只有通用的内核内容，不会涉及到中断控制、 ACPI 、以及其它部分。此处没有详述的部分，会在其它章节中描述。

内核解压之后的首要步骤 - 描述内核中的首要步骤。
早期的中断和异常控制 - 描述了早期的中断初始化和早期的缺页处理函数。
在到达内核入口之前最后的准备 - 描述了在调用 start_kernel 之前最后的准备工作。
内核入口 - start_kernel - 描述了内核通用代码中初始化的第一步。
体系架构初始化 - 描述了特定架构的初始化。
进一步初始化指定体系架构 - 描述了再一次的指定架构初始化流程。
最后对指定体系架构初始化 - 描述了指定架构初始化流程的结尾。
调度器初始化 - 描述了调度初始化之前的准备工作，以及调度初始化。
RCU 初始化 - 描述了 RCU 的初始化。
初始化结束 - Linux内核初始化的最后部分。

内核初始化第一部分

踏入内核代码的第一步（TODO: Need proofreading）

上一章是引导过程的最后一部分。从现在开始，我们将深入探究 Linux 内核的初始化过程。在解压缩完 Linux 内核镜像、并把它妥善地放入内存后，内核就开始工作了。我们在第一章中介绍了 Linux 内核引导程序，它的任务就是为执行内核代码做准备。而在本章中，我们将探究内核代码，看一看内核的初始化过程——即在启动 PID 为 1 的 init 进程前，内核所做的大量工作。

本章的内容很多，介绍了在内核启动前的所有准备工作。arch/x86/kernel/head_64.S 文件中定义了内核入口点，我们会从这里开始，逐步地深入下去。在 start_kernel 函数（定义在 init/main.c）执行之前，我们会看到很多的初期的初始化过程，例如初期页表初始化、切换到一个新的内核空间描述符等等。

在上一章的最后一节中，我们跟踪到了 arch/x86/boot/compressed/head_64.S 文件中的 jmp 指令：

jmp	*%rax

此时 rax 寄存器中保存的就是 Linux 内核入口点，通过调用 decompress_kernel （arch/x86/boot/compressed/misc.c）函数后获得。由此可见，内核引导程序的最后一行代码是一句指向内核入口点的跳转指令。既然已经知道了内核入口点定义在哪，我们就可以继续探究 Linux 内核在引导结束后做了些什么。

内核执行的第一步

OK，在调用了 decompress_kernel 函数后，rax 寄存器中保存了解压缩后的内核镜像的地址，并且跳转了过去。解压缩后的内核镜像的入口点定义在 arch/x86/kernel/head_64.S，这个文件的开头几行如下：

	__HEAD
	.code64
	.globl startup_64
startup_64:
	...
	...
	...

我们可以看到 startup_64 过程定义在了 __HEAD 区段下。 __HEAD 只是一个宏，它将展开为可执行的 .head.text 区段：

#define __HEAD		.section	".head.text","ax"

我们可以在 arch/x86/kernel/vmlinux.lds.S 链接器脚本文件中看到这个区段的定义：

.text : AT(ADDR(.text) - LOAD_OFFSET) {
	_text = .;
	...
	...
	...
} :text = 0x9090

除了对 .text 区段的定义，我们还能从这个脚本文件中得知内核的默认物理地址与虚拟地址。_text 是一个地址计数器，对于 x86_64 来说，它定义为：

. = __START_KERNEL;

__START_KERNEL 宏的定义在 arch/x86/include/asm/page_types.h 头文件中，它由内核映射的虚拟基址与基物理起始点相加得到：

#define _START_KERNEL	(__START_KERNEL_map + __PHYSICAL_START)

#define __PHYSICAL_START  ALIGN(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN)

换句话说：

Linux 内核的物理基址 - 0x1000000;
Linux 内核的虚拟基址 - 0xffffffff81000000.

现在我们知道了 startup_64 过程的默认物理地址与虚拟地址，但是真正的地址必须要通过下面的代码计算得到：

	leaq	_text(%rip), %rbp
	subq	$_text - __START_KERNEL_map, %rbp

没错，虽然定义为 0x1000000，但是仍然有可能变化，例如启用 kASLR 的时候。所以我们当前的目标是计算 0x1000000 与实际加载地址的差。这里我们首先将RIP相对地址（rip-relative）放入 rbp 寄存器，并且从中减去 $_text - __START_KERNEL_map 。我们已经知道， _text 在编译后的默认虚拟地址为 0xffffffff81000000，物理地址为 0x1000000。__START_KERNEL_map 宏将展开为 0xffffffff80000000，因此对于对于第二行汇编代码，我们将得到如下的表达式：

rbp = 0x1000000 - (0xffffffff81000000 - 0xffffffff80000000)

在计算过后，rbp 的值将为 0，代表了实际加载地址与编译后的默认地址之间的差值。在我们这个例子中，0 代表了 Linux 内核被加载到了默认地址，并且没有启用 kASLR 。

在得到了 startup_64 的地址后，我们需要检查这个地址是否已经正确对齐。下面的代码将进行这项工作：

	testl	$~PMD_PAGE_MASK, %ebp
	jnz	bad_address

在这里我们将 rbp 寄存器的低32位与 PMD_PAGE_MASK 进行比较。PMD_PAGE_MASK 代表中层页目录（Page middle directory）屏蔽位（相关信息请阅读 paging 一节），它的定义如下：

#define PMD_PAGE_MASK           (~(PMD_PAGE_SIZE-1))

#define PMD_PAGE_SIZE           (_AC(1, UL) << PMD_SHIFT)
#define PMD_SHIFT       21

可以很容易得出 PMD_PAGE_SIZE 为 2MB 。在这里我们使用标准公式来检查对齐问题，如果 text 的地址没有对齐到 2MB，则跳转到 bad_address。

在此之后，我们通过检查高 18 位来防止这个地址过大：

	leaq	_text(%rip), %rax
	shrq	$MAX_PHYSMEM_BITS, %rax
	jnz	bad_address

这个地址必须不超过 46 个比特，即小于2的46次方：

#define MAX_PHYSMEM_BITS       46

OK，至此我们完成了一些初步的检查，可以继续进行后续的工作了。

修正页表基地址

在开始设置 Identity 分页之前，我们需要首先修正下面的地址：

	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)

如果 startup_64 的值不为默认的 0x1000000 的话，则包括 early_level4_pgt、level3_kernel_pgt 在内的很多地址都会不正确。rbp寄存器中包含的是相对地址，因此我们把它与 early_level4_pgt、level3_kernel_pgt 以及 level2_fixmap_pgt 中特定的项相加。首先我们来看一下它们的定义：

NEXT_PAGE(early_level4_pgt)
	.fill	511,8,0
	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level3_kernel_pgt)
	.fill	L3_START_KERNEL,8,0
	.quad	level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
	.quad	level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE

NEXT_PAGE(level2_kernel_pgt)
	PMDS(0, __PAGE_KERNEL_LARGE_EXEC,
		KERNEL_IMAGE_SIZE/PMD_SIZE)

NEXT_PAGE(level2_fixmap_pgt)
	.fill	506,8,0
	.quad	level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
	.fill	5,8,0

NEXT_PAGE(level1_fixmap_pgt)
	.fill	512,8,0

看起来很难理解，实则不然。首先我们来看一下 early_level4_pgt。它的前 (4096 - 8) 个字节全为 0，即它的前 511 个项均不使用，之后的一项是 level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE。我们知道 __START_KERNEL_map 是内核的虚拟基地址，因此减去 __START_KERNEL_map 后就得到了 level3_kernel_pgt 的物理地址。现在我们来看一下 _PAGE_TABLE，它是页表项的访问权限：

#define _PAGE_TABLE     (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
                         _PAGE_ACCESSED | _PAGE_DIRTY)

更多信息请阅读分页部分.

level3_kernel_pgt 中保存的两项用来映射内核空间，在它的前 510（即 L3_START_KERNEL）项均为 0。这里的 L3_START_KERNEL 保存的是在上层页目录（Page Upper Directory）中包含__START_KERNEL_map 地址的那一条索引，它等于 510。后面一项 level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE 中的 level2_kernel_pgt 比较容易理解，它是一条页表项，包含了指向中层页目录的指针，它用来映射内核空间，并且具有如下的访问权限：

#define _KERNPG_TABLE   (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
                         _PAGE_DIRTY)

level2_fixmap_pgt 是一系列虚拟地址，它们可以在内核空间中指向任意的物理地址。它们由level2_fixmap_pgt作为入口点、10MB 大小的空间用来为 vsyscalls 做映射。level2_kernel_pgt 则调用了PDMS 宏，在 __START_KERNEL_map 地址处为内核的 .text 创建了 512MB 大小的空间（这 512 MB空间的后面是模块内存空间）。

现在，在看过了这些符号的定义之后，让我们回到本节开始时介绍的那几行代码。rbp 寄存器包含了实际地址与 startup_64 地址之差，其中 startup_64 的地址是在内核链接时获得的。因此我们只需要把它与各个页表项的基地址相加，就能够得到正确的地址了。在这里这些操作如下：

	addq	%rbp, early_level4_pgt + (L4_START_KERNEL*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (510*8)(%rip)
	addq	%rbp, level3_kernel_pgt + (511*8)(%rip)
	addq	%rbp, level2_fixmap_pgt + (506*8)(%rip)

换句话说，early_level4_pgt 的最后一项就是 level3_kernel_pgt，level3_kernel_pgt 的最后两项分别是 level2_kernel_pgt 和 level2_fixmap_pgt， level2_fixmap_pgt 的第507项就是 level1_fixmap_pgt 页目录。

在这之后我们就得到了：

early_level4_pgt[511] -> level3_kernel_pgt[0]
level3_kernel_pgt[510] -> level2_kernel_pgt[0]
level3_kernel_pgt[511] -> level2_fixmap_pgt[0]
level2_kernel_pgt[0]   -> 512 MB kernel mapping
level2_fixmap_pgt[507] -> level1_fixmap_pgt

需要注意的是，我们并不修正 early_level4_pgt 以及其他页目录的基地址，我们会在构造、填充这些页目录结构的时候修正。我们修正了页表基地址后，就可以开始构造这些页目录了。

Identity Map Paging

现在我们可以进入到对初期页表进行 Identity 映射的初始化过程了。在 Identity 映射分页中，虚拟地址会被映射到地址相同的物理地址上，即 1 : 1。下面我们来看一下细节。首先我们找到 _text 与 _early_level4_pgt 的 RIP 相对地址，并把他们放入 rdi 与 rbx 寄存器中。

	leaq	_text(%rip), %rdi
	leaq	early_level4_pgt(%rip), %rbx

在此之后我们使用 rax 保存 _text 的地址。同时，在全局页目录表中有一条记录中存放的是 _text 的地址。为了得到这条索引，我们把 _text 的地址右移 PGDIR_SHIFT 位。

	movq	%rdi, %rax
	shrq	$PGDIR_SHIFT, %rax

	leaq	(4096 + _KERNPG_TABLE)(%rbx), %rdx
	movq	%rdx, 0(%rbx,%rax,8)
	movq	%rdx, 8(%rbx,%rax,8)

其中 PGDIR_SHIFT 为 39。PGDIR_SHIFT表示的是在虚拟地址下的全局页目录位的屏蔽值（mask）。下面的宏定义了所有类型的页目录的屏蔽值：

#define PGDIR_SHIFT     39
#define PUD_SHIFT       30
#define PMD_SHIFT       21

此后我们就将 level3_kernel_pgt 的地址放进 rdx 中，并将它的访问权限设置为 _KERNPG_TABLE（见上），然后将 level3_kernel_pgt 填入 early_level4_pgt 的两项中。

然后我们给 rdx 寄存器加上 4096（即 early_level4_pgt 的大小），并把 rdi 寄存器的值（即 _text 的物理地址）赋值给 rax 寄存器。之后我们把上层页目录中的两个项写入 level3_kernel_pgt：

	addq	$4096, %rdx
	movq	%rdi, %rax
	shrq	$PUD_SHIFT, %rax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)
	incl	%eax
	andl	$(PTRS_PER_PUD-1), %eax
	movq	%rdx, 4096(%rbx,%rax,8)

下一步我们把中层页目录表项的地址写入 level2_kernel_pgt，然后修正内核的 text 和 data 的虚拟地址：

	leaq	level2_kernel_pgt(%rip), %rdi
	leaq	4096(%rdi), %r8
1:	testq	$1, 0(%rdi)
	jz	2f
	addq	%rbp, 0(%rdi)
2:	addq	$8, %rdi
	cmp	%r8, %rdi
	jne	1b

这里首先把 level2_kernel_pgt 的地址赋值给 rdi，并把页表项的地址赋值给 r8 寄存器。下一步我们来检查 level2_kernel_pgt 中的存在位，如果其为0，就把 rdi 加上8以便指向下一个页。然后我们将其与 r8（即页表项的地址）作比较，不相等的话就跳转回前面的标签 1 ，反之则继续运行。

接下来我们使用 rbp （即 _text 的物理地址）来修正 phys_base 物理地址。将 early_level4_pgt 的物理地址与 rbp 相加，然后跳转至标签 1：

	addq	%rbp, phys_base(%rip)
	movq	$(early_level4_pgt - __START_KERNEL_map), %rax
	jmp 1f

其中 phys_base 与 level2_kernel_pgt 第一项相同，为 512 MB的内核映射。

跳转至内核入口点之前的最后准备

此后我们就跳转至标签1来开启 PAE 和 PGE （Paging Global Extension），并且将phys_base的物理地址（见上）放入 rax 就寄存器，同时将其放入 cr3 寄存器：

1:
	movl	$(X86_CR4_PAE | X86_CR4_PGE), %ecx
	movq	%rcx, %cr4

	addq	phys_base(%rip), %rax
	movq	%rax, %cr3

接下来我们检查CPU是否支持 NX 位：

	movl	$0x80000001, %eax
	cpuid
	movl	%edx,%edi

首先将 0x80000001 放入 eax 中，然后执行 cpuid 指令来得到处理器信息。这条指令的结果会存放在 edx 中，我们把他再放到 edi 里。

现在我们把 MSR_EFER （即 0xc0000080）放入 ecx，然后执行 rdmsr 指令来读取CPU中的Model Specific Register (MSR)。

	movl	$MSR_EFER, %ecx
	rdmsr

返回结果将存放于 edx:eax 。下面展示了 EFER 各个位的含义：

63                                                                              32
 --------------------------------------------------------------------------------
|                                                                               |
|                                Reserved MBZ                                   |
|                                                                               |
 --------------------------------------------------------------------------------
31                            16  15      14      13   12  11   10  9  8 7  1   0
 --------------------------------------------------------------------------------
|                              | T |       |       |    |   |   |   |   |   |   |
| Reserved MBZ                 | C | FFXSR | LMSLE |SVME|NXE|LMA|MBZ|LME|RAZ|SCE|
|                              | E |       |       |    |   |   |   |   |   |   |
 --------------------------------------------------------------------------------

在这里我们不会介绍每一个位的含义，没有涉及到的位和其他的 MSR 将会在专门的部分介绍。在我们将 EFER 读入 edx:eax 之后，通过 btsl 来将 _EFER_SCE （即第0位）置1，设置 SCE 位将会启用 SYSCALL 以及 SYSRET 指令。下一步我们检查 edi（即 cpuid 的结果（见上））中的第20位。如果第 20 位（即 NX 位）置位，我们就只把 EFER_SCE写入MSR。

	btsl	$_EFER_SCE, %eax
	btl	    $20,%edi
	jnc     1f
	btsl	$_EFER_NX, %eax
	btsq	$_PAGE_BIT_NX,early_pmd_flags(%rip)
1:	wrmsr

如果支持 NX 那么我们就把 _EFER_NX 也写入MSR。在设置了 NX 后，还要对 cr0 （control register）中的一些位进行设置：

X86_CR0_PE - 系统处于保护模式;
X86_CR0_MP - 与CR0的TS标志位一同控制 WAIT/FWAIT 指令的功能；
X86_CR0_ET - 386允许指定外部数学协处理器为80287或80387;
X86_CR0_NE - 如果置位，则启用内置的x87浮点错误报告，否则启用PC风格的x87错误检测；
X86_CR0_WP - 如果置位，则CPU在特权等级为0时无法写入只读内存页;
X86_CR0_AM - 当AM位置位、EFLGS中的AC位置位、特权等级为3时，进行对齐检查;
X86_CR0_PG - 启用分页.

#define CR0_STATE	(X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | \
			 X86_CR0_NE | X86_CR0_WP | X86_CR0_AM | \
			 X86_CR0_PG)
movl	$CR0_STATE, %eax
movq	%rax, %cr0

为了从汇编执行C语言代码，我们需要建立一个栈。首先将栈指针指向一个内存中合适的区域，然后重置FLAGS寄存器

movq stack_start(%rip), %rsp
pushq $0
popfq

在这里最有意思的地方在于 stack_start。它也定义在当前的源文件中：

GLOBAL(stack_start)
.quad  init_thread_union+THREAD_SIZE-8

对于 GLOABL 我们应该很熟悉了。它在 arch/x86/include/asm/linkage.h 头文件中定义如下：

#define GLOBAL(name)    \
         .globl name;           \
         name:

THREAD_SIZE 定义在 arch/x86/include/asm/page_64_types.h，它依赖于 KASAN_STACK_ORDER 的值:

#define THREAD_SIZE_ORDER       (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

首先来考虑当禁用了 kasan 并且 PAGE_SIZE 大小为4096时的情况。此时 THREAD_SIZE 将为 16 KB，代表了一个线程的栈的大小。为什么是线程？我们知道每一个进程可能会有父进程和子进程。事实上，父进程和子进程使用不同的栈空间，每一个新进程都会拥有一个新的内核栈。在Linux内核中，这个栈由 thread_info 结构中的一个union表示：

union thread_union {
         struct thread_info thread_info;
         unsigned long stack[THREAD_SIZE/sizeof(long)];
};

例如，init_thread_union定义如下：

union thread_union init_thread_union __init_task_data =
	{ INIT_THREAD_INFO(init_task) };

其中 INIT_THREAD_INFO 接受 task_struct 结构类型的参数，并进行一些初始化操作：

#define INIT_THREAD_INFO(tsk)		\
{                                               \
	.task		= &tsk,                         \
	.flags		= 0,                            \
	.cpu		= 0,                            \
	.addr_limit	= KERNEL_DS,                    \
}

task_struct 结构在内核中代表了对进程的描述。因此，thread_union 包含了关于一个进程的低级信息，并且其位于进程栈底：

+-----------------------+
|                       |
|                       |
|                       |
|     Kernel stack      |
|                       |
|                       |
|                       |
|-----------------------|
|                       |
|  struct thread_info   |
|                       |
+-----------------------+

需要注意的是我们在栈顶保留了 8 个字节的空间，用来保护对下一个内存页的非法访问。

在初期启动栈设置好之后，使用 lgdt 指令来更新全局描述符表：

lgdt	early_gdt_descr(%rip)

其中 early_gdt_descr 定义如下：

early_gdt_descr:
	.word	GDT_ENTRIES*8-1
early_gdt_descr_base:
	.quad	INIT_PER_CPU_VAR(gdt_page)

需要重新加载 全局描述附表 的原因是，虽然目前内核工作在用户空间的低地址中，但很快内核将会在它自己的内存地址空间中运行。下面让我们来看一下 early_gdt_descr 的定义。全局描述符表包含了32项，用于内核代码、数据、线程局部存储段等：

#define GDT_ENTRIES 32

现在来看一下 early_gdt_descr_base. 首先，gdt_page 的定义在arch/x86/include/asm/desc.h中:

struct gdt_page {
	struct desc_struct gdt[GDT_ENTRIES];
} __attribute__((aligned(PAGE_SIZE)));

它只包含了一项 desc_struct 的数组gdt。desc_struct定义如下:

struct desc_struct {
         union {
                 struct {
                         unsigned int a;
                         unsigned int b;
                 };
                 struct {
                         u16 limit0;
                         u16 base0;
                         unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1;
                         unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8;
                 };
         };
 } __attribute__((packed));

它跟 GDT 描述符的定义很像。同时需要注意的是，gdt_page结构是 PAGE_SIZE(4096) 对齐的，即 gdt 将会占用一页内存。

下面我们来看一下 INIT_PER_CPU_VAR，它定义在 arch/x86/include/asm/percpu.h，只是将给定的参数与 init_per_cpu__连接起来：

#define INIT_PER_CPU_VAR(var) init_per_cpu__##var

所以在宏展开之后，我们会得到 init_per_cpu__gdt_page。而在 linker script 中可以发现:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);

INIT_PER_CPU 扩展后也将得到 init_per_cpu__gdt_page 并将它的值设置为相对于 __per_cpu_load 的偏移量。这样，我们就得到了新GDT的正确的基地址。

per-CPU变量是2.6内核中的特性。顾名思义，当我们创建一个 per-CPU 变量时，每个CPU都会拥有一份它自己的拷贝，在这里我们创建的是 gdt_page per-CPU变量。这种类型的变量有很多有点，比如由于每个CPU都只访问自己的变量而不需要锁等。因此在多处理器的情况下，每一个处理器核心都将拥有一份自己的 GDT 表，其中的每一项都代表了一块内存，这块内存可以由在这个核心上运行的线程访问。这里 Theory/per-cpu 有关于 per-CPU 变量的更详细的介绍。

在加载好了新的全局描述附表之后，跟之前一样我们重新加载一下各个段：

	xorl %eax,%eax
	movl %eax,%ds
	movl %eax,%ss
	movl %eax,%es
	movl %eax,%fs
	movl %eax,%gs

在所有这些步骤都结束后，我们需要设置一下 gs 寄存器，令它指向一个特殊的栈 irqstack，用于处理中断：

	movl	$MSR_GS_BASE,%ecx
	movl	initial_gs(%rip),%eax
	movl	initial_gs+4(%rip),%edx
	wrmsr

其中， MSR_GS_BASE 为：

#define MSR_GS_BASE             0xc0000101

我们需要把 MSR_GS_BASE 放入 ecx 寄存器，同时利用 wrmsr 指令向 eax 和 edx 处的地址加载数据（即指向 initial_gs）。cs, fs, ds 和 ss 段寄存器在64位模式下不用来寻址，但 fs 和 gs 可以使用。 fs 和 gs 有一个隐含的部分（与实模式下的 cs 段寄存器类似），这个隐含部分存储了一个描述符，其指向 Model Specific Registers。因此上面的 0xc0000101 是一个 gs.base MSR 地址。当发生系统调用或者中断时，入口点处并没有内核栈，因此 MSR_GS_BASE 将会用来存放中断栈。

接下来我们把实模式中的 bootparam 结构的地址放入 rdi (要记得 rsi 从一开始就保存了这个结构体的指针)，然后跳转到C语言代码：

	movq	initial_code(%rip),%rax
	pushq	$0
	pushq	$__KERNEL_CS
	pushq	%rax
	lretq

这里我们把 initial_code 放入 rax 中，并且向栈里分别压入一个无用的地址、__KERNEL_CS 和 initial_code 的地址。随后的 lreq 指令表示从栈上弹出返回地址并跳转。initial_code 同样定义在这个文件里：

	.balign	8
	GLOBAL(initial_code)
	.quad	x86_64_start_kernel
	...
	...
	...

可以看到 initial_code 包含了 x86_64_start_kernel 的地址，其定义在 arch/x86/kerne/head64.c：

asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data) {
	...
	...
	...
}

这个函数接受一个参数 real_mode_data（刚才我们把实模式下数据的地址保存到了 rdi 寄存器中）。

这个函数是内核中第一个执行的C语言代码！

走进 start_kernel

在我们真正到达“内核入口点”-init/main.c中的start_kernel函数之前，我们还需要最后的准备工作：

首先在 x86_64_start_kernel 函数中可以看到一些检查工作：

BUILD_BUG_ON(MODULES_VADDR < __START_KERNEL_map);
BUILD_BUG_ON(MODULES_VADDR - __START_KERNEL_map < KERNEL_IMAGE_SIZE);
BUILD_BUG_ON(MODULES_LEN + KERNEL_IMAGE_SIZE > 2*PUD_SIZE);
BUILD_BUG_ON((__START_KERNEL_map & ~PMD_MASK) != 0);
BUILD_BUG_ON((MODULES_VADDR & ~PMD_MASK) != 0);
BUILD_BUG_ON(!(MODULES_VADDR > __START_KERNEL));
BUILD_BUG_ON(!(((MODULES_END - 1) & PGDIR_MASK) == (__START_KERNEL & PGDIR_MASK)));
BUILD_BUG_ON(__fix_to_virt(__end_of_fixed_addresses) <= MODULES_END);

这些检查包括：模块的虚拟地址不能低于内核 text 段基地址 __START_KERNEL_map ，包含模块的内核 text 段的空间大小不能小于内核镜像大小等等。BUILD_BUG_ON 宏定义如下：

#define BUILD_BUG_ON(condition) ((void)sizeof(char[1 - 2*!!(condition)]))

我们来理解一下这些巧妙的设计是怎么工作的。首先以第一个条件 MODULES_VADDR < __START_KERNEL_map 为例：!!conditions 等价于 condition != 0，这代表如果 MODULES_VADDR < __START_KERNEL_map 为真，则 !!(condition) 为1，否则为0。执行2*!!(condition)之后数值变为 2 或 0。因此，这个宏执行完后可能产生两种不同的行为：

编译错误。因为我们尝试取获取一个字符数组索引为负数的变量的大小。
没有编译错误。

就是这么简单，通过C语言中某些常量导致编译错误的技巧实现了这一设计。

接下来 start_kernel 调用了 cr4_init_shadow 函数，其中存储了每个CPU中 cr4 的Shadow Copy。上下文切换可能会修改 cr4 中的位，因此需要保存每个CPU中 cr4 的内容。在这之后将会调用 reset_early_page_tables 函数，它重置了所有的全局页目录项，同时向 cr3 中重新写入了的全局页目录表的地址：

for (i = 0; i < PTRS_PER_PGD-1; i++)
	early_level4_pgt[i].pgd = 0;

next_early_pgt = 0;

write_cr3(__pa_nodebug(early_level4_pgt));

很快我们就会设置新的页表。在这里我们遍历了所有的全局页目录项（其中 PTRS_PER_PGD 为 512），将其设置为0。之后将 next_early_pgt 设置为0（会在下一篇文章中介绍细节），同时把 early_level4_pgt 的物理地址写入 cr3。__pa_nodebug 是一个宏，将被扩展为：

((unsigned long)(x) - __START_KERNEL_map + phys_base)

此后我们清空了从 __bss_stop 到 __bss_start 的 _bss 段，下一步将是建立初期 IDT（中断描述符表） 的处理代码，内容很多，我们将会留到下一个部分再来探究。

总结

第一部分关于Linux内核的初始化过程到这里就结束了。

如果你有任何问题或建议，请在twitter上联系我 0xAX，或者通过邮件与我沟通，还可以新开issue。

下一部分我们会看到初期中断处理程序的初始化过程、内核空间的内存映射等。

内核初始化第二部分

初期中断和异常处理

在上一个部分我们谈到了初期中断初始化。目前我们已经处于解压缩后的Linux内核中了，还有了用于初期启动的基本的分页机制。我们的目标是在内核的主体代码执行前做好准备工作。

我们已经在本章的第一部分做了一些工作，在这一部分中我们会继续分析关于中断和异常处理部分的代码。

我们在上一部分谈到了下面这个循环：

for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
	set_intr_gate(i, early_idt_handler_array[i]);

这段代码位于 arch/x86/kernel/head64.c。在分析这段代码之前，我们先来了解一些关于中断和中断处理程序的知识。

理论

中断是一种由软件或硬件产生的、向CPU发出的事件。例如，如果用户按下了键盘上的一个按键时，就会产生中断。此时CPU将会暂停当前的任务，并且将控制流转到特殊的程序中—— 中断处理程序(Interrupt Handler)。一个中断处理程序会对中断进行处理，然后将控制权交还给之前暂停的任务中。中断分为三类：

软件中断 - 当一个软件可以向CPU发出信号，表明它需要系统内核的相关功能时产生。这些中断通常用于系统调用；
硬件中断 - 当一个硬件有任何事件发生时产生，例如键盘的按键被按下；
异常 - 当CPU检测到错误时产生，例如发生了除零错误或者访问了一个不存在的内存页。

每一个中断和异常都可以由一个数来表示，这个数叫做 向量号 ，它可以取从 0 到 255 中的任何一个数。通常在实践中前 32 个向量号用来表示异常，32 到 255 用来表示用户定义的中断。可以看到在上面的代码中，NUM_EXCEPTION_VECTORS 就定义为：

#define NUM_EXCEPTION_VECTORS 32

CPU会从 APIC 或者 CPU 引脚接收中断，并使用中断向量号作为 中断描述符表 的索引。下面的表中列出了 0-31 号异常：

----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description         |Type |Error Code|Source                   |
----------------------------------------------------------------------------------------------
|0     | #DE    |Divide Error        |Fault|NO        |DIV and IDIV                          |
|---------------------------------------------------------------------------------------------
|1     | #DB    |Reserved            |F/T  |NO        |                                      |
|---------------------------------------------------------------------------------------------
|2     | ---    |NMI                 |INT  |NO        |external NMI                          |
|---------------------------------------------------------------------------------------------
|3     | #BP    |Breakpoint          |Trap |NO        |INT 3                                 |
|---------------------------------------------------------------------------------------------
|4     | #OF    |Overflow            |Trap |NO        |INTO  instruction                     |
|---------------------------------------------------------------------------------------------
|5     | #BR    |Bound Range Exceeded|Fault|NO        |BOUND instruction                     |
|---------------------------------------------------------------------------------------------
|6     | #UD    |Invalid Opcode      |Fault|NO        |UD2 instruction                       |
|---------------------------------------------------------------------------------------------
|7     | #NM    |Device Not Available|Fault|NO        |Floating point or [F]WAIT             |
|---------------------------------------------------------------------------------------------
|8     | #DF    |Double Fault        |Abort|YES       |Ant instrctions which can generate NMI|
|---------------------------------------------------------------------------------------------
|9     | ---    |Reserved            |Fault|NO        |                                      |
|---------------------------------------------------------------------------------------------
|10    | #TS    |Invalid TSS         |Fault|YES       |Task switch or TSS access             |
|---------------------------------------------------------------------------------------------
|11    | #NP    |Segment Not Present |Fault|NO        |Accessing segment register            |
|---------------------------------------------------------------------------------------------
|12    | #SS    |Stack-Segment Fault |Fault|YES       |Stack operations                      |
|---------------------------------------------------------------------------------------------
|13    | #GP    |General Protection  |Fault|YES       |Memory reference                      |
|---------------------------------------------------------------------------------------------
|14    | #PF    |Page fault          |Fault|YES       |Memory reference                      |
|---------------------------------------------------------------------------------------------
|15    | ---    |Reserved            |     |NO        |                                      |
|---------------------------------------------------------------------------------------------
|16    | #MF    |x87 FPU fp error    |Fault|NO        |Floating point or [F]Wait             |
|---------------------------------------------------------------------------------------------
|17    | #AC    |Alignment Check     |Fault|YES       |Data reference                        |
|---------------------------------------------------------------------------------------------
|18    | #MC    |Machine Check       |Abort|NO        |                                      |
|---------------------------------------------------------------------------------------------
|19    | #XM    |SIMD fp exception   |Fault|NO        |SSE[2,3] instructions                 |
|---------------------------------------------------------------------------------------------
|20    | #VE    |Virtualization exc. |Fault|NO        |EPT violations                        |
|---------------------------------------------------------------------------------------------
|21-31 | ---    |Reserved            |INT  |NO        |External interrupts                   |
----------------------------------------------------------------------------------------------

为了能够对中断进行处理，CPU使用了一种特殊的结构 - 中断描述符表（IDT）。IDT 是一个由描述符组成的数组，其中每个描述符都为8个字节，与全局描述附表一致；不过不同的是，我们把IDT中的每一项叫做 门(gate) 。为了获得某一项描述符的起始地址，CPU 会把向量号乘以8，在64位模式中则会乘以16。在前面我们已经见过，CPU使用一个特殊的 GDTR 寄存器来存放全局描述符表的地址，中断描述符表也有一个类似的寄存器 IDTR ，同时还有用于将基地址加载入这个寄存器的指令 lidt 。

64位模式下 IDT 的每一项的结构如下：

127                                                                             96
 --------------------------------------------------------------------------------
|                                                                               |
|                                Reserved                                       |
|                                                                               |
 --------------------------------------------------------------------------------
95                                                                              64
 --------------------------------------------------------------------------------
|                                                                               |
|                               Offset 63..32                                   |
|                                                                               |
 --------------------------------------------------------------------------------
63                               48 47      46  44   42    39             34    32
 --------------------------------------------------------------------------------
|                                  |       |  D  |   |     |      |   |   |     |
|       Offset 31..16              |   P   |  P  | 0 |Type |0 0 0 | 0 | 0 | IST |
|                                  |       |  L  |   |     |      |   |   |     |
 --------------------------------------------------------------------------------
31                                   15 16                                      0
 --------------------------------------------------------------------------------
|                                      |                                        |
|          Segment Selector            |                 Offset 15..0           |
|                                      |                                        |
 --------------------------------------------------------------------------------

其中:

Offset - 代表了到中断处理程序入口点的偏移；
DPL - 描述符特权级别；
P - Segment Present 标志;
Segment selector - 在GDT或LDT中的代码段选择子；
IST - 用来为中断处理提供一个新的栈。

最后的 Type 域描述了这一项的类型，中断处理程序共分为三种：

任务描述符
中断描述符
陷阱描述符

中断和陷阱描述符包含了一个指向中断处理程序的远 (far) 指针，二者唯一的不同在于CPU处理 IF 标志的方式。如果是由中断门进入中断处理程序的，CPU 会清除 IF 标志位，这样当当前中断处理程序执行时，CPU 不会对其他的中断进行处理；只有当当前的中断处理程序返回时，CPU 才在 iret 指令执行时重新设置 IF 标志位。

中断门的其他位为保留位，必须为0。下面我们来看一下 CPU 是如何处理中断的：

CPU 会在栈上保存标志寄存器、cs段寄存器和程序计数器IP；
如果中断是由错误码引起的（比如 #PF）， CPU会在栈上保存错误码；
在中断处理程序执行完毕后，由iret指令返回。

OK，接下来我们继续分析代码。

设置并加载 IDT

我们分析到了如下代码：

for (i = 0; i < NUM_EXCEPTION_VECTORS; i++)
	set_intr_gate(i, early_idt_handler_array[i]);

这里循环内部调用了 set_intr_gate ，它接受两个参数：

中断号，即 向量号；
中断处理程序的地址。

同时，这个函数还会将中断门插入至 IDT 表中，代码中的 &idt_descr 数组即为 IDT。首先让我们来看一下 early_idt_handler_array 数组，它定义在 arch/x86/include/asm/segment.h 头文件中，包含了前32个异常处理程序的地址：

#define EARLY_IDT_HANDLER_SIZE   9
#define NUM_EXCEPTION_VECTORS	32

extern const char early_idt_handler_array[NUM_EXCEPTION_VECTORS][EARLY_IDT_HANDLER_SIZE];

early_idt_handler_array 是一个大小为 288 字节的数组，每一项为 9 个字节，其中2个字节的备用指令用于向栈中压入默认错误码（如果异常本身没有提供错误码的话），2个字节的指令用于向栈中压入向量号，剩余5个字节用于跳转到异常处理程序。

在上面的代码中，我们只通过一个循环向 IDT 中填入了前32项内容，这是因为在整个初期设置阶段，中断是禁用的。early_idt_handler_array 数组中的每一项指向的都是同一个通用中断处理程序，定义在 arch/x86/kernel/head_64.S 。我们先暂时跳过这个数组的内容，看一下 set_intr_gate 的定义。

set_intr_gate 宏定义在 arch/x86/include/asm/desc.h：

#define set_intr_gate(n, addr)                         \
         do {                                                            \
                 BUG_ON((unsigned)n > 0xFF);                             \
                 _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0,        \
                           __KERNEL_CS);                                 \
                 _trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
                                 0, 0, __KERNEL_CS);                     \
         } while (0)

首先 BUG_ON 宏确保了传入的中断向量号不会大于255，因为我们最多只有 256 个中断。然后它调用了 _set_gate 函数，它会将中断门写入 IDT：

static inline void _set_gate(int gate, unsigned type, void *addr,
	                         unsigned dpl, unsigned ist, unsigned seg)
{
         gate_desc s;
         pack_gate(&s, type, (unsigned long)addr, dpl, ist, seg);
         write_idt_entry(idt_table, gate, &s);
         write_trace_idt_entry(gate, &s);
}

在 _set_gate 函数的开始，它调用了 pack_gate 函数。这个函数会使用给定的参数填充 gate_desc 结构：

static inline void pack_gate(gate_desc *gate, unsigned type, unsigned long func,
                             unsigned dpl, unsigned ist, unsigned seg)
{
        gate->offset_low        = PTR_LOW(func);
        gate->segment           = __KERNEL_CS;
        gate->ist               = ist;
        gate->p                 = 1;
        gate->dpl               = dpl;
        gate->zero0             = 0;
        gate->zero1             = 0;
        gate->type              = type;
        gate->offset_middle     = PTR_MIDDLE(func);
        gate->offset_high       = PTR_HIGH(func);
}

在这个函数里，我们把从主循环中得到的中断处理程序入口点地址拆成三个部分，填入门描述符中。下面的三个宏就用来做这个拆分工作：

#define PTR_LOW(x) ((unsigned long long)(x) & 0xFFFF)
#define PTR_MIDDLE(x) (((unsigned long long)(x) >> 16) & 0xFFFF)
#define PTR_HIGH(x) ((unsigned long long)(x) >> 32)

调用 PTR_LOW 可以得到 x 的低 2 个字节，调用 PTR_MIDDLE 可以得到 x 的中间 2 个字节，调用 PTR_HIGH 则能够得到 x 的高 4 个字节。接下来我们来位中断处理程序设置段选择子，即内核代码段 __KERNEL_CS。然后将 Interrupt Stack Table 和 描述符特权等级 （最高特权等级）设置为0，以及在最后设置 GAT_INTERRUPT 类型。

现在我们已经设置好了IDT中的一项，那么通过调用 native_write_idt_entry 函数来把复制到 IDT：

static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
        memcpy(&idt[entry], gate, sizeof(*gate));
}

主循环结束后，idt_table 就已经设置完毕了，其为一个 gate_desc 数组。然后我们就可以通过下面的代码加载 中断描述符表：

load_idt((const struct desc_ptr *)&idt_descr);

其中，idt_descr 为：

struct desc_ptr idt_descr = { NR_VECTORS * 16 - 1, (unsigned long) idt_table };

load_idt 函数只是执行了一下 lidt 指令：

asm volatile("lidt %0"::"m" (*dtr));

你可能已经注意到了，在代码中还有对 _trace_* 函数的调用。这些函数会用跟 _set_gate 同样的方法对 IDT 门进行设置，但仅有一处不同：这些函数并不设置 idt_table ，而是 trace_idt_table ，用于设置追踪点（tracepoint，我们将会在其他章节介绍这一部分）。

好了，至此我们已经了解到，通过设置并加载 中断描述符表 ，能够让CPU在发生中断时做出相应的动作。下面让我们来看一下如何编写中断处理程序。

初期中断处理程序

在上面的代码中，我们用 early_idt_handler_array 的地址来填充了 IDT ，这个 early_idt_handler_array 定义在 arch/x86/kernel/head_64.S：

	.globl early_idt_handler_array
early_idt_handlers:
	i = 0
	.rept NUM_EXCEPTION_VECTORS
	.if (EXCEPTION_ERRCODE_MASK >> i) & 1
	pushq $0
	.endif
	pushq $i
	jmp early_idt_handler_common
	i = i + 1
	.fill early_idt_handler_array + i*EARLY_IDT_HANDLER_SIZE - ., 1, 0xcc
	.endr

这段代码自动生成为前 32 个异常生成了中断处理程序。首先，为了统一栈的布局，如果一个异常没有返回错误码，那么我们就手动在栈中压入一个 0。然后再在栈中压入中断向量号，最后跳转至通用的中断处理程序 early_idt_handler_common 。我们可以通过 objdump 命令的输出一探究竟：

$ objdump -D vmlinux
...
...
...
ffffffff81fe5000 <early_idt_handler_array>:
ffffffff81fe5000:       6a 00                   pushq  $0x0
ffffffff81fe5002:       6a 00                   pushq  $0x0
ffffffff81fe5004:       e9 17 01 00 00          jmpq   ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5009:       6a 00                   pushq  $0x0
ffffffff81fe500b:       6a 01                   pushq  $0x1
ffffffff81fe500d:       e9 0e 01 00 00          jmpq   ffffffff81fe5120 <early_idt_handler_common>
ffffffff81fe5012:       6a 00                   pushq  $0x0
ffffffff81fe5014:       6a 02                   pushq  $0x2
...
...
...

由于在中断发生时，CPU 会在栈上压入标志寄存器、CS 段寄存器和 RIP 寄存器的内容。因此在 early_idt_handler 执行前，栈的布局如下：

|--------------------|
| %rflags            |
| %cs                |
| %rip               |
| rsp --> error code |
|--------------------|

下面我们来看一下 early_idt_handler_common 的实现。它也定义在 arch/x86/kernel/head_64.S 文件中。首先它会检查当前中断是否为不可屏蔽中断(NMI)，如果是则简单地忽略它们：

	cmpl $2,(%rsp)
	je .Lis_nmi

其中 is_nmi 为:

is_nmi:
	addq $16,%rsp
	INTERRUPT_RETURN

这段程序首先从栈顶弹出错误码和中断向量号，然后通过调用 INTERRUPT_RETURN ，即 iretq 指令直接返回。

如果当前中断不是 NMI ，则首先检查 early_recursion_flag 以避免在 early_idt_handler_common 程序中递归地产生中断。如果一切都没问题，就先在栈上保存通用寄存器，为了防止中断返回时寄存器的内容错乱：

	pushq %rax
	pushq %rcx
	pushq %rdx
	pushq %rsi
	pushq %rdi
	pushq %r8
	pushq %r9
	pushq %r10
	pushq %r11

然后我们检查栈上的段选择子：

	cmpl $__KERNEL_CS,96(%rsp)
	jne 11f

段选择子必须为内核代码段，如果不是则跳转到标签 11 ，输出 PANIC 信息并打印栈的内容。然后我们来检查向量号，如果是 #PF 即缺页中断（Page Fault），那么就把 cr2 寄存器中的值赋值给 rdi ，然后调用 early_make_pgtable （详见后文）：

	cmpl $14,72(%rsp)
	jnz 10f
	GET_CR2_INTO(%rdi)
	call early_make_pgtable
	andl %eax,%eax
	jz 20f

如果向量号不是 #PF ，那么就恢复通用寄存器：

	popq %r11
	popq %r10
	popq %r9
	popq %r8
	popq %rdi
	popq %rsi
	popq %rdx
	popq %rcx
	popq %rax

并调用 iret 从中断处理程序返回。

第一个中断处理程序到这里就结束了。由于它只是一个初期中段处理程序，因此只处理缺页中断。下面让我们首先来看一下缺页中断处理程序，其他中断的处理程序我们之后再进行分析。

缺页中断处理程序

在上一节中我们第一次见到了初期中断处理程序，它检查了缺页中断的中断号，并调用了 early_make_pgtable 来建立新的页表。在这里我们需要提供 #PF 中断处理程序，以便为之后将内核加载至 4G 地址以上，并且能访问位于4G以上的 boot_params 结构体。

early_make_pgtable 的实现在 arch/x86/kernel/head64.c，它接受一个参数：从 cr2 寄存器得到的地址，这个地址引发了内存中断。下面让我们来看一下：

int __init early_make_pgtable(unsigned long address)
{
	unsigned long physaddr = address - __PAGE_OFFSET;
	unsigned long i;
	pgdval_t pgd, *pgd_p;
	pudval_t pud, *pud_p;
	pmdval_t pmd, *pmd_p;
	...
	...
	...
}

首先它定义了一些 *val_t 类型的变量。这些类型均为：

typedef unsigned long   pgdval_t;

此外，我们还会遇见 *_t (不带val)的类型，比如 pgd_t ……这些类型都定义在 arch/x86/include/asm/pgtable_types.h，形式如下：

typedef struct { pgdval_t pgd; } pgd_t;

例如，

extern pgd_t early_level4_pgt[PTRS_PER_PGD];

在这里 early_level4_pgt 代表了初期顶层页表目录，它是一个 pdg_t 类型的数组，其中的 pgd 指向了下一级页表。

在确认不是非法地址后，我们取得页表中包含引起 #PF 中断的地址的那一项，将其赋值给 pgd 变量：

pgd_p = &early_level4_pgt[pgd_index(address)].pgd;
pgd = *pgd_p;

接下来我们检查一下 pgd ，如果它包含了正确的全局页表项的话，我们就把这一项的物理地址处理后赋值给 pud_p ：

pud_p = (pudval_t *)((pgd & PTE_PFN_MASK) + __START_KERNEL_map - phys_base);

其中 PTE_PFN_MASK 是一个宏：

#define PTE_PFN_MASK            ((pteval_t)PHYSICAL_PAGE_MASK)

展开后将为：

(~(PAGE_SIZE-1)) & ((1 << 46) - 1)

或者写为：

0b1111111111111111111111111111111111111111111111

它是一个46bit大小的页帧屏蔽值。

如果 pgd 没有包含有效的地址，我们就检查 next_early_pgt 与 EARLY_DYNAMIC_PAGE_TABLES（即 64 ）的大小。EARLY_DYNAMIC_PAGE_TABLES 它是一个固定大小的缓冲区，用来在需要的时候建立新的页表。如果 next_early_pgt 比 EARLY_DYNAMIC_PAGE_TABLES 大，我们就用一个上层页目录指针指向当前的动态页表，并将它的物理地址与 _KERPG_TABLE 访问权限一起写入全局页目录表：

if (next_early_pgt >= EARLY_DYNAMIC_PAGE_TABLES) {
	reset_early_page_tables();
    goto again;
}
	
pud_p = (pudval_t *)early_dynamic_pgts[next_early_pgt++];
for (i = 0; i < PTRS_PER_PUD; i++)
	pud_p[i] = 0;
*pgd_p = (pgdval_t)pud_p - __START_KERNEL_map + phys_base + _KERNPG_TABLE;

然后我们来修正上层页目录的地址：

pud_p += pud_index(address);
pud = *pud_p;

下面我们对中层页目录重复上面同样的操作。最后我们利用 In the end we fix address of the page middle directory which contains maps kernel text+data virtual addresses:

pmd = (physaddr & PMD_MASK) + early_pmd_flags;
pmd_p[pmd_index(address)] = pmd;

到此缺页中断处理程序就完成了它所有的工作，此时 early_level4_pgt 就包含了指向合法地址的项。

小结

本书的第二部分到此结束了。

如果你有任何问题或建议，请在twitter上联系我 0xAX，或者通过邮件与我沟通，还可以新开issue。

接下来我们将会看到进入内核入口点 start_kernel 函数之前剩下所有的准备工作。

内核初始化第三部分

进入内核入口点之前最后的准备工作

这是 Linux 内核初始化过程的第三部分。在上一个部分中我们接触到了初期中断和异常处理，而在这个部分中我们要继续看一看 Linux 内核的初始化过程。在之后的章节我们将会关注“内核入口点”—— init/main.c 文件中的start_kernel 函数。没错，从技术上说这并不是内核的入口点，只是不依赖于特定架构的通用内核代码的开始。不过，在我们调用 start_kernel 之前，有些准备必须要做。下面我们就来看一看。

boot_params again

在上一个部分中我们讲到了设置中断描述符表，并将其加载进 IDTR 寄存器。下一步是调用 copy_bootdata 函数：

copy_bootdata(__va(real_mode_data));

这个函数接受一个参数—— read_mode_data 的虚拟地址。boot_params 结构体是在 arch/x86/include/uapi/asm/bootparam.h 作为第一个参数传递到 arch/x86/kernel/head_64.S 中的 x86_64_start_kernel 函数的：

	/* rsi is pointer to real mode structure with interesting info.
	   pass it to C */
	movq	%rsi, %rdi

下面我们来看一看 __va 宏。这个宏定义在 init/main.c：

#define __va(x)                 ((void *)((unsigned long)(x)+PAGE_OFFSET))

其中 PAGE_OFFSET 就是 __PAGE_OFFSET（即 0xffff880000000000），也是所有对物理地址进行直接映射后的虚拟基地址。因此我们就得到了 boot_params 结构体的虚拟地址，并把他传入 copy_bootdata 函数中。在这个函数里我们把 real_mod_data （定义在 arch/x86/kernel/setup.h）拷贝进 boot_params：

extern struct boot_params boot_params;

copy_boot_data 的实现如下:

static void __init copy_bootdata(char *real_mode_data)
{
	char * command_line;
	unsigned long cmd_line_ptr;

	memcpy(&boot_params, real_mode_data, sizeof boot_params);
	sanitize_boot_params(&boot_params);
	cmd_line_ptr = get_cmd_line_ptr();
	if (cmd_line_ptr) {
		command_line = __va(cmd_line_ptr);
		memcpy(boot_command_line, command_line, COMMAND_LINE_SIZE);
	}
}

首先，这个函数的声明中有一个 __init 前缀，这表示这个函数只在初始化阶段使用，并且它所使用的内存将会被释放。

在这个函数中首先声明了两个用于解析内核命令行的变量，然后使用memcpy 函数将 real_mode_data 拷贝进 boot_params。如果系统引导工具（bootloader）没能正确初始化 boot_params 中的某些成员的话，那么在接下来调用的 sanitize_boot_params 函数中将会对这些成员进行清零，比如 ext_ramdisk_image 等。此后我们通过调用 get_cmd_line_ptr 函数来得到命令行的地址：

unsigned long cmd_line_ptr = boot_params.hdr.cmd_line_ptr;
cmd_line_ptr |= (u64)boot_params.ext_cmd_line_ptr << 32;
return cmd_line_ptr;

get_cmd_line_ptr 函数将会从 boot_params 中获得命令行的64位地址并返回。最后，我们检查一下是否正确获得了 cmd_line_ptr，并把它的虚拟地址拷贝到一个字节数组 boot_command_line 中：

extern char __initdata boot_command_line[];

这一步完成之后，我们就得到了内核命令行和 boot_params 结构体。之后，内核通过调用 load_ucode_bsp 函数来加载处理器微代码（microcode），不过我们目前先暂时忽略这一步。

微代码加载之后，内核会对 console_loglevel 进行检查，同时通过 early_printk 函数来打印出字符串 Kernel Alive。不过这个输出不会真的被显示出来，因为这个时候 early_printk 还没有被初始化。这是目前内核中的一个小bug，作者已经提交了补丁 commit，补丁很快就能应用在主分支中了。所以你可以先跳过这段代码。

初始化内存页

至此，我们已经拷贝了 boot_params 结构体，接下来将对初期页表进行一些设置以便在初始化内核的过程中使用。我们之前已经对初始化了初期页表，以便支持换页，这在之前的部分中已经讨论过。现在则通过调用 reset_early_page_tables 函数将初期页表中大部分项清零（在之前的部分也有介绍），只保留内核高地址的映射。然后我们调用：

	clear_page(init_level4_pgt);

init_level4_pgt 同样定义在 arch/x86/kernel/head_64.S:

NEXT_PAGE(init_level4_pgt)
	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
	.org    init_level4_pgt + L4_PAGE_OFFSET*8, 0
	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
	.org    init_level4_pgt + L4_START_KERNEL*8, 0
	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE

这段代码为内核的代码段、数据段和 bss 段映射了前 2.5G 个字节。clear_page 函数定义在 arch/x86/lib/clear_page_64.S：

ENTRY(clear_page)
	CFI_STARTPROC
	xorl %eax,%eax
	movl $4096/64,%ecx
	.p2align 4
	.Lloop:
    decl	%ecx
#define PUT(x) movq %rax,x*8(%rdi)
	movq %rax,(%rdi)
	PUT(1)
	PUT(2)
	PUT(3)
	PUT(4)
	PUT(5)
	PUT(6)
	PUT(7)
	leaq 64(%rdi),%rdi
	jnz	.Lloop
	nop
	ret
	CFI_ENDPROC
	.Lclear_page_end:
	ENDPROC(clear_page)

顾名思义，这个函数会将页表清零。这个函数的开始和结束部分有两个宏 CFI_STARTPROC 和 CFI_ENDPROC，他们会展开成 GNU 汇编指令，用于调试：

#define CFI_STARTPROC           .cfi_startproc
#define CFI_ENDPROC             .cfi_endproc

在 CFI_STARTPROC 之后我们将 eax 寄存器清零，并将 ecx 赋值为 64（用作计数器）。接下来从 .Lloop 标签开始循环，首先就是将 ecx 减一。然后将 rax 中的值（目前为0）写入 rdi 指向的地址，rdi 中保存的是 init_level4_pgt 的基地址。接下来重复7次这个步骤，但是每次都相对 rdi 多偏移8个字节。之后 init_level4_pgt 的前64个字节就都被填充为0了。接下来我们将 rdi 中的值加上64，重复这个步骤，直到 ecx 减至0。最后就完成了将 init_level4_pgt 填零。

在将 init_level4_pgt 填0之后，再把它的最后一项设置为内核高地址的映射：

init_level4_pgt[511] = early_level4_pgt[511];

在前面我们已经使用 reset_early_page_table 函数清除 early_level4_pgt 中的大部分项，而只保留内核高地址的映射。

x86_64_start_kernel 函数的最后一步是调用：

x86_64_start_reservations(real_mode_data);

并传入 real_mode_data 参数。 x86_64_start_reservations 函数与 x86_64_start_kernel 函数定义在同一个文件中：

void __init x86_64_start_reservations(char *real_mode_data)
{
	if (!boot_params.hdr.version)
		copy_bootdata(__va(real_mode_data));

	reserve_ebda_region();

	start_kernel();
}

这就是进入内核入口点之前的最后一个函数了。下面我们就来介绍一下这个函数。

内核入口点前的最后一步

在 x86_64_start_reservations 函数中首先检查了 boot_params.hdr.version：

if (!boot_params.hdr.version)
	copy_bootdata(__va(real_mode_data));

如果它为0，则再次调用 copy_bootdata，并传入 real_mode_data 的虚拟地址。

接下来则调用了 reserve_ebda_region 函数，它定义在 arch/x86/kernel/head.c。这个函数为 EBDA（即Extended BIOS Data Area，扩展BIOS数据区域）预留空间。扩展BIOS预留区域位于常规内存顶部（译注：常规内存（Conventiional Memory）是指前640K字节内存），包含了端口、磁盘参数等数据。

接下来我们来看一下 reserve_ebda_region 函数。它首先会检查是否启用了半虚拟化：

if (paravirt_enabled())
	return;

如果开启了半虚拟化，那么就退出 reserve_ebda_region 函数，因为此时没有扩展BIOS数据区域。下面我们首先得到低地址内存的末尾地址：

lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
lowmem <<= 10;

首先我们得到了BIOS地地址内存的虚拟地址，以KB为单位，然后将其左移10位（即乘以1024）转换为以字节为单位。然后我们需要获得扩展BIOS数据区域的地址：

ebda_addr = get_bios_ebda();

其中， get_bios_ebda 函数定义在 arch/x86/include/asm/bios_ebda.h：

static inline unsigned int get_bios_ebda(void)
{
	unsigned int address = *(unsigned short *)phys_to_virt(0x40E);
	address <<= 4;
	return address;
}

下面我们来尝试理解一下这段代码。这段代码中，首先我们将物理地址 0x40E 转换为虚拟地址，0x0040:0x000e 就是包含有扩展BIOS数据区域基地址的代码段。这里我们使用了 phys_to_virt 函数进行地址转换，而不是之前使用的 __va 宏。不过，事实上他们两个基本上是一样的：

static inline void *phys_to_virt(phys_addr_t address)
{
         return __va(address);
}

而不同之处在于，phys_to_virt 函数的参数类型 phys_addr_t 的定义依赖于 CONFIG_PHYS_ADDR_T_64BIT：

#ifdef CONFIG_PHYS_ADDR_T_64BIT
	typedef u64 phys_addr_t;
#else
	typedef u32 phys_addr_t;
#endif

具体的类型是由 CONFIG_PHYS_ADDR_T_64BIT 设置选项控制的。此后我们得到了包含扩展BIOS数据区域虚拟基地址的段，把它左移4位后返回。这样，ebda_addr 变量就包含了扩展BIOS数据区域的基地址。

下一步我们来检查扩展BIOS数据区域与低地址内存的地址，看一看它们是否小于 INSANE_CUTOFF 宏：

if (ebda_addr < INSANE_CUTOFF)
	ebda_addr = LOWMEM_CAP;

if (lowmem < INSANE_CUTOFF)
	lowmem = LOWMEM_CAP;

INSANE_CUTOFF 为：

#define INSANE_CUTOFF		0x20000U

即 128 KB. 上一步我们得到了低地址内存中的低地址部分以及扩展BIOS数据区域，然后调用 memblock_reserve 函数来在低内存地址与1MB之间为扩展BIOS数据预留内存区域。

lowmem = min(lowmem, ebda_addr);
lowmem = min(lowmem, LOWMEM_CAP);
memblock_reserve(lowmem, 0x100000 - lowmem);

memblock_reserve 函数定义在 mm/block.c，它接受两个参数：

基物理地址
区域大小

然后在给定的基地址处预留指定大小的内存。memblock_reserve 是在这本书中我们接触到的第一个Linux内核内存管理框架中的函数。我们很快会详细地介绍内存管理，不过现在还是先来看一看这个函数的实现。

Linux内核管理框架初探

在上一段中我们遇到了对 memblock_reserve 函数的调用。现在我们来尝试理解一下这个函数是如何工作的。 memblock_reserve 函数只是调用了：

memblock_reserve_region(base, size, MAX_NUMNODES, 0);

memblock_reserve_region 接受四个参数：

内存区域的物理基地址
内存区域的大小
最大 NUMA 节点数
标志参数 flags

在 memblock_reserve_region 函数一开始，就是一个 memblock_type 结构体类型的变量：

struct memblock_type *_rgn = &memblock.reserved;

memblock_type 类型代表了一块内存，定义如下：

struct memblock_type {
         unsigned long cnt;
         unsigned long max;
         phys_addr_t total_size;
         struct memblock_region *regions;
};

因为我们要为扩展BIOS数据区域预留内存块，所以当前内存区域的类型就是预留。memblock 结构体的定义为：

struct memblock {
         bool bottom_up;
         phys_addr_t current_limit;
         struct memblock_type memory;
         struct memblock_type reserved;
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
         struct memblock_type physmem;
#endif
};

它描述了一块通用的数据块。我们用 memblock.reserved 的值来初始化 _rgn。memblock 全局变量定义如下：

struct memblock memblock __initdata_memblock = {
	.memory.regions		= memblock_memory_init_regions,
	.memory.cnt		= 1,
	.memory.max		= INIT_MEMBLOCK_REGIONS,
	.reserved.regions	= memblock_reserved_init_regions,
	.reserved.cnt		= 1,
	.reserved.max		= INIT_MEMBLOCK_REGIONS,
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
	.physmem.regions	= memblock_physmem_init_regions,
	.physmem.cnt		= 1,
	.physmem.max		= INIT_PHYSMEM_REGIONS,
#endif
	.bottom_up		= false,
	.current_limit		= MEMBLOCK_ALLOC_ANYWHERE,
};

我们现在不会继续深究这个变量，但在内存管理部分的中我们会详细地对它进行介绍。需要注意的是，这个变量的声明中使用了 __initdata_memblock：

#define __initdata_memblock __meminitdata

而 __meminit_data 为：

#define __meminitdata    __section(.meminit.data)

自此我们得出这样的结论：所有的内存块都将定义在 .meminit.data 区段中。在我们定义了 _rgn 之后，使用了 memblock_dbg 宏来输出相关的信息。你可以在从内核命令行传入参数 memblock=debug 来开启这些输出。

在输出了这些调试信息后，是对下面这个函数的调用：

memblock_add_range(_rgn, base, size, nid, flags);

它向 .meminit.data 区段添加了一个新的内存块区域。由于 _rgn 的值是 &memblock.reserved，下面的代码就直接将扩展BIOS数据区域的基地址、大小和标志填入 _rgn 中：

if (type->regions[0].size == 0) {
    WARN_ON(type->cnt != 1 || type->total_size);
    type->regions[0].base = base;
    type->regions[0].size = size;
    type->regions[0].flags = flags;
    memblock_set_region_node(&type->regions[0], nid);
    type->total_size = size;
    return 0;
}

在填充好了区域后，接着是对 memblock_set_region_node 函数的调用。它接受两个参数：

填充好的内存区域的地址
NUMA节点ID

其中我们的区域由 memblock_region 结构体来表示：

struct memblock_region {
    phys_addr_t base;
	phys_addr_t size;
	unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
    int nid;
#endif
};

NUMA节点ID依赖于 MAX_NUMNODES 宏，定义在 include/linux/numa.h

#define MAX_NUMNODES    (1 << NODES_SHIFT)

其中 NODES_SHIFT 依赖于 CONFIG_NODES_SHIFT 配置参数，定义如下：

#ifdef CONFIG_NODES_SHIFT
  #define NODES_SHIFT     CONFIG_NODES_SHIFT
#else
  #define NODES_SHIFT     0
#endif

memblick_set_region_node 函数只是填充了 memblock_region 中的 nid 成员：

static inline void memblock_set_region_node(struct memblock_region *r, int nid)
{
         r->nid = nid;
}

在这之后我们就在 .meminit.data 区段拥有了为扩展BIOS数据区域预留的第一个 memblock。reserve_ebda_region 已经完成了它该做的任务，我们回到 arch/x86/kernel/head64.c 继续。

至此我们已经结束了进入内核之前所有的准备工作。x86_64_start_reservations 的最后一步是调用 init/main.c 中的：

start_kernel()

这一部分到此结束。

小结

本书的第三部分到这里就结束了。在下一部分中，我们将会见到内核入口点处的初始化工作 —— 位于 start_kernel 函数中。这些工作是在启动第一个进程 init 之前首先要完成的工作。

如果你有任何问题或建议，请在twitter上联系我 0xAX，或者通过邮件与我沟通，还可以新开issue。

内核初始化. Part 4.

Kernel entry point

还记得上一章的内容吗 - 跳转到内核入口之前的最后准备？你应该还记得我们已经完成一系列初始化操作，并停在了调用位于init/main.c中的start_kernel函数之前.start_kernel函数是与体系架构无关的通用处理入口函数，尽管我们在此初始化过程中要无数次的返回arch/ 文件夹。如果你仔细看看start_kernel函数的内容，你将发现此函数涉及内容非常广泛。在此过程中约包含了86个调用函数，是的，你发现它真的是非常庞大但是此部分并不是全部的初始化过程，在当前阶段我们只看这些就可以了。此章节以及后续所有在内核初始化过程章节的内容将涉及并详述它。

start_kernel函数的主要目的是完成内核初始化并启动祖先进程(1号进程)。在祖先进程启动之前start_kernel函数做了很多事情，如锁验证器,根据处理器标识ID初始化处理器，开启cgroups子系统，设置每CPU区域环境，初始化VFS Cache机制，初始化内存管理，rcu,vmalloc,scheduler(调度器),IRQs(中断向量表),ACPI(中断可编程控制器)以及其它很多子系统。只有经过这些步骤我们才看到本章最后一部分祖先进程启动的过程；同志们，如此复杂的内核子系统，有没有勾起你的学习欲望，有这么多的内核代码等着我们去征服，让我们开始吧。

注意:在此大章节的所有内容 Linux Kernel initialization process，并不涉及内核调试相关，关于内核调试部分会有一个单独的章节来进行描述

关于 `attribute`

正如我上述所写，start_kernel函数是定义在init/main.c.从已知代码中我们能看到此函数使用了__init特性，你也许从其它地方了解过关于GCC __attribute__相关的内容。在内核初始化阶段这个机制在所有的函数中都是有必要的。

	#define __init      __section(.init.text) __cold notrace

在初始化过程完成后，内核将通过调用free_initmem释放这些sections(段)。注意__init属性是通过__cold和notrace两个属性来定义的。第一个属性cold的目的是标记此函数很少使用所以编译器必须优化此函数的大小，第二个属性notrace定义如下：

	#define notrace __attribute__((no_instrument_function))

含有no_instrument_function意思就是告诉编译器函数调用不产生环境变量(堆栈空间)。

在start_kernel函数的定义中，你也可以看到__visible 属性的扩展：

	#define __visible __attribute__((externally_visible))

含有externally_visible意思就是告诉编译器有一些过程在使用该函数或者变量，为了放至标记这个函数/变量是unusable。你可以在此include/linux/init.h处查到这些属性表达式的含义。

start_kernel 初始化

在start_kernel的初始之初你可以看到这两个变量：

char *command_line;
char *after_dashes;

第一个变量表示内核命令行的全局指针，第二个变量将包含parse_args函数通过输入字符串中的参数'name=value'，寻找特定的关键字和调用正确的处理程序。我们不想在这个时候参与这两个变量的相关细节，但是会在接下来的章节看到。我们接着往下走，下一步我们看到了此函数:

lockdep_init();

lockdep_init 初始化 lock validator. 其实现是相当简单的，它只是初始化了两个哈希表 list_head并设置lockdep_initialized 全局变量为1。关于自旋锁 spinlock以及互斥锁mutex 如何获取请参考链接.

下一个函数是set_task_stack_end_magic，参数为init_task和设置STACK_END_MAGIC (0x57AC6E9D)。init_task代表初始化进程(任务)数据结构:

struct task_struct init_task = INIT_TASK(init_task);

task_struct 存储了进程的所有相关信息。因为它很庞大，我在这本书并不会去介绍，详细信息你可以查看调度相关数据结构定义头文件 include/linux/sched.h。在此刻task_sreuct包含了超过100个字段！虽然你不会看到task_struct是在这本书中的解释，但是我们会经常使用它，因为它是介绍在Linux内核进程的基本知识。我将描述这个结构中字段的一些含义，因为我们在后面的实践中见到它们。

你也可以查看init_task的相关定义以及宏指令INIT_TASK的初始化流程。这个宏指令来自于include/linux/init_task.h在此刻只是设置和初始化了第一个进程来(0号进程)的值。例如这么设置：

初始化进程状态为 zero 或者 runnable. 一个可运行进程即为等待CPU去运行;
初始化仅存的标志位 - PF_KTHREAD 意思为 - 内核线程;
一个可运行的任务列表;
进程地址空间;
初始化进程堆栈 &init_thread_info - init_thread_union.thread_info 和 initthread_union 使用共用体 - thread_union 包含了 thread_info进程信息以及进程栈:。

union thread_union {
	struct thread_info thread_info;
    unsigned long stack[THREAD_SIZE/sizeof(long)];
};

每个进程都有其自己的堆栈，x86_64架构的CPU一般支持的页表是16KB or 4个页框大小。我们注意stack变量被定义为数据并且类型是unsigned long。thread_union结构的下一个字段为thread_union 定义如下：

struct thread_info {
        struct task_struct      *task;
        struct exec_domain      *exec_domain;
        __u32                   flags; 
        __u32                   status;
        __u32                   cpu;
        int                     saved_preempt_count;
        mm_segment_t            addr_limit;
        struct restart_block    restart_block;
        void __user             *sysenter_return;
        unsigned int            sig_on_uaccess_error:1;
        unsigned int            uaccess_err:1;
};

此结构占用52个字节。thread_info结构包含了特定体系架构相关的线程信息，我们都知道在X86_64架构上内核栈是逆生成而thread_union.thread_info结构则是正生长。所以进程进程栈是16KB并且thread_info是在栈底。还需我们处理16 kilobytes - 62 bytes = 16332 bytes.注意 thread_union代表一个联合体union而不是结构体，用一张图来描述栈内存空间。如下图所示:

+-----------------------+
|                       |
|                       |
|        stack          |
|                       |
|_______________________|
|          |            |
|          |            |
|          |            |
|__________↓____________|             +--------------------+
|                       |             |                    |
|      thread_info      |<----------->|     task_struct    |
|                       |             |                    |
+-----------------------+             +--------------------+

http://www.quora.com/In-Linux-kernel-Why-thread_info-structure-and-the-kernel-stack-of-a-process-binds-in-union-construct

所以INIT_TASK宏指令就是task_struct's'结构。正如我上述所写，我并不会去描述这些字段的含义和值，在INIT_TASK赋值处理的时候我们很快能看到这些。

现在让我们回到set_task_stack_end_magic函数，这个函数被定义在kernel/fork.c功能为设置canary init 进程堆栈以检测堆栈溢出。

void set_task_stack_end_magic(struct task_struct *tsk)
{
	unsigned long *stackend;
	stackend = end_of_stack(tsk);
	*stackend = STACK_END_MAGIC; /* for overflow detection */
}

上述函数比较简单，set_task_stack_end_magic函数的作用是先通过end_of_stack函数获取堆栈并赋给 task_struct。关于检测配置需要打开内核配置宏CONFIG_STACK_GROWSUP。因为我们学习的是x86架构的初始化，堆栈是逆生成，所以堆栈底部为：

(unsigned long *)(task_thread_info(p) + 1);

task_thread_info的定义如下，返回一个当前的堆栈；

#define task_thread_info(task)  ((struct thread_info *)(task)->stack)

进程的栈底，我们写STACK_END_MAGIC这个值。如果设置canary，我们可以像这样子去检测堆栈：

if (*end_of_stack(task) != STACK_END_MAGIC) {
        //
        // handle stack overflow here
		//
}

set_task_stack_end_magic 初始化完毕后的下一个函数是 smp_setup_processor_id.此函数在x86_64架构上是空函数：

void __init __weak smp_setup_processor_id(void)
{
}

在此架构上没有实现此函数，但在别的体系架构的实现可以参考s390 and arm64.

我们接着往下走，下一个函数是debug_objects_early_init。此函数的执行几乎和lockdep_init是一样的，但是填充的哈希对象是调试相关。上述我已经表明，关于内核调试部分会在后续专门有一个章节来完成。

debug_object_early_init函数之后我们看到调用了boot_init_stack_canary函数。task_struct->canary 的值利用了GCC特性，但是此特性需要先使能内核CONFIG_CC_STACKPROTECTOR宏后才可以使用。 boot_init_stack_canary 什么也没有做，否则基于随机数和随机池产生 TSC:

get_random_bytes(&canary, sizeof(canary));
tsc = __native_read_tsc();
canary += tsc + (tsc << 32UL);

我们要获取随机数, 我们可以给stack_canary 字段 task_struct赋值：

current->stack_canary = canary;

然后将此值写入IRQ堆栈的顶部:

this_cpu_write(irq_stack_union.stack_canary, canary); // read below about this_cpu_write

关于IRQ的章节我们这里也不会详细刨析, 关于这部分介绍看这里IRQs.如果canary被设置, 关闭本地中断注册bootstrap CPU以及CPU maps. 我们关闭本地中断 (interrupts for current CPU) 使用 local_irq_disable 函数，展开后原型为 arch_local_irq_disable 函数include/linux/percpu-defs.h:

static inline notrace void arch_local_irq_enable(void)
{
        native_irq_enable();
}

如果native_irq_enable通过cli指令判断架构，这里是X86_64， Where native_irq_enable is cli instruction for x86_64.中断的关闭(屏蔽)我们可以通过注册当前CPU ID到CPU bitmap来实现。

激活第一个CPU

当前已经走到start_kernel函数中的boot_cpu_init函数，此函数主要为了通过掩码初始化每一个CPU。首先我们需要获取当前处理器的ID通过下面函数：

int cpu = smp_processor_id();

现在是0. 如果CONFIG_DEBUG_PREEMPT 宏配置了那么 smp_processor_id 的值就来自于 raw_smp_processor_id 函数，原型如下:

#define raw_smp_processor_id() (this_cpu_read(cpu_number))

this_cpu_read 函数与其它很多函数一样如(this_cpu_write, this_cpu_add 等等...) 被定义在include/linux/percpu-defs.h 此部分函数主要为对 this_cpu 进行操作. 这些操作提供不同的对每cpuper-cpu 变量相关访问方式. 譬如让我们来看看这个函数 this_cpu_read:

__pcpu_size_call_return(this_cpu_read_, pcp)

还记得上面我们所写，每cpu变量cpu_number 的值是this_cpu_read通过raw_smp_processor_id来得到，现在让我们看看 __pcpu_size_call_return的执行：

#define __pcpu_size_call_return(stem, variable)                         \
({                                                                      \
        typeof(variable) pscr_ret__;                                    \
        __verify_pcpu_ptr(&(variable));                                 \
        switch(sizeof(variable)) {                                      \
        case 1: pscr_ret__ = stem##1(variable); break;                  \
        case 2: pscr_ret__ = stem##2(variable); break;                  \
        case 4: pscr_ret__ = stem##4(variable); break;                  \
        case 8: pscr_ret__ = stem##8(variable); break;                  \
        default:                                                        \
                __bad_size_call_parameter(); break;                     \
        }                                                               \
        pscr_ret__;                                                     \
})

是的，此函数虽然看起起奇怪但是它的实现是简单的，我们看到pscr_ret__ 变量的定义是int类型，为什么是int类型呢？好吧，变量是common_cpu 它声明了每cpu(per-cpu)变量:

DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);

在下一个步骤中我们调用了__verify_pcpu_ptr通过使用一个有效的每cpu变量指针来取地址得到cpu_number。之后我们通过pscr_ret__ 函数设置变量的大小，common_cpu变量是int,所以它的大小是4字节。意思就是我们通过this_cpu_read_4(common_cpu)获取cpu变量其大小被pscr_ret__决定。在__pcpu_size_call_return的结束我们调用了__pcpu_size_call_return：

#define this_cpu_read_4(pcp)       percpu_from_op("mov", pcp)

需要调用percpu_from_op 并且通过mov指令来传递每cpu变量，percpu_from_op的内联扩展如下：

asm("movl %%gs:%1,%0" : "=r" (pfo_ret__) : "m" (common_cpu))

让我们尝试理解此函数是如果工作的，gs段寄存器包含每个CPU区域的初始值，这里我们通过mov指令copy common_cpu到内存中去，此函数还有另外的形式：

this_cpu_read(common_cpu)

等价于:

movl %gs:$common_cpu, $pfo_ret__

由于我们没有设置每个CPU的区域,我们只有一个 - 为当前CPU的值zero 通过此函数 smp_processor_id返回.

返回的ID表示我们处于哪一个CPU上, boot_cpu_init 函数设置了CPU的在线, 激活, 当前的设置为:

set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);

上述我们所有使用的这些CPU的配置我们称之为- CPU掩码cpumask. cpu_possible 则是设置支持CPU热插拔时候的CPU ID. cpu_present 表示当前热插拔的CPU. cpu_online表示当前所有在线的CPU以及通过 cpu_present 来决定被调度出去的CPU. CPU热插拔的操作需要打开内核配置宏CONFIG_HOTPLUG_CPU并且将 possible == present 以及active == online选项禁用。这些功能都非常相似，每个函数都需要检查第二个参数，如果设置为true，需要通过调用cpumask_set_cpu or cpumask_clear_cpu来改变状态。

譬如我们可以通过true或者第二个参数来这么调用：

cpumask_set_cpu(cpu, to_cpumask(cpu_possible_bits));

让我们继续尝试理解to_cpumask宏指令，此宏指令转化为一个位图通过struct cpumask *，CPU掩码提供了位图集代表了当前系统中所有的CPU's，每CPU都占用1bit，CPU掩码相关定义通过cpu_mask结构定义:

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

在来看下面一组函数定义了位图宏指令。

#define DECLARE_BITMAP(name, bits) unsigned long name[BITS_TO_LONGS(bits)]

正如我们看到的定义一样， DECLARE_BITMAP宏指令的原型是一个unsigned long的数组，现在让我们查看如何执行to_cpumask:

#define to_cpumask(bitmap)                                              \
        ((struct cpumask *)(1 ? (bitmap)                                \
                            : (void *)sizeof(__check_is_bitmap(bitmap))))

我不知道你是怎么想的, 但是我是这么想的，我看到此函数其实就是一个条件判断语句当条件为真的时候，但是为什么执行__check_is_bitmap？让我们看看__check_is_bitmap的定义：

static inline int __check_is_bitmap(const unsigned long *bitmap)
{
        return 1;
}

原来此函数始终返回1，事实上我们需要这样的函数才达到我们的目的：它在编译时给定一个bitmap，换句话将就是检查bitmap的类型是否是unsigned long *,因此我们仅仅通过to_cpumask宏指令将类型为unsigned long的数组转化为struct cpumask *。现在我们可以调用cpumask_set_cpu 函数，这个函数仅仅是一个 set_bit给CPU掩码的功能函数。所有的这些set_cpu_*函数的原理都是一样的。

如果你还不确定set_cpu_*这些函数的操作并且不能理解 cpumask的概念，不要担心。你可以通过读取这些章节cpumask or documentation.来继续了解和学习这些函数的原理。

现在我们已经激活第一个CPU，我们继续接着start_kernel函数往下走，下面的函数是page_address_init,但是此函数不执行任何操作，因为只有当所有内存不能直接映射的时候才会执行。

Linux 内核的第一条打印信息

下面调用了pr_notice函数。

#define pr_notice(fmt, ...) \
    printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)

pr_notice其实是printk的扩展，这里我们使用它打印了Linux 的banner。

pr_notice("%s", linux_banner);

打印的是内核的版本号以及编译环境信息:

Linux version 4.0.0-rc6+ (alex@localhost) (gcc version 4.9.1 (Ubuntu 4.9.1-16ubuntu6) ) #319 SMP

依赖于体系结构的初始化部分

下个步骤我们就要进入到指定的体系架构的初始函数，Linux 内核初始化体系架构相关调用setup_arch函数，这又是一个类型于start_kernel的庞大函数，这里我们仅仅简单描述，在下一个章节我们将继续深入。指定体系架构的内容，我们需要再一次阅读arch/目录，setup_arch函数定义在arch/x86/kernel/setup.c 文件中，此函数就一个参数-内核命令行。

此函数解析内核的段_text和_data来自于_text符号和_bss_stop(你应该还记得此文件arch/x86/kernel/head_64.S)。我们使用memblock来解析内存块。

memblock_reserve(__pa_symbol(_text), (unsigned long)__bss_stop - (unsigned long)_text);

你可以阅读关于memblock的相关内容在Linux kernel memory management Part 1.，你应该还记得memblock_reserve函数的两个参数：

base physical address of a memory block;
size of a memory block.

我们可以通过__pa_symbol宏指令来获取符号表_text段中的物理地址

#define __pa_symbol(x) \
	__phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))

上述宏指令调用 __phys_reloc_hide 宏指令来填充参数，__phys_reloc_hide宏指令在x86_64上返回的参数是给定的。宏指令 __phys_addr_symbol的执行是简单的，只是减去从_text符号表中读到的内核的符号映射地址并且加上物理地址的基地址。

#define __phys_addr_symbol(x) \
 ((unsigned long)(x) - __START_KERNEL_map + phys_base)

memblock_reserve函数对内存页进行分配。

保留可用内存初始化initrd

之后我们保留替换内核的text和data段用来初始化initrd,我们暂时不去了解initrd的详细信息，你仅仅只需要知道根文件系统就是通过这方式来进行初始化这就是early_reserve_initrd 函数的工作，此函数获取RAM DISK的基地址以及大小以及大小加偏移。

u64 ramdisk_image = get_ramdisk_image();
u64 ramdisk_size  = get_ramdisk_size();
u64 ramdisk_end   = PAGE_ALIGN(ramdisk_image + ramdisk_size);

如果你阅读过这些章节Linux Kernel Booting Process，你就知道所有的这些啊参数都来自于boot_params，时刻谨记boot_params在boot期间已经被赋值，内核启动头包含了一下几个字段用来描述RAM DISK：

Field name:	ramdisk_image
Type:		write (obligatory)
Offset/size:	0x218/4
Protocol:	2.00+

  The 32-bit linear address of the initial ramdisk or ramfs.  Leave at
  zero if there is no initial ramdisk/ramfs.

我们可以得到关于 boot_params的一些信息. 具体查看get_ramdisk_image:

static u64 __init get_ramdisk_image(void)
{
        u64 ramdisk_image = boot_params.hdr.ramdisk_image;

        ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;

        return ramdisk_image;
}

关于32位的ramdisk的地址，我们可以阅读此部分内容来获取Documentation/x86/zero-page.txt:

0C0/004	ALL	ext_ramdisk_image ramdisk_image high 32bits

32位变化后，我们获取64位的ramdisk原理一样，为此我们可以检查bootloader 提供的ramdisk信息：

if (!boot_params.hdr.type_of_loader ||
    !ramdisk_image || !ramdisk_size)
	return;

并保留内存块将ramdisk传输到最终的内存地址，然后进行初始化：

memblock_reserve(ramdisk_image, ramdisk_end - ramdisk_image);

结束语

以上就是第四部分关于内核初始化的部分内容，我们从start_kernel函数开始一直到指定体系架构初始化setup_arch的过程中停止，那么在下一个章节我们将继续研究体系架构相关的初始化内容。

如果你有任何的问题或者建议，你可以留言，也可以直接发消息给我twitter。

很抱歉，英语并不是我的母的，非常抱歉给您阅读带来不便，如果你发现文中描述有任何问题，请提交一个 PR 到 linux-insides.

链接

内核初始化第五部分

与系统架构有关的初始化后续分析

在之前的章节中，我们讲到了与系统架构有关的 setup_arch 函数部分，本文会继续从这里开始。我们为 initrd 预留了内存之后，下一步是执行 olpc_ofw_detect 函数检测系统是否支持 One Laptop Per Child support。我们不会考虑与平台有关的东西，且会忽略与平台有关的函数。所以我们继续往下看。下一步是执行 early_trap_init 函数。这个函数会初始化调试功能（#DB -当 TF 标志位和rflags被设置时会被使用）和 int3 （#BP）中断门。如果你不了解中断，你可以从初期中断和异常处理中学习有关中断的内容。在 x86 架构中，INT，INT0 和 INT3 是支持任务显式调用中断处理函数的特殊指令。INT3 指令调用断点（#BP）处理函数。你如果记得，我们在这部分看到过中断和异常概念：

----------------------------------------------------------------------------------------------
|Vector|Mnemonic|Description         |Type |Error Code|Source                                |
----------------------------------------------------------------------------------------------
|3     | #BP    |Breakpoint          |Trap |NO        |INT 3                                 |
----------------------------------------------------------------------------------------------

调试中断 #DB 是激活调试器的重要方法。early_trap_init 函数的定义在 arch/x86/kernel/traps.c 中。这个函数用来设置 #DB 和 #BP 处理函数，并且实现重新加载 IDT。

void __init early_trap_init(void)
{
        set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
        set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
        load_idt(&idt_descr);
}

我们之前中断相关章节中看到过 set_intr_gate 的实现。这里的 set_intr_gate_ist 和 set_system_intr_gate_ist 也是类似的实现。这两个函数都需要三个参数：

中断号
中断/异常处理函数的基地址
第三个参数是 Interrupt Stack Table。 IST 是 TSS 的部分内容，是 x86_64 引入的新机制。在内核态处于活跃状态的线程拥有 16kb 的内核栈空间。但是在用户空间的线程的内核栈是空的。除了线程栈，还有一些与每个 CPU 有关的特殊栈。你可以查阅 linux 内核文档 - Kernel stacks 部分了解这些栈信息。 x86_64 提供了像在非屏蔽中断等类似事件中切换新的特殊栈的特性支持。这个特性的名字是 Interrupt Stack Table。每个CPU最多可以有 7 个 IST 条目，每个条目有自己特定的栈。在我们的案例中使用的是 DEBUG_STACK。

set_intr_gate_ist 和 set_system_intr_gate_ist 与 set_intr_gate 的工作原理几乎一样，只有一个区别。这些函数检查中断号并在内部调用 _set_gate ：

BUG_ON((unsigned)n > 0xFF);
_set_gate(n, GATE_INTERRUPT, addr, 0, ist, __KERNEL_CS);

其中， set_intr_gate 把 dpl 和 ist 置为 0 来调用 _set_gate。但是 set_intr_gate_ist 和 set_system_intr_gate_ist 把 ist 设置为 DEBUG_STACK，并且 set_system_intr_gate_ist 把 dpl 设置为优先级最低的 0x3。当中断发生时，硬件加载这个描述符，然后硬件根据 IST 的值自动设置新的栈指针。之后激活对应的中断处理函数。所有的特殊内核栈会在 cpu_init 函数中设置好（我们会在后文中提到）。

当 #DB 和 #BP 门向 idt_descr 有写操作，我们会调用 load_idt 函数来执行 ldtr 指令来重新加载 IDT 表。现在我们来了解下中断处理函数并尝试理解它的工作原理。当然，我们不可能在这本书中讲解所有的中断处理函数。深入学习 linux 的内核源码是很有意思的事情，我们会在这里讲解 debug 处理函数的实现。请自行学习其他的中断处理函数实现。

`#DB` 处理函数

像上文中提到的，我们在 set_intr_gate_ist 中通过 &debug 的地址传送 #DB 处理函数。lxr.free-electorns.com 是很好的用来搜索 linux 源代码中标识符的资源。遗憾的是，你在其中找不到 debug 处理函数。你只能在 arch/x86/include/asm/traps.h 中找到 debug 的定义：

asmlinkage void debug(void);

从 asmlinkage 属性我们可以知道 debug 是由 assembly 语言实现的函数。是的，又是汇编语言 :)。和其他处理函数一样，#DB 处理函数的实现可以在 arch/x86/kernel/entry_64.S 文件中找到。都是由 idtentry 汇编宏定义的：

idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK

idtentry 是一个定义中断/异常指令入口点的宏。它需要五个参数：

中断条目点的名字
中断处理函数的名字
是否有中断错误码
paranoid - 如果这个参数置为 1，则切换到特殊栈
shift_ist - 支持中断期间切换栈

现在我们来看下 idtentry 宏的实现。这个宏的定义也在相同的汇编文件中，并且定义了有 ENTRY 宏属性的 debug 函数。首先，idtentry 宏检查所有的参数是否正确，是否需要切换到特殊栈。接下来检查中断返回的错误码。例如本案例中的 #DB 不会返回错误码。如果有错误码返回，它会调用 INTR_FRAME 或者 XCPT_FRAM 宏。其实 XCPT_FRAME 和 INTR_FRAME 宏什么也不会做，只是对中断初始状态编译的时候有用。它们使用 CFI 指令用来调试。你可以查阅更多有关 CFI 指令的信息 CFI。就像 arch/x86/kernel/entry_64.S 中解释：CFI 宏是用来产生更好的回溯的 dwarf2 的解开信息。它们不会改变任何代码。因此我们可以忽略它们。

.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
	/* Sanity check */
	.if \shift_ist != -1 && \paranoid == 0
	.error "using shift_ist requires paranoid=1"
	.endif

	.if \has_error_code
	XCPT_FRAME
	.else
	INTR_FRAME
	.endif
	...
	...
	...

当中断发生后经过初期的中断/异常处理，我们可以知道栈内的格式是这样的：

    +-----------------------+
    |                       |
+40 |         SS            |
+32 |         RSP           |
+24 |        RFLAGS         |
+16 |         CS            |
+8  |         RIP           |
 0  |       Error Code      | <---- rsp
    |                       |
    +-----------------------+

idtentry 实现中的另外两个宏分别是

	ASM_CLAC
	PARAVIRT_ADJUST_EXCEPTION_FRAME

第一个 ASM_CLAC 宏依赖于 CONFIG_X86_SMAP 这个配置项和考虑安全因素，你可以从这里了解更多内容。第二个 PARAVIRT_EXCEPTION_FRAME 宏是用来处理 Xen 类型异常（这章只讲解内核初始化，不会考虑虚拟化的内容）。下一段代码会检查中断是否有错误码。如果没有则会把 $-1(在 x86_64 架构下值为 0xffffffffffffffff)压入栈：

	.ifeq \has_error_code
	pushq_cfi $-1
	.endif

为了保证对于所有中断的栈的一致性，我们会把它处理为 dummy 错误码。下一步我们从栈指针中减去 $ORIG_RAX-R15：

	subq $ORIG_RAX-R15, %rsp

其中，ORIG_RAX，R15 和其他宏都定义在 arch/x86/include/asm/calling.h 中。ORIG_RAX-R15 是 120 字节。我们在中断处理过程中需要把所有的寄存器信息存储在栈中，所有通用寄存器会占用这个 120 字节。为通用寄存器设置完栈之后，下一步是检查从用户空间产生的中断：

testl $3, CS(%rsp)
jnz 1f

我们查看段寄存器 CS 的前两个比特位。你应该记得 CS 寄存器包含段选择器，它的前两个比特是 RPL。所有的权限等级是0-3范围内的整数。数字越小代表权限越高。因此当中断来自内核空间，我们会调用 save_paranoid，如果不来自内核空间，我们会跳转到标签 1 处处理。在 save_paranoid 函数中，我们会把所有的通用寄存器存储到栈中，如果需要的话会用户态 gs 切换到内核态 gs：

movl $1,%ebx
	movl $MSR_GS_BASE,%ecx
	rdmsr
	testl %edx,%edx
	js 1f
	SWAPGS
	xorl %ebx,%ebx
1:	ret

下一步我们把 pt_regs 指针存在 rdi 中，如果存在错误码就把它存储到 rsi 中，然后调用中断处理函数，例如就像 arch/x86/kernel/trap.c中的 do_debug。 do_debug 像其他处理函数一样需要两个参数：

pt_regs - 是一个存储在进程内存区域的一组CPU寄存器
error code - 中断错误码

中断处理函数完成工作后会调用 paranoid_exit 还原栈区。如果中断来自用户空间则切换回用户态并调用 iret。我们会在不同的章节继续深入分析中断。这是用在 #DB 中断中的 idtentry 宏的基本介绍。所有的中断都和这个实现类似，都定义在 idtentry中。early_trap_init 执行完后，下一个函数是 early_cpu_init。这个函数定义在 arch/x86/kernel/cpu/common.c 中，负责收集 CPU 和其供应商的信息。

早期ioremap初始化

下一步是初始化早期的 ioremap。通常有两种实现与设备通信的方式：

I/O端口
设备内存

我们在 linux 内核启动过程中见过第一种方法（通过 outb/inb 指令实现）。第二种方法是把 I/O 的物理地址映射到虚拟地址。当 CPU 读取一段物理地址时，它可以读取到映射了 I/O 设备的物理 RAM 区域。 ioremap 就是用来把设备内存映射到内核地址空间的。

像我上面提到的下一个函数时 early_ioremap_init，它可以在正常的像 ioremap 这样的映射函数可用之前，把 I/O 内存映射到内核地址空间以方便读取。我们需要在初期的初始化代码中初始化临时的 ioremap 来映射 I/O 设备到内存区域。初期的 ioremap 实现在 arch/x86/mm/ioremap.c 中可以找到。在 early_ioremap_init 的一开始我们可以看到 pmd_t 类型的 pmd 指针定义（代表页中间目录条目 typedef struct {pmdval_t pmd; } pmd_t; 其中 pmdval_t 是无符号长整型）。然后检查 fixmap 是正确对齐的：

pmd_t *pmd;
BUILD_BUG_ON((fix_to_virt(0) + PAGE_SIZE) & ((1 << PMD_SHIFT) - 1));

fixmap - 是一段从 FIXADDR_START 到 FIXADDR_TOP 的固定虚拟地址映射区域。它在子系统需要知道虚拟地址的编译过程中会被使用。之后 early_ioremap_init 函数会调用 mm/early_ioremap.c 中的 early_ioremap_setup 函数。 early_ioremap_setup 会填充512个临时的启动时固定映射表来完成无符号长整型矩阵 slot_virt 的初始化：

for (i = 0; i < FIX_BTMAPS_SLOTS; i++)
    slot_virt[i] = __fix_to_virt(FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*i);

之后我们就获得了 FIX_BTMAP_BEGIN 的页中间目录条目，并把它赋值给了 pmd 变量，把启动时间页表 bm_pte 写满 0。然后调用 pmd_populate_kernel 函数设置给定的页中间目录的页表条目：

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
pmd_populate_kernel(&init_mm, pmd, bm_pte);

这就是所有过程。如果你仍然觉得困惑，不要担心。在内核内存管理，第二部分章节会有单独一部分讲解 ioremap 和 fixmaps。

获取根设备的主次设备号

ioremap 初始化完成后，紧接着是执行下面的代码：

ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);

这段代码用来获取根设备的主次设备号。后面 initrd 会通过 do_mount_root 函数挂载到这个根设备上。其中主设备号用来识别和这个设备有关的驱动。次设备号用来表示使用该驱动的各设备。注意 old_decode_dev 函数是从 boot_params_structure 中获取了一个参数。我们可以从 x86 linux 内核启动协议中查到：

Field name:	root_dev
Type:		modify (optional)
Offset/size:	0x1fc/2
Protocol:	ALL

  The default root device device number.  The use of this field is
  deprecated, use the "root=" option on the command line instead

现在我们来看看 old_decode_dev 如何实现的。实际上它只是根据主次设备号调用了 MKDEV 来生成一个 dev_t 类型的设备。它的实现很简单：

static inline dev_t old_decode_dev(u16 val)
{
         return MKDEV((val >> 8) & 255, val & 255);
}

其中 dev_t 是用来表示主/次设备号对的一个内核数据类型。但是这个奇怪的 old 前缀代表了什么呢？出于历史原因，有两种管理主次设备号的方法。第一种方法主次设备号占用 2 字节。你可以在以前的代码中发现：主设备号占用 8 bit，次设备号占用 8 bit。但是这会引入一个问题：最多只能支持 256 个主设备号和 256 个次设备号。因此后来引入了 32 bit 来表示主次设备号，其中 12 位用来表示主设备号，20 位用来表示次设备号。你可以在 new_decode_dev 的实现中找到：

static inline dev_t new_decode_dev(u32 dev)
{
         unsigned major = (dev & 0xfff00) >> 8;
         unsigned minor = (dev & 0xff) | ((dev >> 12) & 0xfff00);
         return MKDEV(major, minor);
}

如果 dev 的值是 0xffffffff，经过计算我们可以得到用来表示主设备号的 12 位值 0xfff，表示次设备号的20位值 0xfffff。因此经过 old_decode_dev 我们最终可以得到在 ROOT_DEV 中根设备的主次设备号。

Memory Map设置

下一步是调用 setup_memory_map 函数设置内存映射。但是在这之前我们需要设置与显示屏有关的参数（目前有行、列，视频页等，你可以在显示模式初始化和进入保护模式中了解），与拓展显示识别数据，视频模式，引导启动器类型等参数：

screen_info = boot_params.screen_info;
	edid_info = boot_params.edid_info;
	saved_video_mode = boot_params.hdr.vid_mode;
	bootloader_type = boot_params.hdr.type_of_loader;
	if ((bootloader_type >> 4) == 0xe) {
		bootloader_type &= 0xf;
		bootloader_type |= (boot_params.hdr.ext_loader_type+0x10) << 4;
	}
	bootloader_version  = bootloader_type & 0xf;
	bootloader_version |= boot_params.hdr.ext_loader_ver << 4;

我们可以从启动时候存储在 boot_params 结构中获取这些参数信息。之后我们需要设置 I/O 内存。众所周知，内核主要做的工作就是资源管理。其中一个资源就是内存。我们也知道目前有通过 I/O 口和设备内存两种方法实现设备通信。所有有关注册资源的信息可以通过 /proc/ioports 和 /proc/iomem 获得：

/proc/ioports - 提供用于设备输入输出通信的一租注册端口区域
/proc/iomem - 提供每个物理设备的系统内存映射地址我们先来看下 /proc/iomem：

cat /proc/iomem
00000000-00000fff : reserved
00001000-0009d7ff : System RAM
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cffff : Video ROM
000d0000-000d3fff : PCI Bus 0000:00
000d4000-000d7fff : PCI Bus 0000:00
000d8000-000dbfff : PCI Bus 0000:00
000dc000-000dffff : PCI Bus 0000:00
000e0000-000fffff : reserved
  000e0000-000e3fff : PCI Bus 0000:00
  000e4000-000e7fff : PCI Bus 0000:00
  000f0000-000fffff : System ROM

可以看到，根据不同属性划分为以十六进制符号表示的一段地址范围。linux 内核提供了用来管理所有资源的一种通用 API。全局资源（比如 PICs 或者 I/O 端口）可以划分为与硬件总线插槽有关的子集。 resource 的主要结构是：

struct resource {
        resource_size_t start;
        resource_size_t end;
        const char *name;
        unsigned long flags;
        struct resource *parent, *sibling, *child;
};

例如下图中的树形系统资源子集示例。这个结构提供了资源占用的从 start 到 end 的地址范围（resource_size_t 是 phys_addr_t 类型，在 x86_64 架构上是 u64）。资源名（你可以在 /proc/iomem 输出中看到），资源标记（所有的资源标记定义在 include/linux/ioport.h 文件中）。最后三个是资源结构体指针，如下图所示：

+-------------+      +-------------+
|             |      |             |
|    parent   |------|    sibling  |
|             |      |             |
+-------------+      +-------------+
       |
       |
+-------------+
|             |
|    child    | 
|             |
+-------------+

每个资源子集有自己的根范围资源。iomem 的资源 iomem_resource 的定义是：

struct resource iomem_resource = {
        .name   = "PCI mem",
        .start  = 0,
        .end    = -1,
        .flags  = IORESOURCE_MEM,
};
EXPORT_SYMBOL(iomem_resource);
TODO EXPORT_SYMBOL

iomem_resource 利用 PCI mem 名字和 IORESOURCE_MEM (0x00000200) 标记定义了 io 内存的根地址范围。就像上文提到的，我们目前的目的是设置 iomem 的结束地址，我们需要这样做：

iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;

我们对1左移 boot_cpu_data.x86_phys_bits。boot_cpu_data 是我们在执行 early_cpu_init 的时候初始化的 cpuinfo_x86 结构。从字面理解，x86_phys_bits 代表系统可达到的最大内存地址时需要的比特数。另外，iomem_resource 是通过 EXPORT_SYMBOL 宏传递的。这个宏可以把指定的符号（例如 iomem_resource）做动态链接。换句话说，它可以支持动态加载模块的时候访问对应符号。设置完根 iomem 的资源地址范围的结束地址后，下一步就是设置内存映射。它通过调用 setup_memory_map 函数实现：

void __init setup_memory_map(void)
{
        char *who;

        who = x86_init.resources.memory_setup();
        memcpy(&e820_saved, &e820, sizeof(struct e820map));
        printk(KERN_INFO "e820: BIOS-provided physical RAM map:\n");
        e820_print_map(who);
}

首先，我们来看下 x86_init.resources.memory_setup。x86_init 是一种 x86_init_ops 类型的结构体，用来表示项资源初始化，pci 初始化平台特定的一些设置函数。 x86_init 的初始化实现在 arch/x86/kernel/x86_init.c 文件中。我不会全部解释这个初始化过程，因为我们只关心一个地方：

struct x86_init_ops x86_init __initdata = {
	.resources = {
            .probe_roms             = probe_roms,
            .reserve_resources      = reserve_standard_io_resources,
            .memory_setup           = default_machine_specific_memory_setup,
    },
    ...
    ...
    ...
}

我们可以看到，这里的 memory_setup 赋值为 default_machine_specific_memory_setup，它是我们在对内核启动过程中的所有 e820 条目经过整理和把内存分区填入 e820map 结构体中获得的。所有收集的内存分区会用 printk 打印出来。你可以通过运行 dmesg 命令找到类似于下面的信息：

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009d7ff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009d800-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000be825fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000be826000-0x00000000be82cfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000be82d000-0x00000000bf744fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000bf745000-0x00000000bfff4fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000bfff5000-0x00000000dc041fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dc042000-0x00000000dc0d2fff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000dc0d3000-0x00000000dc138fff] usable
[    0.000000] BIOS-e820: [mem 0x00000000dc139000-0x00000000dc27dfff] ACPI NVS
[    0.000000] BIOS-e820: [mem 0x00000000dc27e000-0x00000000deffefff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000defff000-0x00000000deffffff] usable
...
...
...

复制 `BIOS` 增强磁盘设备信息

下面两部是通过 parse_setup_data 函数解析 setup_data，并且把 BIOS 的 EDD 信息复制到安全的地方。 setup_data 是内核启动头中包含的字段，我们可以在 x86 的启动协议中了解：

Field name:	setup_data
Type:		write (special)
Offset/size:	0x250/8
Protocol:	2.09+

  The 64-bit physical pointer to NULL terminated single linked list of
  struct setup_data. This is used to define a more extensible boot  
  parameters passing mechanism.

它用来存储不同类型的设置信息，例如设备树 blob，EFI 设置数据等等。第二步是从 boot_params 结构中复制我们在 arch/x86/boot/edd.c 中 BIOS 的 EDD 信息到 edd 结构中。

static inline void __init copy_edd(void)
{
     memcpy(edd.mbr_signature, boot_params.edd_mbr_sig_buffer,
            sizeof(edd.mbr_signature));
     memcpy(edd.edd_info, boot_params.eddbuf, sizeof(edd.edd_info));
     edd.mbr_signature_nr = boot_params.edd_mbr_sig_buf_entries;
     edd.edd_info_nr = boot_params.eddbuf_entries;
}

内存描述符初始化

下一步是在初始化阶段完成内存描述符的初始化。我们知道每个进程都有自己的运行内存地址空间。通过调用 memory descriptor 可以看到这些特殊数据结构。在 linux 内核源码中内存描述符是用 mm_struct 结构体表示的。mm_struct 包含许多不同的与进程地址空间有关的字段，像内核代码/数据段的起始和结束地址， brk 的起始和结束，内存区域的数量，内存区域列表等。这些结构定义在 include/linux/mm_types.h 中。task_struct 结构的 mm 和 active_mm 字段包含了每个进程自己的内存描述符。我们的第一个 init 进程也有自己的内存描述符。在之前的章节我们看到过通过 INIT_TASK 宏实现 task_struct 的部分初始化信息：

#define INIT_TASK(tsk)  \
{
    ...
	...
	...
	.mm = NULL,         \
    .active_mm  = &init_mm, \
	...
}

mm 指向进程地址空间，active_mm 指向像内核线程这样子不存在地址空间的有效地址空间（你可以在这个文档中了解更多内容）。接下来我们在初始化阶段完成内存描述符中内核代码段，数据段和 brk 段的初始化：

    init_mm.start_code = (unsigned long) _text;
	init_mm.end_code = (unsigned long) _etext;
	init_mm.end_data = (unsigned long) _edata;
	init_mm.brk = _brk_end;

init_mm 是初始化阶段的内存描述符定义：

struct mm_struct init_mm = {
    .mm_rb          = RB_ROOT,
    .pgd            = swapper_pg_dir,
    .mm_users       = ATOMIC_INIT(2),
    .mm_count       = ATOMIC_INIT(1),
    .mmap_sem       = __RWSEM_INITIALIZER(init_mm.mmap_sem),
    .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
    .mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
    INIT_MM_CONTEXT(init_mm)
};

其中 mm_rb 是虚拟内存区域的红黑树结构，pgd 是全局页目录的指针，mm_user 是使用该内存空间的进程数目，mm_count 是主引用计数，mmap_sem 是内存区域信号量。在初始化阶段完成内存描述符的设置后，下一步是通过 mpx_mm_init 完成 Intel 内存保护扩展的初始化。下一步是代码/数据/bss 资源的初始化：

code_resource.start = __pa_symbol(_text);
	code_resource.end = __pa_symbol(_etext)-1;
	data_resource.start = __pa_symbol(_etext);
	data_resource.end = __pa_symbol(_edata)-1;
	bss_resource.start = __pa_symbol(__bss_start);
	bss_resource.end = __pa_symbol(__bss_stop)-1;

通过上面我们已经知道了一小部分关于 resource 结构体的样子。在这里，我们把物理地址段赋值给代码/数据/bss 段。你可以在 /proc/iomem 中看到：

00100000-be825fff : System RAM
  01000000-015bb392 : Kernel code
  015bb393-01930c3f : Kernel data
  01a11000-01ac3fff : Kernel bss
在 [arch/x86/kernel/setup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/kernel/setup.c) 中有所有这些结构体的定义：
static struct resource code_resource = {
	.name	= "Kernel code",
	.start	= 0,
	.end	= 0,
	.flags	= IORESOURCE_BUSY | IORESOURCE_MEM
};

本章节涉及的最后一部分就是 NX 配置。NX-bit 或者 no-execute 位是页目录条目的第 63 比特位。它的作用是控制被映射的物理页面是否具有执行代码的能力。这个比特位只会在通过把 EFER.NXE 置为1使能 no-execute 页保护机制的时候被使用/设置。在 x86_configure_nx 函数中会检查 CPU 是否支持 NX-bit，以及是否被禁用。经过检查后，我们会根据结果给 _supported_pte_mask 赋值：

void x86_configure_nx(void)
{
        if (cpu_has_nx && !disable_nx)
                __supported_pte_mask |= _PAGE_NX;
        else
                __supported_pte_mask &= ~_PAGE_NX;
}

结论

以上是 linux 内核初始化过程的第五部分。在这一章我们讲解了有关架构初始化的 setup_arch 函数。内容很多，但是我们还没有学习完。其中，setup_arch 是一个很复杂的函数，甚至我不确定我们能在以后的章节中讲完它的所有内容。在这一章节中有一些很有趣的概念像 Fix-mapped 地址，ioremap 等等。如果没听明白也不用担心，在内核内存管理，第二部分还会有更详细的解释。在下一章节我们会继续讲解有关结构初始化的东西，以及初期内核参数的解析，pci 设备的早期转存，直接媒体接口扫描等等。

如果你有任何问题或者建议，你可以留言，也可以直接发送消息给我twitter。

很抱歉，英语并不是我的母语，非常抱歉给您阅读带来不便，如果你发现文中描述有任何问题，清提交一个 PR 到 linux-insides。

链接

Kernel initialization. Part 6.

Architecture-specific initialization, again...

In the previous part we saw architecture-specific (x86_64 in our case) initialization stuff from the arch/x86/kernel/setup.c and finished on x86_configure_nx function which sets the _PAGE_NX flag depends on support of NX bit. As I wrote before setup_arch function and start_kernel are very big, so in this and in the next part we will continue to learn about architecture-specific initialization process. The next function after x86_configure_nx is parse_early_param. This function is defined in the init/main.c and as you can understand from its name, this function parses kernel command line and setups different services depends on the given parameters (all kernel command line parameters you can find are in the Documentation/kernel-parameters.txt). You may remember how we setup earlyprintk in the earliest part. On the early stage we looked for kernel parameters and their value with the cmdline_find_option function and __cmdline_find_option, __cmdline_find_option_bool helpers from the arch/x86/boot/cmdline.c. There we're in the generic kernel part which does not depend on architecture and here we use another approach. If you are reading linux kernel source code, you already note calls like this:

early_param("gbpages", parse_direct_gbpages_on);

early_param macro takes two parameters:

command line parameter name;
function which will be called if given parameter is passed.

and defined as:

#define early_param(str, fn) \
        __setup_param(str, fn, fn, 1)

in the include/linux/init.h. As you can see early_param macro just makes call of the __setup_param macro:

#define __setup_param(str, unique_id, fn, early)                \
        static const char __setup_str_##unique_id[] __initconst \
                __aligned(1) = str; \
        static struct obs_kernel_param __setup_##unique_id      \
                __used __section(.init.setup)                   \
                __attribute__((aligned((sizeof(long)))))        \
                = { __setup_str_##unique_id, fn, early }

This macro defines __setup_str_*_id variable (where * depends on given function name) and assigns it to the given command line parameter name. In the next line we can see definition of the __setup_* variable which type is obs_kernel_param and its initialization. obs_kernel_param structure defined as:

struct obs_kernel_param {
        const char *str;
        int (*setup_func)(char *);
        int early;
};

and contains three fields:

name of the kernel parameter;
function which setups something depend on parameter;
field determines is parameter early (1) or not (0).

Note that __set_param macro defines with __section(.init.setup) attribute. It means that all __setup_str_* will be placed in the .init.setup section, moreover, as we can see in the include/asm-generic/vmlinux.lds.h, they will be placed between __setup_start and __setup_end:

#define INIT_SETUP(initsetup_align)                \
                . = ALIGN(initsetup_align);        \
                VMLINUX_SYMBOL(__setup_start) = .; \
                *(.init.setup)                     \
                VMLINUX_SYMBOL(__setup_end) = .;

Now we know how parameters are defined, let's back to the parse_early_param implementation:

void __init parse_early_param(void)
{
        static int done __initdata;
        static char tmp_cmdline[COMMAND_LINE_SIZE] __initdata;

        if (done)
                return;

        /* All fall through to do_early_param. */
        strlcpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
        parse_early_options(tmp_cmdline);
        done = 1;
}

The parse_early_param function defines two static variables. First done check that parse_early_param already called and the second is temporary storage for kernel command line. After this we copy boot_command_line to the temporary command line which we just defined and call the parse_early_options function from the same source code main.c file. parse_early_options calls the parse_args function from the kernel/params.c where parse_args parses given command line and calls do_early_param function. This function goes from the __setup_start to __setup_end, and calls the function from the obs_kernel_param if a parameter is early. After this all services which are depend on early command line parameters were setup and the next call after the parse_early_param is x86_report_nx. As I wrote in the beginning of this part, we already set NX-bit with the x86_configure_nx. The next x86_report_nx function from the arch/x86/mm/setup_nx.c just prints information about the NX. Note that we call x86_report_nx not right after the x86_configure_nx, but after the call of the parse_early_param. The answer is simple: we call it after the parse_early_param because the kernel support noexec parameter:

noexec		[X86]
			On X86-32 available only on PAE configured kernels.
			noexec=on: enable non-executable mappings (default)
			noexec=off: disable non-executable mappings

We can see it in the booting time:

After this we can see call of the:

	memblock_x86_reserve_range_setup_data();

function. This function is defined in the same arch/x86/kernel/setup.c source code file and remaps memory for the setup_data and reserved memory block for the setup_data (more about setup_data you can read in the previous part and about ioremap and memblock you can read in the Linux kernel memory management).

In the next step we can see following conditional statement:

	if (acpi_mps_check()) {
#ifdef CONFIG_X86_LOCAL_APIC
		disable_apic = 1;
#endif
		setup_clear_cpu_cap(X86_FEATURE_APIC);
	}

The first acpi_mps_check function from the arch/x86/kernel/acpi/boot.c depends on CONFIG_X86_LOCAL_APIC and CONFIG_x86_MPPARSE configuration options:

int __init acpi_mps_check(void)
{
#if defined(CONFIG_X86_LOCAL_APIC) && !defined(CONFIG_X86_MPPARSE)
        /* mptable code is not built-in*/
        if (acpi_disabled || acpi_noirq) {
                printk(KERN_WARNING "MPS support code is not built-in.\n"
                       "Using acpi=off or acpi=noirq or pci=noacpi "
                       "may have problem\n");
                 return 1;
        }
#endif
        return 0;
}

It checks the built-in MPS or MultiProcessor Specification table. If CONFIG_X86_LOCAL_APIC is set and CONFIG_x86_MPPAARSE is not set, acpi_mps_check prints warning message if the one of the command line options: acpi=off, acpi=noirq or pci=noacpi passed to the kernel. If acpi_mps_check returns 1 it means that we disable local APIC and clear X86_FEATURE_APIC bit in the of the current CPU with the setup_clear_cpu_cap macro. (more about CPU mask you can read in the CPU masks).

Early PCI dump

In the next step we make a dump of the PCI devices with the following code:

#ifdef CONFIG_PCI
	if (pci_early_dump_regs)
		early_dump_pci_devices();
#endif

pci_early_dump_regs variable defined in the arch/x86/pci/common.c and its value depends on the kernel command line parameter: pci=earlydump. We can find definition of this parameter in the drivers/pci/pci.c:

early_param("pci", pci_setup);

pci_setup function gets the string after the pci= and analyzes it. This function calls pcibios_setup which defined as __weak in the drivers/pci/pci.c and every architecture defines the same function which overrides __weak analog. For example x86_64 architecture-dependent version is in the arch/x86/pci/common.c:

char *__init pcibios_setup(char *str) {
        ...
		...
		...
		} else if (!strcmp(str, "earlydump")) {
                pci_early_dump_regs = 1;
                return NULL;
        }
		...
		...
		...
}

So, if CONFIG_PCI option is set and we passed pci=earlydump option to the kernel command line, next function which will be called - early_dump_pci_devices from the arch/x86/pci/early.c. This function checks noearly pci parameter with:

if (!early_pci_allowed())
        return;

and returns if it was passed. Each PCI domain can host up to 256 buses and each bus hosts up to 32 devices. So, we goes in a loop:

for (bus = 0; bus < 256; bus++) {
                for (slot = 0; slot < 32; slot++) {
                        for (func = 0; func < 8; func++) {
						...
						...
						...
                        }
                }
}

and read the pci config with the read_pci_config function.

That's all. We will not go deep in the pci details, but will see more details in the special Drivers/PCI part.

Finish with memory parsing

After the early_dump_pci_devices, there are a couple of function related with available memory and e820 which we collected in the First steps in the kernel setup part:

	/* update the e820_saved too */
	e820_reserve_setup_data();
	finish_e820_parsing();
	...
	...
	...
	e820_add_kernel_range();
	trim_bios_range(void);
	max_pfn = e820_end_of_ram_pfn();
	early_reserve_e820_mpc_new();

Let's look on it. As you can see the first function is e820_reserve_setup_data. This function does almost the same as memblock_x86_reserve_range_setup_data which we saw above, but it also calls e820_update_range which adds new regions to the e820map with the given type which is E820_RESERVED_KERN in our case. The next function is finish_e820_parsing which sanitizes e820map with the sanitize_e820_map function. Besides this two functions we can see a couple of functions related to the e820. You can see it in the listing above. e820_add_kernel_range function takes the physical address of the kernel start and end:

u64 start = __pa_symbol(_text);
u64 size = __pa_symbol(_end) - start;

checks that .text .data and .bss marked as E820RAM in the e820map and prints the warning message if not. The next function trm_bios_range update first 4096 bytes in e820Map as E820_RESERVED and sanitizes it again with the call of the sanitize_e820_map. After this we get the last page frame number with the call of the e820_end_of_ram_pfn function. Every memory page has an unique number - Page frame number and e820_end_of_ram_pfn function returns the maximum with the call of the e820_end_pfn:

unsigned long __init e820_end_of_ram_pfn(void)
{
	return e820_end_pfn(MAX_ARCH_PFN);
}

where e820_end_pfn takes maximum page frame number on the certain architecture (MAX_ARCH_PFN is 0x400000000 for x86_64). In the e820_end_pfn we go through the all e820 slots and check that e820 entry has E820_RAM or E820_PRAM type because we calculate page frame numbers only for these types, gets the base address and end address of the page frame number for the current e820 entry and makes some checks for these addresses:

for (i = 0; i < e820.nr_map; i++) {
		struct e820entry *ei = &e820.map[i];
		unsigned long start_pfn;
		unsigned long end_pfn;

		if (ei->type != E820_RAM && ei->type != E820_PRAM)
			continue;

		start_pfn = ei->addr >> PAGE_SHIFT;
		end_pfn = (ei->addr + ei->size) >> PAGE_SHIFT;

        if (start_pfn >= limit_pfn)
			continue;
		if (end_pfn > limit_pfn) {
			last_pfn = limit_pfn;
			break;
		}
		if (end_pfn > last_pfn)
			last_pfn = end_pfn;
}

	if (last_pfn > max_arch_pfn)
		last_pfn = max_arch_pfn;

	printk(KERN_INFO "e820: last_pfn = %#lx max_arch_pfn = %#lx\n",
			 last_pfn, max_arch_pfn);
	return last_pfn;

After this we check that last_pfn which we got in the loop is not greater that maximum page frame number for the certain architecture (x86_64 in our case), print information about last page frame number and return it. We can see the last_pfn in the dmesg output:

...
[    0.000000] e820: last_pfn = 0x41f000 max_arch_pfn = 0x400000000
...

After this, as we have calculated the biggest page frame number, we calculate max_low_pfn which is the biggest page frame number in the low memory or bellow first 4 gigabytes. If installed more than 4 gigabytes of RAM, max_low_pfn will be result of the e820_end_of_low_ram_pfn function which does the same e820_end_of_ram_pfn but with 4 gigabytes limit, in other way max_low_pfn will be the same as max_pfn:

if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
	max_low_pfn = e820_end_of_low_ram_pfn();
else
	max_low_pfn = max_pfn;
		
high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1;

Next we calculate high_memory (defines the upper bound on direct map memory) with __va macro which returns a virtual address by the given physical memory.

DMI scanning

The next step after manipulations with different memory regions and e820 slots is collecting information about computer. We will get all information with the Desktop Management Interface and following functions:

dmi_scan_machine();
dmi_memdev_walk();

First is dmi_scan_machine defined in the drivers/firmware/dmi_scan.c. This function goes through the System Management BIOS structures and extracts information. There are two ways specified to gain access to the SMBIOS table: get the pointer to the SMBIOS table from the EFI's configuration table and scanning the physical memory between 0xF0000 and 0x10000 addresses. Let's look on the second approach. dmi_scan_machine function remaps memory between 0xf0000 and 0x10000 with the dmi_early_remap which just expands to the early_ioremap:

void __init dmi_scan_machine(void)
{
	char __iomem *p, *q;
	char buf[32];
	...
	...
	...
	p = dmi_early_remap(0xF0000, 0x10000);
	if (p == NULL)
			goto error;

and iterates over all DMI header address and find search _SM_ string:

memset(buf, 0, 16);
for (q = p; q < p + 0x10000; q += 16) {
		memcpy_fromio(buf + 16, q, 16);
		if (!dmi_smbios3_present(buf) || !dmi_present(buf)) {
			dmi_available = 1;
			dmi_early_unmap(p, 0x10000);
			goto out;
		}
		memcpy(buf, buf + 16, 16);
}

_SM_ string must be between 000F0000h and 0x000FFFFF. Here we copy 16 bytes to the buf with memcpy_fromio which is the same memcpy and execute dmi_smbios3_present and dmi_present on the buffer. These functions check that first 4 bytes is _SM_ string, get SMBIOS version and gets _DMI_ attributes as DMI structure table length, table address and etc... After one of these functions finish, you will see the result of it in the dmesg output:

[    0.000000] SMBIOS 2.7 present.
[    0.000000] DMI: Gigabyte Technology Co., Ltd. Z97X-UD5H-BK/Z97X-UD5H-BK, BIOS F6 06/17/2014

In the end of the dmi_scan_machine, we unmap the previously remapped memory:

dmi_early_unmap(p, 0x10000);

The second function is - dmi_memdev_walk. As you can understand it goes over memory devices. Let's look on it:

void __init dmi_memdev_walk(void)
{
	if (!dmi_available)
		return;

	if (dmi_walk_early(count_mem_devices) == 0 && dmi_memdev_nr) {
		dmi_memdev = dmi_alloc(sizeof(*dmi_memdev) * dmi_memdev_nr);
		if (dmi_memdev)
			dmi_walk_early(save_mem_devices);
	}
}

It checks that DMI available (we got it in the previous function - dmi_scan_machine) and collects information about memory devices with dmi_walk_early and dmi_alloc which defined as:

#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif

RESERVE_BRK defined in the arch/x86/include/asm/setup.h and reserves space with given size in the brk section.

init_hypervisor_platform();
x86_init.resources.probe_roms();
insert_resource(&iomem_resource, &code_resource);
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);
early_gart_iommu_check();

SMP config

The next step is parsing of the SMP configuration. We do it with the call of the find_smp_config function which just calls function:

static inline void find_smp_config(void)
{
        x86_init.mpparse.find_smp_config();
}

inside. x86_init.mpparse.find_smp_config is the default_find_smp_config function from the arch/x86/kernel/mpparse.c. In the default_find_smp_config function we are scanning a couple of memory regions for SMP config and return if they are found:

if (smp_scan_config(0x0, 0x400) ||
            smp_scan_config(639 * 0x400, 0x400) ||
            smp_scan_config(0xF0000, 0x10000))
            return;

First of all smp_scan_config function defines a couple of variables:

unsigned int *bp = phys_to_virt(base);
struct mpf_intel *mpf;

First is virtual address of the memory region where we will scan SMP config, second is the pointer to the mpf_intel structure. Let's try to understand what is it mpf_intel. All information stores in the multiprocessor configuration data structure. mpf_intel presents this structure and looks:

struct mpf_intel {
        char signature[4];
        unsigned int physptr;
        unsigned char length;
        unsigned char specification;
        unsigned char checksum;
        unsigned char feature1;
        unsigned char feature2;
        unsigned char feature3;
        unsigned char feature4;
        unsigned char feature5;
};

As we can read in the documentation - one of the main functions of the system BIOS is to construct the MP floating pointer structure and the MP configuration table. And operating system must have access to this information about the multiprocessor configuration and mpf_intel stores the physical address (look at second parameter) of the multiprocessor configuration table. So, smp_scan_config going in a loop through the given memory range and tries to find MP floating pointer structure there. It checks that current byte points to the SMP signature, checks checksum, checks if mpf->specification is 1 or 4(it must be 1 or 4 by specification) in the loop:

while (length > 0) {
if ((*bp == SMP_MAGIC_IDENT) &&
    (mpf->length == 1) &&
    !mpf_checksum((unsigned char *)bp, 16) &&
    ((mpf->specification == 1)
    || (mpf->specification == 4))) {

        mem = virt_to_phys(mpf);
        memblock_reserve(mem, sizeof(*mpf));
        if (mpf->physptr)
            smp_reserve_memory(mpf);
	}
}

reserves given memory block if search is successful with memblock_reserve and reserves physical address of the multiprocessor configuration table. You can find documentation about this in the - MultiProcessor Specification. You can read More details in the special part about SMP.

Additional early memory initialization routines

In the next step of the setup_arch we can see the call of the early_alloc_pgt_buf function which allocates the page table buffer for early stage. The page table buffer will be placed in the brk area. Let's look on its implementation:

void  __init early_alloc_pgt_buf(void)
{
        unsigned long tables = INIT_PGT_BUF_SIZE;
        phys_addr_t base;

        base = __pa(extend_brk(tables, PAGE_SIZE));

        pgt_buf_start = base >> PAGE_SHIFT;
        pgt_buf_end = pgt_buf_start;
        pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}

First of all it get the size of the page table buffer, it will be INIT_PGT_BUF_SIZE which is (6 * PAGE_SIZE) in the current linux kernel 4.0. As we got the size of the page table buffer, we call extend_brk function with two parameters: size and align. As you can understand from its name, this function extends the brk area. As we can see in the linux kernel linker script brk is in memory right after the BSS:

	. = ALIGN(PAGE_SIZE);
	.brk : AT(ADDR(.brk) - LOAD_OFFSET) {
		__brk_base = .;
		. += 64 * 1024;		/* 64k alignment slop space */
		*(.brk_reservation)	/* areas brk users have reserved */
		__brk_limit = .;
	}

Or we can find it with readelf util:

After that we got physical address of the new brk with the __pa macro, we calculate the base address and the end of the page table buffer. In the next step as we got page table buffer, we reserve memory block for the brk area with the reserve_brk function:

static void __init reserve_brk(void)
{
	if (_brk_end > _brk_start)
		memblock_reserve(__pa_symbol(_brk_start),
				 _brk_end - _brk_start);

	_brk_start = 0;
}

Note that in the end of the reserve_brk, we set brk_start to zero, because after this we will not allocate it anymore. The next step after reserving memory block for the brk, we need to unmap out-of-range memory areas in the kernel mapping with the cleanup_highmap function. Remember that kernel mapping is __START_KERNEL_map and _end - _text or level2_kernel_pgt maps the kernel _text, data and bss. In the start of the clean_high_map we define these parameters:

unsigned long vaddr = __START_KERNEL_map;
unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;

Now, as we defined start and end of the kernel mapping, we go in the loop through the all kernel page middle directory entries and clean entries which are not between _text and end:

for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
        if (pmd_none(*pmd))
            continue;
        if (vaddr < (unsigned long) _text || vaddr > end)
            set_pmd(pmd, __pmd(0));
}

After this we set the limit for the memblock allocation with the memblock_set_current_limit function (read more about memblock you can in the Linux kernel memory management Part 2), it will be ISA_END_ADDRESS or 0x100000 and fill the memblock information according to e820 with the call of the memblock_x86_fill function. You can see the result of this function in the kernel initialization time:

MEMBLOCK configuration:
 memory size = 0x1fff7ec00 reserved size = 0x1e30000
 memory.cnt  = 0x3
 memory[0x0]	[0x00000000001000-0x0000000009efff], 0x9e000 bytes flags: 0x0
 memory[0x1]	[0x00000000100000-0x000000bffdffff], 0xbfee0000 bytes flags: 0x0
 memory[0x2]	[0x00000100000000-0x0000023fffffff], 0x140000000 bytes flags: 0x0
 reserved.cnt  = 0x3
 reserved[0x0]	[0x0000000009f000-0x000000000fffff], 0x61000 bytes flags: 0x0
 reserved[0x1]	[0x00000001000000-0x00000001a57fff], 0xa58000 bytes flags: 0x0
 reserved[0x2]	[0x0000007ec89000-0x0000007fffffff], 0x1377000 bytes flags: 0x0

The rest functions after the memblock_x86_fill are: early_reserve_e820_mpc_new allocates additional slots in the e820map for MultiProcessor Specification table, reserve_real_mode - reserves low memory from 0x0 to 1 megabyte for the trampoline to the real mode (for rebooting, etc.), trim_platform_memory_ranges - trims certain memory regions started from 0x20050000, 0x20110000, etc. these regions must be excluded because Sandy Bridge has problems with these regions, trim_low_memory_range reserves the first 4 kilobyte page in memblock, init_mem_mapping function reconstructs direct memory mapping and setups the direct mapping of the physical memory at PAGE_OFFSET, early_trap_pf_init setups #PF handler (we will look on it in the chapter about interrupts) and setup_real_mode function setups trampoline to the real mode code.

That's all. You can note that this part will not cover all functions which are in the setup_arch (like early_gart_iommu_check, mtrr initialization, etc.). As I already wrote many times, setup_arch is big, and linux kernel is big. That's why I can't cover every line in the linux kernel. I don't think that we missed something important, but you can say something like: each line of code is important. Yes, it's true, but I missed them anyway, because I think that it is not realistic to cover full linux kernel. Anyway we will often return to the idea that we have already seen, and if something is unfamiliar, we will cover this theme.

Conclusion

It is the end of the sixth part about linux kernel initialization process. In this part we continued to dive in the setup_arch function again and it was long part, but we are not finished with it. Yes, setup_arch is big, hope that next part will be the last part about this function.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Kernel initialization. Part 7.

The End of the architecture-specific initialization, almost...

This is the seventh part of the Linux Kernel initialization process which covers insides of the setup_arch function from the arch/x86/kernel/setup.c. As you can know from the previous parts, the setup_arch function does some architecture-specific (in our case it is x86_64) initialization stuff like reserving memory for kernel code/data/bss, early scanning of the Desktop Management Interface, early dump of the PCI device and many many more. If you have read the previous part, you can remember that we've finished it at the setup_real_mode function. In the next step, as we set limit of the memblock to the all mapped pages, we can see the call of the setup_log_buf function from the kernel/printk/printk.c.

The setup_log_buf function setups kernel cyclic buffer and its length depends on the CONFIG_LOG_BUF_SHIFT configuration option. As we can read from the documentation of the CONFIG_LOG_BUF_SHIFT it can be between 12 and 21. In the insides, buffer defined as array of chars:

#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);
static char *log_buf = __log_buf;

Now let's look on the implementation of the setup_log_buf function. It starts with check that current buffer is empty (It must be empty, because we just setup it) and another check that it is early setup. If setup of the kernel log buffer is not early, we call the log_buf_add_cpu function which increase size of the buffer for every CPU:

if (log_buf != __log_buf)
    return;
 
if (!early && !new_log_buf_len)
    log_buf_add_cpu();

We will not research log_buf_add_cpu function, because as you can see in the setup_arch, we call setup_log_buf as:

setup_log_buf(1);

where 1 means that it is early setup. In the next step we check new_log_buf_len variable which is updated length of the kernel log buffer and allocate new space for the buffer with the memblock_virt_alloc function for it, or just return.

As kernel log buffer is ready, the next function is reserve_initrd. You can remember that we already called the early_reserve_initrd function in the fourth part of the Kernel initialization. Now, as we reconstructed direct memory mapping in the init_mem_mapping function, we need to move initrd into directly mapped memory. The reserve_initrd function starts from the definition of the base address and end address of the initrd and check that initrd is provided by a bootloader. All the same as what we saw in the early_reserve_initrd. But instead of the reserving place in the memblock area with the call of the memblock_reserve function, we get the mapped size of the direct memory area and check that the size of the initrd is not greater than this area with:

mapped_size = memblock_mem_size(max_pfn_mapped);
if (ramdisk_size >= (mapped_size>>1))
    panic("initrd too large to handle, "
	      "disabling initrd (%lld needed, %lld available)\n",
	      ramdisk_size, mapped_size>>1);

You can see here that we call memblock_mem_size function and pass the max_pfn_mapped to it, where max_pfn_mapped contains the highest direct mapped page frame number. If you do not remember what is page frame number, explanation is simple: First 12 bits of the virtual address represent offset in the physical page or page frame. If we right-shift out 12 bits of the virtual address, we'll discard offset part and will get Page Frame Number. In the memblock_mem_size we go through the all memblock mem (not reserved) regions and calculates size of the mapped pages and return it to the mapped_size variable (see code above). As we got amount of the direct mapped memory, we check that size of the initrd is not greater than mapped pages. If it is greater we just call panic which halts the system and prints famous Kernel panic message. In the next step we print information about the initrd size. We can see the result of this in the dmesg output:

[0.000000] RAMDISK: [mem 0x36d20000-0x37687fff]

and relocate initrd to the direct mapping area with the relocate_initrd function. In the start of the relocate_initrd function we try to find a free area with the memblock_find_in_range function:

relocated_ramdisk = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped), area_size, PAGE_SIZE);

if (!relocated_ramdisk)
    panic("Cannot find place for new RAMDISK of size %lld\n",
	       ramdisk_size);

The memblock_find_in_range function tries to find a free area in a given range, in our case from 0 to the maximum mapped physical address and size must equal to the aligned size of the initrd. If we didn't find a area with the given size, we call panic again. If all is good, we start to relocated RAM disk to the down of the directly mapped memory in the next step.

In the end of the reserve_initrd function, we free memblock memory which occupied by the ramdisk with the call of the:

memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);

After we relocated initrd ramdisk image, the next function is vsmp_init from the arch/x86/kernel/vsmp_64.c. This function initializes support of the ScaleMP vSMP. As I already wrote in the previous parts, this chapter will not cover non-related x86_64 initialization parts (for example as the current or ACPI, etc.). So we will skip implementation of this for now and will back to it in the part which cover techniques of parallel computing.

The next function is io_delay_init from the arch/x86/kernel/io_delay.c. This function allows to override default default I/O delay 0x80 port. We already saw I/O delay in the Last preparation before transition into protected mode, now let's look on the io_delay_init implementation:

void __init io_delay_init(void)
{
    if (!io_delay_override)
        dmi_check_system(io_delay_0xed_port_dmi_table);
}

This function check io_delay_override variable and overrides I/O delay port if io_delay_override is set. We can set io_delay_override variably by passing io_delay option to the kernel command line. As we can read from the Documentation/kernel-parameters.txt, io_delay option is:

io_delay=	[X86] I/O delay method
    0x80
        Standard port 0x80 based delay
    0xed
        Alternate port 0xed based delay (needed on some systems)
    udelay
        Simple two microseconds delay
    none
        No delay

We can see io_delay command line parameter setup with the early_param macro in the arch/x86/kernel/io_delay.c

early_param("io_delay", io_delay_param);

More about early_param you can read in the previous part. So the io_delay_param function which setups io_delay_override variable will be called in the do_early_param function. io_delay_param function gets the argument of the io_delay kernel command line parameter and sets io_delay_type depends on it:

static int __init io_delay_param(char *s)
{
        if (!s)
                return -EINVAL;

        if (!strcmp(s, "0x80"))
                io_delay_type = CONFIG_IO_DELAY_TYPE_0X80;
        else if (!strcmp(s, "0xed"))
                io_delay_type = CONFIG_IO_DELAY_TYPE_0XED;
        else if (!strcmp(s, "udelay"))
                io_delay_type = CONFIG_IO_DELAY_TYPE_UDELAY;
        else if (!strcmp(s, "none"))
                io_delay_type = CONFIG_IO_DELAY_TYPE_NONE;
        else
                return -EINVAL;

        io_delay_override = 1;
        return 0;
}

The next functions are acpi_boot_table_init, early_acpi_boot_init and initmem_init after the io_delay_init, but as I wrote above we will not cover ACPI related stuff in this Linux Kernel initialization process chapter.

Allocate area for DMA

In the next step we need to allocate area for the Direct memory access with the dma_contiguous_reserve function which is defined in the drivers/base/dma-contiguous.c. DMA is a special mode when devices communicate with memory without CPU. Note that we pass one parameter - max_pfn_mapped << PAGE_SHIFT, to the dma_contiguous_reserve function and as you can understand from this expression, this is limit of the reserved memory. Let's look on the implementation of this function. It starts from the definition of the following variables:

phys_addr_t selected_size = 0;
phys_addr_t selected_base = 0;
phys_addr_t selected_limit = limit;
bool fixed = false;

where first represents size in bytes of the reserved area, second is base address of the reserved area, third is end address of the reserved area and the last fixed parameter shows where to place reserved area. If fixed is 1 we just reserve area with the memblock_reserve, if it is 0 we allocate space with the kmemleak_alloc. In the next step we check size_cmdline variable and if it is not equal to -1 we fill all variables which you can see above with the values from the cma kernel command line parameter:

if (size_cmdline != -1) {
   ...
   ...
   ...
}

You can find in this source code file definition of the early parameter:

early_param("cma", early_cma);

where cma is:

cma=nn[MG]@[start[MG][-end[MG]]]
		[ARM,X86,KNL]
		Sets the size of kernel global memory area for
		contiguous memory allocations and optionally the
		placement constraint by the physical address range of
		memory allocations. A value of 0 disables CMA
		altogether. For more information, see
		include/linux/dma-contiguous.h

If we will not pass cma option to the kernel command line, size_cmdline will be equal to -1. In this way we need to calculate size of the reserved area which depends on the following kernel configuration options:

CONFIG_CMA_SIZE_SEL_MBYTES - size in megabytes, default global CMA area, which is equal to CMA_SIZE_MBYTES * SZ_1M or CONFIG_CMA_SIZE_MBYTES * 1M;
CONFIG_CMA_SIZE_SEL_PERCENTAGE - percentage of total memory;
CONFIG_CMA_SIZE_SEL_MIN - use lower value;
CONFIG_CMA_SIZE_SEL_MAX - use higher value.

As we calculated the size of the reserved area, we reserve area with the call of the dma_contiguous_reserve_area function which first of all calls:

ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);

function. The cma_declare_contiguous reserves contiguous area from the given base address with given size. After we reserved area for the DMA, next function is the memblock_find_dma_reserve. As you can understand from its name, this function counts the reserved pages in the DMA area. This part will not cover all details of the CMA and DMA, because they are big. We will see much more details in the special part in the Linux Kernel Memory management which covers contiguous memory allocators and areas.

Initialization of the sparse memory

The next step is the call of the function - x86_init.paging.pagetable_init. If you try to find this function in the linux kernel source code, in the end of your search, you will see the following macro:

#define native_pagetable_init        paging_init

which expands as you can see to the call of the paging_init function from the arch/x86/mm/init_64.c. The paging_init function initializes sparse memory and zone sizes. First of all what's zones and what is it Sparsemem. The Sparsemem is a special foundation in the linux kernel memory manager which used to split memory area into different memory banks in the NUMA systems. Let's look on the implementation of the paginig_init function:

void __init paging_init(void)
{
        sparse_memory_present_with_active_regions(MAX_NUMNODES);
        sparse_init();

        node_clear_state(0, N_MEMORY);
        if (N_MEMORY != N_NORMAL_MEMORY)
                node_clear_state(0, N_NORMAL_MEMORY);

        zone_sizes_init();
}

As you can see there is call of the sparse_memory_present_with_active_regions function which records a memory area for every NUMA node to the array of the mem_section structure which contains a pointer to the structure of the array of struct page. The next sparse_init function allocates non-linear mem_section and mem_map. In the next step we clear state of the movable memory nodes and initialize sizes of zones. Every NUMA node is divided into a number of pieces which are called - zones. So, zone_sizes_init function from the arch/x86/mm/init.c initializes size of zones.

Again, this part and next parts do not cover this theme in full details. There will be special part about NUMA.

vsyscall mapping

The next step after SparseMem initialization is setting of the trampoline_cr4_features which must contain content of the cr4 Control register. First of all we need to check that current CPU has support of the cr4 register and if it has, we save its content to the trampoline_cr4_features which is storage for cr4 in the real mode:

if (boot_cpu_data.cpuid_level >= 0) {
    mmu_cr4_features = __read_cr4();
	if (trampoline_cr4_features)
	    *trampoline_cr4_features = mmu_cr4_features;
}

The next function which you can see is map_vsyscal from the arch/x86/kernel/vsyscall_64.c. This function maps memory space for vsyscalls and depends on CONFIG_X86_VSYSCALL_EMULATION kernel configuration option. Actually vsyscall is a special segment which provides fast access to the certain system calls like getcpu, etc. Let's look on implementation of this function:

void __init map_vsyscall(void)
{
        extern char __vsyscall_page;
        unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);

        if (vsyscall_mode != NONE)
                __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
                             vsyscall_mode == NATIVE
                             ? PAGE_KERNEL_VSYSCALL
                             : PAGE_KERNEL_VVAR);

        BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
                     (unsigned long)VSYSCALL_ADDR);
}

In the beginning of the map_vsyscall we can see definition of two variables. The first is extern variable __vsyscall_page. As a extern variable, it defined somewhere in other source code file. Actually we can see definition of the __vsyscall_page in the arch/x86/kernel/vsyscall_emu_64.S. The __vsyscall_page symbol points to the aligned calls of the vsyscalls as gettimeofday, etc.:

	.globl __vsyscall_page
	.balign PAGE_SIZE, 0xcc
	.type __vsyscall_page, @object
__vsyscall_page:

	mov $__NR_gettimeofday, %rax
	syscall
	ret

	.balign 1024, 0xcc
	mov $__NR_time, %rax
	syscall
	ret
    ...
    ...
    ...

The second variable is physaddr_vsyscall which just stores physical address of the __vsyscall_page symbol. In the next step we check the vsyscall_mode variable, and if it is not equal to NONE, it is EMULATE by default:

static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;

And after this check we can see the call of the __set_fixmap function which calls native_set_fixmap with the same parameters:

void native_set_fixmap(enum fixed_addresses idx, unsigned long phys, pgprot_t flags)
{
        __native_set_fixmap(idx, pfn_pte(phys >> PAGE_SHIFT, flags));
}

void __native_set_fixmap(enum fixed_addresses idx, pte_t pte)
{
        unsigned long address = __fix_to_virt(idx);

        if (idx >= __end_of_fixed_addresses) {
                BUG();
                return;
        }
        set_pte_vaddr(address, pte);
        fixmaps_set++;
}

Here we can see that native_set_fixmap makes value of Page Table Entry from the given physical address (physical address of the __vsyscall_page symbol in our case) and calls internal function - __native_set_fixmap. Internal function gets the virtual address of the given fixed_addresses index (VSYSCALL_PAGE in our case) and checks that given index is not greater than end of the fix-mapped addresses. After this we set page table entry with the call of the set_pte_vaddr function and increase count of the fix-mapped addresses. And in the end of the map_vsyscall we check that virtual address of the VSYSCALL_PAGE (which is first index in the fixed_addresses) is not greater than VSYSCALL_ADDR which is -10UL << 20 or ffffffffff600000 with the BUILD_BUG_ON macro:

BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
                     (unsigned long)VSYSCALL_ADDR);

Now vsyscall area is in the fix-mapped area. That's all about map_vsyscall, if you do not know anything about fix-mapped addresses, you can read Fix-Mapped Addresses and ioremap. We will see more about vsyscalls in the vsyscalls and vdso part.

Getting the SMP configuration

You may remember how we made a search of the SMP configuration in the previous part. Now we need to get the SMP configuration if we found it. For this we check smp_found_config variable which we set in the smp_scan_config function (read about it the previous part) and call the get_smp_config function:

if (smp_found_config)
	get_smp_config();

The get_smp_config expands to the x86_init.mpparse.default_get_smp_config function which is defined in the arch/x86/kernel/mpparse.c. This function defines a pointer to the multiprocessor floating pointer structure - mpf_intel (you can read about it in the previous part) and does some checks:

struct mpf_intel *mpf = mpf_found;

if (!mpf)
    return;
 
if (acpi_lapic && early)
   return;

Here we can see that multiprocessor configuration was found in the smp_scan_config function or just return from the function if not. The next check is acpi_lapic and early. And as we did this checks, we start to read the SMP configuration. As we finished reading it, the next step is - prefill_possible_map function which makes preliminary filling of the possible CPU's cpumask (more about it you can read in the Introduction to the cpumasks).

The rest of the setup_arch

Here we are getting to the end of the setup_arch function. The rest of function of course is important, but details about these stuff will not will not be included in this part. We will just take a short look on these functions, because although they are important as I wrote above, but they cover non-generic kernel features related with the NUMA, SMP, ACPI and APICs, etc. First of all, the next call of the init_apic_mappings function. As we can understand this function sets the address of the local APIC. The next is x86_io_apic_ops.init and this function initializes I/O APIC. Please note that we will see all details related with APIC in the chapter about interrupts and exceptions handling. In the next step we reserve standard I/O resources like DMA, TIMER, FPU, etc., with the call of the x86_init.resources.reserve_resources function. Following is mcheck_init function initializes Machine check Exception and the last is register_refined_jiffies which registers jiffy (There will be separate chapter about timers in the kernel).

So that's all. Finally we have finished with the big setup_arch function in this part. Of course as I already wrote many times, we did not see full details about this function, but do not worry about it. We will be back more than once to this function from different chapters for understanding how different platform-dependent parts are initialized.

That's all, and now we can back to the start_kernel from the setup_arch.

Back to the main.c

As I wrote above, we have finished with the setup_arch function and now we can back to the start_kernel function from the init/main.c. As you may remember or saw yourself, start_kernel function as big as the setup_arch. So the couple of the next part will be dedicated to learning of this function. So, let's continue with it. After the setup_arch we can see the call of the mm_init_cpumask function. This function sets the cpumask pointer to the memory descriptor cpumask. We can look on its implementation:

static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
        mm->cpu_vm_mask_var = &mm->cpumask_allocation;
#endif
        cpumask_clear(mm->cpu_vm_mask_var);
}

As you can see in the init/main.c, we pass memory descriptor of the init process to the mm_init_cpumask and depends on CONFIG_CPUMASK_OFFSTACK configuration option we clear TLB switch cpumask.

In the next step we can see the call of the following function:

setup_command_line(command_line);

This function takes pointer to the kernel command line allocates a couple of buffers to store command line. We need a couple of buffers, because one buffer used for future reference and accessing to command line and one for parameter parsing. We will allocate space for the following buffers:

saved_command_line - will contain boot command line;
initcall_command_line - will contain boot command line. will be used in the do_initcall_level;
static_command_line - will contain command line for parameters parsing.

We will allocate space with the memblock_virt_alloc function. This function calls memblock_virt_alloc_try_nid which allocates boot memory block with memblock_reserve if slab is not available or uses kzalloc_node (more about it will be in the linux memory management chapter). The memblock_virt_alloc uses BOOTMEM_LOW_LIMIT (physical address of the (PAGE_OFFSET + 0x1000000) value) and BOOTMEM_ALLOC_ACCESSIBLE (equal to the current value of the memblock.current_limit) as minimum address of the memory region and maximum address of the memory region.

Let's look on the implementation of the setup_command_line:

static void __init setup_command_line(char *command_line)
{
        saved_command_line =
                memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
        initcall_command_line =
                memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
        static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0);
        strcpy(saved_command_line, boot_command_line);
        strcpy(static_command_line, command_line);
 }

Here we can see that we allocate space for the three buffers which will contain kernel command line for the different purposes (read above). And as we allocated space, we store boot_command_line in the saved_command_line and command_line (kernel command line from the setup_arch) to the static_command_line.

The next function after the setup_command_line is the setup_nr_cpu_ids. This function setting nr_cpu_ids (number of CPUs) according to the last bit in the cpu_possible_mask (more about it you can read in the chapter describes cpumasks concept). Let's look on its implementation:

void __init setup_nr_cpu_ids(void)
{
        nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

Here nr_cpu_ids represents number of CPUs, NR_CPUS represents the maximum number of CPUs which we can set in configuration time:

Actually we need to call this function, because NR_CPUS can be greater than actual amount of the CPUs in the your computer. Here we can see that we call find_last_bit function and pass two parameters to it:

cpu_possible_mask bits;
maximum number of CPUS.

In the setup_arch we can find the call of the prefill_possible_map function which calculates and writes to the cpu_possible_mask actual number of the CPUs. We call the find_last_bit function which takes the address and maximum size to search and returns bit number of the first set bit. We passed cpu_possible_mask bits and maximum number of the CPUs. First of all the find_last_bit function splits given unsigned long address to the words:

words = size / BITS_PER_LONG;

where BITS_PER_LONG is 64 on the x86_64. As we got amount of words in the given size of the search data, we need to check is given size does not contain partial words with the following check:

if (size & (BITS_PER_LONG-1)) {
         tmp = (addr[words] & (~0UL >> (BITS_PER_LONG
                                 - (size & (BITS_PER_LONG-1)))));
         if (tmp)
                 goto found;
}

if it contains partial word, we mask the last word and check it. If the last word is not zero, it means that current word contains at least one set bit. We go to the found label:

found:
    return words * BITS_PER_LONG + __fls(tmp);

Here you can see __fls function which returns last set bit in a given word with help of the bsr instruction:

static inline unsigned long __fls(unsigned long word)
{
        asm("bsr %1,%0"
            : "=r" (word)
            : "rm" (word));
        return word;
}

The bsr instruction which scans the given operand for first bit set. If the last word is not partial we going through the all words in the given address and trying to find first set bit:

while (words) {
    tmp = addr[--words];
    if (tmp) {
found:
        return words * BITS_PER_LONG + __fls(tmp);
    }
}

Here we put the last word to the tmp variable and check that tmp contains at least one set bit. If a set bit found, we return the number of this bit. If no one words do not contains set bit we just return given size:

return size;

After this nr_cpu_ids will contain the correct amount of the available CPUs.

That's all.

Conclusion

It is the end of the seventh part about the linux kernel initialization process. In this part, finally we have finished with the setup_arch function and returned to the start_kernel function. In the next part we will continue to learn generic kernel code from the start_kernel and will continue our way to the first init process.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Kernel initialization. Part 8.

Scheduler initialization

This is the eighth part of the Linux kernel initialization process and we stopped on the setup_nr_cpu_ids function in the previous part. The main point of the current part is scheduler initialization. But before we will start to learn initialization process of the scheduler, we need to do some stuff. The next step in the init/main.c is the setup_per_cpu_areas function. This function setups areas for the percpu variables, more about it you can read in the special part about the Per-CPU variables. After percpu areas is up and running, the next step is the smp_prepare_boot_cpu function. This function does some preparations for the SMP:

static inline void smp_prepare_boot_cpu(void)
{
         smp_ops.smp_prepare_boot_cpu();
}

where the smp_prepare_boot_cpu expands to the call of the native_smp_prepare_boot_cpu function (more about smp_ops will be in the special parts about SMP):

void __init native_smp_prepare_boot_cpu(void)
{
        int me = smp_processor_id();
        switch_to_new_gdt(me);
        cpumask_set_cpu(me, cpu_callout_mask);
        per_cpu(cpu_state, me) = CPU_ONLINE;
}

The native_smp_prepare_boot_cpu function gets the id of the current CPU (which is Bootstrap processor and its id is zero) with the smp_processor_id function. I will not explain how the smp_processor_id works, because we already saw it in the Kernel entry point part. As we got processor id number we reload Global Descriptor Table for the given CPU with the switch_to_new_gdt function:

void switch_to_new_gdt(int cpu)
{
        struct desc_ptr gdt_descr;

        gdt_descr.address = (long)get_cpu_gdt_table(cpu);
        gdt_descr.size = GDT_SIZE - 1;
        load_gdt(&gdt_descr);
        load_percpu_segment(cpu);
}

The gdt_descr variable represents pointer to the GDT descriptor here (we already saw desc_ptr in the Early interrupt and exception handling). We get the address and the size of the GDT descriptor where GDT_SIZE is 256 or:

#define GDT_SIZE (GDT_ENTRIES * 8)

and the address of the descriptor we will get with the get_cpu_gdt_table:

static inline struct desc_struct *get_cpu_gdt_table(unsigned int cpu)
{
        return per_cpu(gdt_page, cpu).gdt;
}

The get_cpu_gdt_table uses per_cpu macro for getting gdt_page percpu variable for the given CPU number (bootstrap processor with id - 0 in our case). You may ask the following question: so, if we can access gdt_page percpu variable, where it was defined? Actually we already saw it in this book. If you have read the first part of this chapter, you can remember that we saw definition of the gdt_page in the arch/x86/kernel/head_64.S:

early_gdt_descr:
	.word	GDT_ENTRIES*8-1
early_gdt_descr_base:
	.quad	INIT_PER_CPU_VAR(gdt_page)

and if we will look on the linker file we can see that it locates after the __per_cpu_load symbol:

#define INIT_PER_CPU(x) init_per_cpu__##x = x + __per_cpu_load
INIT_PER_CPU(gdt_page);

and filled gdt_page in the arch/x86/kernel/cpu/common.c:

DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
#ifdef CONFIG_X86_64
	[GDT_ENTRY_KERNEL32_CS]		= GDT_ENTRY_INIT(0xc09b, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_CS]		= GDT_ENTRY_INIT(0xa09b, 0, 0xfffff),
	[GDT_ENTRY_KERNEL_DS]		= GDT_ENTRY_INIT(0xc093, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER32_CS]	= GDT_ENTRY_INIT(0xc0fb, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_DS]	= GDT_ENTRY_INIT(0xc0f3, 0, 0xfffff),
	[GDT_ENTRY_DEFAULT_USER_CS]	= GDT_ENTRY_INIT(0xa0fb, 0, 0xfffff),
    ...
    ...
    ...

more about percpu variables you can read in the Per-CPU variables part. As we got address and size of the GDT descriptor we reload GDT with the load_gdt which just execute lgdt instruct and load percpu_segment with the following function:

void load_percpu_segment(int cpu) {
    loadsegment(gs, 0);
    wrmsrl(MSR_GS_BASE, (unsigned long)per_cpu(irq_stack_union.gs_base, cpu));
    load_stack_canary_segment();
}

The base address of the percpu area must contain gs register (or fs register for x86), so we are using loadsegment macro and pass gs. In the next step we writes the base address if the IRQ stack and setup stack canary (this is only for x86_32). After we load new GDT, we fill cpu_callout_mask bitmap with the current cpu and set cpu state as online with the setting cpu_state percpu variable for the current processor - CPU_ONLINE:

cpumask_set_cpu(me, cpu_callout_mask);
per_cpu(cpu_state, me) = CPU_ONLINE;

So, what is cpu_callout_mask bitmap... As we initialized bootstrap processor (processor which is booted the first on x86) the other processors in a multiprocessor system are known as secondary processors. Linux kernel uses following two bitmasks:

cpu_callout_mask
cpu_callin_mask

After bootstrap processor initialized, it updates the cpu_callout_mask to indicate which secondary processor can be initialized next. All other or secondary processors can do some initialization stuff before and check the cpu_callout_mask on the boostrap processor bit. Only after the bootstrap processor filled the cpu_callout_mask with this secondary processor, it will continue the rest of its initialization. After that the certain processor finish its initialization process, the processor sets bit in the cpu_callin_mask. Once the bootstrap processor finds the bit in the cpu_callin_mask for the current secondary processor, this processor repeats the same procedure for initialization of one of the remaining secondary processors. In a short words it works as i described, but we will see more details in the chapter about SMP.

That's all. We did all SMP boot preparation.

Build zonelists

In the next step we can see the call of the build_all_zonelists function. This function sets up the order of zones that allocations are preferred from. What are zones and what's order we will understand soon. For the start let's see how linux kernel considers physical memory. Physical memory is split into banks which are called - nodes. If you has no hardware support for NUMA, you will see only one node:

$ cat /sys/devices/system/node/node0/numastat 
numa_hit 72452442
numa_miss 0
numa_foreign 0
interleave_hit 12925
local_node 72452442
other_node 0

Every node is presented by the struct pglist_data in the linux kernel. Each node is divided into a number of special blocks which are called - zones. Every zone is presented by the zone struct in the linux kernel and has one of the type:

ZONE_DMA - 0-16M;
ZONE_DMA32 - used for 32 bit devices that can only do DMA areas below 4G;
ZONE_NORMAL - all RAM from the 4GB on the x86_64;
ZONE_HIGHMEM - absent on the x86_64;
ZONE_MOVABLE - zone which contains movable pages.

which are presented by the zone_type enum. We can get information about zones with the:

$ cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3975
        min      3
        low      3
        ...
        ...
Node 0, zone    DMA32
  pages free     694163
        min      875
        low      1093
        ...
        ...
Node 0, zone   Normal
  pages free     2529995
        min      3146
        low      3932
        ...
        ...

As I wrote above all nodes are described with the pglist_data or pg_data_t structure in memory. This structure is defined in the include/linux/mmzone.h. The build_all_zonelists function from the mm/page_alloc.c constructs an ordered zonelist (of different zones DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE) which specifies the zones/nodes to visit when a selected zone or node cannot satisfy the allocation request. That's all. More about NUMA and multiprocessor systems will be in the special part.

The rest of the stuff before scheduler initialization

Before we will start to dive into linux kernel scheduler initialization process we must do a couple of things. The first thing is the page_alloc_init function from the mm/page_alloc.c. This function looks pretty easy:

void __init page_alloc_init(void)
{
        hotcpu_notifier(page_alloc_cpu_notify, 0);
}

and initializes handler for the CPU hotplug. Of course the hotcpu_notifier depends on the CONFIG_HOTPLUG_CPU configuration option and if this option is set, it just calls cpu_notifier macro which expands to the call of the register_cpu_notifier which adds hotplug cpu handler (page_alloc_cpu_notify in our case).

After this we can see the kernel command line in the initialization output:

And a couple of functions such as parse_early_param and parse_args which handles linux kernel command line. You may remember that we already saw the call of the parse_early_param function in the sixth part of the kernel initialization chapter, so why we call it again? Answer is simple: we call this function in the architecture-specific code (x86_64 in our case), but not all architecture calls this function. And we need to call the second function parse_args to parse and handle non-early command line arguments.

In the next step we can see the call of the jump_label_init from the kernel/jump_label.c. and initializes jump label.

After this we can see the call of the setup_log_buf function which setups the printk log buffer. We already saw this function in the seventh part of the linux kernel initialization process chapter.

PID hash initialization

The next is pidhash_init function. As you know each process has assigned a unique number which called - process identification number or PID. Each process generated with fork or clone is automatically assigned a new unique PID value by the kernel. The management of PIDs centered around the two special data structures: struct pid and struct upid. First structure represents information about a PID in the kernel. The second structure represents the information that is visible in a specific namespace. All PID instances stored in the special hash table:

static struct hlist_head *pid_hash;

This hash table is used to find the pid instance that belongs to a numeric PID value. So, pidhash_init initializes this hash table. In the start of the pidhash_init function we can see the call of the alloc_large_system_hash:

pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,
                                   HASH_EARLY | HASH_SMALL,
                                   &pidhash_shift, NULL,
                                   0, 4096);

The number of elements of the pid_hash depends on the RAM configuration, but it can be between 2^4 and 2^12. The pidhash_init computes the size and allocates the required storage (which is hlist in our case - the same as doubly linked list, but contains one pointer instead on the struct hlist_head]. The alloc_large_system_hash function allocates a large system hash table with memblock_virt_alloc_nopanic if we pass HASH_EARLY flag (as it in our case) or with __vmalloc if we did no pass this flag.

The result we can see in the dmesg output:

$ dmesg | grep hash
[    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
...
...
...

That's all. The rest of the stuff before scheduler initialization is the following functions: vfs_caches_init_early does early initialization of the virtual file system (more about it will be in the chapter which will describe virtual file system), sort_main_extable sorts the kernel's built-in exception table entries which are between __start___ex_table and __stop___ex_table, and trap_init initializes trap handlers (more about last two function we will know in the separate chapter about interrupts).

The last step before the scheduler initialization is initialization of the memory manager with the mm_init function from the init/main.c. As we can see, the mm_init function initializes different parts of the linux kernel memory manager:

page_ext_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
pgtable_init();
vmalloc_init();

The first is page_ext_init_flatmem which depends on the CONFIG_SPARSEMEM kernel configuration option and initializes extended data per page handling. The mem_init releases all bootmem, the kmem_cache_init initializes kernel cache, the percpu_init_late - replaces percpu chunks with those allocated by slub, the pgtable_init - initializes the page->ptl kernel cache, the vmalloc_init - initializes vmalloc. Please, NOTE that we will not dive into details about all of these functions and concepts, but we will see all of they it in the Linux kernel memory manager chapter.

That's all. Now we can look on the scheduler.

Scheduler initialization

And now we come to the main purpose of this part - initialization of the task scheduler. I want to say again as I already did it many times, you will not see the full explanation of the scheduler here, there will be special chapter about this. Ok, next point is the sched_init function from the kernel/sched/core.c and as we can understand from the function's name, it initializes scheduler. Let's start to dive into this function and try to understand how the scheduler is initialized. At the start of the sched_init function we can see the following code:

#ifdef CONFIG_FAIR_GROUP_SCHED
         alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
         alloc_size += 2 * nr_cpu_ids * sizeof(void **);
#endif

First of all we can see two configuration options here:

CONFIG_FAIR_GROUP_SCHED
CONFIG_RT_GROUP_SCHED

Both of this options provide two different planning models. As we can read from the documentation, the current scheduler - CFS or Completely Fair Scheduler use a simple concept. It models process scheduling as if the system has an ideal multitasking processor where each process would receive 1/n processor time, where n is the number of the runnable processes. The scheduler uses the special set of rules. These rules determine when and how to select a new process to run and they are called scheduling policy. The Completely Fair Scheduler supports following normal or non-real-time scheduling policies: SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE. The SCHED_NORMAL is used for the most normal applications, the amount of cpu each process consumes is mostly determined by the nice value, the SCHED_BATCH used for the 100% non-interactive tasks and the SCHED_IDLE runs tasks only when the processor has no task to run besides this task. The real-time policies are also supported for the time-critical applications: SCHED_FIFO and SCHED_RR. If you've read something about the Linux kernel scheduler, you can know that it is modular. It means that it supports different algorithms to schedule different types of processes. Usually this modularity is called scheduler classes. These modules encapsulate scheduling policy details and are handled by the scheduler core without knowing too much about them.

Now let's back to the our code and look on the two configuration options CONFIG_FAIR_GROUP_SCHED and CONFIG_RT_GROUP_SCHED. The scheduler operates on an individual task. These options allows to schedule group tasks (more about it you can read in the CFS group scheduling). We can see that we assign the alloc_size variables which represent size based on amount of the processors to allocate for the sched_entity and cfs_rq to the 2 * nr_cpu_ids * sizeof(void **) expression with kzalloc:

ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
 
#ifdef CONFIG_FAIR_GROUP_SCHED
        root_task_group.se = (struct sched_entity **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);

        root_task_group.cfs_rq = (struct cfs_rq **)ptr;
        ptr += nr_cpu_ids * sizeof(void **);
#endif

The sched_entity is a structure which is defined in the include/linux/sched.h and used by the scheduler to keep track of process accounting. The cfs_rq presents run queue. So, you can see that we allocated space with size alloc_size for the run queue and scheduler entity of the root_task_group. The root_task_group is an instance of the task_group structure from the kernel/sched/sched.h which contains task group related information:

struct task_group {
    ...
    ...
    struct sched_entity **se;
    struct cfs_rq **cfs_rq;
    ...
    ...
}

The root task group is the task group which belongs to every task in system. As we allocated space for the root task group scheduler entity and runqueue, we go over all possible CPUs (cpu_possible_mask bitmap) and allocate zeroed memory from a particular memory node with the kzalloc_node function for the load_balance_mask percpu variable:

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);

Here cpumask_var_t is the cpumask_t with one difference: cpumask_var_t is allocated only nr_cpu_ids bits when the cpumask_t always has NR_CPUS bits (more about cpumask you can read in the CPU masks part). As you can see:

#ifdef CONFIG_CPUMASK_OFFSTACK
    for_each_possible_cpu(i) {
        per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
                cpumask_size(), GFP_KERNEL, cpu_to_node(i));
    }
#endif

this code depends on the CONFIG_CPUMASK_OFFSTACK configuration option. This configuration options says to use dynamic allocation for cpumask, instead of putting it on the stack. All groups have to be able to rely on the amount of CPU time. With the call of the two following functions:

init_rt_bandwidth(&def_rt_bandwidth,
                  global_rt_period(), global_rt_runtime());
init_dl_bandwidth(&def_dl_bandwidth,
                  global_rt_period(), global_rt_runtime());

we initialize bandwidth management for the SCHED_DEADLINE real-time tasks. These functions initializes rt_bandwidth and dl_bandwidth structures which store information about maximum deadline bandwidth of the system. For example, let's look on the implementation of the init_rt_bandwidth function:

void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime)
{
        rt_b->rt_period = ns_to_ktime(period);
        rt_b->rt_runtime = runtime;

        raw_spin_lock_init(&rt_b->rt_runtime_lock);

        hrtimer_init(&rt_b->rt_period_timer,
                     CLOCK_MONOTONIC, HRTIMER_MODE_REL);
        rt_b->rt_period_timer.function = sched_rt_period_timer;
}

It takes three parameters:

address of the rt_bandwidth structure which contains information about the allocated and consumed quota within a period;
period - period over which real-time task bandwidth enforcement is measured in us;
runtime - part of the period that we allow tasks to run in us.

As period and runtime we pass result of the global_rt_period and global_rt_runtime functions. Which are 1s second and and 0.95s by default. The rt_bandwidth structure is defined in the kernel/sched/sched.h and looks:

struct rt_bandwidth {
        raw_spinlock_t          rt_runtime_lock;
        ktime_t                 rt_period;
        u64                     rt_runtime;
        struct hrtimer          rt_period_timer;
};

As you can see, it contains runtime and period and also two following fields:

rt_runtime_lock - spinlock for the rt_time protection;
rt_period_timer - high-resolution kernel timer for unthrottled of real-time tasks.

So, in the init_rt_bandwidth we initialize rt_bandwidth period and runtime with the given parameters, initialize the spinlock and high-resolution time. In the next step, depends on enable of SMP, we make initialization of the root domain:

#ifdef CONFIG_SMP
	init_defrootdomain();
#endif

The real-time scheduler requires global resources to make scheduling decision. But unfortunately scalability bottlenecks appear as the number of CPUs increase. The concept of root domains was introduced for improving scalability. The linux kernel provides a special mechanism for assigning a set of CPUs and memory nodes to a set of tasks and it is called - cpuset. If a cpuset contains non-overlapping with other cpuset CPUs, it is exclusive cpuset. Each exclusive cpuset defines an isolated domain or root domain of CPUs partitioned from other cpusets or CPUs. A root domain is presented by the struct root_domain from the kernel/sched/sched.h in the linux kernel and its main purpose is to narrow the scope of the global variables to per-domain variables and all real-time scheduling decisions are made only within the scope of a root domain. That's all about it, but we will see more details about it in the chapter about real-time scheduler.

After root domain initialization, we make initialization of the bandwidth for the real-time tasks of the root task group as we did it above:

#ifdef CONFIG_RT_GROUP_SCHED
	init_rt_bandwidth(&root_task_group.rt_bandwidth,
			global_rt_period(), global_rt_runtime());
#endif

In the next step, depends on the CONFIG_CGROUP_SCHED kernel configuration option we initialize the siblings and children lists of the root task group. As we can read from the documentation, the CONFIG_CGROUP_SCHED is:

This option allows you to create arbitrary task groups using the "cgroup" pseudo
filesystem and control the cpu bandwidth allocated to each such task group.

As we finished with the lists initialization, we can see the call of the autogroup_init function:

#ifdef CONFIG_CGROUP_SCHED
         list_add(&root_task_group.list, &task_groups);
         INIT_LIST_HEAD(&root_task_group.children);
         INIT_LIST_HEAD(&root_task_group.siblings);
         autogroup_init(&init_task);
#endif

which initializes automatic process group scheduling.

After this we are going through the all possible cpu (you can remember that possible CPUs store in the cpu_possible_mask bitmap that can ever be available in the system) and initialize a runqueue for each possible cpu:

for_each_possible_cpu(i) {
    struct rq *rq;
    ...
    ...
    ...

Each processor has its own locking and individual runqueue. All runnable tasks are stored in an active array and indexed according to its priority. When a process consumes its time slice, it is moved to an expired array. All of these arras are stored in the special structure which names is runqueue. As there are no global lock and runqueue, we are going through the all possible CPUs and initialize runqueue for the every cpu. The runqueue is presented by the rq structure in the linux kernel which is defined in the kernel/sched/sched.h.

rq = cpu_rq(i);
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt);
init_dl_rq(&rq->dl);
rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;

Here we get the runqueue for the every CPU with the cpu_rq macro which returns runqueues percpu variable and start to initialize it with runqueue lock, number of running tasks, calc_load relative fields (calc_load_active and calc_load_update) which are used in the reckoning of a CPU load and initialization of the completely fair, real-time and deadline related fields in a runqueue. After this we initialize cpu_load array with zeros and set the last load update tick to the jiffies variable which determines the number of time ticks (cycles), since the system boot:

for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
    rq->cpu_load[j] = 0;

rq->last_load_update_tick = jiffies;

where cpu_load keeps history of runqueue loads in the past, for now CPU_LOAD_IDX_MAX is 5. In the next step we fill runqueue fields which are related to the SMP, but we will not cover them in this part. And in the end of the loop we initialize high-resolution timer for the give runqueue and set the iowait (more about it in the separate part about scheduler) number:

init_rq_hrtick(rq);
atomic_set(&rq->nr_iowait, 0);

Now we come out from the for_each_possible_cpu loop and the next we need to set load weight for the init task with the set_load_weight function. Weight of process is calculated through its dynamic priority which is static priority + scheduling class of the process. After this we increase memory usage counter of the memory descriptor of the init process and set scheduler class for the current process:

atomic_inc(&init_mm.mm_count);
current->sched_class = &fair_sched_class;

And make current process (it will be the first init process) idle and update the value of the calc_load_update with the 5 seconds interval:

init_idle(current, smp_processor_id());
calc_load_update = jiffies + LOAD_FREQ;

So, the init process will be run, when there will be no other candidates (as it is the first process in the system). In the end we just set scheduler_running variable:

scheduler_running = 1;

That's all. Linux kernel scheduler is initialized. Of course, we have skipped many different details and explanations here, because we need to know and understand how different concepts (like process and process groups, runqueue, rcu, etc.) works in the linux kernel , but we took a short look on the scheduler initialization process. We will look all other details in the separate part which will be fully dedicated to the scheduler.

Conclusion

It is the end of the eighth part about the linux kernel initialization process. In this part, we looked on the initialization process of the scheduler and we will continue in the next part to dive in the linux kernel initialization process and will see initialization of the RCU and many other initialization stuff in the next part.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Kernel initialization. Part 9.

RCU initialization

This is ninth part of the Linux Kernel initialization process and in the previous part we stopped at the scheduler initialization. In this part we will continue to dive to the linux kernel initialization process and the main purpose of this part will be to learn about initialization of the RCU. We can see that the next step in the init/main.c after the sched_init is the call of the preempt_disable. There are two macros:

preempt_disable
preempt_enable

for preemption disabling and enabling. First of all let's try to understand what is preempt in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one init process for the early boot time and we don't need to stop it before we call cpu_idle function. The preempt_disable macro is defined in the include/linux/preempt.h and depends on the CONFIG_PREEMPT_COUNT kernel configuration option. This macro is implemented as:

#define preempt_disable() \
do { \
        preempt_count_inc(); \
        barrier(); \
} while (0)

and if CONFIG_PREEMPT_COUNT is not set just:

#define preempt_disable()                       barrier()

Let's look on it. First of all we can see one difference between these macro implementations. The preempt_disable with CONFIG_PREEMPT_COUNT set contains the call of the preempt_count_inc. There is special percpu variable which stores the number of held locks and preempt_disable calls:

DECLARE_PER_CPU(int, __preempt_count);

In the first implementation of the preempt_disable we increment this __preempt_count. There is API for returning value of the __preempt_count, it is the preempt_count function. As we called preempt_disable, first of all we increment preemption counter with the preempt_count_inc macro which expands to the:

#define preempt_count_inc() preempt_count_add(1)
#define preempt_count_add(val)  __preempt_count_add(val)

where preempt_count_add calls the raw_cpu_add_4 macro which adds 1 to the given percpu variable (__preempt_count) in our case (more about precpu variables you can read in the part about Per-CPU variables). Ok, we increased __preempt_count and the next step we can see the call of the barrier macro in the both macros. The barrier macro inserts an optimization barrier. In the processors with x86_64 architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:

preempt_disable();
foo();
preempt_enable();

Compiler can rearrange it as:

preempt_disable();
preempt_enable();
foo();

In this case non-preemptible function foo can be preempted. As we put barrier macro in the preempt_disable and preempt_enable macros, it prevents the compiler from swapping preempt_count_inc with other statements. More about barriers you can read here and here.

In the next step we can see following statement:

if (WARN(!irqs_disabled(),
	 "Interrupts were enabled *very* early, fixing it\n"))
	local_irq_disable();

which check IRQs state, and disabling (with cli instruction for x86_64) if they are enabled.

That's all. Preemption is disabled and we can go ahead.

Initialization of the integer ID management

In the next step we can see the call of the idr_init_cache function which defined in the lib/idr.c. The idr library is used in a various places in the linux kernel to manage assigning integer IDs to objects and looking up objects by id.

Let's look on the implementation of the idr_init_cache function:

void __init idr_init_cache(void)
{
        idr_layer_cache = kmem_cache_create("idr_layer_cache",
                                sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
}

Here we can see the call of the kmem_cache_create. We already called the kmem_cache_init in the init/main.c. This function create generalized caches again using the kmem_cache_alloc (more about caches we will see in the Linux kernel memory management chapter). In our case, as we are using kmem_cache_t which will be used by the slab allocator and kmem_cache_create creates it. As you can see we pass five parameters to the kmem_cache_create:

name of the cache;
size of the object to store in cache;
offset of the first object in the page;
flags;
constructor for the objects.

and it will create kmem_cache for the integer IDs. Integer IDs is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the i2c drivers subsystem. For example drivers/i2c/i2c-core.c which represents the core of the i2c subsystem defines ID for the i2c adapter with the DEFINE_IDR macro:

static DEFINE_IDR(i2c_adapter_idr);

and then uses it for the declaration of the i2c adapter:

static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
{
  int     id;
  ...
  ...
  ...
  id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
  ...
  ...
  ...
}

and id2_adapter_idr presents dynamically calculated bus number.

More about integer ID management you can read here.

RCU initialization

The next step is RCU initialization with the rcu_init function and it's implementation depends on two kernel configuration options:

CONFIG_TINY_RCU
CONFIG_TREE_RCU

In the first case rcu_init will be in the kernel/rcu/tiny.c and in the second case it will be defined in the kernel/rcu/tree.c. We will see the implementation of the tree rcu, but first of all about the RCU in general.

RCU or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including lock-free data structures, percpu data structures and other. One of these mechanisms is - the read-copy update. The RCU technique is designed for rarely-modified data structures. The idea of the RCU is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.

Of course this description of the RCU is very simplified. To understand some details about RCU, first of all we need to learn some terminology. Data readers in the RCU executed in the critical section. Every time when data reader get to the critical section, it calls the rcu_read_lock, and rcu_read_unlock on exit from the critical section. If the thread is not in the critical section, it will be in state which called - quiescent state. The moment when every thread is in the quiescent state called - grace period. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is removal - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the grace period finished, the second step of the element removal will be started, it just removes the element from the physical memory.

There a couple of implementations of the RCU. Old RCU called classic, the new implementation called tree RCU. As you may already understand, the CONFIG_TREE_RCU kernel configuration option enables tree RCU. Another is the tiny RCU which depends on CONFIG_TINY_RCU and CONFIG_SMP=n. We will see more details about the RCU in general in the separate chapter about synchronization primitives, but now let's look on the rcu_init implementation from the kernel/rcu/tree.c:

void __init rcu_init(void)
{
         int cpu;

         rcu_bootup_announce();
         rcu_init_geometry();
         rcu_init_one(&rcu_bh_state, &rcu_bh_data);
         rcu_init_one(&rcu_sched_state, &rcu_sched_data);
         __rcu_init_preempt();
         open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

         /*
          * We don't need protection against CPU-hotplug here because
          * this is called early in boot, before either interrupts
          * or the scheduler are operational.
          */
         cpu_notifier(rcu_cpu_notify, 0);
         pm_notifier(rcu_pm_notify, 0);
         for_each_online_cpu(cpu)
                 rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);

         rcu_early_boot_tests();
}

In the beginning of the rcu_init function we define cpu variable and call rcu_bootup_announce. The rcu_bootup_announce function is pretty simple:

static void __init rcu_bootup_announce(void)
{
        pr_info("Hierarchical RCU implementation.\n");
        rcu_bootup_announce_oddness();
}

It just prints information about the RCU with the pr_info function and rcu_bootup_announce_oddness which uses pr_info too, for printing different information about the current RCU configuration which depends on different kernel configuration options like CONFIG_RCU_TRACE, CONFIG_PROVE_RCU, CONFIG_RCU_FANOUT_EXACT, etc. In the next step, we can see the call of the rcu_init_geometry function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually RCU provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? RCU API provides the rcu_state structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:

struct rcu_node node[NUM_RCU_NODES];

array of structures. As we can read in the comment of above definition:

The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]).  The number of levels is
determined by the number of CPUs and by CONFIG_RCU_FANOUT.

Small systems will have a "hierarchy" consisting of a single rcu_node.

The rcu_node structure is defined in the kernel/rcu/tree.h and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every rcu_node contains a lock for a couple of CPUs. These rcu_node structures are embedded into a linear array in the rcu_state structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the NUM_RCU_NODES which depends on number of available CPUs:

#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)

where levels values depend on the CONFIG_RCU_FANOUT_LEAF configuration option. For example for the simplest case, one rcu_node will cover two CPU on machine with the eight CPUs:

+-----------------------------------------------------------------+
|  rcu_state                                                      |
|                 +----------------------+                        |
|                 |         root         |                        |
|                 |       rcu_node       |                        |
|                 +----------------------+                        |
|                    |                |                           |
|               +----v-----+       +--v-------+                   |
|               |          |       |          |                   |
|               | rcu_node |       | rcu_node |                   |
|               |          |       |          |                   |
|         +------------------+     +----------------+             |
|         |                  |        |             |             |
|         |                  |        |             |             |
|    +----v-----+    +-------v--+   +-v--------+  +-v--------+    |
|    |          |    |          |   |          |  |          |    |
|    | rcu_node |    | rcu_node |   | rcu_node |  | rcu_node |    |
|    |          |    |          |   |          |  |          |    |
|    +----------+    +----------+   +----------+  +----------+    |
|         |                 |             |               |       |
|         |                 |             |               |       |
|         |                 |             |               |       |
|         |                 |             |               |       |
+---------|-----------------|-------------|---------------|-------+
          |                 |             |               |
+---------v-----------------v-------------v---------------v--------+
|                 |                |               |               |
|     CPU1        |      CPU3      |      CPU5     |     CPU7      |
|                 |                |               |               |
|     CPU2        |      CPU4      |      CPU6     |     CPU8      |
|                 |                |               |               |
+------------------------------------------------------------------+

So, in the rcu_init_geometry function we just need to calculate the total number of rcu_node structures. We start to do it with the calculation of the jiffies till to the first and next fqs which is force-quiescent-state (read above about it):

d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
if (jiffies_till_first_fqs == ULONG_MAX)
        jiffies_till_first_fqs = d;
if (jiffies_till_next_fqs == ULONG_MAX)
        jiffies_till_next_fqs = d;

where:

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
#define RCU_JIFFIES_FQS_DIV     256

As we calculated these jiffies, we check that previous defined jiffies_till_first_fqs and jiffies_till_next_fqs variables are equal to the ULONG_MAX (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the ULONG_MAX:

static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;

In the next step of the rcu_init_geometry, we check that rcu_fanout_leaf didn't change (it has the same value as CONFIG_RCU_FANOUT_LEAF in compile-time) and equal to the value of the CONFIG_RCU_FANOUT_LEAF configuration option, we just return:

if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
    nr_cpu_ids == NR_CPUS)
    return;

After this we need to compute the number of nodes that an rcu_node tree can handle with the given number of levels:

rcu_capacity[0] = 1;
rcu_capacity[1] = rcu_fanout_leaf;
for (i = 2; i <= MAX_RCU_LVLS; i++)
    rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;

And in the last step we calculate the number of rcu_nodes at each level of the tree in the loop.

As we calculated geometry of the rcu_node tree, we need to go back to the rcu_init function and next step we need to initialize two rcu_state structures with the rcu_init_one function:

rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);

The rcu_init_one function takes two arguments:

Global RCU state;
Per-CPU data for RCU.

Both variables defined in the kernel/rcu/tree.h with its percpu data:

extern struct rcu_state rcu_bh_state;
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);

About this states you can read here. As I wrote above we need to initialize rcu_state structures and rcu_init_one function will help us with it. After the rcu_state initialization, we can see the call of the __rcu_init_preempt which depends on the CONFIG_PREEMPT_RCU kernel configuration option. It does the same as previous functions - initialization of the rcu_preempt_state structure with the rcu_init_one function which has rcu_state type. After this, in the rcu_init, we can see the call of the:

open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

function. This function registers a handler of the pending interrupt. Pending interrupt or softirq supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:

struct softirq_action
{
        void    (*action)(struct softirq_action *);
};

which is defined in the include/linux/interrupt.h and contains only one field - handler of an interrupt. You can check about softirqs in the your system with the:

$ cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
          HI:          2          0          0          1          0          2          0          0
       TIMER:     137779     108110     139573     107647     107408     114972      99653      98665
      NET_TX:       1127          0          4          0          1          1          0          0
      NET_RX:        334        221     132939       3076        451        361        292        303
       BLOCK:       5253       5596          8        779       2016      37442         28       2855
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
     TASKLET:         66          0       2916        113          0         24      26708          0
       SCHED:     102350      75950      91705      75356      75323      82627      69279      69914
     HRTIMER:        510        302        368        260        219        255        248        246
         RCU:      81290      68062      82979      69015      68390      69385      63304      63473

The open_softirq function takes two parameters:

index of the interrupt;
interrupt handler.

and adds interrupt handler to the array of the pending interrupts:

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
        softirq_vec[nr].action = action;
}

In our case the interrupt handler is - rcu_process_callbacks which is defined in the kernel/rcu/tree.c and does the RCU core processing for the current CPU. After we registered softirq interrupt for the RCU, we can see the following code:

cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
    rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);

Here we can see registration of the cpu notifier which needs in systems which supports CPU hotplug and we will not dive into details about this theme. The last function in the rcu_init is the rcu_early_boot_tests:

void rcu_early_boot_tests(void)
{
        pr_info("Running RCU self tests\n");

        if (rcu_self_test)
                 early_boot_test_call_rcu();
         if (rcu_self_test_bh)
                 early_boot_test_call_rcu_bh();
         if (rcu_self_test_sched)
                early_boot_test_call_rcu_sched();
}

which runs self tests for the RCU.

That's all. We saw initialization process of the RCU subsystem. As I wrote above, more about the RCU will be in the separate chapter about synchronization primitives.

Rest of the initialization process

Ok, we already passed the main theme of this part which is RCU initialization, but it is not the end of the linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:

They are not very important for the generic kernel initialization process and depend on the different kernel configuration;
They have the character of debugging and not important for now;
We will see many of this stuff in the separate parts/chapters.

After we initialized RCU, the next step which you can see in the init/main.c is the - trace_init function. As you can understand from its name, this function initialize tracing subsystem. You can read more about linux kernel trace system - here.

After the trace_init, we can see the call of the radix_tree_init. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the Radix tree. This function is defined in the lib/radix-tree.c and you can read more about it in the part about Radix tree.

In the next step we can see the functions which are related to the interrupts handling subsystem, they are:

early_irq_init
init_IRQ
softirq_init

We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like init_timers, hrtimers_init, time_init, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.

The next couple of functions are related with the perf events - perf_event-init (there will be separate chapter about perf), initialization of the profiling with the profile_init. After this we enable irq with the call of the:

local_irq_enable();

which expands to the sti instruction and making post initialization of the SLAB with the call of the kmem_cache_init_late function (As I wrote above we will know about the SLAB in the Linux memory management chapter).

After the post initialization of the SLAB, next point is initialization of the console with the console_init function from the drivers/tty/tty_io.c.

After the console initialization, we can see the lockdep_info function which prints information about the Lock dependency validator. After this, we can see the initialization of the dynamic allocation of the debug objects with the debug_objects_mem_init, kernel memory leak detector initialization with the kmemleak_init, percpu pageset setup with the setup_per_cpu_pageset, setup of the NUMA policy with the numa_policy_init, setting time for the scheduler with the sched_clock_init, pidmap initialization with the call of the pidmap_init function for the initial PID namespace, cache creation with the anon_vma_init for the private virtual memory areas and early initialization of the ACPI with the acpi_early_init.

This is the end of the ninth part of the linux kernel initialization process and here we saw initialization of the RCU. In the last paragraph of this part (Rest of the initialization process) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.

Conclusion

It is the end of the ninth part about the linux kernel initialization process. In this part, we looked on the initialization process of the RCU subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the start_kernel function and will go to the rest_init function from the same init/main.c source code file and will see the start of the first process.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.

Kernel initialization. Part 10.

End of the linux kernel initialization process

This is tenth part of the chapter about linux kernel initialization process and in the previous part we saw the initialization of the RCU and stopped on the call of the acpi_early_init function. This part will be the last part of the Kernel initialization process chapter, so let's finish it.

After the call of the acpi_early_init function from the init/main.c, we can see the following code:

#ifdef CONFIG_X86_ESPFIX64
	init_espfix_bsp();
#endif

Here we can see the call of the init_espfix_bsp function which depends on the CONFIG_X86_ESPFIX64 kernel configuration option. As we can understand from the function name, it does something with the stack. This function is defined in the arch/x86/kernel/espfix_64.c and prevents leaking of 31:16 bits of the esp register during returning to 16-bit stack. First of all we install espfix page upper directory into the kernel page directory in the init_espfix_bs:

pgd_p = &init_level4_pgt[pgd_index(ESPFIX_BASE_ADDR)];
pgd_populate(&init_mm, pgd_p, (pud_t *)espfix_pud_page);

Where ESPFIX_BASE_ADDR is:

#define PGDIR_SHIFT     39
#define ESPFIX_PGD_ENTRY _AC(-2, UL)
#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << PGDIR_SHIFT)

Also we can find it in the Documentation/x86/x86_64/mm:

... unused hole ...
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
... unused hole ...

After we've filled page global directory with the espfix pud, the next step is call of the init_espfix_random and init_espfix_ap functions. The first function returns random locations for the espfix page and the second enables the espfix for the current CPU. After the init_espfix_bsp finished the work, we can see the call of the thread_info_cache_init function which defined in the kernel/fork.c and allocates cache for the thread_info if THREAD_SIZE is less than PAGE_SIZE:

# if THREAD_SIZE >= PAGE_SIZE
...
...
...
void thread_info_cache_init(void)
{
        thread_info_cache = kmem_cache_create("thread_info", THREAD_SIZE,
                                              THREAD_SIZE, 0, NULL);
        BUG_ON(thread_info_cache == NULL);
}
...
...
...
#endif

As we already know the PAGE_SIZE is (_AC(1,UL) << PAGE_SHIFT) or 4096 bytes and THREAD_SIZE is (PAGE_SIZE << THREAD_SIZE_ORDER) or 16384 bytes for the x86_64. The next function after the thread_info_cache_init is the cred_init from the kernel/cred.c. This function just allocates cache for the credentials (like uid, gid, etc.):

void __init cred_init(void)
{
         cred_jar = kmem_cache_create("cred_jar", sizeof(struct cred),
                                     0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
}

more about credentials you can read in the Documentation/security/credentials.txt. Next step is the fork_init function from the kernel/fork.c. The fork_init function allocates cache for the task_struct. Let's look on the implementation of the fork_init. First of all we can see definitions of the ARCH_MIN_TASKALIGN macro and creation of a slab where task_structs will be allocated:

#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
#ifndef ARCH_MIN_TASKALIGN
#define ARCH_MIN_TASKALIGN      L1_CACHE_BYTES
#endif
        task_struct_cachep =
                kmem_cache_create("task_struct", sizeof(struct task_struct),
                        ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
#endif

As we can see this code depends on the CONFIG_ARCH_TASK_STRUCT_ACLLOCATOR kernel configuration option. This configuration option shows the presence of the alloc_task_struct for the given architecture. As x86_64 has no alloc_task_struct function, this code will not work and even will not be compiled on the x86_64.

Allocating cache for init task

After this we can see the call of the arch_task_cache_init function in the fork_init:

void arch_task_cache_init(void)
{
        task_xstate_cachep =
                kmem_cache_create("task_xstate", xstate_size,
                                  __alignof__(union thread_xstate),
                                  SLAB_PANIC | SLAB_NOTRACK, NULL);
        setup_xstate_comp();
}

The arch_task_cache_init does initialization of the architecture-specific caches. In our case it is x86_64, so as we can see, the arch_task_cache_init allocates cache for the task_xstate which represents FPU state and sets up offsets and sizes of all extended states in xsave area with the call of the setup_xstate_comp function. After the arch_task_cache_init we calculate default maximum number of threads with the:

set_max_threads(MAX_THREADS);

where default maximum number of threads is:

#define FUTEX_TID_MASK  0x3fffffff
#define MAX_THREADS     FUTEX_TID_MASK

In the end of the fork_init function we initialize signal handler:

init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
		init_task.signal->rlim[RLIMIT_NPROC];

As we know the init_task is an instance of the task_struct structure, so it contains signal field which represents signal handler. It has following type struct signal_struct. On the first two lines we can see setting of the current and maximum limit of the resource limits. Every process has an associated set of resource limits. These limits specify amount of resources which current process can use. Here rlim is resource control limit and presented by the:

struct rlimit {
        __kernel_ulong_t        rlim_cur;
        __kernel_ulong_t        rlim_max;
};

structure from the include/uapi/linux/resource.h. In our case the resource is the RLIMIT_NPROC which is the maximum number of processes that user can own and RLIMIT_SIGPENDING - the maximum number of pending signals. We can see it in the:

cat /proc/self/limits
Limit                     Soft Limit           Hard Limit           Units     
...
...
...
Max processes             63815                63815                processes 
Max pending signals       63815                63815                signals   
...
...
...

Initialization of the caches

The next function after the fork_init is the proc_caches_init from the kernel/fork.c. This function allocates caches for the memory descriptors (or mm_struct structure). At the beginning of the proc_caches_init we can see allocation of the different SLAB caches with the call of the kmem_cache_create:

sighand_cachep - manage information about installed signal handlers;
signal_cachep - manage information about process signal descriptor;
files_cachep - manage information about opened files;
fs_cachep - manage filesystem information.

After this we allocate SLAB cache for the mm_struct structures:

mm_cachep = kmem_cache_create("mm_struct",
                         sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
                         SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);

After this we allocate SLAB cache for the important vm_area_struct which used by the kernel to manage virtual memory space:

vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);

Note, that we use KMEM_CACHE macro here instead of the kmem_cache_create. This macro is defined in the include/linux/slab.h and just expands to the kmem_cache_create call:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
                sizeof(struct __struct), __alignof__(struct __struct),\
                (__flags), NULL)

The KMEM_CACHE has one difference from kmem_cache_create. Take a look on __alignof__ operator. The KMEM_CACHE macro aligns SLAB to the size of the given structure, but kmem_cache_create uses given value to align space. After this we can see the call of the mmap_init and nsproxy_cache_init functions. The first function initializes virtual memory area SLAB and the second function initializes SLAB for namespaces.

The next function after the proc_caches_init is buffer_init. This function is defined in the fs/buffer.c source code file and allocate cache for the buffer_head. The buffer_head is a special structure which defined in the include/linux/buffer_head.h and used for managing buffers. In the start of the buffer_init function we allocate cache for the struct buffer_head structures with the call of the kmem_cache_create function as we did in the previous functions. And calculate the maximum size of the buffers in memory with:

nrpages = (nr_free_buffer_pages() * 10) / 100;
max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));

which will be equal to the 10% of the ZONE_NORMAL (all RAM from the 4GB on the x86_64). The next function after the buffer_init is - vfs_caches_init. This function allocates SLAB caches and hashtable for different VFS caches. We already saw the vfs_caches_init_early function in the eighth part of the linux kernel initialization process which initialized caches for dcache (or directory-cache) and inode cache. The vfs_caches_init function makes post-early initialization of the dcache and inode caches, private data cache, hash tables for the mount points, etc. More details about VFS will be described in the separate part. After this we can see signals_init function. This function is defined in the kernel/signal.c and allocates a cache for the sigqueue structures which represents queue of the real time signals. The next function is page_writeback_init. This function initializes the ratio for the dirty pages. Every low-level page entry contains the dirty bit which indicates whether a page has been written to after been loaded into memory.

Creation of the root for the procfs

After all of this preparations we need to create the root for the proc filesystem. We will do it with the call of the proc_root_init function from the fs/proc/root.c. At the start of the proc_root_init function we allocate the cache for the inodes and register a new filesystem in the system with the:

err = register_filesystem(&proc_fs_type);
      if (err)
                return;

As I wrote above we will not dive into details about VFS and different filesystems in this chapter, but will see it in the chapter about the VFS. After we've registered a new filesystem in our system, we call the proc_self_init function from the fs/proc/self.c and this function allocates inode number for the self (/proc/self directory refers to the process accessing the /proc filesystem). The next step after the proc_self_init is proc_setup_thread_self which setups the /proc/thread-self directory which contains information about current thread. After this we create /proc/self/mounts symlink which will contains mount points with the call of the

proc_symlink("mounts", NULL, "self/mounts");

and a couple of directories depends on the different configuration options:

#ifdef CONFIG_SYSVIPC
        proc_mkdir("sysvipc", NULL);
#endif
        proc_mkdir("fs", NULL);
        proc_mkdir("driver", NULL);
        proc_mkdir("fs/nfsd", NULL);
#if defined(CONFIG_SUN_OPENPROMFS) || defined(CONFIG_SUN_OPENPROMFS_MODULE)
        proc_mkdir("openprom", NULL);
#endif
        proc_mkdir("bus", NULL);
        ...
        ...
        ...
        if (!proc_mkdir("tty", NULL))
                 return;
        proc_mkdir("tty/ldisc", NULL);
        ...
        ...
        ...

In the end of the proc_root_init we call the proc_sys_init function which creates /proc/sys directory and initializes the Sysctl.

It is the end of start_kernel function. I did not describe all functions which are called in the start_kernel. I skipped them, because they are not important for the generic kernel initialization stuff and depend on only different kernel configurations. They are taskstats_init_early which exports per-task statistic to the user-space, delayacct_init - initializes per-task delay accounting, key_init and security_init initialize different security stuff, check_bugs - fix some architecture-dependent bugs, ftrace_init function executes initialization of the ftrace, cgroup_init makes initialization of the rest of the cgroup subsystem,etc. Many of these parts and subsystems will be described in the other chapters.

That's all. Finally we have passed through the long-long start_kernel function. But it is not the end of the linux kernel initialization process. We haven't run the first process yet. In the end of the start_kernel we can see the last call of the - rest_init function. Let's go ahead.

First steps after the start_kernel

The rest_init function is defined in the same source code file as start_kernel function, and this file is init/main.c. In the beginning of the rest_init we can see call of the two following functions:

	rcu_scheduler_starting();
	smpboot_thread_init();

The first rcu_scheduler_starting makes RCU scheduler active and the second smpboot_thread_init registers the smpboot_thread_notifier CPU notifier (more about it you can read in the CPU hotplug documentation. After this we can see the following calls:

kernel_thread(kernel_init, NULL, CLONE_FS);
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);

Here the kernel_thread function (defined in the kernel/fork.c) creates new kernel thread.As we can see the kernel_thread function takes three arguments:

Function which will be executed in a new thread;
Parameter for the kernel_init function;
Flags.

We will not dive into details about kernel_thread implementation (we will see it in the chapter which describe scheduler, just need to say that kernel_thread invokes clone). Now we only need to know that we create new kernel thread with kernel_thread function, parent and child of the thread will use shared information about filesystem and it will start to execute kernel_init function. A kernel thread differs from a user thread that it runs in kernel mode. So with these two kernel_thread calls we create two new kernel threads with the PID = 1 for init process and PID = 2 for kthreadd. We already know what is init process. Let's look on the kthreadd. It is a special kernel thread which manages and helps different parts of the kernel to create another kernel thread. We can see it in the output of the ps util:

$ ps -ef | grep kthread
root         2     0  0 Jan11 ?        00:00:00 [kthreadd]

Let's postpone kernel_init and kthreadd for now and go ahead in the rest_init. In the next step after we have created two new kernel threads we can see the following code:

	rcu_read_lock();
	kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
	rcu_read_unlock();

The first rcu_read_lock function marks the beginning of an RCU read-side critical section and the rcu_read_unlock marks the end of an RCU read-side critical section. We call these functions because we need to protect the find_task_by_pid_ns. The find_task_by_pid_ns returns pointer to the task_struct by the given pid. So, here we are getting the pointer to the task_struct for PID = 2 (we got it after kthreadd creation with the kernel_thread). In the next step we call complete function

complete(&kthreadd_done);

and pass address of the kthreadd_done. The kthreadd_done defined as

static __initdata DECLARE_COMPLETION(kthreadd_done);

where DECLARE_COMPLETION macro defined as:

#define DECLARE_COMPLETION(work) \
         struct completion work = COMPLETION_INITIALIZER(work)

and expands to the definition of the completion structure. This structure is defined in the include/linux/completion.h and presents completions concept. Completions is a code synchronization mechanism which provides race-free solution for the threads that must wait for some process to have reached a point or a specific state. Using completions consists of three parts: The first is definition of the complete structure and we did it with the DECLARE_COMPLETION. The second is call of the wait_for_completion. After the call of this function, a thread which called it will not continue to execute and will wait while other thread did not call complete function. Note that we call wait_for_completion with the kthreadd_done in the beginning of the kernel_init_freeable:

wait_for_completion(&kthreadd_done);

And the last step is to call complete function as we saw it above. After this the kernel_init_freeable function will not be executed while kthreadd thread will not be set. After the kthreadd was set, we can see three following functions in the rest_init:

	init_idle_bootup_task(current);
	schedule_preempt_disabled();
    cpu_startup_entry(CPUHP_ONLINE);

The first init_idle_bootup_task function from the kernel/sched/core.c sets the Scheduling class for the current process (idle class in our case):

void init_idle_bootup_task(struct task_struct *idle)
{
         idle->sched_class = &idle_sched_class;
}

where idle class is a low task priority and tasks can be run only when the processor doesn't have anything to run besides this tasks. The second function schedule_preempt_disabled disables preempt in idle tasks. And the third function cpu_startup_entry is defined in the kernel/sched/idle.c and calls cpu_idle_loop from the kernel/sched/idle.c. The cpu_idle_loop function works as process with PID = 0 and works in the background. Main purpose of the cpu_idle_loop is to consume the idle CPU cycles. When there is no process to run, this process starts to work. We have one process with idle scheduling class (we just set the current task to the idle with the call of the init_idle_bootup_task function), so the idle thread does not do useful work but just checks if there is an active task to switch to:

static void cpu_idle_loop(void)
{
        ...
        ...
        ...
        while (1) {
                while (!need_resched()) {
                ...
                ...
                ...
                }
        ...
        }

More about it will be in the chapter about scheduler. So for this moment the start_kernel calls the rest_init function which spawns an init (kernel_init function) process and become idle process itself. Now is time to look on the kernel_init. Execution of the kernel_init function starts from the call of the kernel_init_freeable function. The kernel_init_freeable function first of all waits for the completion of the kthreadd setup. I already wrote about it above:

wait_for_completion(&kthreadd_done);

After this we set gfp_allowed_mask to __GFP_BITS_MASK which means that system is already running, set allowed cpus/mems to all CPUs and NUMA nodes with the set_mems_allowed function, allow init process to run on any CPU with the set_cpus_allowed_ptr, set pid for the cad or Ctrl-Alt-Delete, do preparation for booting of the other CPUs with the call of the smp_prepare_cpus, call early initcalls with the do_pre_smp_initcalls, initialize SMP with the smp_init and initialize lockup_detector with the call of the lockup_detector_init and initialize scheduler with the sched_init_smp.

After this we can see the call of the following functions - do_basic_setup. Before we will call the do_basic_setup function, our kernel already initialized for this moment. As comment says:

Now we can finally start doing some real work..

The do_basic_setup will reinitialize cpuset to the active CPUs, initialize the khelper - which is a kernel thread which used for making calls out to userspace from within the kernel, initialize tmpfs, initialize drivers subsystem, enable the user-mode helper workqueue and make post-early call of the initcalls. We can see opening of the dev/console and dup twice file descriptors from 0 to 2 after the do_basic_setup:

if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
	pr_err("Warning: unable to open an initial console.\n");

(void) sys_dup(0);
(void) sys_dup(0);

We are using two system calls here sys_open and sys_dup. In the next chapters we will see explanation and implementation of the different system calls. After we opened initial console, we check that rdinit= option was passed to the kernel command line or set default path of the ramdisk:

if (!ramdisk_execute_command)
	ramdisk_execute_command = "/init";

Check user's permissions for the ramdisk and call the prepare_namespace function from the init/do_mounts.c which checks and mounts the initrd:

if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
	ramdisk_execute_command = NULL;
	prepare_namespace();
}

This is the end of the kernel_init_freeable function and we need return to the kernel_init. The next step after the kernel_init_freeable finished its execution is the async_synchronize_full. This function waits until all asynchronous function calls have been done and after it we will call the free_initmem which will release all memory occupied by the initialization stuff which located between __init_begin and __init_end. After this we protect .rodata with the mark_rodata_ro and update state of the system from the SYSTEM_BOOTING to the

system_state = SYSTEM_RUNNING;

And tries to run the init process:

if (ramdisk_execute_command) {
	ret = run_init_process(ramdisk_execute_command);
	if (!ret)
		return 0;
	pr_err("Failed to execute %s (error %d)\n",
	       ramdisk_execute_command, ret);
}

First of all it checks the ramdisk_execute_command which we set in the kernel_init_freeable function and it will be equal to the value of the rdinit= kernel command line parameters or /init by default. The run_init_process function fills the first element of the argv_init array:

static const char *argv_init[MAX_INIT_ARGS+2] = { "init", NULL, };

which represents arguments of the init program and call do_execve function:

argv_init[0] = init_filename;
return do_execve(getname_kernel(init_filename),
	(const char __user *const __user *)argv_init,
	(const char __user *const __user *)envp_init);

The do_execve function is defined in the include/linux/sched.h and runs program with the given file name and arguments. If we did not pass rdinit= option to the kernel command line, kernel starts to check the execute_command which is equal to value of the init= kernel command line parameter:

	if (execute_command) {
		ret = run_init_process(execute_command);
		if (!ret)
			return 0;
		panic("Requested init %s failed (error %d).",
		      execute_command, ret);
	}

If we did not pass init= kernel command line parameter either, kernel tries to run one of the following executable files:

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    !try_to_run_init_process("/bin/sh"))
	return 0;

Otherwise we finish with panic:

panic("No working init found.  Try passing init= option to kernel. "
      "See Linux Documentation/init.txt for guidance.");

That's all! Linux kernel initialization process is finished!

Conclusion

It is the end of the tenth part about the linux kernel initialization process. It is not only the tenth part, but also is the last part which describes initialization of the linux kernel. As I wrote in the first part of this chapter, we will go through all steps of the kernel initialization and we did it. We started at the first architecture-independent function - start_kernel and finished with the launch of the first init process in the our system. I skipped details about different subsystem of the kernel, for example I almost did not cover scheduler, interrupts, exception handling, etc. From the next part we will start to dive to the different kernel subsystems. Hope it will be interesting.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to linux-insides.