（五）6.828 Operating System lab4: Preemptive Multitasking

最新推荐文章于 2022-11-06 18:10:46 发布

HearmingBear

最新推荐文章于 2022-11-06 18:10:46 发布

阅读量1.6k

点赞数

分类专栏： MIT 6.828 Operating System

本文链接：https://blog.csdn.net/hjw199666/article/details/103286130

版权

MIT 6.828 Operating System 专栏收录该内容

4 篇文章 3 订阅

订阅专栏

Introduction

In this lab you will implement preemptive multitasking among multiple simultaneously active user-mode environments.

In part A you will add multiprocessor support to JOS, implement round-robin scheduling, and add basic environment management system calls (calls that create and destroy environments, and allocate/map memory).

In part B, you will implement a Unix-like fork(), which allows a user-mode environment to create copies of itself.

Finally, in part C you will add support for inter-process communication (IPC), allowing different user-mode environments to communicate and synchronize with each other explicitly. You will also add support for hardware clock interrupts and preemption.

在这个lab中实现用户进程之间的多任务抢占式处理。

part A实现JOS中的多处理器支持，实现用户进程轮询(round-robin)调度算法，增加基础进程管理相关的系统调用（创建、销毁进程，分配、映射内存）。

part B实现fork，允许用户进程创建自己的拷贝。

part C增加用户进程之间的交互(IPC)，允许不同用户进程通信和同步，还要增加对硬件时钟中断和抢占的支持。

Part A: Multiprocessor Support and Cooperative Multitasking

In the first part of this lab, you will first extend JOS to run on a multiprocessor system, and then implement some new JOS kernel system calls to allow user-level environments to create additional new environments. You will also implement cooperative round-robin scheduling, allowing the kernel to switch from one environment to another when the current environment voluntarily relinquishes the CPU (or exits). Later in part C you will implement preemptive scheduling, which allows the kernel to re-take control of the CPU from an environment after a certain time has passed even if the environment does not cooperate.

在这个lab的第一部分，实现JOS中的多处理器支持，实现用户进程轮询(round-robin)调度算法，增加基础进程管理相关的系统调用（创建、销毁进程，分配、映射内存）。

知识点：Multiprocessor Support

We are going to make JOS support "symmetric multiprocessing" (SMP), a multiprocessor model in which all CPUs have equivalent access to system resources such as memory and I/O buses. While all CPUs are functionally identical in SMP, during the boot process they can be classified into two types: the bootstrap processor (BSP) is responsible for initializing the system and for booting the operating system; and the application processors (APs) are activated by the BSP only after the operating system is up and running. Which processor is the BSP is determined by the hardware and the BIOS. Up to this point, all your existing JOS code has been running on the BSP.

In an SMP system, each CPU has an accompanying local APIC (LAPIC) unit. The LAPIC units are responsible for delivering interrupts throughout the system. The LAPIC also provides its connected CPU with a unique identifier. In this lab, we make use of the following basic functionality of the LAPIC unit (in kern/lapic.c):

Reading the LAPIC identifier (APIC ID) to tell which CPU our code is currently running on (see cpunum()).
Sending the STARTUP interprocessor interrupt (IPI) from the BSP to the APs to bring up other CPUs (see lapic_startap()).
In part C, we program LAPIC's built-in timer to trigger clock interrupts to support preemptive multitasking (see apic_init()).

接下来先让JOS能够支持“对称多处理”(symmetric multiprocessing,SMP)模式，这种模式使CPUs 同等地访问系统资源，比如内存和I/0总线等。虽然SMP模式下所有CPU功能性相同，在启动阶段会分成两种类型：引导处理器bootstrap processor (BSP)负责初始化系统以及载入操作系统，应用处理器application processors (APs)由BSP载入运行操作系统后就激活。哪些处理器是BSP是由硬件和BIOS决定的，也就是说目前我们所有的 JOS 代码都运行在 BSP 上。

在对称多处理系统中，每个CPU都有一个附属的local APIC (LAPIC)单元，LAPIC单元负责在系统中传递中断，LAPIC也给连接的不同CPU提供唯一的ID，在这个lab中，我们利用LAPIC单元的下述基本功能（在 kern/lapic.c）：

读取LAPIC ID，判断运行在哪个CPU上。
从BSP发送STARTUP interprocessor interrupt (IPI)到APs启动其他CPU。
在part c，编写LAPIC的内置计时器，出发时钟中断来实现抢占式多任务处理。

A processor accesses its LAPIC using memory-mapped I/O (MMIO). In MMIO, a portion of physical memory is hardwired to the registers of some I/O devices, so the same load/store instructions typically used to access memory can be used to access device registers. You've already seen one IO hole at physical address 0xA0000 (we use this to write to the VGA display buffer). The LAPIC lives in a hole starting at physical address 0xFE000000 (32MB short of 4GB), so it's too high for us to access using our usual direct map at KERNBASE. The JOS virtual memory map leaves a 4MB gap at MMIOBASE so we have a place to map devices like this. Since later labs introduce more MMIO regions, you'll write a simple function to allocate space from this region and map device memory to it.

一个处理器利用using memory-mapped I/O (MMIO)访问自己的LAPIC，在MMIO中，物理内存的一部分被硬连到一些I/O设备的寄存器上去，所以访问内存的 load/store 指令可以被用于访问设备的寄存器。已经见过0xA0000地址上的IO hole，LAPIC分布的物理地址开始于0xFE000000，以前在KERNBASE建立好的映射最高只到0x0FFFFFFF，因此，JOS在虚拟地址MMIOBASE留了4MB地方来映射这些外设，后面会编写这部分代码。

Exercise 1

Exercise 1. Implement mmio_map_region in kern/pmap.c. To see how this is used, look at the beginning of lapic_init in kern/lapic.c. You'll have to do the next exercise, too, before the tests for mmio_map_region will run.

先观察kern/lapic.c中的lapic_init()。

	// lapicaddr is the physical address of the LAPIC's 4K MMIO
	// region.  Map it in to virtual memory so we can access it.
	lapic = mmio_map_region(lapicaddr, 4096);

从lapicaddr地址开始4K物理地址映射到虚拟地址。

//
// Reserve size bytes in the MMIO region and map [pa,pa+size) at this
// location.  Return the base of the reserved region.  size does *not*
// have to be multiple of PGSIZE.
//
void *
mmio_map_region(physaddr_t pa, size_t size)
{
	// Where to start the next region.  Initially, this is the
	// beginning of the MMIO region.  Because this is static, its
	// value will be preserved between calls to mmio_map_region
	// (just like nextfree in boot_alloc).
	static uintptr_t base = MMIOBASE;

	// Reserve size bytes of virtual memory starting at base and
	// map physical pages [pa,pa+size) to virtual addresses
	// [base,base+size).  Since this is device memory and not
	// regular DRAM, you'll have to tell the CPU that it isn't
	// safe to cache access to this memory.  Luckily, the page
	// tables provide bits for this purpose; simply create the
	// mapping with PTE_PCD|PTE_PWT (cache-disable and
	// write-through) in addition to PTE_W.  (If you're interested
	// in more details on this, see section 10.5 of IA32 volume
	// 3A.)
	//
	// Be sure to round size up to a multiple of PGSIZE and to
	// handle if this reservation would overflow MMIOLIM (it's
	// okay to simply panic if this happens).
	//
	// Hint: The staff solution uses boot_map_region.
	//
	// Your code here:

	// panic("mmio_map_region not implemented");
	size = ROUNDUP(pa+size, PGSIZE);
	pa = ROUNDDOWN(pa,PGSIZE);
	size = size-pa;
	if(base + size > MMIOLIM) panic("overflow MMIOLIM");
	// 
	boot_map_region(kern_pgdir, base, size, pa, PTE_W|PTE_PCD|PTE_PWT);
	uintptr_t res = base;
	base +=size;
	return (void *)res;
}

Application Processor Bootstrap

Before booting up APs, the BSP should first collect information about the multiprocessor system, such as the total number of CPUs, their APIC IDs and the MMIO address of the LAPIC unit. The mp_init() function in kern/mpconfig.c retrieves this information by reading the MP configuration table that resides in the BIOS's region of memory.

The boot_aps() function (in kern/init.c) drives the AP bootstrap process. APs start in real mode, much like how the bootloader started in boot/boot.S, so boot_aps() copies the AP entry code (kern/mpentry.S) to a memory location that is addressable in the real mode. Unlike with the bootloader, we have some control over where the AP will start executing code; we copy the entry code to 0x7000 (MPENTRY_PADDR), but any unused, page-aligned physical address below 640KB would work.

After that, boot_aps() activates APs one after another, by sending STARTUP IPIs to the LAPIC unit of the corresponding AP, along with an initial CS:IP address at which the AP should start running its entry code (MPENTRY_PADDR in our case). The entry code in kern/mpentry.S is quite similar to that of boot/boot.S. After some brief setup, it puts the AP into protected mode with paging enabled, and then calls the C setup routine mp_main() (also in kern/init.c). boot_aps() waits for the AP to signal a CPU_STARTED flag in cpu_status field of its struct CpuInfo before going on to wake up the next one.

在启动APs前，BSP处理器先要收集多处理器系统的信息，比如CPU的总数、每个CPU的APIC ID、LAPIC单元的MMIO地址。kern/mpconfig.c中的 mp_init()函数通过读取BIOS区域内村中的MP配置表来获取这些信息。

kern/init.c中的boot_aps()函数驱动AP引导进程，APs在real mode中开始，有点像bootloader从boot/boot.S开始。因此boot_aps()将AP的入口代码（放在kern/mpentry.S）拷贝到实模式的地址空间,也就是 0x7000 (MPENTRY_PADDR)处。

之后，boot_aps()向对应的AP LAPIC单元发送STARTUP IPIs来依次激活APs，也一起发送了每个AP运行的入口代码(MPENTRY_PADDR)的初始CS:IP 地址。 kern/mpentry.S的入口代码很像boot/boot.S，经过简单的设置，AP开启分页进入保护模式(32-bit)，之后调用kern/init.c中的mp_main()。boot_aps()等待AP发送CPU_STARTED信号再去激活下一个。

Exercise 2

Exercise 2. Read boot_aps() and mp_main() in kern/init.c, and the assembly code in kern/mpentry.S. Make sure you understand the control flow transfer during the bootstrap of APs. Then modify your implementation of page_init() in kern/pmap.c to avoid adding the page at MPENTRY_PADDR to the free list, so that we can safely copy and run AP bootstrap code at that physical address. Your code should pass the updated check_page_free_list() test (but might fail the updated check_kern_pgdir() test, which we will fix soon).

修改kern/pmap.c 中的page_init()，不要把MPENTRY_PADDR所在的页加入到free list中，那块页所在的物理内存要被用来复制AP引导代码并且运行。MPENTRY_PADDR在0x7000 处，也就是在[PGSIZE, npages_basemem * PGSIZE)之间，添加MPENTRY_PADDR这种特殊情况。

		// 2.the rest of base memory,[PGSIZE, npages_basemem * PGSIZE) is free
		// mark剩下的base memory可用
		// 地址范围0x1000 ~ 0xA0000
		else if(i>=1 && i<npages_basemem){
			// lab4
			if(i == MPENTRY_PADDR/PGSIZE){
				pages[i].pp_ref = 1;
				pages[i].pp_link = NULL;
				continue;
			}
			pages[i].pp_ref = 0;
			pages[i].pp_link = page_free_list;
			page_free_list = &pages[i];
		}

运行结果：

HemingbeardeMacBook-Pro:lab hemingbear$ make qemu
qemu-system-i386 -drive file=obj/kern/kernel.img,index=0,media=disk,format=raw -serial mon:stdio -gdb tcp::25501 -D qemu.log -smp 1 
6828 decimal is 15254 octal!
Physical memory: 131072K available, base = 640K, extended = 130432K
check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
kernel panic on CPU 0 at kern/pmap.c:912: assertion failed: check_va2pa(pgdir, base + KSTKGAP + i) == PADDR(percpu_kstacks[n]) + i

Question

Compare kern/mpentry.S side by side with boot/boot.S. Bearing in mind that kern/mpentry.S is compiled and linked to run above KERNBASE just like everything else in the kernel, what is the purpose of macro MPBOOTPHYS? Why is it necessary in kern/mpentry.S but not in boot/boot.S? In other words, what could go wrong if it were omitted in kern/mpentry.S?
Hint: recall the differences between the link address and the load address that we have discussed in Lab 1.

在AP的保护模式打开之前，是没办法寻址到KERNBASE(0xf0000000)上的，而start32这些数据都在KERNBASE之上，需要MPBOOTPHYS这样的宏定义帮助。

在boot.S中，需要的物理地址都在实模式能访问的范围内，但mpentry.S中需要转换访问范围外的地址。

Per-CPU State and Initialization

When writing a multiprocessor OS, it is important to distinguish between per-CPU state that is private to each processor, and global state that the whole system shares. kern/cpu.h defines most of the per-CPU state, including struct CpuInfo, which stores per-CPU variables. cpunum() always returns the ID of the CPU that calls it, which can be used as an index into arrays like cpus. Alternatively, the macro thiscpu is shorthand for the current CPU's struct CpuInfo.

当编写一个多处理器OS时，需要分清楚CPU的私有状态( per-CPU state)和全局状态(global state)。 kern/cpu.h 中定义了大部分CPU私有状态，包括存储CPU私有变量的CpuInfo结构。cpunum() 返回调用此函数CPU的ID，可以用作cpus数组中的索引。thiscpu是当前CPU CpuInfo结构的缩写。

Here is the per-CPU state you should be aware of:

Per-CPU kernel stack.
Because multiple CPUs can trap into the kernel simultaneously, we need a separate kernel stack for each processor to prevent them from interfering with each other's execution. The array percpu_kstacks[NCPU][KSTKSIZE] reserves space for NCPU's worth of kernel stacks.

In Lab 2, you mapped the physical memory that bootstack refers to as the BSP's kernel stack just below KSTACKTOP. Similarly, in this lab, you will map each CPU's kernel stack into this region with guard pages acting as a buffer between them. CPU 0's stack will still grow down from KSTACKTOP; CPU 1's stack will start KSTKGAP bytes below the bottom of CPU 0's stack, and so on. inc/memlayout.h shows the mapping layout.

Per-CPU TSS and TSS descriptor.
A per-CPU task state segment (TSS) is also needed in order to specify where each CPU's kernel stack lives. The TSS for CPU i is stored in cpus[i].cpu_ts, and the corresponding TSS descriptor is defined in the GDT entry gdt[(GD_TSS0 >> 3) + i]. The global ts variable defined in kern/trap.c will no longer be useful.

Per-CPU current environment pointer.
Since each CPU can run different user process simultaneously, we redefined the symbol curenv to refer to cpus[cpunum()].cpu_env (or thiscpu->cpu_env), which points to the environment currently executing on the current CPU (the CPU on which the code is running).

Per-CPU system registers.
All registers, including system registers, are private to a CPU. Therefore, instructions that initialize these registers, such as lcr3(), ltr(), lgdt(), lidt(), etc., must be executed once on each CPU. Functions env_init_percpu() and trap_init_percpu() are defined for this purpose.

下面是需要注意的一些CPU私有状态：

Per-CPU kernel stack ：因为多CPU可能同时陷入内核，所以每个处理器都需要自己的内核栈，这样不会在同时执行时互相干扰，percpu_kstacks[NCPU][KSTKSIZE]数组为N CPU保留内核栈地址。在lab 2里，将BSP的内核栈映射到物理空间KSTACKTOP的下方，在这个lab中，需要将每一个CPU的内核栈映射到这个空间，每个栈之间留一个保护页分隔。CPU 0(BSP)的栈会从KSTACKTOP开始延伸，CPU 1的栈从距CPU 0底部KSTKGAP的地方开始，inc/memlayout.h展示了映射布局。

*    KERNBASE, ---->  +------------------------------+ 0xf0000000      --+
 *    KSTACKTOP        |     CPU0's Kernel Stack      | RW/--  KSTKSIZE   |
 *                     | - - - - - - - - - - - - - - -|                   |
 *                     |      Invalid Memory (*)      | --/--  KSTKGAP    |
 *                     +------------------------------+                   |
 *                     |     CPU1's Kernel Stack      | RW/--  KSTKSIZE   |
 *                     | - - - - - - - - - - - - - - -|                 PTSIZE
 *                     |      Invalid Memory (*)      | --/--  KSTKGAP    |
 *                     +------------------------------+                   |
 *                     :              .               :                   |
 *                     :              .               :                   |
 *    MMIOLIM ------>  +------------------------------+ 0xefc00000      --+

Per-CPU TSS and TSS descriptor：也需要CPU 任务状态段指明CPU内核栈的位置，TSS将会详细记录这个堆栈所在的段的段描述符和地址。CPU i的TSS存放在cpus[i].cpu_ts中，对应的TSS描述符定义为gdt[(GD_TSS0 >> 3) + i]，kern/trap.c中定义的全局变量ts不再使用。

Per-CPU current environment pointer：不同的CPU会运行不同的用户进程，重新定义curenv指向 cpus[cpunum()].cpu_env (or thiscpu->cpu_env)，指向执行当前代码CPU正在运行的用户进程。

Per-CPU system registers：每一个寄存器，包括系统寄存器，对于CPU都是私有的。因此，初始化这些寄存器的指令，比如lcr3(), ltr(), lgdt(), lidt()等，必须在每个CPU上都执行一次。env_init_percpu() 和trap_init_percpu() 是为这个目的设计的。

Exercise 3

Exercise 3. Modify mem_init_mp() (in kern/pmap.c) to map per-CPU stacks starting at KSTACKTOP, as shown in inc/memlayout.h. The size of each stack is KSTKSIZE bytes plus KSTKGAP bytes of unmapped guard pages. Your code should pass the new check in check_kern_pgdir().

修改kern/pmap.c中的mem_init_mp()函数，映射从KSTACKTOP开始的内核栈，遵从inc/memlayout.h所示的布局。每个栈的大小是KSTKSIZE+KSTKGAP。

先注释掉前面mem_init()中写的对BSP栈的映射：

	//
	// Use the physical memory that 'bootstack' refers to as the kernel
	// stack.  The kernel stack grows down from virtual address KSTACKTOP.
	// We consider the entire range from [KSTACKTOP-PTSIZE, KSTACKTOP)
	// to be the kernel stack, but break this into two pieces:
	//     * [KSTACKTOP-KSTKSIZE, KSTACKTOP) -- backed by physical memory
	//     * [KSTACKTOP-PTSIZE, KSTACKTOP-KSTKSIZE) -- not backed; so if
	//       the kernel overflows its stack, it will fault rather than
	//       overwrite memory.  Known as a "guard page".
	//     Permissions: kernel RW, user NONE
	// Your code goes here:
	// 第一个参数为前面建立好的页目录kern_pgdir；第二个参数是虚拟地址；
	// 第三个参数是映射内存块大小；第四个参数是映射到的物理地址，第五个参数是页表项标志
	// boot_map_region(kern_pgdir,KSTACKTOP-KSTKSIZE,KSTKSIZE,PADDR(bootstack),PTE_W | PTE_P);

在mem_init_mp()中增加新的BSP和AP栈的映射：

// Modify mappings in kern_pgdir to support SMP
//   - Map the per-CPU stacks in the region [KSTACKTOP-PTSIZE, KSTACKTOP)
//
static void
mem_init_mp(void)
{
	// Map per-CPU stacks starting at KSTACKTOP, for up to 'NCPU' CPUs.
	//
	// For CPU i, use the physical memory that 'percpu_kstacks[i]' refers
	// to as its kernel stack. CPU i's kernel stack grows down from virtual
	// address kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP), and is
	// divided into two pieces, just like the single stack you set up in
	// mem_init:
	//     * [kstacktop_i - KSTKSIZE, kstacktop_i)
	//          -- backed by physical memory
	//     * [kstacktop_i - (KSTKSIZE + KSTKGAP), kstacktop_i - KSTKSIZE)
	//          -- not backed; so if the kernel overflows its stack,
	//             it will fault rather than overwrite another CPU's stack.
	//             Known as a "guard page".
	//     Permissions: kernel RW, user NONE
	//
	// LAB 4: Your code here:
	uintptr_t start = KSTACKTOP - KSTKSIZE;
	for(size_t i=0;i<NCPU;++i){
		// 第一个参数为前面建立好的页目录kern_pgdir；第二个参数是虚拟地址；
		// 第三个参数是映射内存块大小；第四个参数是映射到的物理地址，第五个参数是页表项标志
		boot_map_region(kern_pgdir,(uintptr_t)start,KSTKSIZE,PADDR(percpu_kstacks[i]),PTE_U | PTE_P);
		start -= (KSTKSIZE+KSTKGAP);
	}

}

Exercise 4

Exercise 4. The code in trap_init_percpu() (kern/trap.c) initializes the TSS and TSS descriptor for the BSP. It worked in Lab 3, but is incorrect when running on other CPUs. Change the code so that it can work on all CPUs. (Note: your new code should not use the global ts variable any more.)

kern/trap.c 中trap_init_percpu() 的代码初始化了BSP的TSS和TSS描述符，但在初始化其他CPU的时候会有问题，修改代码能在所有CPU上运行。

// Initialize and load the per-CPU TSS and IDT
void
trap_init_percpu(void)
{
	// The example code here sets up the Task State Segment (TSS) and
	// the TSS descriptor for CPU 0. But it is incorrect if we are
	// running on other CPUs because each CPU has its own kernel stack.
	// Fix the code so that it works for all CPUs.
	//
	// Hints:
	//   - The macro "thiscpu" always refers to the current CPU's
	//     struct CpuInfo;
	//   - The ID of the current CPU is given by cpunum() or
	//     thiscpu->cpu_id;
	//   - Use "thiscpu->cpu_ts" as the TSS for the current CPU,
	//     rather than the global "ts" variable;
	//   - Use gdt[(GD_TSS0 >> 3) + i] for CPU i's TSS descriptor;
	//   - You mapped the per-CPU kernel stacks in mem_init_mp()
	//   - Initialize cpu_ts.ts_iomb to prevent unauthorized environments
	//     from doing IO (0 is not the correct value!)
	//
	// ltr sets a 'busy' flag in the TSS selector, so if you
	// accidentally load the same TSS on more than one CPU, you'll
	// get a triple fault.  If you set up an individual CPU's TSS
	// wrong, you may not get a fault until you try to return from
	// user space on that CPU.
	//
	// LAB 4: Your code here:
	struct Taskstate* this_ts = &thiscpu->cpu_ts;

	// Setup a TSS so that we get the right stack
	// when we trap to the kernel.
	this_ts->ts_esp0 = KSTACKTOP - thiscpu->cpu_id*(KSTKSIZE + KSTKGAP);
	this_ts->ts_ss0 = GD_KD;
	this_ts->ts_iomb = sizeof(struct Taskstate);

	// Initialize the TSS slot of the gdt.
	gdt[(GD_TSS0 >> 3 )+ thiscpu->cpu_id] = SEG16(STS_T32A, (uint32_t) (this_ts),
					sizeof(struct Taskstate) - 1, 0);
	gdt[(GD_TSS0 >> 3 )+ thiscpu->cpu_id].sd_s = 0;

	// Load the TSS selector (like other segment selectors, the
	// bottom three bits are special; we leave them 0)
	ltr(GD_TSS0 + (thiscpu->cpu_id << 3));

	// Load the IDT
	lidt(&idt_pd);
}

运行make qemu CPUS=4的结果：

SMP: CPU 0 found 4 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
SMP: CPU 2 starting
SMP: CPU 3 starting

Locking

Our current code spins after initializing the AP in mp_main(). Before letting the AP get any further, we need to first address race conditions when multiple CPUs run kernel code simultaneously. The simplest way to achieve this is to use a big kernel lock. The big kernel lock is a single global lock that is held whenever an environment enters kernel mode, and is released when the environment returns to user mode. In this model, environments in user mode can run concurrently on any available CPUs, but no more than one environment can run in kernel mode; any other environments that try to enter kernel mode are forced to wait.

kern/spinlock.h declares the big kernel lock, namely kernel_lock. It also provides lock_kernel() and unlock_kernel(), shortcuts to acquire and release the lock. You should apply the big kernel lock at four locations:

In i386_init(), acquire the lock before the BSP wakes up the other CPUs.
In mp_main(), acquire the lock after initializing the AP, and then call sched_yield() to start running environments on this AP.
In trap(), acquire the lock when trapped from user mode. To determine whether a trap happened in user mode or in kernel mode, check the low bits of the tf_cs.
In env_run(), release the lock right before switching to user mode. Do not do that too early or too late, otherwise you will experience races or deadlocks.

我们现在的代码在mp_main()函数初始化AP后会开始自旋。在让AP执行功能前，我们现需要解决多个CPU同时运行内核代码带来的竞态条件。最简单的方式是是用大内核锁，当一个进程进入内核态时获得大内核锁，当该进程回到用户态时释放大内核锁。在这个模型中，用户进程可以运行在任何可用CPU中，但同时不能有超过一个进程进入内核态，如果已经有进程进入了内核，其他进程需要强制等待。

kern/spinlock.h中声明了大内核锁，也就是 kernel_lock。它也提供了lock_kernel()和unlock_kernel()，是获得所和释放锁的调用函数，应该在四个地方应用大内核锁。

i386_init()中，在BSP唤醒其他CPU前获取锁。
mp_main()中，在初始化AP后获取锁，之后调用sched_yield() 在此AP上运行进程。
trap()中，在用户态发生异常的时候获取锁，通过检查tf_cs的地位看异常发生在用户态还是内核态。
env_run()中，在刚刚切换用户态之前释放锁，不要太早或太迟释放，不然后陷入竞态或者死锁。

Exercise 5

Exercise 5. Apply the big kernel lock as described above, by calling lock_kernel() and unlock_kernel() at the proper locations.

在kern/init.c的i386_init()函数中，BSP启动其他CPU前获取锁：


	// Acquire the big kernel lock before waking up APs
	// Your code here:
	lock_kernel();

	// Starting non-boot CPUs
	boot_aps();

在kern/init.c的mp_main()函数中，在初始化AP后获取锁，之后调用sched_yield() 在此AP上运行进程:

	// Now that we have finished some basic setup, call sched_yield()
	// to start running processes on this CPU.  But make sure that
	// only one CPU can enter the scheduler at a time!
	//
	// Your code here:
	lock_kernel();
	sched_yield();

在kern/trap.c的trap()中，在用户态发生异常的时候获取锁，通过检查tf_cs的地位看异常发生在用户态还是内核态。

	if ((tf->tf_cs & 3) == 3) {
		// Trapped from user mode.
		// Acquire the big kernel lock before doing any
		// serious kernel work.
		// LAB 4: Your code here.
		lock_kernel();
		assert(curenv);

		// Garbage collect if current enviroment is a zombie
		if (curenv->env_status == ENV_DYING) {
			env_free(curenv);
			curenv = NULL;
			sched_yield();
		}

		// Copy trap frame (which is currently on the stack)
		// into 'curenv->env_tf', so that running the environment
		// will restart at the trap point.
		curenv->env_tf = *tf;
		// The trapframe on the stack should be ignored from here on.
		tf = &curenv->env_tf;
	}

在kern/env.c的env_run()中，在刚刚切换用户态之前释放锁，不要太早或太迟释放，不然后陷入竞态或者死锁:

	// 有一个进程正在运行，需要涉及到上下文切换
	if(curenv && curenv->env_status == ENV_RUNNING){
		curenv->env_status = ENV_RUNNABLE;
	}
	// cprintf("[%08x] new env %08x and pgdir address %x\n", curenv ? curenv->env_id : 0, e->env_id,e->env_pgdir);
	// 设置新进程
	curenv = e;
	e->env_status = ENV_RUNNING;
	e->env_runs++;


	lcr3(PADDR(e->env_pgdir));

	unlock_kernel();
	// 切换回用户态
	env_pop_tf(&(e->env_tf));

Question

It seems that using the big kernel lock guarantees that only one CPU can run the kernel code at a time. Why do we still need separate kernel stacks for each CPU? Describe a scenario in which using a shared kernel stack will go wrong, even with the protection of the big kernel lock.

中断发生时，在进入有加锁的trap()之前已执trapentry.S中压入中断号的操作，如果共用栈可能会错乱。

知识点：Round-Robin Scheduling

Your next task in this lab is to change the JOS kernel so that it can alternate between multiple environments in "round-robin" fashion. Round-robin scheduling in JOS works as follows:

The function sched_yield() in the new kern/sched.c is responsible for selecting a new environment to run. It searches sequentially through the envs[] array in circular fashion, starting just after the previously running environment (or at the beginning of the array if there was no previously running environment), picks the first environment it finds with a status of ENV_RUNNABLE (see inc/env.h), and calls env_run() to jump into that environment.
sched_yield() must never run the same environment on two CPUs at the same time. It can tell that an environment is currently running on some CPU (possibly the current CPU) because that environment's status will be ENV_RUNNING.
We have implemented a new system call for you, sys_yield(), which user environments can call to invoke the kernel's sched_yield() function and thereby voluntarily give up the CPU to a different environment.

下一个任务是实现轮询(round-robin)进程调度算法，RR调度算法在JOS按以下方式工作：

kern/sched.c中的方程sched_yield()负责选择新的进程运行，以循环的方式一个接一个的访问envs数组中的元素，在上一个进程运行停止后运行下一个，选取他第一个找到的在ENV_RUNNABLE状态的进程，调用env_run() 运行。
sched_yield()不能同时在两个CPU上运行一个进程，如果一个进程正在某个CPU上执行，进程的状态会变成 ENV_RUNNING，不会被其他CPU执行。
已经实现了新的系统调用sys_yield()，用户进程可以调用这个函数唤醒内核的sched_yield()函数，从而把CPU资源交给其他进程。

Exercise 6

Exercise 6. Implement round-robin scheduling in sched_yield() as described above. Don't forget to modify syscall() to dispatch sys_yield().

Make sure to invoke sched_yield() in mp_main.

Modify kern/init.c to create three (or more!) environments that all run the program user/yield.c.

Run make qemu. You should see the environments switch back and forth between each other five times before terminating, like below.

Test also with several CPUS: make qemu CPUS=2.
...
Hello, I am environment 00001000.
Hello, I am environment 00001001.
Hello, I am environment 00001002.
Back in environment 00001000, iteration 0.
Back in environment 00001001, iteration 0.
Back in environment 00001002, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001001, iteration 1.
Back in environment 00001002, iteration 1.
...
After the yield programs exit, there will be no runnable environment in the system, the scheduler should invoke the JOS kernel monitor. If any of this does not happen, then fix your code before proceeding.

If you use CPUS=1 at this point, all environments should successfully run. Setting CPUS larger than 1 at this time may result in a general protection or kernel page fault once there are no more runnable environments due to unhandled timer interrupts (which we will fix below!).

在sched_yield()中实现RR进程调度算法，修改syscall() 函数分配 sys_yield()。确保mp_main调用了sched_yield()，修改kern/init.c创建3个进程，都运行程序user/yield.c。

在sched_yield()中实现RR进程调度算法的代码：

// Choose a user environment to run and run it.
void
sched_yield(void)
{
	struct Env *idle;

	// Implement simple round-robin scheduling.
	//
	// Search through 'envs' for an ENV_RUNNABLE environment in
	// circular fashion starting just after the env this CPU was
	// last running.  Switch to the first such environment found.
	//
	// If no envs are runnable, but the environment previously
	// running on this CPU is still ENV_RUNNING, it's okay to
	// choose that environment.
	//
	// Never choose an environment that's currently running on
	// another CPU (env_status == ENV_RUNNING). If there are
	// no runnable environments, simply drop through to the code
	// below to halt the cpu.

	// LAB 4: Your code here.
	idle = curenv;
	size_t index = curenv? -1:ENVX(idle->env_id);	//从当前进程后一个开始
	for(size_t i=0;i<NENV;++i){
		index = (index + 1)%NENV;
		if(envs[index].env_status == ENV_RUNNABLE){
			env_run(&envs[index]);
			return;
		}
	}
	if(idle && idle->env_status == ENV_RUNNING){	//没有其他可运行进程，继续执行当前进程
		env_run(idle);
        return;
	}

	// sched_halt never returns
	sched_halt();
}

syscall() 函数分配 sys_yield():

	case SYS_yield:
		sys_yield();
		break;

确保mp_main调用了sched_yield():

	// Now that we have finished some basic setup, call sched_yield()
	// to start running processes on this CPU.  But make sure that
	// only one CPU can enter the scheduler at a time!
	//
	// Your code here:
	lock_kernel();
	sched_yield();

修改kern/init.c的i386_init()创建3个进程：

#if defined(TEST)
	// Don't touch -- used by grading script!
	ENV_CREATE(TEST, ENV_TYPE_USER);
#else
	// Touch all you want.
	// ENV_CREATE(user_primes, ENV_TYPE_USER);
		// create 3 environment
	ENV_CREATE(user_yield, ENV_TYPE_USER);
    ENV_CREATE(user_yield, ENV_TYPE_USER);
    ENV_CREATE(user_yield, ENV_TYPE_USER);
#endif // TEST*

运行make qemu CPUS=2结果：

Hello, I am environment 00001000.
Hello, I am environment 00001001.
Back in environment 00001000, iteration 0.
Hello, I am environment 00001002.
Back in environment 00001001, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001002, iteration 0.
Back in environment 00001001, iteration 1.
Back in environment 00001000, iteration 2.
Back in environment 00001002, iteration 1.
Back in environment 00001001, iteration 2.
Back in environment 00001000, iteration 3.
Back in environment 00001002, iteration 2.
Back in environment 00001001, iteration 3.
Back in environment 00001000, iteration 4.
Back in environment 00001002, iteration 3.
All done in environment 00001000.
[00001000] exiting gracefully
[00001000] free env 00001000
Back in environment 00001001, iteration 4.
Back in environment 00001002, iteration 4.
All done in environment 00001001.
All done in environment 00001002.
[00001001] exiting gracefully
[00001001] free env 00001001
[00001002] exiting gracefully
[00001002] free env 00001002
No runnable environments in the system!

Question

1.In your implementation of env_run() you should have called lcr3(). Before and after the call to lcr3(), your code makes references (at least it should) to the variable e, the argument to env_run. Upon loading the %cr3 register, the addressing context used by the MMU is instantly changed. But a virtual address (namely e) has meaning relative to a given address context--the address context specifies the physical address to which the virtual address maps. Why can the pointer e be dereferenced both before and after the addressing switch?

问题：为什么经过lcr3()切换不同的页目录后，变量e还是同一个？

因为用户进程的页目录表是按内核空间页目录改的，在内核部分是一样的。

2.Whenever the kernel switches from one environment to another, it must ensure the old environment's registers are saved so they can be restored properly later. Why? Where does this happen?

因为旧进程的恢复是要靠保存好的寄存器值。

保存寄存器的动作在trapentry.S的_alltraps中，保存于内核栈中，借指针*tf在各内核函数中传递。恢复在env.c的env_pop_tf()中。

保存：

#define TRAPHANDLER_NOEC(name, num)
    .globl name;                            
    .type name, @function;                      
    .align 2;                           
    name:                               
    pushl $0;                           
    pushl $(num);                           
    jmp _alltraps
...

_alltraps:
pushl %ds    // 保存当前段寄存器
pushl %es
pushal    // 保存其他寄存器

movw $GD_KD, %ax
movw %ax, %ds
movw %ax, %es
pushl %esp    //  保存当前栈顶指针
call trap

恢复：

void
env_pop_tf(struct Trapframe *tf)
{
    // Record the CPU we are running on for user-space debugging
    curenv->env_cpunum = cpunum();

    asm volatile(
        "\tmovl %0,%%esp\n"    // 恢复栈顶指针
        "\tpopal\n"    // 恢复其他寄存器
        "\tpopl %%es\n"    // 恢复段寄存器
        "\tpopl %%ds\n"
        "\taddl $0x8,%%esp\n" /* skip tf_trapno and tf_errcode */
        "\tiret\n"
        : : "g" (tf) : "memory");
    panic("iret failed");  /* mostly to placate the compiler */
}

知识点：System Calls for Environment Creation

Although your kernel is now capable of running and switching between multiple user-level environments, it is still limited to running environments that the kernel initially set up. You will now implement the necessary JOS system calls to allow user environments to create and start other new user environments.

Unix provides the fork() system call as its process creation primitive. Unix fork() copies the entire address space of calling process (the parent) to create a new process (the child). The only differences between the two observable from user space are their process IDs and parent process IDs (as returned by getpid and getppid). In the parent, fork() returns the child's process ID, while in the child, fork() returns 0. By default, each process gets its own private address space, and neither process's modifications to memory are visible to the other.

尽管现在可以运行和切换多个用户进程，不过现在只能运行内核设置好的用户进程，现在要依靠JOS系统调用来创建和运行新用户进程。

Unix提供fork()系统调用来创建进程，Unix fork()复制了了整个调用进程(parent)的地址空间来创建新进程(child)。父子进程唯一可以观察到的区别就是他们的进程ID(returned by getpid)和父进程ID(returned by getppid)不同。对于父进程fork()返回子进程ID，对于子进程fork() 返回0。默认的每个进程的地址空间是私有的，每个进程对内存的改动都对其他进程不可见。

You will provide a different, more primitive set of JOS system calls for creating new user-mode environments. With these system calls you will be able to implement a Unix-like fork() entirely in user space, in addition to other styles of environment creation. The new system calls you will write for JOS are as follows:

sys_exofork:

This system call creates a new environment with an almost blank slate: nothing is mapped in the user portion of its address space, and it is not runnable. The new environment will have the same register state as the parent environment at the time of the sys_exofork call. In the parent, sys_exofork will return the envid_t of the newly created environment (or a negative error code if the environment allocation failed). In the child, however, it will return 0. (Since the child starts out marked as not runnable, sys_exofork will not actually return in the child until the parent has explicitly allowed this by marking the child runnable using....)

sys_env_set_status:

Sets the status of a specified environment to ENV_RUNNABLE or ENV_NOT_RUNNABLE. This system call is typically used to mark a new environment ready to run, once its address space and register state has been fully initialized.

sys_page_alloc:

Allocates a page of physical memory and maps it at a given virtual address in a given environment's address space.

sys_page_map:

Copy a page mapping (not the contents of a page!) from one environment's address space to another, leaving a memory sharing arrangement in place so that the new and the old mappings both refer to the same page of physical memory.

sys_page_unmap:

Unmap a page mapped at a given virtual address in a given environment.

需要为JOS提供一个不同的、更原始的系统调用，用来创建新的用户进程。通过这些系统调用能够实现像Unix一样的fork()，涉及到要写的系统调用有以下几种：

sys_exofork:这个系统调用创建了一个白板进程，没有地址映射，并且不可运行，这个新进程的寄存器值会和调用sys_exofork()时父进程的寄存器值保持一致。对于父进程，sys_exofork的调用会返回新创建进程的envid_t；对于子进程，返回0（返回0是因为此时进程无法运行）。
sys_env_set_status:将特定的进程状态设为ENV_RUNNABLE或ENV_NOT_RUNNABLE，这个系统调用特用来在新进程的地址空间和寄存器初始化好后，标记新进程为可运行。
sys_page_alloc:分配一页物理空间，并映射到给出的进程虚拟空间上去。
sys_page_map:复制一个进程的地址映射关系（不是物理页内容）到另一个进程，这样实现内存共享。
sys_page_unmap:删除指定进程的指定虚拟地址映射。

For all of the system calls above that accept environment IDs, the JOS kernel supports the convention that a value of 0 means "the current environment." This convention is implemented by envid2env() in kern/env.c.

We have provided a very primitive implementation of a Unix-like fork() in the test program user/dumbfork.c. This test program uses the above system calls to create and run a child environment with a copy of its own address space. The two environments then switch back and forth using sys_yield as in the previous exercise. The parent exits after 10 iterations, whereas the child exits after 20.

对于上面所有系统调用需要进程ID当参数的，JOS支持使用0标识为当前进程。

我们在user/dumbfork.c的测试程序里提供了一个fork()非常原始的实现，这个测试程序使用了上述系统调用来创建和运行子进程。之后父子进程通过前面的RR调度切换着执行sys_yield，父进程10轮后退出，子进程20轮后退出。

Exercise 7

Exercise 7. Implement the system calls described above in kern/syscall.c. You will need to use various functions in kern/pmap.c and kern/env.c, particularly envid2env(). For now, whenever you call envid2env(), pass 1 in the checkperm parameter. Be sure you check for any invalid system call arguments, returning -E_INVAL in that case. Test your JOS kernel with user/dumbfork and make sure it works before proceeding.

在syscall.c中实现上面说的系统调用，会用到kern/pmap.c和kern/env.c中的很多函数，尤其是envid2env()。现在，无论什么时候调用envid2env()，将checkperm参数设为1。当遇到非法系统调用参数时，返回-E_INVAL。

第一个实现的函数是kern/syscall.c中的sys_exofork()。这个系统调用创建了一个白板进程，没有地址映射，并且不可运行，这个新进程的寄存器值会和调用sys_exofork()时父进程的寄存器值保持一致。对于父进程，sys_exofork的调用会返回新创建进程的envid_t；对于子进程，返回0（返回0是因为此时进程无法运行）。

// Allocate a new environment.
// Returns envid of new environment, or < 0 on error.  Errors are:
//	-E_NO_FREE_ENV if no free environment is available.
//	-E_NO_MEM on memory exhaustion.
static envid_t
sys_exofork(void)
{
	// Create the new environment with env_alloc(), from kern/env.c.
	// It should be left as env_alloc created it, except that
	// status is set to ENV_NOT_RUNNABLE, and the register set is copied
	// from the current environment -- but tweaked so sys_exofork
	// will appear to return 0.

	// LAB 4: Your code here.
	// panic("sys_exofork not implemented");
	struct Env *e;
	int ret = env_alloc(&e,curenv->env_id);
	if(ret < 0) return ret;
	e->env_status = ENV_NOT_RUNNABLE;
	e->env_tf = curenv->env_tf;
	// tweaked so sys_exofork will appear to return 0.
	e->env_tf.tf_regs.reg_eax = 0;
	return e->env_id;
}

第二个实现的函数是kern/syscall.c中的sys_env_set_status()。将特定的进程状态设为ENV_RUNNABLE或ENV_NOT_RUNNABLE，这个系统调用特用来在新进程的地址空间和寄存器初始化好后，标记新进程为可运行。

// Set envid's env_status to status, which must be ENV_RUNNABLE
// or ENV_NOT_RUNNABLE.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if status is not a valid status for an environment.
static int
sys_env_set_status(envid_t envid, int status)
{
	// Hint: Use the 'envid2env' function from kern/env.c to translate an
	// envid to a struct Env.
	// You should set envid2env's third argument to 1, which will
	// check whether the current environment has permission to set
	// envid's status.

	// LAB 4: Your code here.
	// panic("sys_env_set_status not implemented");
	// -E_INVAL if status is not a valid status for an environment.
	if(status != ENV_RUNNABLE && status != ENV_NOT_RUNNABLE) 
		return -E_INVAL;
	struct Env *e;
	// -E_BAD_ENV if environment envid doesn't currently exist,
	if(envid2env(envid,&e,1)<0) 
		return -E_BAD_ENV;
	e->env_status = status;
	// Returns 0 on success
	return 0;
}

第三个实现的函数是kern/syscall.c中的sys_page_alloc()。分配一页物理空间，并映射到给出的进程虚拟空间上去。

// Allocate a page of memory and map it at 'va' with permission
// 'perm' in the address space of 'envid'.
// The page's contents are set to 0.
// If a page is already mapped at 'va', that page is unmapped as a
// side effect.
//
// perm -- PTE_U | PTE_P must be set, PTE_AVAIL | PTE_W may or may not be set,
//         but no other bits may be set.  See PTE_SYSCALL in inc/mmu.h.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if va >= UTOP, or va is not page-aligned.
//	-E_INVAL if perm is inappropriate (see above).
//	-E_NO_MEM if there's no memory to allocate the new page,
//		or to allocate any necessary page tables.
static int
sys_page_alloc(envid_t envid, void *va, int perm)
{
	// Hint: This function is a wrapper around page_alloc() and
	//   page_insert() from kern/pmap.c.
	//   Most of the new code you write should be to check the
	//   parameters for correctness.
	//   If page_insert() fails, remember to free the page you
	//   allocated!

	// LAB 4: Your code here.
	// panic("sys_page_alloc not implemented");
	if((~perm & (PTE_U | PTE_P))!=0) return E_INVAL;
	if ((perm & (~(PTE_U|PTE_P|PTE_AVAIL|PTE_W))) != 0) return -E_INVAL;
	if ((uintptr_t)va >= UTOP || PGOFF(va) != 0) return -E_INVAL; 

	struct PageInfo *pg = page_alloc(ALLOC_ZERO);		 //分配物理页
	if(!pg) return -E_NO_MEM;
	struct Env *e;
	int r = envid2env(envid, &e, 1);		// //根据envid找出需要操作的Env结构
    if (r < 0) return -E_BAD_ENV;
    r = page_insert(e->env_pgdir, pg, va, perm);		建立映射关系
    if (r < 0) {
        page_free(pg);
        return -E_NO_MEM;
    }
    return 0; 
}

第四个实现的函数是kern/syscall.c中的sys_page_map()。复制一个进程的地址映射关系（不是物理页内容）到另一个进程，这样实现内存共享。

// Map the page of memory at 'srcva' in srcenvid's address space
// at 'dstva' in dstenvid's address space with permission 'perm'.
// Perm has the same restrictions as in sys_page_alloc, except
// that it also must not grant write access to a read-only
// page.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if srcenvid and/or dstenvid doesn't currently exist,
//		or the caller doesn't have permission to change one of them.
//	-E_INVAL if srcva >= UTOP or srcva is not page-aligned,
//		or dstva >= UTOP or dstva is not page-aligned.
//	-E_INVAL is srcva is not mapped in srcenvid's address space.
//	-E_INVAL if perm is inappropriate (see sys_page_alloc).
//	-E_INVAL if (perm & PTE_W), but srcva is read-only in srcenvid's
//		address space.
//	-E_NO_MEM if there's no memory to allocate any necessary page tables.
static int
sys_page_map(envid_t srcenvid, void *srcva,
	     envid_t dstenvid, void *dstva, int perm)
{
	// Hint: This function is a wrapper around page_lookup() and
	//   page_insert() from kern/pmap.c.
	//   Again, most of the new code you write should be to check the
	//   parameters for correctness.
	//   Use the third argument to page_lookup() to
	//   check the current permissions on the page.

	// LAB 4: Your code here.
	// panic("sys_page_map not implemented");
	if((uintptr_t)srcva >= UTOP || PGOFF(srcva) != 0 ||
		(uintptr_t)dstva >= UTOP || PGOFF(srcva) != 0) return -E_INVAL;
	
	if ((perm & PTE_U) == 0 || (perm & PTE_P) == 0 || (perm & ~PTE_SYSCALL) != 0) return -E_INVAL;
	struct Env *srcenv,*dstenv;
	if(envid2env(srcenvid,&srcenv,1)<0 || envid2env(dstenvid,&dstenv,1)<0) return -E_BAD_ENV;
	pte_t *srcpte;
	struct PageInfo *pp = page_lookup(srcenv->env_pgdir,srcva,&srcpte);
	if((*srcpte && PTE_W) == 0 && (perm & PTE_W)==1) return -E_INVAL;
	if(page_insert(dstenv->env_pgdir,pp,dstva,perm) <0) return -E_NO_MEM;
	return 0;
}

第五个实现的函数是kern/syscall.c中的sys_page_unmap()。删除指定进程的指定虚拟地址映射。

// Unmap the page of memory at 'va' in the address space of 'envid'.
// If no page is mapped, the function silently succeeds.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if va >= UTOP, or va is not page-aligned.
static int
sys_page_unmap(envid_t envid, void *va)
{
	// Hint: This function is a wrapper around page_remove().

	// LAB 4: Your code here.
	// panic("sys_page_unmap not implemented");
	if((uintptr_t)va >UTOP || PGOFF(va)!= 0)return -E_INVAL;
	struct Env *e;
	if (envid2env(envid, &e, 1) < 0) return -E_BAD_ENV;
	page_remove(e->env_pgdir, va);
    return 0;
}

添加以上的系统调用的分配函数：

// Dispatches to the correct kernel function, passing the arguments.
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
	// Call the function corresponding to the 'syscallno' parameter.
	// Return any appropriate return value.
	// LAB 3: Your code here.

	// panic("syscall not implemented");
	int32_t ret = 0;

	switch (syscallno) {
	case SYS_cputs:
		sys_cputs((const char*)a1,a2);
		break;
	case SYS_cgetc:
		ret = sys_cgetc();
		break;
	case SYS_env_destroy:
		ret = sys_env_destroy(a1);
		break;
	case SYS_getenvid:
		ret = sys_getenvid();
		break;
	case SYS_yield:
		sys_yield();
		break;
	case SYS_exofork:
        ret = (int32_t)sys_exofork();
        break;
    case SYS_env_set_status:
        ret = sys_env_set_status(a1, a2);
        break;
    case SYS_page_alloc:
        ret = sys_page_alloc(a1,(void *)a2, (int)a3);
        break;
    case SYS_page_map:
        ret = sys_page_map(a1, (void *)a2, a3, (void*)a4, (int)a5);
        break;
    case SYS_page_unmap:
        ret = sys_page_unmap(a1, (void *)a2);
        break;

	default:
		ret = -E_INVAL;
	}
	return ret;
}

Part B: Copy-on-Write Fork

As mentioned earlier, Unix provides the fork() system call as its primary process creation primitive. The fork() system call copies the address space of the calling process (the parent) to create a new process (the child).

xv6 Unix implements fork() by copying all data from the parent's pages into new pages allocated for the child. This is essentially the same approach that dumbfork() takes. The copying of the parent's address space into the child is the most expensive part of the fork() operation.

However, a call to fork() is frequently followed almost immediately by a call to exec() in the child process, which replaces the child's memory with a new program. This is what the the shell typically does, for example. In this case, the time spent copying the parent's address space is largely wasted, because the child process will use very little of its memory before calling exec().

前面提到过，Unix提供fork()系统调用来创建进程，fork()系统调用拷贝了调用此函数进程(parent)的用户空间来创建新进程(child)。

xv6将父进程所有页上数据拷贝到给新进程分配的页上去，这也是dumbfork()所采取的方式，数据拷贝是fork()函数最费时的操作。

但是，fork()调用后会紧跟着调用子进程中的exec()，exec()在子进程的内存空间内加载入新程序。这是SHELL的典型做法。比如，在这个例子中，拷贝父进程地址空间数据是一种极大的浪费，因为拷贝的数据没用多久就要被替代。

For this reason, later versions of Unix took advantage of virtual memory hardware to allow the parent and child to share the memory mapped into their respective address spaces until one of the processes actually modifies it. This technique is known as copy-on-write. To do this, on fork() the kernel would copy the address space mappings from the parent to the child instead of the contents of the mapped pages, and at the same time mark the now-shared pages read-only. When one of the two processes tries to write to one of these shared pages, the process takes a page fault. At this point, the Unix kernel realizes that the page was really a "virtual" or "copy-on-write" copy, and so it makes a new, private, writable copy of the page for the faulting process. In this way, the contents of individual pages aren't actually copied until they are actually written to. This optimization makes a fork() followed by an exec() in the child much cheaper: the child will probably only need to copy one page (the current page of its stack) before it calls exec().

In the next piece of this lab, you will implement a "proper" Unix-like fork() with copy-on-write, as a user space library routine. Implementing fork() and copy-on-write support in user space has the benefit that the kernel remains much simpler and thus more likely to be correct. It also lets individual user-mode programs define their own semantics for fork(). A program that wants a slightly different implementation (for example, the expensive always-copy version like dumbfork(), or one in which the parent and child actually share memory afterward) can easily provide its own.

因此，后来的Unix系统做了优化，让父子进程共享一片物理内存，也就是两个虚拟地址映射到一片区域，直到某个进程要修改内存。这种技术被叫做写时复制(copy-on-write)。写时复制是在调用fork()时，不拷贝父进程的内存，只拷贝页面映射关系，同时将共享的页标记为只读。当两个进程中任一个进程试图修改（写操作）内存时，进程产生缺页错误。此时，Unix内核知道这个页要么是没有映射关系，要么是写时复制内存，内核此时创建一个新的、私有的、可写的页给错误的进程，这样，直到有进程要对该页做修改时，才拷贝真正的页。这样的优化让fork()和紧接着发生在子进程中的exec()开销较小，子进程在执行exec()前，有可能只需要拷贝一个页面(栈页面)。

为什么需要写时复制呢？
直接给进程分配独立的地址空间不就更省事了嘛，效率也高，但是请记住一点，商业的东西都是需要成本控制的，内存在电脑体系中是非常宝贵的资源(手动狗头三星)，所以Linux等人，要想方设法节省资源，提高资源利用率，好明白了这一点，重点就来了，假如父子进程对原有的所有页面是无任何改变，也就是说对数据是只读的，没有写过，那么懒懒的操作系统是不是根本没有必要为了子进程再开辟一块物理空间(页面)，所以说子进程里的页表是和父进程的页表是一模一样的即其中的逻辑地址对应的物理地址是一模一样的，也就是所谓的页面共享。这时我们如果改变策略，改变了某一个共享页面的某一项数据，那么此时这个页面已经无法被共享了，会发生什么呢？

具体过程是这样的：
fork子进程完全复制父进程的栈空间，也复制了它的内存分配页表，但没有复制物理页面，所以这时虚拟地址相同，物理地址也相同，但是会把父子共享的页面标记为“只读”（类似mmap的private的方式），如果父子进程一直对这个页面是同一个页面，直到其中任意一个进程要对共享的页面进行“写操作”，这时内核会分配+复制一个物理页面给这个进程使用，同时修改页表。而把原来的只读页面标记为“可写”，留给另外一个进程使用。画重点，操作性同在内存分配这块非常懒，仅仅只分配"新的页面" 并在页表里边改变相关属性

这就是所谓的“写时复制”。正因为fork采用了这种写时复制的机制，所以fork出来子进程之后，父子进程哪个先调度呢？内核一般会先调度子进程，因为很多情况下子进程是要马上执行exec，会清空栈、堆。。这些和父进程共享的空间，加载新的代码段。这就避免了“写时复制”拷贝共享页面的机会。如果父进程先调度很可能写共享页面，会产生“写时复制”的无用功。所以在这种情况下，一般是子进程先调度滴(欢迎指正)，这个关于父子进程谁先执行是一个比较复杂的问题不是本篇重点，需要考虑很多因素，这个我们之后再细细研究。
原文链接：https://blog.csdn.net/holy_666/article/details/85336387

知识点：User-level page fault handling

A user-level copy-on-write fork() needs to know about page faults on write-protected pages, so that's what you'll implement first. Copy-on-write is only one of many possible uses for user-level page fault handling.

It's common to set up an address space so that page faults indicate when some action needs to take place. For example, most Unix kernels initially map only a single page in a new process's stack region, and allocate and map additional stack pages later "on demand" as the process's stack consumption increases and causes page faults on stack addresses that are not yet mapped. A typical Unix kernel must keep track of what action to take when a page fault occurs in each region of a process's space. For example, a fault in the stack region will typically allocate and map new page of physical memory. A fault in the program's BSS region will typically allocate a new page, fill it with zeroes, and map it. In systems with demand-paged executables, a fault in the text region will read the corresponding page of the binary off of disk and then map it.

This is a lot of information for the kernel to keep track of. Instead of taking the traditional Unix approach, you will decide what to do about each page fault in user space, where bugs are less damaging. This design has the added benefit of allowing programs great flexibility in defining their memory regions; you'll use user-level page fault handling later for mapping and accessing files on a disk-based file system.

用户级的写时复制fork()需要分辨写保护页面的缺页错误，这是要实现的第一步，写时复制造成的缺页错误是很多可能的用户级缺页错误中一种。

Setting the Page Fault Handler

In order to handle its own page faults, a user environment will need to register a page fault handler entrypoint with the JOS kernel. The user environment registers its page fault entrypoint via the new sys_env_set_pgfault_upcall system call. We have added a new member to the Env structure, env_pgfault_upcall, to record this information.

为了处理自己的页错误，用户进程需要在JOS内核注册一个页错误处理入口。用户进程通过sys_env_set_pgfault_upcall系统调用注册入口，在Env结构中新加入了一个env_pgfault_upcall变量记录这个信息。

Exercise 8

Exercise 8. Implement the sys_env_set_pgfault_upcall system call. Be sure to enable permission checking when looking up the environment ID of the target environment, since this is a "dangerous" system call.

实现sys_env_set_pgfault_upcall的系统调用。

// Set the page fault upcall for 'envid' by modifying the corresponding struct
// Env's 'env_pgfault_upcall' field.  When 'envid' causes a page fault, the
// kernel will push a fault record onto the exception stack, then branch to
// 'func'.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
	// LAB 4: Your code here.
	// panic("sys_env_set_pgfault_upcall not implemented");
	struct Env *e;
	if(envid2env(envid,&e,1)<0) return -E_BAD_ENV;
	e->env_pgfault_upcall = func;
	return 0;
}

通过修改Env结构的env_pgfault_upcall变量来设置页错误，当envid造成一个页错误，内核会将页错误记录压入异常栈，然后转到func执行。

Normal and Exception Stacks in User Environments

During normal execution, a user environment in JOS will run on the normal user stack: its ESP register starts out pointing at USTACKTOP, and the stack data it pushes resides on the page between USTACKTOP-PGSIZE and USTACKTOP-1 inclusive. When a page fault occurs in user mode, however, the kernel will restart the user environment running a designated user-level page fault handler on a different stack, namely the user exception stack. In essence, we will make the JOS kernel implement automatic "stack switching" on behalf of the user environment, in much the same way that the x86 processor already implements stack switching on behalf of JOS when transferring from user mode to kernel mode!

The JOS user exception stack is also one page in size, and its top is defined to be at virtual address UXSTACKTOP, so the valid bytes of the user exception stack are from UXSTACKTOP-PGSIZE through UXSTACKTOP-1 inclusive. While running on this exception stack, the user-level page fault handler can use JOS's regular system calls to map new pages or adjust mappings so as to fix whatever problem originally caused the page fault. Then the user-level page fault handler returns, via an assembly language stub, to the faulting code on the original stack.

Each user environment that wants to support user-level page fault handling will need to allocate memory for its own exception stack, using the sys_page_alloc() system call introduced in part A.

在正常执行中，JOS中的用户进程会在正常的用户栈上运行。它的ESP寄存器指向USTACKTOP，而压入栈的数据分布在USTACKTOP-PGSIZE 到 USTACKTOP-1之间。当缺页错误是在用户态发生，但是内核会把缺页错误的处理函数放在另一个栈上重新运行，也就是异常栈。也就是，需要让JOS内核代替自动实现栈切换，很像JOS从内核态转换到用户态。

JOS异常栈只有一页大小，栈顶定义为虚拟地址UXSTACKTOP，用户异常栈有效的数据是从UXSTACKTOP-PGSIZE到 UXSTACKTOP-1。当运行在异常栈时，用户级缺页错误处理利用JOS常规系统调用映射新页，或修改映射以修复正常页错误，之后用户级缺页错误处理函数返回，切换回原来栈。

每个先要支持用户级页错误处理的进程需要给自己的异常栈分配空间，用part A完成的sys_page_alloc。

Invoking the User Page Fault Handler

You will now need to change the page fault handling code in kern/trap.c to handle page faults from user mode as follows. We will call the state of the user environment at the time of the fault the trap-time state.

If there is no page fault handler registered, the JOS kernel destroys the user environment with a message as before. Otherwise, the kernel sets up a trap frame on the exception stack that looks like a struct UTrapframe from inc/trap.h:
                    <-- UXSTACKTOP
trap-time esp
trap-time eflags
trap-time eip
trap-time eax       start of struct PushRegs
trap-time ecx
trap-time edx
trap-time ebx
trap-time esp
trap-time ebp
trap-time esi
trap-time edi       end of struct PushRegs
tf_err (error code)
fault_va            <-- %esp when handler is run 

现在需要改写kern/trap.c中的缺页错误处理代码，处理用户级别的页错误。如果没有页错误处理注册，JOS内核像以前一样结束掉用户进程，如果有，内核将在异常栈上加入UTrapframe的结构。

The kernel then arranges for the user environment to resume execution with the page fault handler running on the exception stack with this stack frame; you must figure out how to make this happen. The fault_va is the virtual address that caused the page fault.

If the user environment is already running on the user exception stack when an exception occurs, then the page fault handler itself has faulted. In this case, you should start the new stack frame just under the current tf->tf_esp rather than at UXSTACKTOP. You should first push an empty 32-bit word, then a struct UTrapframe.

To test whether tf->tf_esp is already on the user exception stack, check whether it is in the range between UXSTACKTOP-PGSIZE and UXSTACKTOP-1, inclusive.

之后内核将用户进程从异常栈的缺页错误处理切换回去继续执行，需要搞清楚这是怎么实现的，fault_va 字段是造成缺页错误的虚拟地址。

如果出现异常的时候用户已经在用户异常栈上运行了，说明缺页处理有错，这时，需要在tf->tf_esp而不是UXSTACKTOP分配新的栈，在压入UTrapframe结构前，先压入空的32-bit word。

为了验证tf->tf_esp是否已经在用户异常栈，检查它是够在UXSTACKTOP-PGSIZE 到 UXSTACKTOP-1的范围内。

Exercise 9

Exercise 9. Implement the code in page_fault_handler in kern/trap.c required to dispatch page faults to the user-mode handler. Be sure to take appropriate precautions when writing into the exception stack. (What happens if the user environment runs out of space on the exception stack?)

先看看UTrapframe的结构。

struct UTrapframe {
	/* information about the fault */
	uint32_t utf_fault_va;	/* va for T_PGFLT, 0 otherwise */
	uint32_t utf_err;
	/* trap-time return state */
	struct PushRegs utf_regs;
	uintptr_t utf_eip;
	uint32_t utf_eflags;
	/* the trap-time stack to return to */
	uintptr_t utf_esp;
} __attribute__((packed));

实现在kern/trap.c中的page_fault_handler()函数，需要分配页错误到用户态处理。当在异常栈写入的时候要先做检查，防止用户进程在异常栈越界。

void
page_fault_handler(struct Trapframe *tf)
{
	uint32_t fault_va;

	// Read processor's CR2 register to find the faulting address
	fault_va = rcr2();

	// Handle kernel-mode page faults.

	// LAB 3: Your code here.
	// check the low bits of the tf_cs


	if((tf->tf_cs & 3)==0) panic("kernel-mode page faults\n");

	// We've already handled kernel-mode exceptions, so if we get here,
	// the page fault happened in user mode.

	// Call the environment's page fault upcall, if one exists.  Set up a
	// page fault stack frame on the user exception stack (below
	// UXSTACKTOP), then branch to curenv->env_pgfault_upcall.
	//
	// The page fault upcall might cause another page fault, in which case
	// we branch to the page fault upcall recursively, pushing another
	// page fault stack frame on top of the user exception stack.
	//
	// It is convenient for our code which returns from a page fault
	// (lib/pfentry.S) to have one word of scratch space at the top of the
	// trap-time stack; it allows us to more easily restore the eip/esp. In
	// the non-recursive case, we don't have to worry about this because
	// the top of the regular user stack is free.  In the recursive case,
	// this means we have to leave an extra word between the current top of
	// the exception stack and the new stack frame because the exception
	// stack _is_ the trap-time stack.
	//
	// If there's no page fault upcall, the environment didn't allocate a
	// page for its exception stack or can't write to it, or the exception
	// stack overflows, then destroy the environment that caused the fault.
	// Note that the grade script assumes you will first check for the page
	// fault upcall and print the "user fault va" message below if there is
	// none.  The remaining three checks can be combined into a single test.
	//
	// Hints:
	//   user_mem_assert() and env_run() are useful here.
	//   To change what the user environment runs, modify 'curenv->env_tf'
	//   (the 'tf' variable points at 'curenv->env_tf').

	// LAB 4: Your code here.
	if(curenv->env_pgfault_upcall){		//检查是够有 page fault upcall
		struct UTrapframe *utf;
		if(tf->tf_esp >= UXSTACKTOP-PGSIZE && tf->tf_esp < UXSTACKTOP){
			// 出现嵌套错误直接下移32bit 然后处理
			utf = (struct UTrapframe*)(tf->tf_esp - 4 - sizeof(struct UTrapframe)); 	
		}else{
			utf = (struct UTrapframe*)(UXSTACKTOP - sizeof(struct UTrapframe));
		}
		// 先检查是否有权访问
		user_mem_assert(curenv,(void*)utf,sizeof(struct UTrapframe),PTE_U|PTE_W|PTE_P);
		// 设置错误信息
		utf->utf_fault_va = fault_va;
		utf->utf_err = tf->tf_trapno;
		utf->utf_regs = tf->tf_regs;
		utf->utf_eip = tf->tf_eip;
		utf->utf_eflags = tf->tf_eflags;
		utf->utf_esp = tf->tf_esp;
		// 修改eip跳转到页错误处理，切换栈
		tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;
		tf->tf_esp = (uintptr_t) utf;
		env_run(curenv);
	}

	// Destroy the environment that caused the fault.
	cprintf("[%08x] user fault va %08x ip %08x\n",
		curenv->env_id, fault_va, tf->tf_eip);
	print_trapframe(tf);
	env_destroy(curenv);
}

User-mode Page Fault Entrypoint

Next, you need to implement the assembly routine that will take care of calling the C page fault handler and resume execution at the original faulting instruction. This assembly routine is the handler that will be registered with the kernel using sys_env_set_pgfault_upcall().

下一步，需要实现让汇编语句负责调用C的错误处理函数，并恢复到原先出错的指令继续执行，汇编语句会在sys_env_set_pgfault_upcall()中调用。

Exercise 10

Exercise 10. Implement the _pgfault_upcall routine in lib/pfentry.S. The interesting part is returning to the original point in the user code that caused the page fault. You'll return directly there, without going back through the kernel. The hard part is simultaneously switching stacks and re-loading the EIP.

实现in lib/pfentry.S的_pgfault_upcall，有意思的地方在返回用户态出现页错误的起点，会直接返回那里，不用再回溯到内核（调用顺序：用户进程正常栈->内核栈->用户进程异常栈->用户进程正常栈），困难的地方在于同时切换栈以及重载EIP。

不是很看得懂汇编代码，参考https://blog.csdn.net/cinmyheart/article/details/43875961 的exercise 10。

	// Throughout the remaining code, think carefully about what
	// registers are available for intermediate calculations.  You
	// may find that you have to rearrange your code in non-obvious
	// ways as registers become unavailable as scratch space.
	//
	// LAB 4: Your code here.
	movl %esp,%ebx
	movl 40(%esp),%eax 	// esp from utf_fault_va to utf_regs(end)
	movl 48(%esp),%esp 	// 
	pushl %eax


	// Restore the trap-time registers.  After you do this, you
	// can no longer modify any general-purpose registers.
	// LAB 4: Your code here.
	movl %ebx,%esp
	subl $4,48(%esp)
	popl %eax
	popl %eax
	popal
	// Restore eflags from the stack.  After you do this, you can
	// no longer use arithmetic operations or anything else that
	// modifies eflags.
	// LAB 4: Your code here.
	add $4,%esp
	popfl
	// Switch back to the adjusted trap-time stack.
	// LAB 4: Your code here.
	popl %esp

	// Return to re-execute the instruction that faulted.
	// LAB 4: Your code here.
	ret

Exercise 11

Finally, you need to implement the C user library side of the user-level page fault handling mechanism.. Finish set_pgfault_handler() in lib/pgfault.c.

最后修改 lib/pgfault.c中的set_pgfault_handler处理用户级页错误。

//
// Set the page fault handler function.
// If there isn't one yet, _pgfault_handler will be 0.
// The first time we register a handler, we need to
// allocate an exception stack (one page of memory with its top
// at UXSTACKTOP), and tell the kernel to call the assembly-language
// _pgfault_upcall routine when a page fault occurs.
//
void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
	int r;

	if (_pgfault_handler == 0) {
		// First time through!
		// LAB 4: Your code here.
		// panic("set_pgfault_handler not implemented");
		void *va = (void *)(UXSTACKTOP - PGSIZE);
		envid_t eid = sys_getenvid();
		r = sys_page_alloc(eid,va,PTE_P | PTE_U | PTE_W);
		if(r<0) panic("set_pgfault_handler\n");
		r = sys_env_set_pgfault_upcall(eid,_pgfault_upcall);
		if(r<0) panic("set_pgfault_handler\n");

	}

	// Save handler pointer for assembly to call.
	_pgfault_handler = handler;
}

缺少了SYS_env_set_pgfault_upcall的系统调用不能过faultdie，补上以后通过。

    case SYS_env_set_pgfault_upcall:
		ret = sys_env_set_pgfault_upcall(a1, (void *)a2);
		break;

知识点：Implementing Copy-on-Write Fork

You now have the kernel facilities to implement copy-on-write fork() entirely in user space.

We have provided a skeleton for your fork() in lib/fork.c. Like dumbfork(), fork() should create a new environment, then scan through the parent environment's entire address space and set up corresponding page mappings in the child. The key difference is that, while dumbfork() copied pages, fork() will initially only copy page mappings. fork() will copy each page only when one of the environments tries to write it.

现在内核已经有能力在用户空间实现写时复制的fork()。lib/fork.c中已经有搭好基础架构的fork()函数，像dumbfork()一样，fork()应该创建新的进程，之后遍历父进程的整个地址空间并在自己那里设置同样的映射，最重要的不同是，当dumbfork()拷贝整个页时，fork()只需要拷贝页的映射关系，只有在一个进程写入共享空间的时候fork()才会拷贝整个页。

The basic control flow for fork() is as follows:

The parent installs pgfault() as the C-level page fault handler, using the set_pgfault_handler() function you implemented above.
The parent calls sys_exofork() to create a child environment.
For each writable or copy-on-write page in its address space below UTOP, the parent calls duppage, which should map the page copy-on-write into the address space of the child and then remap the page copy-on-write in its own address space. duppage sets both PTEs so that the page is not writeable, and to contain PTE_COW in the "avail" field to distinguish copy-on-write pages from genuine read-only pages.
The exception stack is not remapped this way, however. Instead you need to allocate a fresh page in the child for the exception stack. Since the page fault handler will be doing the actual copying and the page fault handler runs on the exception stack, the exception stack cannot be made copy-on-write: who would copy it?

fork() also needs to handle pages that are present, but not writable or copy-on-write.

The parent sets the user page fault entrypoint for the child to look like its own.
The child is now ready to run, so the parent marks it runnable.

大致的fork()流程是：

1.父进程使用set_pgfault_handler() 函数将 pgfault()设为页错误处理函数。

2.父进程调用sys_exofork() 创建子进程。

3.对于UTOP之下每一个可写的或写时复制的页，父进程调用duppage将写时复制页面映射到子进程，并重新映射自己空间内的写时复制页。duppage将共享的页设为PTEs，表示页是不可写的，存在"avail"变量中的PTE_COW 将写时复制页从其他只读页中区分出来。但是异常栈不是这样重映射的，需要重新给子进程的异常栈分配一个全新的页，这是因为缺页错误处理器会做“实际”的拷贝并往异常栈写入内容，异常栈不能写时复制。

4.父进程会为子进程设置 user page fault entrypoint。

5.子进程准备好运行，父进程改变子进程状态。

Each time one of the environments writes a copy-on-write page that it hasn't yet written, it will take a page fault. Here's the control flow for the user page fault handler:

The kernel propagates the page fault to _pgfault_upcall, which calls fork()'s pgfault() handler.
pgfault() checks that the fault is a write (check for FEC_WR in the error code) and that the PTE for the page is marked PTE_COW. If not, panic.
pgfault() allocates a new page mapped at a temporary location and copies the contents of the faulting page into it. Then the fault handler maps the new page at the appropriate address with read/write permissions, in place of the old read-only mapping.

The user-level lib/fork.c code must consult the environment's page tables for several of the operations above (e.g., that the PTE for a page is marked PTE_COW). The kernel maps the environment's page tables at UVPT exactly for this purpose. It uses a clever mapping trick to make it to make it easy to lookup PTEs for user code. lib/entry.S sets up uvpt and uvpd so that you can easily lookup page-table information in lib/fork.c.

每一次有进程试图在写时复制的页上写入时，就产生页错误，页错误处理器的控制流程是：

1.内核传输页错误到_pgfault_upcall，这调用了 fork()的 pgfault()来处理。

2.pgfault() 检车页错误是不是由写入产生的（检查FEC_WR）以及页的PTE是不是有PTE_COW标志，如果没有，panic。

3.pgfault() 分配一个新的页面并将出错页面的内容拷贝进去，然后将旧的映射覆盖，使其映射到该新页面。

Exercise 12

Exercise 12. Implement fork, duppage and pgfault in lib/fork.c.

Test your code with the forktree program. It should produce the following messages, with interspersed 'new env', 'free env', and 'exiting gracefully' messages. The messages may not appear in this order, and the environment IDs may be different.
	1000: I am ''
	1001: I am '0'
	2000: I am '00'
	2001: I am '000'
	1002: I am '1'
	3000: I am '11'
	3001: I am '10'
	4000: I am '100'
	1003: I am '01'
	5000: I am '010'
	4001: I am '011'
	2002: I am '110'
	1004: I am '001'
	1005: I am '111'
	1006: I am '101'

补全lib/fork.c中的fork, duppage and pgfault函数。

整体思路就是有进程尝试写入共享页面以后，出发用户态页错误，调用trap_dispatch()，依据页错误的类型分配到lib/fork.c的pgfault()函数处理。

第一个要实现的函数是fork()，按提示仿照user/dumbfork.c改写。

//
// User-level fork with copy-on-write.
// Set up our page fault handler appropriately.
// Create a child.
// Copy our address space and page fault handler setup to the child.
// Then mark the child as runnable and return.
//
// Returns: child's envid to the parent, 0 to the child, < 0 on error.
// It is also OK to panic on error.
//
// Hint:
//   Use uvpd, uvpt, and duppage.
//   Remember to fix "thisenv" in the child process.
//   Neither user exception stack should ever be marked copy-on-write,
//   so you must allocate a new page for the child's user exception stack.
//
envid_t
fork(void)
{
	// LAB 4: Your code here.
	// panic("fork not implemented");
	set_pgfault_handler(pgfault);
	envid_t eid = sys_exofork();
	if(eid < 0) panic("fork fault 1\n");
	if(eid == 0){
		thisenv = &envs[ENVX(sys_getenvid())];
		return eid;
	}
	// parent process copy address space to child
	for (uintptr_t addr = UTEXT; addr < USTACKTOP; addr += PGSIZE) {
        if ( (uvpd[PDX(addr)] & PTE_P) && (uvpt[PGNUM(addr)] & PTE_P) ) {
            // dup page to child
            duppage(eid, PGNUM(addr));
        }
    }
    // alloc page for exception stack
    int r = sys_page_alloc(eid, (void *)(UXSTACKTOP-PGSIZE), PTE_U | PTE_W | PTE_P);
    if (r < 0) panic("fork fault 2\n");

    extern void _pgfault_upcall();
    r = sys_env_set_pgfault_upcall(eid, _pgfault_upcall);
    if (r < 0) panic("fork fault 3\n");

    // mark the child environment runnable
    if ((r = sys_env_set_status(eid, ENV_RUNNABLE)) < 0)
        panic("fork fault 4\n");
    return eid;

}

第二个要实现的函数是duppage()，作用是复制父进程页面映射到子进程，

//
// Map our virtual page pn (address pn*PGSIZE) into the target envid
// at the same virtual address.  If the page is writable or copy-on-write,
// the new mapping must be created copy-on-write, and then our mapping must be
// marked copy-on-write as well.  (Exercise: Why do we need to mark ours
// copy-on-write again if it was already copy-on-write at the beginning of
// this function?)
//
// Returns: 0 on success, < 0 on error.
// It is also OK to panic on error.
//
static int
duppage(envid_t envid, unsigned pn)
{
	int r;

	// LAB 4: Your code here.
	// panic("duppage not implemented");
	envid_t myenvid = sys_getenvid();
	pte_t pte = uvpt[pn];
	int perm = PTE_U | PTE_P;
	if(pte & PTE_W || pte & PTE_COW)
		perm |= PTE_COW;

	if((r = sys_page_map(myenvid,(void*)(pn*PGSIZE),envid,(void*)(pn*PGSIZE),perm))<0){
		panic("duppage fault :%e\n",r);
	}

	//if COW remap to self
	if(perm & PTE_COW){
		if((r = sys_page_map(myenvid,(void*)(pn*PGSIZE),myenvid,(void*)(pn*PGSIZE),perm))<0)
			panic("duppage fault :%e\n",r);
	}
	return 0;
}

第三个要实现的函数是 pgfault()，是页错误处理函数，调用之后分配一个物理页面给子进程单独使用。

//
// Custom page fault handler - if faulting page is copy-on-write,
// map in our own private writable copy.
//
static void
pgfault(struct UTrapframe *utf)
{
	void *addr = (void *) utf->utf_fault_va;
	uint32_t err = utf->utf_err;
	int r;

	// Check that the faulting access was (1) a write, and (2) to a
	// copy-on-write page.  If not, panic.
	// Hint:
	//   Use the read-only page table mappings at uvpt
	//   (see <inc/memlayout.h>).

	// LAB 4: Your code here.

	// Allocate a new page, map it at a temporary location (PFTEMP),
	// copy the data from the old page to the new page, then move the new
	// page to the old page's address.
	// Hint:
	//   You should make three system calls.

	// LAB 4: Your code here.

	// panic("pgfault not implemented");
	void *addr = (void *) utf->utf_fault_va;
	uint32_t err = utf->utf_err;
	int r;

	// 检查页错误是否因为要写入到写时复制页
	if((err & FEC_WR) == 0){
		panic("pgfault not cause by write \n");
	}

	if ((uvpt[PGNUM(addr)] & PTE_COW) == 0) 
    {
        panic("pgfault not cause by COW \n")
    }

    envid_t envid = sys_getenvid();

    // 在交换区PFTEMP的位置分配新页
    if ((r = sys_page_alloc(envid, (void *)PFTEMP, PTE_P | PTE_W | PTE_U)) < 0)
        panic("pgfault: page allocation failed %e\n", r);

    addr = ROUNDDOWN(addr, PGSIZE);

    // 从addr拷贝一页数据到PFTEMP
    memmove(PFTEMP, addr, PGSIZE);

    // 将当前进程PFTEMP也映射到当前进程addr指向的物理页
    if ((r = sys_page_map(envid, PFTEMP, envid, addr, PTE_P | PTE_W |PTE_U)) < 0)
        panic("pgfault: page map failed %e\n", r);
    // 解除当前进程PFTEMP映射
    if ((r = sys_page_unmap(envid, PFTEMP)) < 0)
        panic("pgfault: page unmap failed %e\n", r);
    
}

结果：

faultread: OK (1.7s) 
faultwrite: OK (2.9s) 
faultdie: OK (1.2s) 
faultregs: OK (2.0s) 
faultalloc: OK (1.9s) 
faultallocbad: OK (2.0s) 
faultnostack: OK (2.0s) 
faultbadhandler: OK (2.1s) 
faultevilhandler: OK (1.9s) 
forktree: OK (2.5s) 
    (Old jos.out.forktree failure log removed)
Part B score: 50/50

Part C: Preemptive Multitasking and Inter-Process communication (IPC)

In the final part of lab 4 you will modify the kernel to preempt uncooperative environments and to allow environments to pass messages to each other explicitly.

在part c中，修改内核允许进程抢占，并实现进程之间交流。

Exercise 13

Exercise 13. Modify kern/trapentry.S and kern/trap.c to initialize the appropriate entries in the IDT and provide handlers for IRQs 0 through 15. Then modify the code in env_alloc() in kern/env.c to ensure that user environments are always run with interrupts enabled.

The processor never pushes an error code or checks the Descriptor Privilege Level (DPL) of the IDT entry when invoking a hardware interrupt handler. You might want to re-read section 9.2 of the 80386 Reference Manual, or section 5.8 of the IA-32 Intel Architecture Software Developer's Manual, Volume 3, at this time.

After doing this exercise, if you run your kernel with any test program that runs for a non-trivial length of time (e.g., spin), you should see the kernel print trap frames for hardware interrupts. While interrupts are now enabled in the processor, JOS isn't yet handling them, so you should see it misattribute each interrupt to the currently running user environment and destroy it. Eventually it should run out of environments to destroy and drop into the monitor.

修改kern/trapentry.S 和 kern/trap.c ，增加外部请求中断0~15。

// IRQs
TRAPHANDLER_NOEC(irqtimer_handler, IRQ_OFFSET + IRQ_TIMER)
TRAPHANDLER_NOEC(irqkbd_handler, IRQ_OFFSET + IRQ_KBD)
TRAPHANDLER_NOEC(irqserial_handler, IRQ_OFFSET + IRQ_SERIAL)
TRAPHANDLER_NOEC(irqspurious_handler, IRQ_OFFSET + IRQ_SPURIOUS)
TRAPHANDLER_NOEC(irqide_handler, IRQ_OFFSET + IRQ_IDE)
TRAPHANDLER_NOEC(irqerror_handler, IRQ_OFFSET + IRQ_ERROR)

	// IRQS

	void irqtimer_handler();
	void irqkbd_handler();
	void irqserial_handler();
	void irqspurious_handler();
	void irqide_handler();
	void irqerror_handler();

  // IRQS
    SETGATE(idt[IRQ_OFFSET + IRQ_TIMER], 0, GD_KT, irqtimer_handler, 0);
    SETGATE(idt[IRQ_OFFSET + IRQ_KBD], 0, GD_KT, irqkbd_handler, 0);
    SETGATE(idt[IRQ_OFFSET + IRQ_SERIAL], 0, GD_KT, irqserial_handler, 0);
    SETGATE(idt[IRQ_OFFSET + IRQ_SPURIOUS], 0, GD_KT, irqspurious_handler, 0);
    SETGATE(idt[IRQ_OFFSET + IRQ_IDE], 0, GD_KT, irqide_handler, 0);
    SETGATE(idt[IRQ_OFFSET + IRQ_ERROR], 0, GD_KT, irqerror_handler, 0);

Exercise 14

Exercise 14. Modify the kernel's trap_dispatch() function so that it calls sched_yield() to find and run a different environment whenever a clock interrupt takes place.

You should now be able to get the user/spin test to work: the parent environment should fork off the child, sys_yield() to it a couple times but in each case regain control of the CPU after one time slice, and finally kill the child environment and terminate gracefully.

修改trap_dispatch()，使时钟中断发生时，切换到另一个进程执行。

	if (tf->tf_trapno == IRQ_OFFSET + IRQ_TIMER) {
        lapic_eoi();
        sched_yield();
        return;
    }

出现不能通过stresssched的错误，需要在 kern/sched.c的 sched_halt(void) 中去掉 //sti的注释

        // Uncomment the following line after completing exercise 13
        "sti\n"

Exercise 15

Exercise 15. Implement sys_ipc_recv and sys_ipc_try_send in kern/syscall.c. Read the comments on both before implementing them, since they have to work together. When you call envid2env in these routines, you should set the checkperm flag to 0, meaning that any environment is allowed to send IPC messages to any other environment, and the kernel does no special permission checking other than verifying that the target envid is valid.

Then implement the ipc_recv and ipc_send functions in lib/ipc.c.

Use the user/pingpong and user/primes functions to test your IPC mechanism. user/primes will generate for each prime number a new environment until JOS runs out of environments. You might find it interesting to read user/primes.c to see all the forking and IPC going on behind the scenes.

实现系统调用sys_ipc_recv 和系统调用 sys_ipc_try_send ，再封装为两个库函数，ipc_recv 和 ipc_send 以支持通信。

sys_ipc_recv()和sys_ipc_try_send()是这么协作的：

1.当某个进程调用sys_ipc_recv()后，该进程会阻塞（状态被置为ENV_NOT_RUNNABLE），直到另一个进程向它发送“消息”。当进程调用sys_ipc_recv()传入dstva参数时，表明当前进程准备接收页映射。

2.进程可以调用sys_ipc_try_send()向指定的进程发送“消息”，如果目标进程已经调用了sys_ipc_recv()，那么就发送数据，然后返回0，否则返回-E_IPC_NOT_RECV，表示目标进程不希望接受数据。当传入srcva参数时，表明发送进程希望和接收进程共享srcva对应的物理页。如果发送成功了发送进程的srcva和接收进程的dstva将指向相同的物理页。

先在kern/syscall.c实现sys_ipc_recv()和sys_ipc_try_send()，并包装成库函数。

// Try to send 'value' to the target env 'envid'.
// If srcva < UTOP, then also send page currently mapped at 'srcva',
// so that receiver gets a duplicate mapping of the same page.
//
// The send fails with a return value of -E_IPC_NOT_RECV if the
// target is not blocked, waiting for an IPC.
//
// The send also can fail for the other reasons listed below.
//
// Otherwise, the send succeeds, and the target's ipc fields are
// updated as follows:
//    env_ipc_recving is set to 0 to block future sends;
//    env_ipc_from is set to the sending envid;
//    env_ipc_value is set to the 'value' parameter;
//    env_ipc_perm is set to 'perm' if a page was transferred, 0 otherwise.
// The target environment is marked runnable again, returning 0
// from the paused sys_ipc_recv system call.  (Hint: does the
// sys_ipc_recv function ever actually return?)
//
// If the sender wants to send a page but the receiver isn't asking for one,
// then no page mapping is transferred, but no error occurs.
// The ipc only happens when no errors occur.
//
// Returns 0 on success, < 0 on error.
// Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist.
//		(No need to check permissions.)
//	-E_IPC_NOT_RECV if envid is not currently blocked in sys_ipc_recv,
//		or another environment managed to send first.
//	-E_INVAL if srcva < UTOP but srcva is not page-aligned.
//	-E_INVAL if srcva < UTOP and perm is inappropriate
//		(see sys_page_alloc).
//	-E_INVAL if srcva < UTOP but srcva is not mapped in the caller's
//		address space.
//	-E_INVAL if (perm & PTE_W), but srcva is read-only in the
//		current environment's address space.
//	-E_NO_MEM if there's not enough memory to map srcva in envid's
//		address space.
static int
sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm)
{
	// LAB 4: Your code here.
	// panic("sys_ipc_try_send not implemented");
	envid_t src_envid = sys_getenvid(); 
    struct Env *dst_e;
    if (envid2env(envid, &dst_e, 0) < 0) {
        return -E_BAD_ENV;
    }

    if (dst_e->env_ipc_recving == false) 
        return -E_IPC_NOT_RECV;
    
    // pass the value
    dst_e->env_ipc_value = value;
    dst_e->env_ipc_perm = 0;

    // pass the page
    if ((uintptr_t)srcva < UTOP) {
        // customerize 0x200 as PTE_NO_CHECK
        unsigned tmp_perm = perm | 0x200;
        int r = sys_page_map(src_envid, srcva, envid, (void *)dst_e->env_ipc_dstva, tmp_perm);
        if (r < 0) return r;
        dst_e->env_ipc_perm = perm;
    }

    dst_e->env_ipc_from = src_envid;
    dst_e->env_status = ENV_RUNNABLE;
    // return from the syscall, set %eax
    dst_e->env_tf.tf_regs.reg_eax = 0;
    dst_e->env_ipc_recving = false;
    return 0;

}

// Block until a value is ready.  Record that you want to receive
// using the env_ipc_recving and env_ipc_dstva fields of struct Env,
// mark yourself not runnable, and then give up the CPU.
//
// If 'dstva' is < UTOP, then you are willing to receive a page of data.
// 'dstva' is the virtual address at which the sent page should be mapped.
//
// This function only returns on error, but the system call will eventually
// return 0 on success.
// Return < 0 on error.  Errors are:
//	-E_INVAL if dstva < UTOP but dstva is not page-aligned.
static int
sys_ipc_recv(void *dstva)
{
	// LAB 4: Your code here.
	// panic("sys_ipc_recv not implemented");
	if ((uintptr_t) dstva < UTOP && PGOFF(dstva) != 0) return -E_INVAL;

    envid_t envid = sys_getenvid();
    struct Env *e;
    // do not check permission
    if (envid2env(envid, &e, 0) < 0) return -E_BAD_ENV;
    
    e->env_ipc_recving = true;
    e->env_ipc_dstva = dstva;
    e->env_status = ENV_NOT_RUNNABLE;
    sys_yield();

    return 0;
}

// Receive a value via IPC and return it.
// If 'pg' is nonnull, then any page sent by the sender will be mapped at
//	that address.
// If 'from_env_store' is nonnull, then store the IPC sender's envid in
//	*from_env_store.
// If 'perm_store' is nonnull, then store the IPC sender's page permission
//	in *perm_store (this is nonzero iff a page was successfully
//	transferred to 'pg').
// If the system call fails, then store 0 in *fromenv and *perm (if
//	they're nonnull) and return the error.
// Otherwise, return the value sent by the sender
//
// Hint:
//   Use 'thisenv' to discover the value and who sent it.
//   If 'pg' is null, pass sys_ipc_recv a value that it will understand
//   as meaning "no page".  (Zero is not the right value, since that's
//   a perfectly valid place to map a page.)
int32_t
ipc_recv(envid_t *from_env_store, void *pg, int *perm_store)
{
	// LAB 4: Your code here.
	// panic("ipc_recv not implemented");
	int r;
    if (pg != NULL) {
        r = sys_ipc_recv(pg);
    } else {
        r = sys_ipc_recv((void *) UTOP);
    }
    if (r < 0) {
        // failed
        if (from_env_store != NULL) *from_env_store = 0;
        if (perm_store != NULL) *perm_store = 0;
        return r;
    } else {
        if (from_env_store != NULL) *from_env_store = thisenv->env_ipc_from;
        if (perm_store != NULL) *perm_store = thisenv->env_ipc_perm;
        return thisenv->env_ipc_value;
    }

	return 0;
}

// Send 'val' (and 'pg' with 'perm', if 'pg' is nonnull) to 'toenv'.
// This function keeps trying until it succeeds.
// It should panic() on any error other than -E_IPC_NOT_RECV.
//
// Hint:
//   Use sys_yield() to be CPU-friendly.
//   If 'pg' is null, pass sys_ipc_try_send a value that it will understand
//   as meaning "no page".  (Zero is not the right value.)
void
ipc_send(envid_t to_env, uint32_t val, void *pg, int perm)
{
	// LAB 4: Your code here.
	// panic("ipc_send not implemented");
	int r;
    if (pg == NULL) pg = (void *)UTOP;
    do {
        r = sys_ipc_try_send(to_env, val, pg, perm);
        if (r < 0 && r != -E_IPC_NOT_RECV) panic("ipc send failed: %e", r);
        sys_yield();
    } while (r != 0);
}
// Try to send 'value' to the target env 'envid'.
// If srcva < UTOP, then also send page currently mapped at 'srcva',
// so that receiver gets a duplicate mapping of the same page.
//
// The send fails with a return value of -E_IPC_NOT_RECV if the
// target is not blocked, waiting for an IPC.
//
// The send also can fail for the other reasons listed below.
//
// Otherwise, the send succeeds, and the target's ipc fields are
// updated as follows:
//    env_ipc_recving is set to 0 to block future sends;
//    env_ipc_from is set to the sending envid;
//    env_ipc_value is set to the 'value' parameter;
//    env_ipc_perm is set to 'perm' if a page was transferred, 0 otherwise.
// The target environment is marked runnable again, returning 0
// from the paused sys_ipc_recv system call.  (Hint: does the
// sys_ipc_recv function ever actually return?)
//
// If the sender wants to send a page but the receiver isn't asking for one,
// then no page mapping is transferred, but no error occurs.
// The ipc only happens when no errors occur.
//
// Returns 0 on success, < 0 on error.
// Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist.
//		(No need to check permissions.)
//	-E_IPC_NOT_RECV if envid is not currently blocked in sys_ipc_recv,
//		or another environment managed to send first.
//	-E_INVAL if srcva < UTOP but srcva is not page-aligned.
//	-E_INVAL if srcva < UTOP and perm is inappropriate
//		(see sys_page_alloc).
//	-E_INVAL if srcva < UTOP but srcva is not mapped in the caller's
//		address space.
//	-E_INVAL if (perm & PTE_W), but srcva is read-only in the
//		current environment's address space.
//	-E_NO_MEM if there's not enough memory to map srcva in envid's
//		address space.
static int
sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm)
{
	// LAB 4: Your code here.
	// panic("sys_ipc_try_send not implemented");
	envid_t src_envid = sys_getenvid(); 
    struct Env *dst_e;
    if (envid2env(envid, &dst_e, 0) < 0) {
        return -E_BAD_ENV;
    }

    if (dst_e->env_ipc_recving == false) 
        return -E_IPC_NOT_RECV;
    
    // pass the value
    dst_e->env_ipc_value = value;
    dst_e->env_ipc_perm = 0;

    // pass the page
    if ((uintptr_t)srcva < UTOP) {
        // customerize 0x200 as PTE_NO_CHECK
        unsigned tmp_perm = perm | 0x200;
        int r = sys_page_map(src_envid, srcva, envid, (void *)dst_e->env_ipc_dstva, tmp_perm);
        if (r < 0) return r;
        dst_e->env_ipc_perm = perm;
    }

    dst_e->env_ipc_from = src_envid;
    dst_e->env_status = ENV_RUNNABLE;
    // return from the syscall, set %eax
    dst_e->env_tf.tf_regs.reg_eax = 0;
    dst_e->env_ipc_recving = false;
    return 0;

}

增加新的系统调用类型：

	case SYS_ipc_try_send:
        ret = sys_ipc_try_send(a1, a2, (void *)a3, a4);
        break;
    case SYS_ipc_recv:
        ret = sys_ipc_recv((void *)a1);
        break;

运行结果：

+ mk obj/kern/kernel.img
dumbfork: OK (1.5s) 
Part A score: 5/5

faultread: OK (1.5s) 
faultwrite: OK (2.3s) 
faultdie: OK (1.7s) 
faultregs: OK (2.1s) 
faultalloc: OK (2.0s) 
faultallocbad: OK (2.0s) 
faultnostack: OK (2.0s) 
faultbadhandler: OK (2.2s) 
faultevilhandler: OK (2.0s) 
forktree: OK (2.3s) 
Part B score: 50/50

spin: OK (1.7s) 
stresssched: OK (3.1s) 
    (Old jos.out.stresssched failure log removed)
sendpage: OK (1.0s) 
pingpong: OK (1.7s) 
primes: OK (4.9s) 
Part C score: 25/25

Score: 80/80