Lab 2
@[6.828|OS kernel|MIT2014]
课程地址:http://pdos.csail.mit.edu/6.828/2014/schedule.html
Lec 5 note
Isolation mechanisms
OS design driven by isolation(隔离), multiplexing(复用), and sharing(共享、分享)
What is isolation
the process is the unit of isolation
prevent process X from wrecking(破坏) or spying on process Y
memory, cpu, FDs, resource exhaustion
prevent a process from wrecking the operating system itself
i.e. from preventing kernel from enforcing isolation
in the face of bugs or malice
e.g. a bad process may try to trick the h/w(hardware) or kernelwhat are all the mechanisms that keep processes isolated?
user/kernel mode flag
address spaces
timeslicing
system call interfacethe foundation of xv6’s isolation: user/kernel mode flag
controls whether instructions can access privileged h/w
called CPL on the x86, bottom two bits of %cs
CPL=0 – kernel mode – privileged
CPL=3 – user mode – no privilege
x86 CPL protects everything relevant to isolation
writes to %cs (to defend CPL)
every memory read/write
I/O port accesses
control register accesses (eflags, %cs4, …)
every serious microprocessor has something similaruser/kernel mode flag is not enough
protects only against direct attacks on the hardware
kernel must configure control regs, page tables, &c to protect other stuff
e.g. kernel memoryhow to do a system call – switching CPL
Q: would this be an OK design for user programs to make a system call:
set CPL=0
jmp sys_open
bad: user-specified instructions with CPL=0
Q: how about a combined instruction that sets CPL=0,
but requires an immediate jump to someplace in the kernel?
bad: user might jump somewhere awkward in the kernel
the x86 answer:
there are only a few permissible kernel entry points
INT instruction sets CPL=0 and jumps to an entry point
but user code can’t otherwise modify CPL or jump anywhere else in kernel
system call return sets CPL=3 before returning to user code
also a combined instruction (can’t separately set CPL and jmp)
but kernel is allowed to jump anywhere in user codethe result: well-defined notion of user vs kernel
either CPL=3 and executing user code
or CPL=0 and executing from entry point in kernel code
not:
CPL=0 and executing user
CPL=0 and executing anywhere in kernel the user pleaseshow to isolate process memory?
idea: “address space”
give each process some memory it can access
for its code, variables, heap, stack
prevent it from accessing other memory (kernel or other processes)how to create isolated address spaces?
xv6 uses x86 “paging hardware”
MMU translates (or “maps”) every address issued by program
VA -> PA
instruction fetch, data load/store
for kernel and user
there’s no way for any instruction to directly use a PA
MMU array w/ entry for each 4k range of “virtual” address space
refers to phy address for that “page”
this is the page table
o/s tells h/w to switch page table when switching process
why isolated?
each page table entry (PTE) has a bit saying if user-mode instructions can use
kernel only sets the bit for the memory in current process’s address space
paging h/w used in many ways, not just isolation
e.g. copy-on-write fork(), see Lab 4
note: you don’t need paging to isolate memory
type safety, JVM, Singularity
but paging is the most popular planhow to isolate CPU?
prevent a process from hogging the CPU, e.g. buggy infinite loop
how to force uncooperative process to yield
h/w provides a periodic “clock interrupt”(时钟中断)
forcefully suspends current process
jumps into kernel
which can switch to a different process
kernel must save/restore process state (registers)
totally transparent, even to cooperative processes
called “pre-emptive context switch”
note: traditional, but maybe not perfect; see exokernel paperback to system calls
i’ve talked a lot about how o/s isolates processes
but need user/kernel to cooperate! user needs kernel services.
what should user/kernel interaction look like?
can’t let user r/w kernel mem (well, you can, later…)
kernel can r/w user mem
but don’t want to do this too much!
so style of system call interface is pretty simple
integers, strings (copying only), user-allocated buffers
no objects, data structures, &c
never any doubt about who owns memory
somethine matters
- .bss的解释为:There is another section called the .bss. This section is like the data section, except that it doesn’t take up space in the executable.text和data段都在可执行文件中(在嵌入式系统里一般是固化在镜像文件中),由系统从可执行文件中加载;而bss段不在可执行文件中,由系统初始化。
LAB part
Part 1: Physical Page Management
Exercise 1
In the file kern/pmap.c, you must implement code for the following functions (probably in the order given).
boot_alloc()
mem_init() (only up to the call to check_page_free_list(1))
page_init()
page_alloc()
page_free()
check_page_free_list() and check_page_alloc() test your physical page allocator. You should boot JOS and see whether check_page_alloc() reports success. Fix your code so that it passes. You may find it helpful to add your own assert()s to verify that your assumptions are correct.
page granularity:页粒度
写程序前 仔细看memlayout.h
mmu.h
以及pmap.c
根据程序中的注释写出代码:
- 1
static void *
boot_alloc(uint32_t n)
{
static char *nextfree; // virtual address of next byte of free memory
char *result;
// Initialize nextfree if this is the first time.
// 'end' is a magic symbol automatically generated by the linker,
// which points to the end of the kernel's bss segment:
// the first virtual address that the linker did *not* assign
// to any kernel code or global variables.
if (!nextfree) {
extern char end[];
nextfree = ROUNDUP((char *) end, PGSIZE);
}
// Allocate a chunk large enough to hold 'n' bytes, then update
// nextfree. Make sure nextfree is kept aligned
// to a multiple of PGSIZE.
//
// LAB 2: Your code here.
//cprintf("boot_alloc\r\n");
result = nextfree;
nextfree += ROUNDUP(n,PGSIZE);
return result;
//return NULL;
}
- 2
void
mem_init(void)
{
//增添部分
pages = (struct PageInfo*) boot_alloc (npages * sizeof(struct PageInfo));
memset(pages, 0 , sizeof(struct PageInfo)*npages);
}
- 3
void
page_init(void)
{
// The example code here marks all physical pages as free.
// However this is not truly the case. What memory is free?
// 1) Mark physical page 0 as in use.
// This way we preserve the real-mode IDT and BIOS structures
// in case we ever need them. (Currently we don't, but...)
// 2) The rest of base memory, [PGSIZE, npages_basemem * PGSIZE)
// is free.
// 3) Then comes the IO hole [IOPHYSMEM, EXTPHYSMEM), which must
// never be allocated.
// 4) Then extended memory [EXTPHYSMEM, ...).
// Some of it is in use, some is free. Where is the kernel
// in physical memory? Which pages are already in use for
// page tables and other data structures?
//
// Change the code to reflect this.
// NB: DO NOT actually touch the physical memory corresponding to
// free pages!
size_t i;
uint32_t nextfree = (uint32_t)boot_alloc(0);
//cprintf("1 %d \r\n", nextfree);
nextfree -= KERNBASE;
//pages[1].pp_link = 0;
//int lower_p = IOPHYSMEM;
//int upper_p = ROUNDUP (nextfree,PGSIZE);
page_free_list = &pages[1];
int lower_p = PGNUM (IOPHYSMEM);
int upper_p = PGNUM (ROUNDUP (nextfree,PGSIZE)) ;
//cprintf("page_init\r\n");
//cprintf("2 %d \r\n", upper_p);
for (i = 0; i < npages; i++) {
if(i == 0 || i == 1)
{
pages[i].pp_ref = 0;
continue;
}
if(i >= 2 && i < npages_basemem )
{
pages[i].pp_ref = 0;
pages[i].pp_link = page_free_list;
page_free_list = &pages[i];
continue;
}
if(lower_p <= i && i < upper_p)
{
pages[i].pp_ref = 1;
continue;
}
else
{
pages[i].pp_ref = 0;
pages[i].pp_link = page_free_list;
page_free_list = &pages[i];
}
}
}
- 4
struct PageInfo *
page_alloc(int alloc_flags)
{
// Fill this function in
//cprintf("page_alloc\r\n");
if (page_free_list == NULL)
return NULL;
struct PageInfo *res = page_free_list;
page_free_list = page_free_list->pp_link;
if(alloc_flags & ALLOC_ZERO)
memset(page2kva(res),'\0',PGSIZE);
//page2kva:for the conversion of phy addr to vir addr
//memset need vir addr
return res;
}
- 5
void
page_free(struct PageInfo *pp)
{
// Fill this function in
// Hint: You may want to panic if pp->pp_ref is nonzero or
// pp->pp_link is not NULL.
//cprintf("page_free\r\n");
assert(pp->pp_ref == 0);
pp->pp_link = page_free_list;
page_free_list = pp;
}
Part1 实验结果截图:
Part 2: Virtual Memory
Exercise 2
Look at chapters 5 and 6 of the Intel 80386 Reference Manual, if you haven’t done so already. Read the sections about page translation and page-based protection closely (5.2 and 6.4). We recommend that you also skim the sections about segmentation; while JOS uses paging for virtual memory and protection, segment translation and segment-based protection cannot be disabled on the x86, so you will need a basic understanding of it.
Virtual, Linear, and Physical Addresses
Exercise 3
While GDB can only access QEMU’s memory by virtual address, it’s often useful to be able to inspect physical memory while setting up virtual memory. Review the QEMU monitor commands from the lab tools guide, especially the xp command, which lets you inspect physical memory. To access the QEMU monitor, press Ctrl-a c in the terminal (the same binding returns to the serial console).
Use the xp command in the QEMU monitor and the x command in GDB to inspect memory at corresponding physical and virtual addresses and make sure you see the same data.
Our patched version of QEMU provides an info pg command that may also prove useful: it shows a compact but detailed representation of the current page tables, including all mapped memory ranges, permissions, and flags. Stock QEMU also provides an info mem command that shows an overview of which ranges of virtual memory are mapped and with what permissions.
- 所有指针引用均采用虚拟地址
pp_ref
(struct PageInfo) In general, this count(pp_ref) should equal to the number of times the physical page appears below UTOP(0xeec00000) in all page tables (the mappings above UTOP are mostly set up at boot time by the kernel and should never be freed, so there’s no need to reference count them).
Page Table Management
Exercise 4
In the file kern/pmap.c, you must implement code for the following functions.
pgdir_walk()
boot_map_region()
page_lookup()
page_remove()
page_insert()
check_page(), called from mem_init(), tests your page table management routines. You should make sure it reports success before proceeding.
- 1 pgdir_walk()
注意以下宏函数的作用:page2kva()
,page2pa()
,pa2page()
,PTX(),PDX(),PGNUM()
,KADDR(),PADDR()
等等 2 PDE PTE后12flag位作用(http://leanote.com/file/outputImage?fileId=55e3128838f4115bf2007abb)
P 代表present,即存在在做本实验前需要了解的相关寻址的知识
http://pdos.csail.mit.edu/6.828/2014/lec/x86_translation_and_registers.pdf3 代码
pgdir_walk():
pte_t *
pgdir_walk(pde_t *pgdir, const void *va, int create)
{
// Fill this function in
pte_t *pte = NULL;
pte_t *entry;
if(pgdir[PDX(va)] == (pde_t)NULL)
{
if(create == 0)
return NULL;
else
{
struct PageInfo *page = page_alloc(1); //physical addr
if(page == NULL)
return NULL;
page->pp_ref++;
pgdir[PDX(va)] = page2pa(page)|PTE_P|PTE_W|PTE_U;
// 20bit(page)->32bit && why store phy addr to pg dir?
//pte = PTE_ADDR(pgdir[PDX(va)]); // 20bit->32bit
//pgdir[PTX(va)] = page2pa(page)|PTE_P|PTE_W|PTE_U;
//pte = page2kva(page);
}
}
/*
//cprintf("%u ",PGNUM(PTE_ADDR(pgdir[PDX(va)])));
pte = page2kva(pa2page(PTE_ADDR(pgdir[PDX(va)])));
*/
entry = (pte_t *)PTE_ADDR(pgdir[PDX(va)]);
//cprintf("entry: %x ",entry);
//why PTE_ADDR
pte = &entry[PTX(va)];
return (pte_t *)KADDR((pte_t)pte);
//return &pte[PTX(va)];
}
boot_map_region():
static void
boot_map_region(pde_t *pgdir, uintptr_t va, size_t size, physaddr_t pa, int perm)
{
// Fill this function in
int offset;
pte_t *pte;
for(offset=0;offset<size;offset+=PGSIZE)
{
pte = pgdir_walk(pgdir,(void*)(va),1);
*pte = pa|perm|PTE_P;
va += PGSIZE;
pa += PGSIZE;
}
}
page_lookup()
struct PageInfo *
page_lookup(pde_t *pgdir, void *va, pte_t **pte_store)
{
// Fill this function in
pte_t *pte = pgdir_walk(pgdir,va,0);
if(pte == NULL){
return NULL;
}
if(pte_store != 0){
*pte_store = pte;
}
if(pte != NULL && (*pte & PTE_P)){
return pa2page(PTE_ADDR(*pte));
}
return NULL;
}
page_remove():
void
page_remove(pde_t *pgdir, void *va)
{
// Fill this function in
pte_t *pte = 0;
struct PageInfo *phypage = page_lookup(pgdir,va,&pte);
//cprintf("page_remove %x\n", phypage);
if(phypage){
page_decref(phypage);
}
if(pte){
*pte = 0;
}
tlb_invalidate(pgdir,va);
//对此函数的目的并不是很了解,且并没有找到原函数
}
page_insert():
int
page_insert(pde_t *pgdir, struct PageInfo *pp, void *va, int perm)
{
// Fill this function in
//struct PageInfo *page;
pte_t *pte;
pte = pgdir_walk(pgdir,va,1);
if(!pte){
return -E_NO_MEM;
}
else if(PTE_ADDR(*pte) != page2pa(pp)){
page_remove(pgdir,va);
//pte = pgdir_walk(pgdir,va,1);
}
else{
//tlb_invalidate(pgdir,va);
pp->pp_ref--;
}
*pte = page2pa(pp)|perm|PTE_P;
pp->pp_ref++;
//page = page_lookup()
tlb_invalidate(pgdir, va);
return 0;
}
各函数的调用关系:(http://leanote.com/file/outputImage?fileId=55e5c4eb38f41128cb000202)
Part 3: Kernel Address Space
Permissions and Fault Isolation
Initializing the Kernel Address Space
Exercise 5
Fill in the missing code in mem_init() after the call to check_page().
Your code should now pass the check_kern_pgdir() and check_page_installed_pgdir() checks.
所补充函数:
//////////////////////////////////////////////////////////////////////
// Map 'pages' read-only by the user at linear address UPAGES
// Permissions:
// - the new image at UPAGES -- kernel R, user R
// (ie. perm = PTE_U | PTE_P)
// - pages itself -- kernel RW, user NONE
// Your code goes here:
boot_map_region(kern_pgdir,UPAGES,PTSIZE,PADDR((uintptr_t *) pages), PTE_U);
//////////////////////////////////////////////////////////////////////
// Use the physical memory that 'bootstack' refers to as the kernel
// stack. The kernel stack grows down from virtual address KSTACKTOP.
// We consider the entire range from [KSTACKTOP-PTSIZE, KSTACKTOP)
// to be the kernel stack, but break this into two pieces:
// * [KSTACKTOP-KSTKSIZE, KSTACKTOP) -- backed by physical memory
// * [KSTACKTOP-PTSIZE, KSTACKTOP-KSTKSIZE) -- not backed; so if
// the kernel overflows its stack, it will fault rather than
// overwrite memory. Known as a "guard page".
// Permissions: kernel RW, user NONE
// Your code goes here:
boot_map_region(kern_pgdir,KSTACKTOP - KSTKSIZE,KSTKSIZE,PADDR((uintptr_t *) bootstack),PTE_W);
//boot_map_region(kern_pgdir,KSTACKTOP - PTSIZE,PTSIZE - KSTKSIZE,0,0);
//////////////////////////////////////////////////////////////////////
// Map all of physical memory at KERNBASE.
// Ie. the VA range [KERNBASE, 2^32) should map to
// the PA range [0, 2^32 - KERNBASE)
// We might not have 2^32 - KERNBASE bytes of physical memory, but
// we just set up the mapping anyway.
// Permissions: kernel RW, user NONE
// Your code goes here:
boot_map_region(kern_pgdir,KERNBASE,0xffffffff - KERNBASE,(physaddr_t) 0,PTE_W);
- 全部函数补充完之后的结果:
(http://leanote.com/file/outputImage?fileId=55e5c57138f41128cb000206)
- 及最后评分:
(http://leanote.com/file/outputImage?fileId=55e5c5b738f41128cb000208)
Question&Answer:
1 What entries (rows) in the page directory have been filled in at this point? What addresses do they map and where do they point? In other words, fill out this table as much as possible:
Entry Base Virtual Address Points to (logically): 1023 ? Page table for top 4MB of phys memory 1022 ? ? . ? ? . ? ? . ? ? 2 0x00800000 ? 1 0x00400000 ? 0 0x00000000 [see next question] 2 We have placed the kernel and user environment in the same address space. Why will user programs not be able to read or write the kernel’s memory? What specific mechanisms protect the kernel memory?
3 What is the maximum amount of physical memory that this operating system can support? Why?
4 How much space overhead is there for managing memory, if we actually had the maximum amount of physical memory? How is this overhead broken down?
5 Revisit the page table setup in kern/entry.S and kern/entrypgdir.c. Immediately after we turn on paging, EIP is still a low number (a little over 1MB). At what point do we transition to running at an EIP above KERNBASE? What makes it possible for us to continue executing at a low EIP between when we enable paging and when we begin running at an EIP above KERNBASE? Why is this transition necessary?
DPL & CPL & RPL(系统保护机制):
http://pdosnew.csail.mit.edu/6.828/2014/readings/i386/s06_03.htm(http://leanote.com/file/outputImage?fileId=55e621ec38f4116b9300023f)(http://leanote.com/file/outputImage?fileId=55e6220638f4116b93000241)- 通过翻看X86编程手册关于COMBINING PAGE AND SEGMENT PROTECTION(在Reference里的Volume 3A:System Programming Guide, Part 1),可以知道x86 mmu处理地址访问的时候,先查看segment的权限位,如果满足,则查看相应的页目录权限位,还满足,则查看对应的页表权限位,都满足以后就可以访问了。我们在创建新页表的时候,不能知道这个页表里所有的1024个页哪些可读可写,所以这个页表对应的在页目录里那项的权限位最好设置成用户态可读可写,即PTE_U | PTE_W | PTE_P
- 还可以参考Intel 80386 Reference Manual 6.3Segment-Level Protection
1 在 points to 那一列中所指向的逻辑地址为从UPAGES(0xef000000)往上,每一个值加4B,Base Virtual Address 的值则是每一个加一个PTSIZE(即0x00400000)
2 for safety , 用户显然不能触碰内核数据, 原理参见上面DPL等的系统保护机制的介绍,
3 4GB,一个PDT(page directory table)包含1024个表项,可以指向1024的page table,但UPAGES以上PTSIZE只能存储约莫1/3M个pages(struct PageInfo),每个pages对应4KB,故总共可以支持约莫1.3GB的物理内存。
4 一个页目录 + 1024个页表,虽然页目录也只有4KB 但是UVPT(0xef400000)以上PTSIZE(4MB)都只存了一个页目录,总共1025*4KB ,约莫4MB
5 在entry.s中有这样一段汇编代码:
.globl entry
entry:
movw $0x1234,0x472 # warm boot
# We haven't set up virtual memory yet, so we're running from
# the physical address the boot loader loaded the kernel at: 1MB
# (plus a few bytes). However, the C code is linked to run at
# KERNBASE+1MB. Hence, we set up a trivial page directory that
# translates virtual addresses [KERNBASE, KERNBASE+4MB) to
# physical addresses [0, 4MB). This 4MB region will be
# sufficient until we set up our real page table in mem_init
# in lab 2.
# Load the physical address of entry_pgdir into cr3. entry_pgdir
# is defined in entrypgdir.c.
movl $(RELOC(entry_pgdir)), %eax
movl %eax, %cr3
# Turn on paging.
movl %cr0, %eax
orl $(CR0_PE|CR0_PG|CR0_WP), %eax
movl %eax, %cr0
# Now paging is enabled, but we're still running at a low EIP
# (why is this okay?). Jump up above KERNBASE before entering
# C code.
mov $relocated, %eax
jmp *%eax
relocated:
- 第17行 页目录地址赋到
cr3
寄存器中,打开cr0
中相关标志位,最后第27行讲eip跳到新的地址去执行。 至于第24行那个问题:# (why is this okay?). 我认为可能是因为虚拟地址所映射的物理地址刚好是这块区域,所以才不冲突,然而我自己都觉得这个解答不靠谱。。然后查找了一些资料得知:在entrypgdir.c中预设了两个数组,
entry_pgtable[]
:预设了一个二级页表中的值,
如图每一个元素的值对应一个物理页的起始地址,以及标志位
entry_pgdir[]
:预设了页目录的值,如下:__attribute__((__aligned__(PGSIZE))) pde_t entry_pgdir[NPDENTRIES] = { // Map VA's [0, 4MB) to PA's [0, 4MB) [0] = ((uintptr_t)entry_pgtable - KERNBASE) + PTE_P, // Map VA's [KERNBASE, KERNBASE+4MB) to PA's [0, 4MB) [KERNBASE>>PDXSHIFT] = ((uintptr_t)entry_pgtable - KERNBASE) + PTE_P + PTE_W };
即已经把entry_pgdir[]数组中0号元素设置成了与虚拟地址0:0x400000对应的物理地址0:0x400000(页表项 0 将虚拟地址 0:0x400000 映射到物理地址 0:0x400000),同样,页表项 960将虚拟地址的 KERNBASE:KERNBASE+0x400000 映射到物理地址 0:0x400000。这个页表项将在 entry.S 的代码结束后被使用;它将内核指令和内核数据应该出现的高虚拟地址处映射到了 boot loader 实际将它们载入的低物理地址处(0x100000)。这个映射就限制内核的指令+代码必须在 4mb 以内(实际是在3MB以内)。
- 因为将系统区域放在虚拟地址的高地址是约定好了的。