How Do Windows NT System Calls REALLY Work?

译文出处:http://www.codeguru.com/Cpp/W-P/system/devicedriverdevelopment/article.php/c8035/

Most texts that describe Windows NT system calls keep many of the important details in the dark. This leads to confusion when trying to understand exactly what is going on when a user-mode application "calls into" kernel mode. The following article will shed light on the exact mechanism that Windows NT uses when switching to kernel-mode to execute a system service. The description is for an x86 compatible CPU running in protected mode. Other platforms supported by Windows NT will have a similar mechanism for switching to kernel-mode.

By John Gulbrandsen 8/19/2004
John.Gulbrandsen@SummitSoftConsulting.com

译:许多描述Windows NT系统调用的文档对于许多重要的细节描述的不够清晰,致使很多人不能准确的理解ring3如何进入ring0。这篇文章,会让大家清楚地知道这个过程。这篇文章是基于x86兼容架构下的保护模式。其它平台切换至内核的机制类似。

What is kernel-mode?

Contrary to what most developers believe (even kernel-mode developers) there is no mode of the x86 CPU called "Kernel-mode". Other CPUs such as the Motorola 68000 has two processor modes "built into" the CPU, i.e. it has a flag in a status register that tells the CPU if it is currently executing in user-mode or supervisor-mode. Intel x86 CPUs do not have such a flag. Instead, it is the privilege level of the code segment that is currently executing that determines the privilege level of the executing program. Each code segment in an application that runs in protected mode on an x86 CPU is described by an 8 byte data structure called a Segment Descriptor. A segment descriptor contains (among other information) the start address of the code segment that is described by the descriptor, the length of the code segment and the privilege level that the code in the code segment will execute at. Code that executes in a code segment with a privilege level of 3 is said to run in user mode and code that executes in a code segment with a privilege level of 0 is said to execute in kernel mode. In other words, kernel-mode (privilege level 0) and user-mode (privilege level 3) are attributes of the code and not of the CPU. Intel calls privilege level 0 "Ring 0" and privilege level 3 "Ring 3". There are two more privilege levels in the x86 CPU that are not used by Windows NT (ring 1 and 2). The reason privilege levels 1 and 2 are not used is because Windows NT was designed to run on several other hardware platforms that may or may not have four privilege levels like the Intel x86 CPU.

译:内核模式是什么?正在执行的代码段的特权级决定了正在执行的这个程序的特权。段的一系列属性,包括特权级,基地址,长度等由一个8字节的描述符来描述。描述符描述一个段的特权是0,即ring0,那么就是内核;描述一个段式3,即ring3,那么就是用户态。

 

The x86 CPU will not allow code that is running at a lower privilege level (numerically higher) to call into code that is running at a higher privilege level (numerically lower). If this is attempted a general protection (GP) exception is automatically generated by the CPU. A general protection exception handler in the operating system will be called and the appropriate action can be taken (warn the user, terminate the application etc). Note that all memory protection discussed above, including the privilege levels, are features of the x86 CPU and not of Windows NT. Without the support from the CPU Windows NT cannot implement memory protection like described above.

译:x86CPU不允许低特权段call高特权代码。如果硬要call,则CPU产生一个GP异常。

Where do the Segment Descriptors reside?

Since each code segment that exists in the system is described by a segment descriptor and since there are potentially many, many code segments in a system (each program may have many) the segment descriptors must be stored somewhere so that the CPU can read them in order to accept or deny access to a program that wishes to execute code in a segment. Intel did not choose to store all this information on the CPU chip itself but instead in the main memory. There are two tables in main memory that store segment descriptors; the Global Descriptor Table (GDT) and the Local Descriptor Table (LDT). There are also two registers in the CPU that holds the addresses to and sizes of these descriptor tables so that the CPU can find the segment descriptors. These registers are the Global Descriptor Table Register (GDTR) and the Local Descriptor Table Register (LDTR). It is the operating system's responsibility to set up these descriptor tables and to load the GDTR and LDTR registers with the addresses of the GDT and LDT respectively. This has to be done very early in the boot process, even before the CPU is switched into protected mode, because without the descriptor tables no memory segments can be accessed in protected mode. Figure 1 below illustrates the relationship between the GDTR, LDTR, GDT and the LDT.

译:段描述符在哪?内存中存放了两个描述符表(不是存放在intel芯片里),GDT和LDT。CPU也就有两个寄存器,来存放表的地址和大小,致使CPU能找到段描述符。这些寄存器是GDTR和LDTR。操作系统负责建立描述符表,并且把GDT和LDT的地址放入GDTR和LDTR中。这个操作必须在启动的早期,在CPU切换到保护模式之前进行。因为如果没有描述符表,在保护模式下,就没有内存段能被访问。下图1描述了GDTR、LDTR、GDT和LDT的关系。

Since there are two segment descriptor tables it is not enough to use an index to uniquely select a segment descriptor. A bit that identifies in which of the two tables the segment descriptor resides is necessary. The index combined with the table indicator bit is called a segment selector. The segment selector format is displayed below.

译:1个bit来区别段描述符在哪个表,以及若干个bit来索引表,这就构成了段选择子。其格式如下:

As can be seen in figure 2 above, the segment selector also contains a two-bit field called a Requestor Privilege Level (RPL). These bits are used to determine if a certain piece of code can access the code segment descriptor that the selector points to. For instance, if a piece of code that runs at privilege level 3 (user mode) tries to make a jump or call code in the code segment that is described by the code segment descriptor that the selector points to and the RPL in the selector indicates that only code that runs at privilege level 0 can read the code segment a general protection exception occurs. This is the way the x86 CPU can make sure that no ring 3 (user mode) code can get access to ring 0 (kernel-mode) code. In fact, the truth is slightly more complicated than this. For the information-eager please see the further reading list, "Protected Mode Software Architecture" for the details of the RPL field. For our purposes it is enough to know that the RPL field is used for privilege checks of the code trying to use the segment selector to read a segment descriptor.

译:上图2中,段选择子lowest 2 bit标示RPL,即某个代码片段要访问这个段描述描述的代码的最低特权级。如果ring3的代码,想要访问(call or jump)选择子的RPL为0所指示的代码段,CPU会抛出GP异常。

Interrupt gates

So if application code running in user-mode (at privilege level 3) cannot call code running in kernel-mode (at privilege level 0) how do system calls in Windows NT work? The answer again is that they use features of the CPU. In order to control transitions between code executing at different privilege levels, Windows NT uses a feature of the x86 CPU called an interrupt gate. In order to understand interrupt gates we must first understand how interrupts are used in an x86 CPU executing in protected mode.

译:既然ring3访问不了ring0,那NT如何实现system call呢?答案是CPU的中断特性。为了实现切换模式,nt使用了x86CPU的中断门。

Like most other CPUs, the x86 CPU has an interrupt vector table that contains information about how each interrupt should be handled. In real-mode, the x86 CPU's interrupt vector table simply contains pointers (4 byte values) to the Interrupt Service Routines that will handle the interrupts. In protected-mode, however, the interrupt vector table contains Interrupt Gate Descriptors which are 8 byte data structures that describe how the interrupt should be handled. An Interrupt Gate Descriptor contains information about what code segment the Interrupt Service Routine resides in and where in that code segment the ISR starts. The reason for having an Interrupt Gate Descriptor instead of a simple pointer in the interrupt vector table is the requirement that code executing in user-mode cannot directly call into kernel-mode. By checking the privilege level in the Interrupt Gate Descriptor the CPU can verify that the calling application is allowed to call the protected code at well defined locations (this is the reason for the name "Interrupt Gate", i.e. it is a well defined gate through which user-mode code can transfer control to kernel-mode code).

译:像多数CPU,x86CPU也有一个中断向量表,表里包含了每个需要被处理的中断的相关信息。在实模式下,x86中断向量表只包含4字节的中断处理函数地址。在保护模式下,中断向量表包含了8字节的中断门描述符。这个中断门描述符,包含了一个选择子和ISR在被选择子描述的段的偏移。具体看下图吧,我就不翻译了。

 

Back to the NT system call

Now after having covered the background material we are ready to describe exactly how a Windows NT system call finds its way from user-mode into kernel-mode. System calls in Windows NT are initiated by executing an "int 2e" instruction. The 'int' instructor causes the CPU to execute a software interrupt, i.e. it will go into the Interrupt Descriptor Table at index 2e and read the Interrupt Gate Descriptor at that location. The Interrupt Gate Descriptor contains the Segment Selector of the Code Segment that contains the Interrupt Service Routine (the ISR). It also contains the offset to the ISR within the target code segment. The CPU will use the Segment Selector in the Interrupt Gate Descriptor to index into the GDT or LDT (depending on the TI-bit in the segment selector). Once the CPU knows the information in the target segment descriptor it loads the information from the segment descriptor into the CPU. It also loads the EIP register from the Offset in the Interrupt Gate Descriptor. At this point the CPU is almost set up to start executing the ISR code in the kernel-mode code segment.

译:回到NT系统调用。NT中,系统调用从int 2E开始,int指令导致CPU执行一个软件中断,然后,在IDT中读取2E的中断门描述符。

通过中断门描述提供的选择子,定位到ISR所在的段基地址,然后,根据中断门描述符提供的偏移地址,定位到ISR。

The CPU switches automatically to the kernel-mode stack

Before the CPU starts to execute the ISR in the kernel-mode code segment, it needs to switch to the kernel-mode stack. The reason for this is that the kernel-mode code cannot trust the user-mode stack to have enough room to execute the kernel-mode code. For instance, malicious user-mode code could modify its stack pointer to point to invalid memory, execute an 'int 2e' instruction and thereby crash the system when the kernel-mode functions uses the invalid stack pointer. Each privilege level in the x86 Protected Mode environment therefore has its own stack. When making function calls to a higher-privileged level through an interrupt gate descriptor like described above, the CPU automatically saves the user-mode program's SS, ESP, EFLAGS, CS and EIP registers on the kernel-mode stack. In the case of our Windows NT system service dispatcher function (KiSystemService) it needs access to the parameters that the user-mode code pushed onto its stack before it called 'int 2e'. By convention, the user-mode code must set up the EBX register to contain a pointer to the user-mode stack's parameters before executing the 'int 2e' instruction. The KiSystemService can then simply copy over as many arguments as the called system function needs from the user-mode stack to the kernel-mode stack before calling the system function. See figure 4 below for an illustration of this.

译:CPU自动切换到内核模式栈。

 

What system call are we calling?

Since all Windows NT system calls use the same 'int 2e' software interrupt to switch into kernel-mode, how does the user-mode code tell the kernel-mode code what system function to execute? The answer is that an index is placed in the EAX register before the int 2e instruction is executed. The kernel-mode ISR looks in the EAX register and calls the specified kernel-mode function if all parameters passed from user-mode appears to be correct. The call parameters (for instance passed to our OpenFile function) are passed to the kernel-mode function by the ISR.

Returning from the system call

Once the system call has completed the CPU automatically restores the running program's original registers by executing an IRET instruction. This pops all the saved register values from the kernel-mode stack and causes the CPU to continue the execution at the point in the user-mode code next after the 'int 2e' call.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值