Sysenter Based System Call Mechanism in Linux 2.6

最新推荐文章于 2021-07-03 17:10:55 发布

maimang09

最新推荐文章于 2021-07-03 17:10:55 发布

阅读量957

点赞数

分类专栏： linux 学习

linux 学习专栏收录该内容

156 篇文章 3 订阅

订阅专栏

http://articles.manugarg.com/systemcallinlinux2_6.html

Sysenter Based System Call Mechanism in Linux 2.6

Starting with version 2.5, linux kernel introduced a new system callentry mechanism on Pentium II+ processors. Due to performance issues onPentium IV processors with existing software interrupt method, analternative system call entry mechanism was implemented usingSYSENTER/SYSEXIT instructions available on Pentium II+ processors. Thisarticle explores this new mechanism. Discussion is limited to x86architecture and all source code listings are based on linux kernel2.6.15.6.

1. What are system calls?

System calls provide userland processes a way to request services fromthe kernel. What kind of services? Services which are managed byoperating system like storage, memory, network, process management etc.For example if a user process wants to read a file, it will have tomake 'open' and 'read' system calls. Generally system calls are notcalled by processes directly. C library provides an interface to allsystem calls.

2. What happens in a system call?

A kernel code snippet is run on request of a user process. This coderuns in ring 0 (with current privilege level -CPL- 0), which is thehighest level of privilege in x86 architecture. All user processes runin ring 3 (CPL 3). So, to implement system call mechanism, what we needis 1) a way to call ring 0 code from ring 3 and 2) some kernel code toservice the request.

3. Good old way of doing it

Until some time back, linux used toimplement system calls on all x86 platforms using software interrupts.To execute a system call, user process will copy desired system callnumber to %eax and will execute 'int 0x80'. This will generateinterrupt 0x80 and an interrupt service routine will be called. Forinterrupt 0x80, this routine is an "all system calls handling" routine.This routine will execute in ring 0. This routine, as defined in thefile /usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.

4. New shiny way of doing it

It was found outthat this software interrupt method was much slower on Pentium IVprocessors. To solve this issue, Linus implemented an alternativesystem call mechanism to take advantage of SYSENTER/SYSEXITinstructions provided by all Pentium II+ processors. Before goingfurther with this new way of doing it, let's make ourselves morefamiliar with these instructions.

4.1. SYSENTER/SYSEXIT instructions:

Let's look at the authorized source, Intel manual itself. Intel manual says:

TheSYSENTER instruction is part of the "Fast System Call" facilityintroduced on the Pentium® II processor. The SYSENTER instruction isoptimized to provide the maximum performance for transitions toprotection ring 0 (CPL = 0). The SYSENTER instruction sets thefollowing registers according to values specified by the operatingsystem in certain model-specific registers.

CS register set to the value of (SYSENTER_CS_MSR)

EIP register set to the value of (SYSENTER_EIP_MSR)

SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)

ESP register set to the value of (SYSENTER_ESP_MSR)

Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:

TheSYSEXIT instruction is part of the "Fast System Call" facilityintroduced on the Pentium® II processor. The SYSEXIT instruction isoptimized to provide the maximum performance for transitions toprotection ring 3 (CPL = 3) from protection ring 0 (CPL = 0). TheSYSEXIT instruction sets the following registers according to valuesspecified by the operating system in certain model-specific or generalpurpose registers.

CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)

EIP register set to the value contained in the EDX register

SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)

ESP register set to the value contained in the ECX register

SYSENTER_CS_MSR,SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR are not really names of theregisters. Intel just defines the address of these registers as:

SYSENTER_CS_MSR   174h
SYSENTER_ESP_MSR  175h
SYSENTER_EIP_MSR  176h

In linux these registers are named as:

/usr/src/linux/include/asm/msr.h:
    101 #define MSR_IA32_SYSENTER_CS            0x174
    102 #define MSR_IA32_SYSENTER_ESP           0x175
    103 #define MSR_IA32_SYSENTER_EIP           0x176

4.2. How does linux 2.6 uses these instructions?

Linux sets up these registers during initialization itself.
```
/usr/src/linux/arch/i386/kernel/sysenter.c:
     36         wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
     37         wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
     38         wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
```
Pleasenote that 'tss' refers to the Task State Segment (TSS) and tss->esp1thus points to the kernel mode stack. [4] explains the use of TSS inlinux as:

Thex86 architecture includes a specific segment type called the Task StateSegment (TSS), to store hardware contexts. Although Linux doesn't usehardware context switches, it is nonetheless forced to set up a TSS foreach distinct CPU in the system. This is done for two main reasons:

- When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS.

-When a User Mode process attempts to access an I/O port by means of anin or out instruction, the CPU may need to access an I/O PermissionBitmap stored in the TSS to verify whether the process is allowed toaddress the port.

So during initializationkernel sets up these registers such that after SYSENTER instruction,ESP is set to kernel mode stack and EIP is set to sysenter_entry.

Kernelalso setups system call entry/exit points for user processes. Kernelcreates a single page in the memory and attaches it to all processes'address space when they are loaded into memory. This page contains theactual implementation of the system call entry/exit mechanism.Definition of this page can be found in the file /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S. Kernel calls this page virtual dynamic shared object (vdso). Existence of this page can be confirmed by looking at cat /proc/`pid`/maps:

slax ~ # cat /proc/self/maps
08048000-0804c000 r-xp 00000000 07:00 13         /bin/cat
0804c000-0804d000 rwxp 00003000 07:00 13         /bin/cat
0804d000-0806e000 rwxp 0804d000 00:00 0          [heap]
b7ea0000-b7ea1000 rwxp b7ea0000 00:00 0
b7ea1000-b7fca000 r-xp 00000000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fca000-b7fcb000 r-xp 00128000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fcb000-b7fce000 rwxp 00129000 07:03 1840       /lib/tls/libc-2.3.6.so
b7fce000-b7fd1000 rwxp b7fce000 00:00 0
b7fe7000-b7ffd000 r-xp 00000000 07:03 1730       /lib/ld-2.3.6.so
b7ffd000-b7fff000 rwxp 00015000 07:03 1730       /lib/ld-2.3.6.so
bffe7000-bfffd000 rwxp bffe7000 00:00 0          [stack]
ffffe000-fffff000 ---p 00000000 00:00 0          [vdso]

For binaries using shared libraries, this page can be seen using ldd also:

slax ~ # ldd /bin/ls
        linux-gate.so.1 =>  (0xffffe000)
        librt.so.1 => /lib/tls/librt.so.1 (0xb7f5f000)
        ...

Observe linux-gate.so.1. This is no physical file. Content of this vdso can be seen as follows:

==> dd if=/proc/self/mem of=linux-gate.dso bs=4096 skip=1048574 count=1
1+0 records in
1+0 records out

==> objdump -d --start-address=0xffffe400 --stop-address=0xffffe414 linux-gate.dso
ffffe400 <__kernel_vsyscall>:
ffffe400:       51                      push   %ecx
ffffe401:       52                      push   %edx
ffffe402:       55                      push   %ebp
ffffe403:       89 e5                   mov    %esp,%ebp
ffffe405:       0f 34                   sysenter 
...
ffffe40d:       90                      nop    
ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall+0x3>
ffffe410:       5d                      pop    %ebp
ffffe411:       5a                      pop    %edx
ffffe412:       59                      pop    %ecx
ffffe413:       c3                      ret

In all listings, ... stands for omitted irrelevant code.

Initiation:Userland processes (or C library on their behalf) call__kernel_vsyscall to execute system calls. Address of __kernel_vsyscallis not fixed. Kernel passes this address to userland processes usingAT_SYSINFO elf parameter. AT_ elf parameters, a.k.a. elf auxiliaryvectors, are loaded on the process stack at the time of startup,alongwith the process arguments and the environment variables. Look at[1] for more information on Elf auxiliary vectors.

After movingto this address, registers %ecx, %edx and %ebp are saved on the userstack and %esp is copied to %ebp before executing sysenter. This %ebplater helps kernel in restoring userland stack back. After executingsysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in /usr/src/linux/arch/i386/kernel/entry.S as: (See my comments in [ ])

    179 ENTRY(sysenter_entry)
    180         movl TSS_sysenter_esp0(%esp),%esp
    181 sysenter_past_esp:
    182         sti
    183         pushl $(__USER_DS)
    184         pushl %ebp			[%ebp contains userland %esp]
    185         pushfl
    186         pushl $(__USER_CS)
    187         pushl $SYSENTER_RETURN		[%userland return addr]
    188
		....
    201         pushl %eax			
    202         SAVE_ALL			[pushes registers on to stack]
    203         GET_THREAD_INFO(%ebp)
    204
    205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
    206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),
                                                                             TI_flags(%ebp)
    207         jnz syscall_trace_entry
    208         cmpl $(nr_syscalls), %eax
    209         jae syscall_badsys
    210         call *sys_call_table(,%eax,4)
    211         movl %eax,EAX(%esp)
		......

Inside sysenter_entry: between line 183 and 202, kernel is saving the current state by pushing register values on to the stack.

Observe that $SYSENTER_RETURN is the userland return address as defined inside /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S and %ebp contains userland ESP as %esp was copied to %ebp before calling sysenter.
Aftersaving the state, kernel validates the system call number stored in%eax. Finally appropriate system call is called using instruction:
```
    210 call *sys_call_table(,%eax,4)
```
This is very much similar to old way.

After system call is complete, processor resumes execution at line 211. Looking further in sysenter_entry definition:

    210         call *sys_call_table(,%eax,4)
    211         movl %eax,EAX(%esp)
    212         cli
    213         movl TI_flags(%ebp), %ecx
    214         testw $_TIF_ALLWORK_MASK, %cx
    215         jne syscall_exit_work
    216 /* if something modifies registers it must also disable sysexit */
    217         movl EIP(%esp), %edx			(EIP is 0x28)
    218         movl OLDESP(%esp), %ecx			(OLD ESP is 0x34)
    219         xorl %ebp,%ebp
    220         sti
    221         sysexit

Copiesvalue in %eax to stack. Userland ESP and return address (to-be EIP) arecopied from kernel stack to %edx and %ecx respectively. Observe thatthe userland return address, $SYSENTER_RETURN was pushed on to stack inline 187. After that 0x28 bytes have been pushed on to the stack.That's why 0x28(%esp) points to $SYSENTER_RETURN.
Afterthat SYSEXIT instruction is executed. As we know from previous section,sysexit copies value in %edx to EIP and value in %ecx to ESP. sysexittransfers processor back to ring 3 and processor resumes execution inuserland.

5. Some Code

#include <stdio.h>

int pid;

int main() {
        __asm__(
                "movl $20, %eax    \n"
                "call *%gs:0x10    \n"   /* offset 0x10 is not fixed across the systems */
                "movl %eax, pid    \n"
        );
        printf("pid is %d\n", pid);
        return 0;
}

Thisdoes the getpid() system call (__NR_getpid is 20) using__kernel_vsyscall instead of int 0x80. Why %gs:0x10? Parsing processstack to find out AT_SYSINFO's value can be a cumbersome task. So, whenlibc.so (C library) is loaded, it copies the value of AT_SYSINFO fromthe process stack to the TCB (Thread Control Block). Segment register%gs refers to the TCB.

Please note that the offset 0x10 is notfixed across the systems. I found it out for my system using GDB. Asystem independent way to find out AT_SYSINFO is given in [1].

Note:This example is taken from http://www.win.tue.nl/~aeb/linux/lk/lk-4.html after littlemodification to make it work on my system.

6. References

Here are some references that helped meunderstand this.

About Elf auxiliary vectors By Manu Garg
What is linux-gate.so.1? By Johan Peterson
This Linux kernel: System Calls By Andries Brouwer
Understanding the Linux Kernel, By Daniel P. Bovet, Marco Cesati
Linux Kernel source code

maimang09

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Sysenter Based System Call Mechanism in Linux 2.6

http://articles.manugarg.com/systemcallinlinux2_6.htmlSysenter Based System Call Mechanism in Linux 2.6By Manu Garg (www.manugarg.com) | Starting with version 2.5, linux kernel int
复制链接

扫一扫

专栏目录