Sysenter Based System Call Mechanism in Linux 2.6

http://articles.manugarg.com/systemcallinlinux2_6.html


Sysenter Based System Call Mechanism in Linux 2.6

Starting with version 2.5, linux kernel introduced a new system callentry mechanism on Pentium II+ processors. Due to performance issues onPentium IV processors with existing software interrupt method, analternative system call entry mechanism was implemented usingSYSENTER/SYSEXIT instructions available on Pentium II+ processors. Thisarticle explores this new mechanism. Discussion is limited to x86architecture and all source code listings are based on linux kernel2.6.15.6.

1. What are system calls?

System calls provide userland processes a way to request services fromthe kernel. What kind of services? Services which are managed byoperating system like storage, memory, network, process management etc.For example if a user process wants to read a file, it will have tomake 'open' and 'read' system calls. Generally system calls are notcalled by processes directly. C library provides an interface to allsystem calls.

2. What happens in a system call? 

A kernel code snippet is run on request of a user process. This coderuns in ring 0 (with current privilege level -CPL- 0), which is thehighest level of privilege in x86 architecture. All user processes runin ring 3 (CPL 3). So, to implement system call mechanism, what we needis 1) a way to call ring 0 code from ring 3 and 2) some kernel code toservice the request.

3. Good old way of doing it

Until some time back, linux used toimplement system calls on all x86 platforms using software interrupts.To execute a system call, user process will copy desired system callnumber to %eax and will execute 'int 0x80'. This will generateinterrupt 0x80 and an interrupt service routine will be called. Forinterrupt 0x80, this routine is an "all system calls handling" routine.This routine will execute in ring 0. This routine, as defined in thefile /usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.

4. New shiny way of doing it

It was found outthat this software interrupt method was much slower on Pentium IVprocessors. To solve this issue, Linus implemented an alternativesystem call mechanism to take advantage of SYSENTER/SYSEXITinstructions provided by all Pentium II+ processors. Before goingfurther with this new way of doing it, let's make ourselves morefamiliar with these instructions.

4.1. SYSENTER/SYSEXIT instructions:

Let's look at the authorized source, Intel manual itself. Intel manual says:

TheSYSENTER instruction is part of the "Fast System Call" facilityintroduced on the Pentium® II processor. The SYSENTER instruction isoptimized to provide the maximum performance for transitions toprotection ring 0 (CPL = 0). The SYSENTER instruction sets thefollowing registers according to values specified by the operatingsystem in certain model-specific registers.

  • CS register set to the value of (SYSENTER_CS_MSR)

  • EIP register set to the value of (SYSENTER_EIP_MSR)

  • SS register set to the sum of (8 plus the value in SYSENTER_CS_MSR)

  • ESP register set to the value of (SYSENTER_ESP_MSR)

Looks like processor is trying to help us. Let's look at SYSEXIT also very quickly:

TheSYSEXIT instruction is part of the "Fast System Call" facilityintroduced on the Pentium® II processor. The SYSEXIT instruction isoptimized to provide the maximum performance for transitions toprotection ring 3 (CPL = 3) from protection ring 0 (CPL = 0). TheSYSEXIT instruction sets the following registers according to valuesspecified by the operating system in certain model-specific or generalpurpose registers.

  • CS register set to the sum of (16 plus the value in SYSENTER_CS_MSR)

  • EIP register set to the value contained in the EDX register

  • SS register set to the sum of (24 plus the value in SYSENTER_CS_MSR)

  • ESP register set to the value contained in the ECX register


SYSENTER_CS_MSR,SYSENTER_ESP_MSR, and SYSENTER_EIP_MSR are not really names of theregisters. Intel just defines the address of these registers as:

SYSENTER_CS_MSR   174h
SYSENTER_ESP_MSR  175h
SYSENTER_EIP_MSR  176h

In linux these registers are named as:

/usr/src/linux/include/asm/msr.h:
    101 #define MSR_IA32_SYSENTER_CS            0x174
    102 #define MSR_IA32_SYSENTER_ESP           0x175
    103 #define MSR_IA32_SYSENTER_EIP           0x176

4.2. How does linux 2.6 uses these instructions?

  1. Linux sets up these registers during initialization itself.

    /usr/src/linux/arch/i386/kernel/sysenter.c:
         36         wrmsr(MSR_IA32_SYSENTER_CS, __KERNEL_CS, 0);
         37         wrmsr(MSR_IA32_SYSENTER_ESP, tss->esp1, 0);
         38         wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long) sysenter_entry, 0);
    

    Pleasenote that 'tss' refers to the Task State Segment (TSS) and tss->esp1thus points to the kernel mode stack. [4] explains the use of TSS inlinux as:

    Thex86 architecture includes a specific segment type called the Task StateSegment (TSS), to store hardware contexts. Although Linux doesn't usehardware context switches, it is nonetheless forced to set up a TSS foreach distinct CPU in the system. This is done for two main reasons:

    - When an 80 x 86 CPU switches from User Mode to Kernel Mode, it fetches the address of the Kernel Mode stack from the TSS.

    -When a User Mode process attempts to access an I/O port by means of anin or out instruction, the CPU may need to access an I/O PermissionBitmap stored in the TSS to verify whether the process is allowed toaddress the port.

    So during initializationkernel sets up these registers such that after SYSENTER instruction,ESP is set to kernel mode stack and EIP is set to sysenter_entry.

  2. Kernelalso setups system call entry/exit points for user processes. Kernelcreates a single page in the memory and attaches it to all processes'address space when they are loaded into memory. This page contains theactual implementation of the system call entry/exit mechanism.Definition of this page can be found in the file /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S. Kernel calls this page virtual dynamic shared object (vdso). Existence of this page can be confirmed by looking at cat /proc/`pid`/maps:

    slax ~ # cat /proc/self/maps
    08048000-0804c000 r-xp 00000000 07:00 13         /bin/cat
    0804c000-0804d000 rwxp 00003000 07:00 13         /bin/cat
    0804d000-0806e000 rwxp 0804d000 00:00 0          [heap]
    b7ea0000-b7ea1000 rwxp b7ea0000 00:00 0
    b7ea1000-b7fca000 r-xp 00000000 07:03 1840       /lib/tls/libc-2.3.6.so
    b7fca000-b7fcb000 r-xp 00128000 07:03 1840       /lib/tls/libc-2.3.6.so
    b7fcb000-b7fce000 rwxp 00129000 07:03 1840       /lib/tls/libc-2.3.6.so
    b7fce000-b7fd1000 rwxp b7fce000 00:00 0
    b7fe7000-b7ffd000 r-xp 00000000 07:03 1730       /lib/ld-2.3.6.so
    b7ffd000-b7fff000 rwxp 00015000 07:03 1730       /lib/ld-2.3.6.so
    bffe7000-bfffd000 rwxp bffe7000 00:00 0          [stack]
    ffffe000-fffff000 ---p 00000000 00:00 0          [vdso]
    

    For binaries using shared libraries, this page can be seen using ldd also:

    slax ~ # ldd /bin/ls
            linux-gate.so.1 =>  (0xffffe000)
            librt.so.1 => /lib/tls/librt.so.1 (0xb7f5f000)
            ...
    

    Observe linux-gate.so.1. This is no physical file. Content of this vdso can be seen as follows:

    ==> dd if=/proc/self/mem of=linux-gate.dso bs=4096 skip=1048574 count=1
    1+0 records in
    1+0 records out
    
    ==> objdump -d --start-address=0xffffe400 --stop-address=0xffffe414 linux-gate.dso
    ffffe400 <__kernel_vsyscall>:
    ffffe400:       51                      push   %ecx
    ffffe401:       52                      push   %edx
    ffffe402:       55                      push   %ebp
    ffffe403:       89 e5                   mov    %esp,%ebp
    ffffe405:       0f 34                   sysenter 
    ...
    ffffe40d:       90                      nop    
    ffffe40e:       eb f3                   jmp    ffffe403 <__kernel_vsyscall+0x3>
    ffffe410:       5d                      pop    %ebp
    ffffe411:       5a                      pop    %edx
    ffffe412:       59                      pop    %ecx
    ffffe413:       c3                      ret 
    

    In all listings, ... stands for omitted irrelevant code.

  3. Initiation:Userland processes (or C library on their behalf) call__kernel_vsyscall to execute system calls. Address of __kernel_vsyscallis not fixed. Kernel passes this address to userland processes usingAT_SYSINFO elf parameter. AT_ elf parameters, a.k.a. elf auxiliaryvectors, are loaded on the process stack at the time of startup,alongwith the process arguments and the environment variables. Look at[1] for more information on Elf auxiliary vectors.

    After movingto this address, registers %ecx, %edx and %ebp are saved on the userstack and %esp is copied to %ebp before executing sysenter. This %ebplater helps kernel in restoring userland stack back. After executingsysenter instruction, processor starts execution at sysenter_entry. sysenter_entry is defined in /usr/src/linux/arch/i386/kernel/entry.S as: (See my comments in [ ])

        179 ENTRY(sysenter_entry)
        180         movl TSS_sysenter_esp0(%esp),%esp
        181 sysenter_past_esp:
        182         sti
        183         pushl $(__USER_DS)
        184         pushl %ebp			[%ebp contains userland %esp]
        185         pushfl
        186         pushl $(__USER_CS)
        187         pushl $SYSENTER_RETURN		[%userland return addr]
        188
    		....
        201         pushl %eax			
        202         SAVE_ALL			[pushes registers on to stack]
        203         GET_THREAD_INFO(%ebp)
        204
        205         /* Note, _TIF_SECCOMP is bit number 8, and so it needs testw and not testb */
        206         testw $(_TIF_SYSCALL_EMU|_TIF_SYSCALL_TRACE|_TIF_SECCOMP|_TIF_SYSCALL_AUDIT),
                                                                                 TI_flags(%ebp)
        207         jnz syscall_trace_entry
        208         cmpl $(nr_syscalls), %eax
        209         jae syscall_badsys
        210         call *sys_call_table(,%eax,4)
        211         movl %eax,EAX(%esp)
    		......
    
  4. Inside sysenter_entry: between line 183 and 202, kernel is saving the current state by pushing register values on to the stack.

    Observe that $SYSENTER_RETURN is the userland return address as defined inside /usr/src/linux/arch/i386/kernel/vsyscall-sysenter.S and %ebp contains userland ESP as %esp was copied to %ebp before calling sysenter.

  5. Aftersaving the state, kernel validates the system call number stored in%eax. Finally appropriate system call is called using instruction:

        210 call *sys_call_table(,%eax,4)
    

    This is very much similar to old way.

  6. After system call is complete, processor resumes execution at line 211. Looking further in sysenter_entry definition:

        210         call *sys_call_table(,%eax,4)
        211         movl %eax,EAX(%esp)
        212         cli
        213         movl TI_flags(%ebp), %ecx
        214         testw $_TIF_ALLWORK_MASK, %cx
        215         jne syscall_exit_work
        216 /* if something modifies registers it must also disable sysexit */
        217         movl EIP(%esp), %edx			(EIP is 0x28)
        218         movl OLDESP(%esp), %ecx			(OLD ESP is 0x34)
        219         xorl %ebp,%ebp
        220         sti
        221         sysexit
    
  7. Copiesvalue in %eax to stack. Userland ESP and return address (to-be EIP) arecopied from kernel stack to %edx and %ecx respectively. Observe thatthe userland return address, $SYSENTER_RETURN was pushed on to stack inline 187. After that 0x28 bytes have been pushed on to the stack.That's why 0x28(%esp) points to $SYSENTER_RETURN.

  8. Afterthat SYSEXIT instruction is executed. As we know from previous section,sysexit copies value in %edx to EIP and value in %ecx to ESP. sysexittransfers processor back to ring 3 and processor resumes execution inuserland.

5. Some Code

#include <stdio.h>

int pid;

int main() {
        __asm__(
                "movl $20, %eax    \n"
                "call *%gs:0x10    \n"   /* offset 0x10 is not fixed across the systems */
                "movl %eax, pid    \n"
        );
        printf("pid is %d\n", pid);
        return 0;
}

Thisdoes the getpid() system call (__NR_getpid is 20) using__kernel_vsyscall instead of int 0x80. Why %gs:0x10? Parsing processstack to find out AT_SYSINFO's value can be a cumbersome task. So, whenlibc.so (C library) is loaded, it copies the value of AT_SYSINFO fromthe process stack to the TCB (Thread Control Block). Segment register%gs refers to the TCB.

Please note that the offset 0x10 is notfixed across the systems. I found it out for my system using GDB. Asystem independent way to find out AT_SYSINFO is given in [1].

Note:This example is taken from http://www.win.tue.nl/~aeb/linux/lk/lk-4.html after littlemodification to make it work on my system.

6. References

Here are some references that helped meunderstand this.

  1. About Elf auxiliary vectors By Manu Garg

  2. What is linux-gate.so.1? By Johan Peterson

  3. This Linux kernel: System Calls By Andries Brouwer

  4. Understanding the Linux Kernel, By Daniel P. Bovet, Marco Cesati

  5. Linux Kernel source code  


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值