glibc 知:系统调用

说明:本文所使用的代码为glibc的master分支代码,版本>2.33。

1. 简介

系统调用是操作系统内核提供一系列具备预定功能的函数接口供应用程序调用。系统调用把应用程序的请求传给内核,内核调用相应的函数完成所需的处理,再将处理结果返回给应用程序。

应用程序运行在用户态下,其诸多操作都受到限制。而系统调用是运行在内核态的,那么运行在用户态的应用程序如何运行内核态的代码呢?操作系统一般是通过中断来从用户态切换到内核态的。

中断分为硬件中断和软件中断。其中软件中断通常是一条指令,使用这条指令用户可以手动触发某个中断。中断一般有两个属性,一个是中断号,一个是中断处理程序。不同的中断有不同的中断号,每个中断号都对应了一个中断处理程序。中断号是有限的,所以不会用一个中断来对应一个系统调用。对于每个系统调用都有一个系统调用号,在触发中断之前,会将系统调用号放入到一个固定的寄存器,中断处理程序会读取该寄存器的值,然后决定执行哪个系统调用的代码。

2. 包装器

wiki主页:https://sourceware.org/glibc/wiki/SyscallWrappers

glibc 对操作系统内核的系统调用使用了三种包装器:汇编、宏和定制。

2.1. 汇编系统调用

glibc 中的简单内核系统调用从名称列表转换为汇编包装器,然后进行编译。

在构建目录中反汇编socket系统调用,将看到syscall-template.S包装器:

maminjie@fedora ~/w/g/build> objdump -ldr socket/socket.o

socket/socket.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__socket>:
__socket():
/mnt/hgfs/projects/linux/glibc/socket/../sysdeps/unix/syscall-template.S:120
   0:   b8 29 00 00 00          mov    $0x29,%eax
   5:   0f 05                   syscall
   7:   48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax
   d:   0f 83 00 00 00 00       jae    13 <__socket+0x13>
                        f: R_X86_64_PLT32       __syscall_error-0x4
/mnt/hgfs/projects/linux/glibc/socket/../sysdeps/unix/syscall-template.S:122
  13:   c3                      retq
maminjie@fedora ~/w/g/build>

使用包装器的系统调用列表保存在syscalls.list文件中:

maminjie@fedora /m/h/p/l/glibc (master)> find . -name syscalls.list
./sysdeps/unix/bsd/syscalls.list
./sysdeps/unix/syscalls.list
./sysdeps/unix/sysv/linux/alpha/syscalls.list
./sysdeps/unix/sysv/linux/arc/syscalls.list
./sysdeps/unix/sysv/linux/arm/syscalls.list
./sysdeps/unix/sysv/linux/csky/syscalls.list
./sysdeps/unix/sysv/linux/generic/syscalls.list
./sysdeps/unix/sysv/linux/generic/wordsize-32/syscalls.list
./sysdeps/unix/sysv/linux/hppa/syscalls.list
./sysdeps/unix/sysv/linux/i386/syscalls.list
./sysdeps/unix/sysv/linux/ia64/syscalls.list
./sysdeps/unix/sysv/linux/m68k/syscalls.list
./sysdeps/unix/sysv/linux/microblaze/syscalls.list
./sysdeps/unix/sysv/linux/mips/mips32/syscalls.list
./sysdeps/unix/sysv/linux/mips/mips64/n32/syscalls.list
./sysdeps/unix/sysv/linux/mips/mips64/n64/syscalls.list
./sysdeps/unix/sysv/linux/mips/syscalls.list
./sysdeps/unix/sysv/linux/powerpc/powerpc32/syscalls.list
./sysdeps/unix/sysv/linux/s390/s390-32/syscalls.list
./sysdeps/unix/sysv/linux/sh/syscalls.list
./sysdeps/unix/sysv/linux/sparc/sparc32/syscalls.list
./sysdeps/unix/sysv/linux/sparc/sparc64/syscalls.list
./sysdeps/unix/sysv/linux/syscalls.list
./sysdeps/unix/sysv/linux/wordsize-64/syscalls.list
./sysdeps/unix/sysv/linux/x86_64/syscalls.list
./sysdeps/unix/sysv/linux/x86_64/x32/syscalls.list

sysdep 目录排序有助于决定哪些系统调用被应用。因此,例如在 x86_64 上,以下文件将被应用:

./sysdeps/unix/sysv/linux/syscalls.list
./sysdeps/unix/sysv/linux/wordsize-64/syscalls.list
./sysdeps/unix/sysv/linux/x86_64/syscalls.list

处理系统调用包装器的 makefile 规则在sysdeps/unix/Makefile 中,例如:

...
ifndef avoid-generated
$(common-objpfx)sysd-syscalls: $(..)sysdeps/unix/make-syscalls.sh \
                   $(wildcard $(+sysdep_dirs:%=%/syscalls.list)) \
                   $(wildcard $(+sysdep_dirs:%=%/arch-syscall.h)) \
                   $(common-objpfx)libc-modules.stmp
    for dir in $(+sysdep_dirs); do \
      test -f $$dir/syscalls.list && \
      { sysdirs='$(sysdirs)' \
        asm_CPP='$(COMPILE.S) -E -x assembler-with-cpp' \
        $(SHELL) $(dir $<)$(notdir $<) $$dir || exit 1; }; \
      test $$dir = $(..)sysdeps/unix && break; \
    done > $@T
    mv -f $@T $@
endif
...

syscalls.list 文件由名为 sysdeps/unix/make-syscalls.sh 的脚本处理,该脚本的注释描述了 syscalls.list 文件的格式。

该脚本使用名为syscall-template.S 的模板生成汇编文件,该文件使用特定于机器的宏来构建系统调用的包装器。机器可以用自己的副本覆盖syscall-template.S,因为它也是根据 sysdep 目录顺序选择的。

最后,每台机器的宏由sysdep.h头文件提供:

maminjie@fedora /m/h/p/l/glibc (master)> find . -name sysdep.h
./sysdeps/aarch64/sysdep.h
./sysdeps/arc/sysdep.h
./sysdeps/arm/sysdep.h
./sysdeps/csky/sysdep.h
./sysdeps/generic/sysdep.h
./sysdeps/hppa/sysdep.h
./sysdeps/i386/sysdep.h
./sysdeps/ia64/sysdep.h
./sysdeps/m68k/coldfire/sysdep.h
./sysdeps/m68k/m680x0/sysdep.h
./sysdeps/m68k/sysdep.h
./sysdeps/mach/i386/sysdep.h
./sysdeps/mach/sysdep.h
./sysdeps/microblaze/sysdep.h
./sysdeps/nios2/sysdep.h
./sysdeps/powerpc/powerpc32/sysdep.h
./sysdeps/powerpc/powerpc64/sysdep.h
./sysdeps/powerpc/sysdep.h
./sysdeps/s390/s390-32/sysdep.h
./sysdeps/s390/s390-64/sysdep.h
./sysdeps/sh/sysdep.h
./sysdeps/sparc/sysdep.h
./sysdeps/unix/arm/sysdep.h
./sysdeps/unix/i386/sysdep.h
./sysdeps/unix/mips/mips32/sysdep.h
./sysdeps/unix/mips/mips64/sysdep.h
./sysdeps/unix/mips/sysdep.h
./sysdeps/unix/powerpc/sysdep.h
./sysdeps/unix/sh/sysdep.h
./sysdeps/unix/sysdep.h
./sysdeps/unix/sysv/linux/aarch64/sysdep.h
./sysdeps/unix/sysv/linux/alpha/sysdep.h
./sysdeps/unix/sysv/linux/arc/sysdep.h
./sysdeps/unix/sysv/linux/arm/sysdep.h
./sysdeps/unix/sysv/linux/csky/sysdep.h
./sysdeps/unix/sysv/linux/generic/sysdep.h
./sysdeps/unix/sysv/linux/hppa/sysdep.h
./sysdeps/unix/sysv/linux/i386/sysdep.h
./sysdeps/unix/sysv/linux/ia64/sysdep.h
./sysdeps/unix/sysv/linux/m68k/coldfire/sysdep.h
./sysdeps/unix/sysv/linux/m68k/m680x0/sysdep.h
./sysdeps/unix/sysv/linux/m68k/sysdep.h
./sysdeps/unix/sysv/linux/microblaze/sysdep.h
./sysdeps/unix/sysv/linux/mips/mips32/sysdep.h
./sysdeps/unix/sysv/linux/mips/mips64/sysdep.h
./sysdeps/unix/sysv/linux/mips/sysdep.h
./sysdeps/unix/sysv/linux/nios2/sysdep.h
./sysdeps/unix/sysv/linux/powerpc/powerpc64/sysdep.h
./sysdeps/unix/sysv/linux/powerpc/sysdep.h
./sysdeps/unix/sysv/linux/riscv/sysdep.h
./sysdeps/unix/sysv/linux/s390/s390-32/sysdep.h
./sysdeps/unix/sysv/linux/s390/s390-64/sysdep.h
./sysdeps/unix/sysv/linux/s390/sysdep.h
./sysdeps/unix/sysv/linux/sh/sh4/sysdep.h
./sysdeps/unix/sysv/linux/sh/sysdep.h
./sysdeps/unix/sysv/linux/sparc/sparc32/sysdep.h
./sysdeps/unix/sysv/linux/sparc/sparc64/sysdep.h
./sysdeps/unix/sysv/linux/sparc/sysdep.h
./sysdeps/unix/sysv/linux/sysdep.h
./sysdeps/unix/sysv/linux/x86_64/sysdep.h
./sysdeps/unix/sysv/linux/x86_64/x32/sysdep.h
./sysdeps/unix/x86_64/sysdep.h
./sysdeps/x86/sysdep.h
./sysdeps/x86_64/sysdep.h
./sysdeps/x86_64/x32/sysdep.h
maminjie@fedora /m/h/p/l/glibc (master)>

所有这些部分一起产生一个包装器的编译,类似这样:

(echo ‘#define SYSCALL_NAME socket’;
echo ‘#define SYSCALL_NARGS 3’;
echo ‘#define SYSCALL_SYMBOL __socket’;
echo ‘#define SYSCALL_CANCELLABLE 0’;
echo ‘#define SYSCALL_NOERRNO 0’;
echo ‘#define SYSCALL_ERRVAL 0’;
echo ‘#include <syscall-template.S>’;
echo ‘weak_alias (__socket, socket)’;
echo ‘hidden_weak (socket)’;
) | /opt/cross/x86_64-linux-gnu/bin/x86_64-glibc-linux-gnu-gcc -c -I…/include -I/home/azanella/Projects/glibc/build/x86_64-linux-gnu/socket -I/home/azanella/Projects/glibc/build/x86_64-linux-gnu -I…/sysdeps/unix/sysv/linux/x86_64/64 -I…/sysdeps/unix/sysv/linux/x86_64 -I…/sysdeps/unix/sysv/linux/x86 -I…/sysdeps/x86/nptl -I…/sysdeps/unix/sysv/linux/wordsize-64 -I…/sysdeps/x86_64/nptl -I…/sysdeps/unix/sysv/linux/include -I…/sysdeps/unix/sysv/linux -I…/sysdeps/nptl -I…/sysdeps/pthread -I…/sysdeps/gnu -I…/sysdeps/unix/inet -I…/sysdeps/unix/sysv -I…/sysdeps/unix/x86_64 -I…/sysdeps/unix -I…/sysdeps/posix -I…/sysdeps/x86_64/64 -I…/sysdeps/x86_64/fpu/multiarch -I…/sysdeps/x86_64/fpu -I…/sysdeps/x86/fpu/include -I…/sysdeps/x86/fpu -I…/sysdeps/x86_64/multiarch -I…/sysdeps/x86_64 -I…/sysdeps/x86 -I…/sysdeps/ieee754/float128 -I…/sysdeps/ieee754/ldbl-96/include -I…/sysdeps/ieee754/ldbl-96 -I…/sysdeps/ieee754/dbl-64/wordsize-64 -I…/sysdeps/ieee754/dbl-64 -I…/sysdeps/ieee754/flt-32 -I…/sysdeps/wordsize-64 -I…/sysdeps/ieee754 -I…/sysdeps/generic -I… -I…/libio -I. -D_LIBC_REENTRANT -include /home/azanella/Projects/glibc/build/x86_64-linux-gnu/libc-modules.h -DMODULE_NAME=libc -include …/include/libc-symbols.h -DPIC -DSHARED -DTOP_NAMESPACE=glibc -DASSEMBLER -g -Werror=undef -Wa,–noexecstack -o /home/azanella/Projects/glibc/build/x86_64-linux-gnu/socket/socket.os -x assembler-with-cpp - -MD -MP -MF /home/azanella/Projects/glibc/build/x86_64-linux-gnu/socket/socket.os.dt -MT /home/azanella/Projects/glibc/build/x86_64-linux-gnu/socket/socket.os

注意-x assembler-with-cpp 的使用,因此这些包装器应该只使用汇编。

注意:GLIBC 2.26 和之前的版本用于通过使用包含所需步骤的宏的辅助头文件sysdep-cancel.h来定义取消系统调用(在nopic中调用__{libc,pthread,librt}_{enable,disable}_asynccancel函数/图片模式)。GLIBC 2.27 及更高版本只需要默认的sysdep.h汇编宏,并且所有取消系统调用都使用 SYSCALL_CANCEL 宏在 C 文件中实现。

2.2. 宏系统调用

宏系统调用由比简单包装器复杂得多的*.c文件处理。

一些系统调用可能需要将内核结果改组(shuffling )到用户空间结构中,因此 glibc 需要一种在 C 代码中进行内联系统调用的方法。

这由sysdep.h文件中定义的宏处理。

这些宏都被称为INTERNAL_和INLINE_,并提供了几个供源代码使用的变体。

例如,可以在wait函数实现 (sysdeps/unix/sysv/linux/wait4.c) 中看到这些宏的使用:

...
pid_t
__wait4_time64 (pid_t pid, int *stat_loc, int options, struct __rusage64 *usage)
{
#ifdef __NR_wait4
# if __KERNEL_OLD_TIMEVAL_MATCHES_TIMEVAL64
  return SYSCALL_CANCEL (wait4, pid, stat_loc, options, usage);
# else
  pid_t ret;
  struct __rusage32 usage32;

  ret = SYSCALL_CANCEL (wait4, pid, stat_loc, options,
                        usage != NULL ? &usage32 : NULL);

  if (ret > 0 && usage != NULL)
    rusage32_to_rusage64 (&usage32, usage);

  return ret;
# endif
#elif defined (__ASSUME_WAITID_PID0_P_PGID)
  idtype_t idtype = P_PID;
...

函数__wait4_time64 调用宏SYSCALL_CANCEL,其定义在sysdeps/unix/sysdep.h中,如下所示:

#define SYSCALL_CANCEL(...) \
  ({                                         					 \
    long int sc_ret;                                 			 \
    if (NO_SYSCALL_CANCEL_CHECKING)                      		 \
      sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__);                \
    else                                     \
      {                                      \
	    int sc_cancel_oldtype = LIBC_CANCEL_ASYNC ();            \
    	sc_ret = INLINE_SYSCALL_CALL (__VA_ARGS__);              \
        LIBC_CANCEL_RESET (sc_cancel_oldtype);                   \
      }                                      \
    sc_ret;                                  \
  })

LIBC_CANCEL_ASYNC调用__ {libc,pthread,librt} _enable_asynccancel在系统调用之前原子地使能异步取消(cancellation )模式。在另一个句柄LIBC_CANCEL_RESET 中,通过调用__{libc,pthread,librt}_disable_asynccancel原子地禁用异步取消模式,并根据需要采取相应的行动。

2.3. 定制系统调用

英国(British)术语 “bespoke” 意味着它是根据买方的要求定制或定制的。glibc 中有一些地方进行了系统调用,它们不使用标准汇编或 C 代码宏。

最好的例子是 fork 和 vfork 实现,这需要 Linux 上的特定调用约定,具体取决于体系结构。例如对于 x86_64 (sysdeps/unix/sysv/linux/x86_64/vfork.S):

/* Clone the calling process, but without copying the whole address space.
   The calling process is suspended until the new process exits or is
   replaced by a call to `execve'.  Return -1 for errors, 0 to the new process,
   and the process ID of the new process to the old process.  */

ENTRY (__vfork)

    /* Pop the return PC value into RDI.  We need a register that
       is preserved by the syscall and that we're allowed to destroy. */
    popq    %rdi
    cfi_adjust_cfa_offset(-8)
    cfi_register(%rip, %rdi)

    /* Stuff the syscall number in RAX and enter into the kernel.  */
    movl    $SYS_ify (vfork), %eax
    syscall

    /* Push back the return PC.  */
    pushq   %rdi
    cfi_adjust_cfa_offset(8)

    cmpl    $-4095, %eax
    jae SYSCALL_ERROR_LABEL     /* Branch forward if it failed.  */

#if SHSTK_ENABLED
    /* Check if shadow stack is in use.  */
    xorl    %esi, %esi
    rdsspq  %rsi
    testq   %rsi, %rsi
    /* Normal return if shadow stack isn't in use.  */
    je  L(no_shstk)

    testl   %eax, %eax
    /* In parent, normal return.  */
    jnz L(no_shstk)

    /* NB: In child, jump back to caller via indirect branch without
       popping shadow stack which is shared with parent.  Keep shadow
       stack mismatched so that child returns in the vfork-calling
       function will trigger SIGSEGV.  */
    popq    %rdi
    cfi_adjust_cfa_offset(-8)
    jmp *%rdi

L(no_shstk):
#endif

    /* Normal return.  */
    ret

PSEUDO_END (__vfork)

它使用sysdep.h宏进行函数返回 (SYSCALL_ERROR_LABEL),但是由于一些特定的 ABI 和语义约束,它需要一些特定的汇编实现。

事情的真相是,大多数定制案例可能都应该全部清理以使用宏。

3. 汇编系统调用详解

通过“包装器->汇编系统调用”的介绍,我们知道汇编系统调用主要由三部分组成:make-syscall.sh、syscall-template.S、syscalls.list。其中make-syscall.sh文件是shell脚本文件。该脚本文件读取syscalls.list文件内容,对syscalls.list文件中每一行数据进行解析。syscall-template.S文件是系统调用封装的模板文件,包含了封装代码。

3.1. syscalls.list

下面以sysdeps/unix/syscalls.list为例,来理解syscalls.list的内容:

# File name Caller  Syscall name    Args    Strong name Weak names

accept      -   accept      Ci:iBN  __libc_accept   accept
access      -   access      i:si    __access    access
acct        -   acct        i:S acct
adjtime     -   adjtime     i:pp    __adjtime   adjtime
bind        -   bind        i:ipi   __bind      bind
chdir       -   chdir       i:s __chdir     chdir
...

syscalls.list文件由许多行组成,每一行都对应一个系统调用。每一行可分为6列:

  • File name: 生成系统调用目标文件的文件名
  • Caller:调用者
  • Syscall name:系统调用的名字
  • Args:系统调用的参数类型和个数以及返回值的类型
    冒号(:)前面表示返回值类型,后面表示参数类型和个数。
    系统调用签名前缀:
    E: errno 和返回值不是由调用设置
    V: errno 未设置,但调用返回 errno 或零(成功)
    C: 未知
    系统调用签名关键字母:
    a:未经检查的地址(例如,mmap的第1个参数)
    b:非空缓冲区(例如,read的第2个参数,mmap的返回值)
    B:可选的 NULL 缓冲区(例如,getsockopt 的第 4 个参数)
    f:2 个整数的缓冲区(例如,socketpair的第4个参数)
    F:fcntl的第3个参数
    i:标量(任何符号和大小:int、long、long long、enum,等等)
    I:ioctl 的第3个参数
    n:标量缓冲区长度(例如,read的第3个参数)
    N:指向值/返回标量缓冲区长度的指针(例如, recvfrom 的第 6 个参数)
    p:指向类型对象的非 NULL 指针(例如,任何非 void* arg)
    P:可选的指向类型对象的 NULL 指针(例如,sigaction 的第3个参数)
    s:非空字符串(例如,open的第1个参数)
    S:可选的 NULL 字符串(例如,acct的第1个参数)
    U:unsigned long int(32 位类型零扩展为 64 位类型)
    v:vararg 标量(例如,open的可选的第3个参数)
    V:每页字节向量(mincore的第3个参数)
    W:等待状态,可选的指向 int 的 NULL 指针(例如,wait4 的第2个参数)
    (说明:上面释义来自sysdeps/unix/make-syscalls.sh中的注释)
  • Strong name:系统调用对应函数的名字
  • Weak names:系统调用对应函数的名字的别称。可以使用别称来调用函数

3.2. assembly syscall wrappers

再来看看sysdeps/unix/Makefile 中的规则:

...
ifndef avoid-generated
$(common-objpfx)sysd-syscalls: $(..)sysdeps/unix/make-syscalls.sh \
                   $(wildcard $(+sysdep_dirs:%=%/syscalls.list)) \
                   $(wildcard $(+sysdep_dirs:%=%/arch-syscall.h)) \
                   $(common-objpfx)libc-modules.stmp
    for dir in $(+sysdep_dirs); do \
      test -f $$dir/syscalls.list && \
      { sysdirs='$(sysdirs)' \
        asm_CPP='$(COMPILE.S) -E -x assembler-with-cpp' \
        $(SHELL) $(dir $<)$(notdir $<) $$dir || exit 1; }; \
      test $$dir = $(..)sysdeps/unix && break; \
    done > $@T
    mv -f $@T $@
endif
...

该部分在编译时,被解析成如下形式:

touch /home/maminjie/work/glibc/tmp-build/libc-modules.stmp
for dir in /home/maminjie/work/glibc/tmp-build sysdeps/unix/sysv/linux/x86_64/64 sysdeps/unix/sysv/linux/x86_64 sysdeps/unix/sysv/linux/x86 sysdeps/x86/nptl sysdeps/unix/sysv/linux/wordsize-64 sysdeps/x86_64/nptl sysdeps/unix/sysv/linux sysdeps/nptl sysdeps/pthread sysdeps/gnu sysdeps/unix/inet sysdeps/unix/sysv sysdeps/unix/x86_64 sysdeps/unix sysdeps/posix sysdeps/x86_64/64 sysdeps/x86_64/fpu/multiarch sysdeps/x86_64/fpu sysdeps/x86/fpu sysdeps/x86_64/multiarch sysdeps/x86_64 sysdeps/x86 sysdeps/ieee754/float128 sysdeps/ieee754/ldbl-96 sysdeps/ieee754/dbl-64 sysdeps/ieee754/flt-32 sysdeps/wordsize-64 sysdeps/ieee754 sysdeps/generic; do \
  test -f $dir/syscalls.list && \
  { sysdirs='sysdeps/unix/sysv/linux/x86_64/64 sysdeps/unix/sysv/linux/x86_64 sysdeps/unix/sysv/linux/x86 sysdeps/x86/nptl sysdeps/unix/sysv/linux/wordsize-64 sysdeps/x86_64/nptl sysdeps/unix/sysv/linux sysdeps/nptl sysdeps/pthread sysdeps/gnu sysdeps/unix/inet sysdeps/unix/sysv sysdeps/unix/x86_64 sysdeps/unix sysdeps/posix sysdeps/x86_64/64 sysdeps/x86_64/fpu/multiarch sysdeps/x86_64/fpu sysdeps/x86/fpu sysdeps/x86_64/multiarch sysdeps/x86_64 sysdeps/x86 sysdeps/ieee754/float128 sysdeps/ieee754/ldbl-96 sysdeps/ieee754/dbl-64 sysdeps/ieee754/flt-32 sysdeps/wordsize-64 sysdeps/ieee754 sysdeps/generic' \
    asm_CPP='gcc -c     -Iinclude   -I/home/maminjie/work/glibc/tmp-build  -Isysdeps/unix/sysv/linux/x86_64/64  -Isysdeps/unix/sysv/linux/x86_64  -Isysdeps/unix/sysv/linux/x86/include -Isysdeps/unix/sysv/linux/x86  -Isysdeps/x86/nptl  -Isysdeps/unix/sysv/linux/wordsize-64  -Isysdeps/x86_64/nptl  -Isysdeps/unix/sysv/linux/include -Isysdeps/unix/sysv/linux  -Isysdeps/nptl  -Isysdeps/pthread  -Isysdeps/gnu  -Isysdeps/unix/inet  -Isysdeps/unix/sysv  -Isysdeps/unix/x86_64  -Isysdeps/unix  -Isysdeps/posix  -Isysdeps/x86_64/64  -Isysdeps/x86_64/fpu/multiarch  -Isysdeps/x86_64/fpu  -Isysdeps/x86/fpu  -Isysdeps/x86_64/multiarch  -Isysdeps/x86_64  -Isysdeps/x86/include -Isysdeps/x86  -Isysdeps/ieee754/float128  -Isysdeps/ieee754/ldbl-96/include -Isysdeps/ieee754/ldbl-96  -Isysdeps/ieee754/dbl-64  -Isysdeps/ieee754/flt-32  -Isysdeps/wordsize-64  -Isysdeps/ieee754  -Isysdeps/generic   -Ilibio -I.  -D_LIBC_REENTRANT -include /home/maminjie/work/glibc/tmp-build/libc-modules.h -DMODULE_NAME=libc -include include/libc-symbols.h       -DTOP_NAMESPACE=glibc -DASSEMBLER  -g -Werror=undef -Wa,--noexecstack   -E -x assembler-with-cpp' \
    /bin/sh sysdeps/unix/make-syscalls.sh $dir || exit 1; }; \
  test $dir = sysdeps/unix && break; \
done > /home/maminjie/work/glibc/tmp-build/sysd-syscallsT
mv -f /home/maminjie/work/glibc/tmp-build/sysd-syscallsT /home/maminjie/work/glibc/tmp-build/sysd-syscalls

sysdeps/unix/make-syscalls.sh遍历相关目录下的syscalls.list,最后将内容输出到sysd-syscalls文件中。
sysd-syscalls文件内容如下:

#### DIRECTORY = sysdeps/unix/sysv/linux/x86_64
#### SYSDIRS = sysdeps/unix/sysv/linux/x86_64/64

#### CALL=arch_prctl NUMBER=158 ARGS=i:ii SOURCE=-
ifeq (,$(filter arch_prctl,$(unix-syscalls)))
unix-syscalls += arch_prctl
unix-extra-syscalls += arch_prctl
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,arch_prctl)$o)): \
        $(..)sysdeps/unix/make-syscalls.sh
    $(make-target-directory)
    (echo '#define SYSCALL_NAME arch_prctl'; \
     echo '#define SYSCALL_NARGS 2'; \
     echo '#define SYSCALL_ULONG_ARG_1 0'; \
     echo '#define SYSCALL_ULONG_ARG_2 0'; \
     echo '#define SYSCALL_SYMBOL __arch_prctl'; \
     echo '#define SYSCALL_NOERRNO 0'; \
     echo '#define SYSCALL_ERRVAL 0'; \
     echo '#include <syscall-template.S>'; \
     echo 'weak_alias (__arch_prctl, arch_prctl)'; \
     echo 'hidden_weak (arch_prctl)'; \
    ) | $(compile-syscall) $(foreach p,$(patsubst %arch_prctl,%,$(basename $(@F))),$($(p)CPPFLAGS))
endif
...
#### DIRECTORY = sysdeps/unix/sysv/linux/wordsize-64
#### SYSDIRS = sysdeps/unix/sysv/linux/x86_64/64 sysdeps/unix/sysv/linux/x86_64 sysdeps/unix/sysv/linux/x86 sysdeps/x86/nptl

#### CALL=sendfile NUMBER=40 ARGS=i:iipi SOURCE=-
ifeq (,$(filter sendfile,$(unix-syscalls)))
unix-syscalls += sendfile
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,sendfile)$o)): \
        $(..)sysdeps/unix/make-syscalls.sh
    $(make-target-directory)
    (echo '#define SYSCALL_NAME sendfile'; \
     echo '#define SYSCALL_NARGS 4'; \
     echo '#define SYSCALL_ULONG_ARG_1 0'; \
     echo '#define SYSCALL_ULONG_ARG_2 0'; \
     echo '#define SYSCALL_SYMBOL sendfile'; \
     echo '#define SYSCALL_NOERRNO 0'; \
     echo '#define SYSCALL_ERRVAL 0'; \
     echo '#include <syscall-template.S>'; \
     echo 'weak_alias (sendfile, sendfile64)'; \
     echo 'hidden_weak (sendfile64)'; \
    ) | $(compile-syscall) $(foreach p,$(patsubst %sendfile,%,$(basename $(@F))),$($(p)CPPFLAGS))
endif
...
#### DIRECTORY = sysdeps/unix/sysv/linux
#### SYSDIRS = sysdeps/unix/sysv/linux/x86_64/64 sysdeps/unix/sysv/linux/x86_64 sysdeps/unix/sysv/linux/x86 sysdeps/x86/nptl sysdeps/unix/sysv/linux/wordsize-64 sysdeps/x86_64/nptl

#### CALL=alarm NUMBER=37 ARGS=i:i SOURCE=-
ifeq (,$(filter alarm,$(unix-syscalls)))
unix-syscalls += alarm
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,alarm)$o)): \
        $(..)sysdeps/unix/make-syscalls.sh
    $(make-target-directory)
    (echo '#define SYSCALL_NAME alarm'; \
     echo '#define SYSCALL_NARGS 1'; \
     echo '#define SYSCALL_ULONG_ARG_1 0'; \
     echo '#define SYSCALL_ULONG_ARG_2 0'; \
     echo '#define SYSCALL_SYMBOL alarm'; \
     echo '#define SYSCALL_NOERRNO 0'; \
     echo '#define SYSCALL_ERRVAL 0'; \
     echo '#include <syscall-template.S>'; \
    ) | $(compile-syscall) $(foreach p,$(patsubst %alarm,%,$(basename $(@F))),$($(p)CPPFLAGS))
endif
...

以chdir为例,如下所示:

#### CALL=chdir NUMBER=80 ARGS=i:s SOURCE=-
ifeq (,$(filter chdir,$(unix-syscalls)))
unix-syscalls += chdir
$(foreach p,$(sysd-rules-targets),$(foreach o,$(object-suffixes),$(objpfx)$(patsubst %,$p,chdir)$o)): \
        $(..)sysdeps/unix/make-syscalls.sh
    $(make-target-directory)
    (echo '#define SYSCALL_NAME chdir'; \
     echo '#define SYSCALL_NARGS 1'; \
     echo '#define SYSCALL_ULONG_ARG_1 0'; \
     echo '#define SYSCALL_ULONG_ARG_2 0'; \
     echo '#define SYSCALL_SYMBOL __chdir'; \
     echo '#define SYSCALL_NOERRNO 0'; \
     echo '#define SYSCALL_ERRVAL 0'; \
     echo '#include <syscall-template.S>'; \
     echo 'weak_alias (__chdir, chdir)'; \
     echo 'hidden_weak (chdir)'; \
    ) | $(compile-syscall) $(foreach p,$(patsubst %chdir,%,$(basename $(@F))),$($(p)CPPFLAGS))
endif

实际上,最后chdir的汇编系统调用代码(assembly syscall wrappers)被解析成如下临时内容,然后进行编译:

#define SYSCALL_NAME chdir
#define SYSCALL_NARGS 1
#define SYSCALL_ULONG_ARG_1 0
#define SYSCALL_ULONG_ARG_2 0
#define SYSCALL_SYMBOL __chdir
#define SYSCALL_NOERRNO 0
#define SYSCALL_ERRVAL 0
#include <syscall-template.S>
weak_alias (__chdir, chdir)
hidden_weak (chdir)

sysdeps/unix/syscalls.list中的chdir系统调用定义如下:

# File name Caller  Syscall name    Args    Strong name Weak names
...
chdir       -   chdir       i:s __chdir     chdir
...

每个系统调用的对象都是由 make-syscalls.sh 生成的 sysd-syscalls 中的规则构建的,该规则在定义了几个宏之后 #include <syscall-template.S>:

  • SYSCALL_NAME:系统调用名称。可以从Syscall name列获取。
  • SYSCALL_NARGS:此调用采用的参数数量。可以通过解析Args列获取。
  • SYSCALL_ULONG_ARG_1:此调用采用的第一个无符号长整型参数。
    0 表示没有 unsigned long int 参数。可以通过解析Args列获取。
  • SYSCALL_ULONG_ARG_2:此调用采用的第二个无符号长整型参数。
    0 表示最多有一个 unsigned long int 参数。可以通过解析Args列获取。
  • SYSCALL_SYMBOL:主要符号名称。可以从Strong name列获取。
  • SYSCALL_NOERRNO:1 定义无错误版本,即没有出错返回。可以通过解析Args列设置。
  • SYSCALL_ERRVAL:1 定义错误值版本,直接返回错误号,不是返回-1并将错误号放入errno中。可以通过解析Args列设置。
    (说明:上述释义参考sysdeps/unix/syscall-template.S文件中的注释)

weak_alias (__chdir, chdir):定义了__chdir函数的别称,可以调用chdir来调用__chdir。 chdir从Weak names列获取。

3.3. syscall-template.S

#include <sysdep.h>

/* This indirection is needed so that SYMBOL gets macro-expanded.  */
#define syscall_hidden_def(SYMBOL)      hidden_def (SYMBOL)

/* If PSEUDOS_HAVE_ULONG_INDICES is defined, PSEUDO and T_PSEUDO macros
   have 2 extra arguments for unsigned long int arguments:
     Extra argument 1: Position of the first unsigned long int argument.
     Extra argument 2: Position of the second unsigned long int argument.
 */
#ifndef PSEUDOS_HAVE_ULONG_INDICES
# undef SYSCALL_ULONG_ARG_1
# define SYSCALL_ULONG_ARG_1 0
#endif

#if SYSCALL_ULONG_ARG_1
# define T_PSEUDO(SYMBOL, NAME, N, U1, U2) \
  PSEUDO (SYMBOL, NAME, N, U1, U2)
# define T_PSEUDO_NOERRNO(SYMBOL, NAME, N, U1, U2) \
  PSEUDO_NOERRNO (SYMBOL, NAME, N, U1, U2)
# define T_PSEUDO_ERRVAL(SYMBOL, NAME, N, U1, U2) \
  PSEUDO_ERRVAL (SYMBOL, NAME, N, U1, U2)
#else
# define T_PSEUDO(SYMBOL, NAME, N) \
  PSEUDO (SYMBOL, NAME, N)
# define T_PSEUDO_NOERRNO(SYMBOL, NAME, N) \
  PSEUDO_NOERRNO (SYMBOL, NAME, N)
# define T_PSEUDO_ERRVAL(SYMBOL, NAME, N) \
  PSEUDO_ERRVAL (SYMBOL, NAME, N)
#endif
#define T_PSEUDO_END(SYMBOL)            PSEUDO_END (SYMBOL)
#define T_PSEUDO_END_NOERRNO(SYMBOL)        PSEUDO_END_NOERRNO (SYMBOL)
#define T_PSEUDO_END_ERRVAL(SYMBOL)     PSEUDO_END_ERRVAL (SYMBOL)

#if SYSCALL_NOERRNO

/* This kind of system call stub never returns an error.
   We return the return value register to the caller unexamined.  */

# if SYSCALL_ULONG_ARG_1
T_PSEUDO_NOERRNO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
          SYSCALL_ULONG_ARG_1, SYSCALL_ULONG_ARG_2)
# else
T_PSEUDO_NOERRNO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
# endif
    ret_NOERRNO
T_PSEUDO_END_NOERRNO (SYSCALL_SYMBOL)

#elif SYSCALL_ERRVAL

/* This kind of system call stub returns the errno code as its return
   value, or zero for success.  We may massage the kernel's return value
   to meet that ABI, but we never set errno here.  */

# if SYSCALL_ULONG_ARG_1
T_PSEUDO_ERRVAL (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
         SYSCALL_ULONG_ARG_1, SYSCALL_ULONG_ARG_2)
# else
T_PSEUDO_ERRVAL (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
# endif
    ret_ERRVAL
T_PSEUDO_END_ERRVAL (SYSCALL_SYMBOL)

#else

/* This is a "normal" system call stub: if there is an error,
   it returns -1 and sets errno.  */

# if SYSCALL_ULONG_ARG_1
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
      SYSCALL_ULONG_ARG_1, SYSCALL_ULONG_ARG_2)
# else
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
# endif
    ret
T_PSEUDO_END (SYSCALL_SYMBOL)

#endif

syscall_hidden_def (SYSCALL_SYMBOL)

这里的sysdep.h在x86_64平台上指sysdeps/unix/sysv/linux/x86_64/sysdep.h
sysdep.h的包含/调用关系如下所示:

sysdeps/unix/sysv/linux/x86_64/sysdep.h
	\-> sysdeps/unix/sysv/linux/sysdep.h
	\-> sysdeps/unix/x86_64/sysdep.h
			\->sysdeps/unix/sysdep.h
					\-> sysdeps/generic/sysdep.h
					\-> sys/syscall.h
			\-> sysdeps/x86_64/sysdep.h
					\-> sysdeps/x86/sysdep.h
							\-> sysdeps/generic/sysdep.h

chdir系统调用的SYSCALL_NOERRNO宏定义为0,SYSCALL_ERRVAL宏定义为0,所以执行:

# if SYSCALL_ULONG_ARG_1
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS,
      SYSCALL_ULONG_ARG_1, SYSCALL_ULONG_ARG_2)
# else
T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
# endif
    ret
T_PSEUDO_END (SYSCALL_SYMBOL)

由于SYSCALL_ULONG_ARG_1宏定义为0,所以最终执行T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS),而T_PSEUDO宏又调用了PSEUDO宏,如下:

# define T_PSEUDO(SYMBOL, NAME, N) \
  PSEUDO (SYMBOL, NAME, N)

3.3.1. PSEUDO

PSEUDO宏定义在 sysdeps/unix/sysv/linux/x86_64/sysdep.h 文件中,定义如下:

#ifdef __ASSEMBLER__
...
# undef	PSEUDO
...
#  define PSEUDO(name, syscall_name, args) \
  .text;                                  \
  ENTRY (name)                                \
    DO_CALL (syscall_name, args, 0, 0);                   \
    cmpq $-4095, %rax;                            \
    jae SYSCALL_ERROR_LABEL
...

只有定义宏__ASSEMBLER__,PSEUDO宏才有效。

3.3.1.1. ENTRY

ENTRY宏定义在 sysdeps/x86/sysdep.h 文件中,定义如下:

#define ALIGNARG(log2) 1<<log2
...
/* Define an entry point visible from C.  */
#define ENTRY(name)                               \
  .globl C_SYMBOL_NAME(name);                             \
  .type C_SYMBOL_NAME(name),@function;                        \
  .align ALIGNARG(4);                                 \
  C_LABEL(name)                                   \
  cfi_startproc;                                  \
  _CET_ENDBR;                                     \
  CALL_MCOUNT

其它宏散落在相应的其它头文件中,这里收集统一展示如下:

// include/libc-symbols.h
#ifndef C_SYMBOL_NAME
# define C_SYMBOL_NAME(name) name
#endif

// sysdeps/generic/sysdep.h
#ifndef C_LABEL
/* Define a macro we can use to construct the asm name for a C symbol.  */
# define C_LABEL(name)  name##:
#endif

// sysdeps/generic/sysdep.h
# define cfi_startproc          .cfi_startproc

// sysdeps/x86_64/sysdep.h
#define CALL_MCOUNT     /* Do nothing.  */

ENTRY (name)定义了函数名,并声明该函数名是全局的。

3.3.1.2. DO_CALL

DO_CALL宏定义在 sysdeps/unix/sysv/linux/x86_64/sysdep.h 文件中,定义如下:

/* For Linux we can use the system call table in the header file
	/usr/include/asm/unistd.h
   of the kernel.  But these symbols do not follow the SYS_* syntax
   so we have to redefine the `SYS_ify' macro here.  */
#undef SYS_ify
#define SYS_ify(syscall_name)	__NR_##syscall_name
...
/* The Linux/x86-64 kernel expects the system call parameters in
   registers according to the following table:

    syscall number	rax
    arg 1		rdi
    arg 2		rsi
    arg 3		rdx
    arg 4		r10
    arg 5		r8
    arg 6		r9

    The Linux kernel uses and destroys internally these registers:
    return address from
    syscall		rcx
    eflags from syscall	r11

    Normal function call, including calls to the system call stub
    functions in the libc, get the first six parameters passed in
    registers and the seventh parameter and later on the stack.  The
    register use is as follows:

     system call number	in the DO_CALL macro
     arg 1		rdi
     arg 2		rsi
     arg 3		rdx
     arg 4		rcx
     arg 5		r8
     arg 6		r9

    We have to take care that the stack is aligned to 16 bytes.  When
    called the stack is not aligned since the return address has just
    been pushed.


    Syscalls of more than 6 arguments are not supported.  */

# undef	DO_CALL
# define DO_CALL(syscall_name, args, ulong_arg_1, ulong_arg_2) \
    DOARGS_##args				\
    ZERO_EXTEND_##ulong_arg_1			\
    ZERO_EXTEND_##ulong_arg_2			\
    movl $SYS_ify (syscall_name), %eax;		\
    syscall;

# define DOARGS_0 /* nothing */
# define DOARGS_1 /* nothing */
# define DOARGS_2 /* nothing */
# define DOARGS_3 /* nothing */
# define DOARGS_4 movq %rcx, %r10;
# define DOARGS_5 DOARGS_4
# define DOARGS_6 DOARGS_5

# define ZERO_EXTEND_0 /* nothing */
# define ZERO_EXTEND_1 /* nothing */
# define ZERO_EXTEND_2 /* nothing */
# define ZERO_EXTEND_3 /* nothing */
# define ZERO_EXTEND_4 /* nothing */
# define ZERO_EXTEND_5 /* nothing */
# define ZERO_EXTEND_6 /* nothing */

Linux/x86-64内核期望系统调用的第4个参数在寄存器r10中,而实际函数调用的第4个参数在寄存器rcx中,所以需要进行mov操作,即 movq %rcx, %r10。

movl $SYS_ify (syscall_name), %eax 将系统调用号(__NR_syscall_name)放入寄存器eax中。

最后执行 syscall 指令完成系统调用……

3.3.1.3. SYSCALL_ERROR_LABEL
cmpq $-4095, %rax;                            \
jae SYSCALL_ERROR_LABEL

执行系统调用后,系统调用返回值放入eax寄存器中。此处比较eax寄存器值是否大于-4095,如果大于则表示系统调用执行错误,跳转到SYSCALL_ERROR_LABEL标签处。(为什么是-4095?这是linux操作系统的规定)

//  sysdeps/unix/sysv/linux/x86_64/sysdep.h
# undef SYSCALL_ERROR_LABEL
# ifdef PIC
#  undef SYSCALL_ERROR_LABEL
#  define SYSCALL_ERROR_LABEL 0f
# else
#  undef SYSCALL_ERROR_LABEL
#  define SYSCALL_ERROR_LABEL syscall_error
# endif

// sysdeps/x86/sysdep.h
#define syscall_error   __syscall_error

// sysdeps/unix/x86_64/sysdep.S
__syscall_error:
#if defined (EWOULDBLOCK_sys) && EWOULDBLOCK_sys != EAGAIN
    /* We translate the system's EWOULDBLOCK error into EAGAIN.
       The GNU C library always defines EWOULDBLOCK==EAGAIN.
       EWOULDBLOCK_sys is the original number.  */
    cmp $EWOULDBLOCK_sys, %RAX_LP /* Is it the old EWOULDBLOCK?  */
    jne notb        /* Branch if not.  */
    movl $EAGAIN, %eax  /* Yes; translate it to EAGAIN.  */
notb:
#endif
#ifdef PIC
    movq C_SYMBOL_NAME(errno@GOTTPOFF)(%rip), %rcx
    movl %eax, %fs:0(%rcx)
#else
    movl %eax, %fs:C_SYMBOL_NAME(errno@TPOFF)
#endif
    or $-1, %RAX_LP
    ret

3.3.2. PSEUDO_END

T_PSEUDO_END宏调用了PSEUDO_END宏,PSEUDO_END宏定义在sysdeps/unix/sysv/linux/x86_64/sysdep.h 文件中,如下所示:

# undef	PSEUDO_END
# define PSEUDO_END(name)						      \
  SYSCALL_ERROR_HANDLER							      \
  END (name)
...
...
# ifndef PIC
#  define SYSCALL_ERROR_HANDLER	/* Nothing here; code in sysdep.S is used.  */
# else
#  define SYSCALL_ERROR_HANDLER			\
0:						\
  SYSCALL_SET_ERRNO;				\
  or $-1, %RAX_LP;				\
  ret;
# endif	/* PIC */
3.3.2.1. END

END宏定义在 sysdeps/x86/sysdep.h 文件中,定义如下:

#define ASM_SIZE_DIRECTIVE(name) .size name,.-name;
...
#undef	END
#define END(name)							      \
  cfi_endproc;								      \
  ASM_SIZE_DIRECTIVE(name)

其它宏散落在相应的其它头文件中,这里收集统一展示如下:

// sysdeps/generic/sysdep.h
# define cfi_endproc			.cfi_endproc

PSEUDO_END结束了整个汇编代码。

以上就是在x86_64平台下对chdir系统调用的分析。其它平台的chdir和其它的系统调用,读者可以自行查看。

4. 宏系统调用详解

通过“包装器->宏系统调用”的介绍,我们知道宏系统调用是由一些*.c文件处理的,本节我们以x86_64平台的系统调用clock_gettime进行讲解。

4.1. clock_gettime

在这里插入图片描述
clock_gettime声明在time/time.h文件中,内容如下:

#ifdef __USE_POSIX199309
# ifndef __USE_TIME_BITS64
/* Pause execution for a number of nanoseconds.

   This function is a cancellation point and therefore not marked with
   __THROW.  */
extern int nanosleep (const struct timespec *__requested_time,
		      struct timespec *__remaining);

/* Get resolution of clock CLOCK_ID.  */
extern int clock_getres (clockid_t __clock_id, struct timespec *__res) __THROW;

/* Get current value of clock CLOCK_ID and store it in TP.  */
extern int clock_gettime (clockid_t __clock_id, struct timespec *__tp) __THROW;

/* Set clock CLOCK_ID to value TP.  */
extern int clock_settime (clockid_t __clock_id, const struct timespec *__tp)
     __THROW;
# else
#  ifdef __REDIRECT
extern int __REDIRECT (nanosleep, (const struct timespec *__requested_time,
                                   struct timespec *__remaining),
                       __nanosleep64);
extern int __REDIRECT_NTH (clock_getres, (clockid_t __clock_id,
                                          struct timespec *__res),
                           __clock_getres64);
extern int __REDIRECT_NTH (clock_gettime, (clockid_t __clock_id, struct
                                           timespec *__tp), __clock_gettime64);
extern int __REDIRECT_NTH (clock_settime, (clockid_t __clock_id, const struct
                                           timespec *__tp), __clock_settime64);
#  else
#   define nanosleep __nanosleep64
#   define clock_getres __clock_getres64
#   define clock_gettime __clock_gettime64
#   define clock_settime __clock_settime64
#  endif
# endif

1)如果定义宏__USE_TIME_BITS64,那么将使用64位的接口,# define clock_gettime __clock_gettime64很好理解,重定向__REDIRECT宏定义在文件 misc/sys/cdefs.h 中,内容如下:

# define __REDIRECT(name, proto, alias) name proto __asm__ (__ASMNAME (#alias))

定义别名,类似于#define。

2)如果没定义宏__USE_TIME_BITS64,那么clock_gettime的实现在哪里呢?
a)time/clock_gettime.c 文件中
在这里插入图片描述

// time/clock_gettime.c
#include <errno.h>
#include <time.h>
#include <shlib-compat.h>

/* Get current value of CLOCK and store it in TP.  */
int
__clock_gettime (clockid_t clock_id, struct timespec *tp)
{
  __set_errno (ENOSYS);
  return -1;
}
libc_hidden_def (__clock_gettime)

versioned_symbol (libc, __clock_gettime, clock_gettime, GLIBC_2_17);
/* clock_gettime moved to libc in version 2.17;
   old binaries may expect the symbol version it had in librt.  */
#if SHLIB_COMPAT (libc, GLIBC_2_2, GLIBC_2_17)
compat_symbol (libc, __clock_gettime, clock_gettime, GLIBC_2_2);
#endif

stub_warning (clock_gettime)

b)sysdeps/unix/sysv/linux/clock_gettime.c 文件中
在这里插入图片描述

// sysdeps/unix/sysv/linux/clock_gettime.c
int
__clock_gettime64 (clockid_t clock_id, struct __timespec64 *tp)
{
  int r;

#ifndef __NR_clock_gettime64
# define __NR_clock_gettime64 __NR_clock_gettime
#endif

#ifdef HAVE_CLOCK_GETTIME64_VSYSCALL
  int (*vdso_time64) (clockid_t clock_id, struct __timespec64 *tp)
    = GLRO(dl_vdso_clock_gettime64);
  if (vdso_time64 != NULL)
    {
      r = INTERNAL_VSYSCALL_CALL (vdso_time64, 2, clock_id, tp);
      if (r == 0)
	return 0;
      return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
    }
#endif

#ifdef HAVE_CLOCK_GETTIME_VSYSCALL
  int (*vdso_time) (clockid_t clock_id, struct timespec *tp)
    = GLRO(dl_vdso_clock_gettime);
  if (vdso_time != NULL)
    {
      struct timespec tp32;
      r = INTERNAL_VSYSCALL_CALL (vdso_time, 2, clock_id, &tp32);
      if (r == 0 && tp32.tv_sec > 0)
	{
	  *tp = valid_timespec_to_timespec64 (tp32);
	  return 0;
	}
      else if (r != 0)
	return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);

      /* Fallback to syscall if the 32-bit time_t vDSO returns overflows.  */
    }
#endif

  r = INTERNAL_SYSCALL_CALL (clock_gettime64, clock_id, tp);
  if (r == 0)
    return 0;
  if (r != -ENOSYS)
    return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);

#ifndef __ASSUME_TIME64_SYSCALLS
  /* Fallback code that uses 32-bit support.  */
  struct timespec tp32;
  r = INTERNAL_SYSCALL_CALL (clock_gettime, clock_id, &tp32);
  if (r == 0)
    {
      *tp = valid_timespec_to_timespec64 (tp32);
      return 0;
    }
#endif

  return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
}

#if __TIMESIZE != 64
libc_hidden_def (__clock_gettime64)

int
__clock_gettime (clockid_t clock_id, struct timespec *tp)
{
  int ret;
  struct __timespec64 tp64;

  ret = __clock_gettime64 (clock_id, &tp64);

  if (ret == 0)
    {
      if (! in_time_t_range (tp64.tv_sec))
        {
          __set_errno (EOVERFLOW);
          return -1;
        }

      *tp = valid_timespec64_to_timespec (tp64);
    }

  return ret;
}
#endif
libc_hidden_def (__clock_gettime)

versioned_symbol (libc, __clock_gettime, clock_gettime, GLIBC_2_17);
/* clock_gettime moved to libc in version 2.17;
   old binaries may expect the symbol version it had in librt.  */
#if SHLIB_COMPAT (libc, GLIBC_2_2, GLIBC_2_17)
strong_alias (__clock_gettime, __clock_gettime_2);
compat_symbol (libc, __clock_gettime_2, clock_gettime, GLIBC_2_2);
#endif

如上所述,有两个clock_gettime.c文件(time/clock_gettime.c 和 sysdeps/unix/sysv/linux/clock_gettime.c)中定义了clock_gettime,都是别名到__clock_gettime,简述如下:

versioned_symbol定义如下:

// include/shlib-compat.h
#ifdef SHARED
...
# define versioned_symbol(lib, local, symbol, version) \
  versioned_symbol_1 (lib, local, symbol, version)
# define versioned_symbol_1(lib, local, symbol, version) \
  versioned_symbol_2 (local, symbol, VERSION_##lib##_##version)
# define versioned_symbol_2(local, symbol, name) \
  default_symbol_version (local, symbol, name)
...
#else
...
# define versioned_symbol(lib, local, symbol, version) \
  weak_alias (local, symbol)
...

// include/libc-symbols.h
#ifdef SHARED
...
# define default_symbol_version(real, name, version) \
     _default_symbol_version(real, name, version)
/* See <libc-symver.h>.  */
# ifdef __ASSEMBLER__
#  define _default_symbol_version(real, name, version) \
  _set_symbol_version (real, name@@version)
# else
#  define _default_symbol_version(real, name, version) \
  _set_symbol_version (real, #name "@@" #version)
# endif
...
#else /* !SHARED */
...
# define default_symbol_version(real, name, version) \
  strong_alias(real, name)
#endif

versioned_symbol和compat_symbol实际上都是将 clock_gettime 别名到 __clock_gettime,调用clock_gettime相当于调用__clock_gettime。
1)time/clock_gettime.c中的__clock_gettime是个空函数,没有具体的实现;
2)sysdeps/unix/sysv/linux/clock_gettime.c中的__clock_gettime只有在__TIMESIZE != 64的情况下才有定义,且调用的是 __clock_gettime64。那如果 __TIMESIZE == 64,__clock_gettime定义在哪里呢?
在这里插入图片描述

#if __TIMESIZE == 64
# define __clock_nanosleep_time64 __clock_nanosleep
# define __clock_gettime64 __clock_gettime
# define __timespec_get64 __timespec_get
# define __timespec_getres64 __timespec_getres
#else
extern int __clock_nanosleep_time64 (clockid_t clock_id,
                                     int flags, const struct __timespec64 *req,
                                     struct __timespec64 *rem);
libc_hidden_proto (__clock_nanosleep_time64)
extern int __clock_gettime64 (clockid_t clock_id, struct __timespec64 *tp);
libc_hidden_proto (__clock_gettime64)
extern int __timespec_get64 (struct __timespec64 *ts, int base);
libc_hidden_proto (__timespec_get64)
extern int __timespec_getres64 (struct __timespec64 *ts, int base);
libc_hidden_proto (__timespec_getres64)
#endif

如果 __TIMESIZE == 64,宏__clock_gettime64定义为__clock_gettime,即sysdeps/unix/sysv/linux/clock_gettime.c 中的__clock_gettime64将被替换为__clock_gettime,就完成了__clock_gettime的实现,如下:

// sysdeps/unix/sysv/linux/clock_gettime.c
int
__clock_gettime64 (clockid_t clock_id, struct __timespec64 *tp)
{
  int r;

#ifndef __NR_clock_gettime64
# define __NR_clock_gettime64 __NR_clock_gettime
#endif
...
}

#if __TIMESIZE != 64
...
int
__clock_gettime (clockid_t clock_id, struct timespec *tp)
{
  int ret;
  struct __timespec64 tp64;

  ret = __clock_gettime64 (clock_id, &tp64);
...
}
#endif

这两个clock_gettime.c文件,我们实际上使用了哪个文件呢?通过代码上,很难直观的看出使用了哪个,可以通过查看编译的产物来确认,如下所示:

maminjie@fedora ~/w/g/tmp-build> find -name clock_gettime.o
./time/clock_gettime.o
maminjie@fedora ~/w/g/tmp-build>

通过.o文件的路径,感觉像是使用了time/clock_gettime.c文件,还需要进一步确认,有几种方法如下:
1)查看编译日志
编译过程中,如果将编译过程日志保存到了文件中,可以查看日志文件来确认。
在这里插入图片描述
从日志中可知,使用了sysdeps/unix/sysv/linux/clock_gettime.c文件。

2)readelf查看
2.1)readelf -s xxx查看符号表

maminjie@fedora ~/w/g/tmp-build> readelf -s ./time/clock_gettime.o

Symbol table '.symtab' contains 21 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS clock_gettime.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    8
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    9
     9: 0000000000000000     0 SECTION LOCAL  DEFAULT   11
    10: 0000000000000000     0 SECTION LOCAL  DEFAULT   12
    11: 0000000000000000     0 SECTION LOCAL  DEFAULT   14
    12: 0000000000000000     0 SECTION LOCAL  DEFAULT   15
    13: 0000000000000000     0 SECTION LOCAL  DEFAULT   17
    14: 0000000000000000     0 SECTION LOCAL  DEFAULT   18
    15: 0000000000000000     0 SECTION LOCAL  DEFAULT   16
    16: 0000000000000000   123 FUNC    GLOBAL HIDDEN     1 __clock_gettime
    17: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND _dl_vdso_clock_g[...]
    18: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND _GLOBAL_OFFSET_TABLE_
    19: 0000000000000000     0 TLS     GLOBAL DEFAULT  UND __libc_errno
    20: 0000000000000000   123 FUNC    WEAK   DEFAULT    1 clock_gettime
maminjie@fedora ~/w/g/tmp-build>

似乎看不出来使用了哪个文件……

2.2)readelf --debug-dump=info xxx查看debug信息(前提是debug模式编译)

maminjie@fedora ~/w/g/tmp-build> readelf --debug-dump=info ./time/clock_gettime.o | grep -A10 DW_TAG_compile_unit
 <0><c>: Abbrev Number: 21 (DW_TAG_compile_unit)
    <d>   DW_AT_producer    : (indirect string, offset: 0x8): GNU C11 11.1.1 20210428 (Red Hat 11.1.1-1) -mtune=generic -march=x86-64 -g -O2 -std=gnu11 -fgnu89-inline -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -fmath-errno -ftls-model=initial-exec
    <11>   DW_AT_language    : 29       (C11)
    <12>   DW_AT_name        : (indirect line string, offset: 0x0): ../sysdeps/unix/sysv/linux/clock_gettime.c
    <16>   DW_AT_comp_dir    : (indirect line string, offset: 0x2b): /mnt/hgfs/projects/linux/glibc/time
    <1a>   DW_AT_low_pc      : 0x0
    <22>   DW_AT_high_pc     : 0x7b
    <2a>   DW_AT_stmt_list   : 0x0
 <1><2e>: Abbrev Number: 3 (DW_TAG_base_type)
    <2f>   DW_AT_byte_size   : 1
    <30>   DW_AT_encoding    : 8        (unsigned char)
maminjie@fedora ~/w/g/tmp-build>

从.debug_info段中也可以看出使用了sysdeps/unix/sysv/linux/clock_gettime.c文件。

3)objdump -S xxx 查看反汇编代码(前提是debug模式编译)

maminjie@fedora ~/w/g/tmp-build> objdump -S ./time/clock_gettime.o

./time/clock_gettime.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <__clock_gettime>:
#ifndef __NR_clock_gettime64
# define __NR_clock_gettime64 __NR_clock_gettime
#endif

#ifdef HAVE_CLOCK_GETTIME64_VSYSCALL
  int (*vdso_time64) (clockid_t clock_id, struct __timespec64 *tp)
   0:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 7 <__clock_gettime+0x7>
    = GLRO(dl_vdso_clock_gettime64);
  if (vdso_time64 != NULL)
   7:   48 85 c0                test   %rax,%rax
   a:   74 14                   je     20 <__clock_gettime+0x20>
{
   c:   48 83 ec 08             sub    $0x8,%rsp
    {
      r = INTERNAL_VSYSCALL_CALL (vdso_time64, 2, clock_id, tp);
  10:   ff d0                   callq  *%rax
      if (r == 0)
  12:   85 c0                   test   %eax,%eax
  14:   75 52                   jne    68 <__clock_gettime+0x68>
        return 0;
  16:   31 c0                   xor    %eax,%eax
      return 0;
    }
#endif

  return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
}
  18:   48 83 c4 08             add    $0x8,%rsp
  1c:   c3                      retq
  1d:   0f 1f 00                nopl   (%rax)
  r = INTERNAL_SYSCALL_CALL (clock_gettime64, clock_id, tp);
  20:   b8 e4 00 00 00          mov    $0xe4,%eax
  25:   0f 05                   syscall
  if (r == 0)
  27:   85 c0                   test   %eax,%eax
  29:   74 1d                   je     48 <__clock_gettime+0x48>
  if (r != -ENOSYS)
  2b:   83 f8 da                cmp    $0xffffffda,%eax
  2e:   74 20                   je     50 <__clock_gettime+0x50>
    return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
  30:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 37 <__clock_gettime+0x37>

  37:   f7 d8                   neg    %eax
  39:   64 89 02                mov    %eax,%fs:(%rdx)
  3c:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  41:   c3                      retq
  42:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
        return 0;
  48:   31 c0                   xor    %eax,%eax
}
  4a:   c3                      retq
  4b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
  50:   48 8b 05 00 00 00 00    mov    0x0(%rip),%rax        # 57 <__clock_gettime+0x57>
  57:   64 c7 00 26 00 00 00    movl   $0x26,%fs:(%rax)
  5e:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  63:   c3                      retq
  64:   0f 1f 40 00             nopl   0x0(%rax)
      return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
  68:   48 8b 15 00 00 00 00    mov    0x0(%rip),%rdx        # 6f <__clock_gettime+0x6f>
  6f:   f7 d8                   neg    %eax
  71:   64 89 02                mov    %eax,%fs:(%rdx)
  74:   b8 ff ff ff ff          mov    $0xffffffff,%eax
  79:   eb 9d                   jmp    18 <__clock_gettime+0x18>

从反汇编信息中也可以看出使用了sysdeps/unix/sysv/linux/clock_gettime.c文件,且代码中的__clock_gettime64确实被替换为__clock_gettime了,这里也可以反推出__TIMESIZE == 64。

4.2. __clock_gettime64

通过上面clock_gettime的分析,最终都会使用 sysdeps/unix/sysv/linux/clock_gettime.c 文件中定义的__clock_gettime64,接下来让我们直奔__clock_gettime64,定义如下:

// sysdeps/unix/sysv/linux/clock_gettime.c
int
__clock_gettime64 (clockid_t clock_id, struct __timespec64 *tp)
{
  int r;

#ifndef __NR_clock_gettime64
# define __NR_clock_gettime64 __NR_clock_gettime
#endif

#ifdef HAVE_CLOCK_GETTIME64_VSYSCALL
  int (*vdso_time64) (clockid_t clock_id, struct __timespec64 *tp)
    = GLRO(dl_vdso_clock_gettime64);
  if (vdso_time64 != NULL)
    {
      r = INTERNAL_VSYSCALL_CALL (vdso_time64, 2, clock_id, tp);
      if (r == 0)
	return 0;
      return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
    }
#endif

#ifdef HAVE_CLOCK_GETTIME_VSYSCALL
  int (*vdso_time) (clockid_t clock_id, struct timespec *tp)
    = GLRO(dl_vdso_clock_gettime);
  if (vdso_time != NULL)
    {
      struct timespec tp32;
      r = INTERNAL_VSYSCALL_CALL (vdso_time, 2, clock_id, &tp32);
      if (r == 0 && tp32.tv_sec > 0)
	{
	  *tp = valid_timespec_to_timespec64 (tp32);
	  return 0;
	}
      else if (r != 0)
	return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);

      /* Fallback to syscall if the 32-bit time_t vDSO returns overflows.  */
    }
#endif

  r = INTERNAL_SYSCALL_CALL (clock_gettime64, clock_id, tp);
  if (r == 0)
    return 0;
  if (r != -ENOSYS)
    return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);

#ifndef __ASSUME_TIME64_SYSCALLS
  /* Fallback code that uses 32-bit support.  */
  struct timespec tp32;
  r = INTERNAL_SYSCALL_CALL (clock_gettime, clock_id, &tp32);
  if (r == 0)
    {
      *tp = valid_timespec_to_timespec64 (tp32);
      return 0;
    }
#endif

  return INLINE_SYSCALL_ERROR_RETURN_VALUE (-r);
}

这里分了VSYSCALL的32和64位,以及SYSCALL的32和64位,共4中系统调用。接下来,我们重点看看INTERNAL_SYSCALL_CALL (clock_gettime64, clock_id, tp),其它方式读者可以自行研究。

4.2.1. INTERNAL_SYSCALL_CALL

在这里插入图片描述
宏函数 INTERNAL_SYSCALL_CALL 定义在 sysdeps/unix/sysdep.h 文件中,定义如下:

...
#define __SYSCALL_CONCAT_X(a,b)     a##b
#define __SYSCALL_CONCAT(a,b)       __SYSCALL_CONCAT_X (a, b)


#define __INTERNAL_SYSCALL0(name) \
  INTERNAL_SYSCALL (name, 0)
#define __INTERNAL_SYSCALL1(name, a1) \
  INTERNAL_SYSCALL (name, 1, a1)
#define __INTERNAL_SYSCALL2(name, a1, a2) \
  INTERNAL_SYSCALL (name, 2, a1, a2)
#define __INTERNAL_SYSCALL3(name, a1, a2, a3) \
  INTERNAL_SYSCALL (name, 3, a1, a2, a3)
#define __INTERNAL_SYSCALL4(name, a1, a2, a3, a4) \
  INTERNAL_SYSCALL (name, 4, a1, a2, a3, a4)
#define __INTERNAL_SYSCALL5(name, a1, a2, a3, a4, a5) \
  INTERNAL_SYSCALL (name, 5, a1, a2, a3, a4, a5)
#define __INTERNAL_SYSCALL6(name, a1, a2, a3, a4, a5, a6) \
  INTERNAL_SYSCALL (name, 6, a1, a2, a3, a4, a5, a6)
#define __INTERNAL_SYSCALL7(name, a1, a2, a3, a4, a5, a6, a7) \
  INTERNAL_SYSCALL (name, 7, a1, a2, a3, a4, a5, a6, a7)

#define __INTERNAL_SYSCALL_NARGS_X(a,b,c,d,e,f,g,h,n,...) n
#define __INTERNAL_SYSCALL_NARGS(...) \
  __INTERNAL_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)
#define __INTERNAL_SYSCALL_DISP(b,...) \
  __SYSCALL_CONCAT (b,__INTERNAL_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

/* Issue a syscall defined by syscall number plus any other argument required.
   It is similar to INTERNAL_SYSCALL macro, but without the need to pass the
   expected argument number as second parameter.  */
#define INTERNAL_SYSCALL_CALL(...) \
  __INTERNAL_SYSCALL_DISP (__INTERNAL_SYSCALL, __VA_ARGS__)
...

下面推导一下 INTERNAL_SYSCALL_CALL 最终的调用形式如下:

#define INTERNAL_SYSCALL_CALL(...) \
  __INTERNAL_SYSCALL_DISP (__INTERNAL_SYSCALL, __VA_ARGS__)

#define INTERNAL_SYSCALL_CALL(...) \
  __SYSCALL_CONCAT (__INTERNAL_SYSCALL,__INTERNAL_SYSCALL_NARGS(__VA_ARGS__))(__VA_ARGS__)

#define INTERNAL_SYSCALL_CALL(...) \
  __SYSCALL_CONCAT (__INTERNAL_SYSCALL,__INTERNAL_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,))(__VA_ARGS__)

//
// 因为INTERNAL_SYSCALL_CALL的第一个参数是系统调用名,后面参数是系统调用的参数,
// 所以__VA_ARGS__至少是一个参数。
//
// __INTERNAL_SYSCALL_NARGS_X (__VA_ARGS__,7,6,5,4,3,2,1,0,)根据__VA_ARGS__个数来决定其值,情况如下:
// 系统调用名+0个参数时,其值为0
// 系统调用名+1个参数时,其值为1
// 系统调用名+2个参数时,其值为2
// ...
// 根据上述规律可知,其值就是系统调用函数的参数个数。

#define INTERNAL_SYSCALL_CALL(...) \
  __SYSCALL_CONCAT (__INTERNAL_SYSCALL, n)(__VA_ARGS__)

#define INTERNAL_SYSCALL_CALL(...) \
  __INTERNAL_SYSCALLn(__VA_ARGS__)		// n系统调用函数的参数个数

#define INTERNAL_SYSCALL_CALL(...) \
  INTERNAL_SYSCALL(系统调用名, 系统调用参数个数, 系统调用参数)

4.2.2. INTERNAL_SYSCALL

在这里插入图片描述
宏函数INTERNAL_SYSCALL定义在文件 sysdeps/unix/sysv/linux/x86_64/sysdep.h 中,定义如下:

#undef SYS_ify
#define SYS_ify(syscall_name)	__NR_##syscall_name

#ifdef __ASSEMBLER__
...

#else	/* !__ASSEMBLER__ */

/* Registers clobbered by syscall.  */
# define REGISTERS_CLOBBERED_BY_SYSCALL "cc", "r11", "cx"

/* NB: This also works when X is an array.  For an array X,  type of
   (X) - (X) is ptrdiff_t, which is signed, since size of ptrdiff_t
   == size of pointer, cast is a NOP.   */
#define TYPEFY1(X) __typeof__ ((X) - (X))
/* Explicit cast the argument.  */
#define ARGIFY(X) ((TYPEFY1 (X)) (X))
/* Create a variable 'name' based on type of variable 'X' to avoid
   explicit types.  */
#define TYPEFY(X, name) __typeof__ (ARGIFY (X)) name

#undef INTERNAL_SYSCALL
#define INTERNAL_SYSCALL(name, nr, args...)				\
	internal_syscall##nr (SYS_ify (name), args)
...

#undef internal_syscall0
#define internal_syscall0(number, dummy...)				\
({									\
    unsigned long int resultvar;					\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number)							\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall1
#define internal_syscall1(number, arg1)					\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1)						\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall2
#define internal_syscall2(number, arg1, arg2)				\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2)				\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall3
#define internal_syscall3(number, arg1, arg2, arg3)			\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3)			\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall4
#define internal_syscall4(number, arg1, arg2, arg3, arg4)		\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);			 	\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;			\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4)		\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall5
#define internal_syscall5(number, arg1, arg2, arg3, arg4, arg5)	\
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg5, __arg5) = ARGIFY (arg5);			 	\
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);			 	\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg5, _a5) asm ("r8") = __arg5;			\
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;			\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4),		\
      "r" (_a5)								\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})

#undef internal_syscall6
#define internal_syscall6(number, arg1, arg2, arg3, arg4, arg5, arg6) \
({									\
    unsigned long int resultvar;					\
    TYPEFY (arg6, __arg6) = ARGIFY (arg6);			 	\
    TYPEFY (arg5, __arg5) = ARGIFY (arg5);			 	\
    TYPEFY (arg4, __arg4) = ARGIFY (arg4);			 	\
    TYPEFY (arg3, __arg3) = ARGIFY (arg3);			 	\
    TYPEFY (arg2, __arg2) = ARGIFY (arg2);			 	\
    TYPEFY (arg1, __arg1) = ARGIFY (arg1);			 	\
    register TYPEFY (arg6, _a6) asm ("r9") = __arg6;			\
    register TYPEFY (arg5, _a5) asm ("r8") = __arg5;			\
    register TYPEFY (arg4, _a4) asm ("r10") = __arg4;			\
    register TYPEFY (arg3, _a3) asm ("rdx") = __arg3;			\
    register TYPEFY (arg2, _a2) asm ("rsi") = __arg2;			\
    register TYPEFY (arg1, _a1) asm ("rdi") = __arg1;			\
    asm volatile (							\
    "syscall\n\t"							\
    : "=a" (resultvar)							\
    : "0" (number), "r" (_a1), "r" (_a2), "r" (_a3), "r" (_a4),		\
      "r" (_a5), "r" (_a6)						\
    : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);			\
    (long int) resultvar;						\
})
...
#endif	/* __ASSEMBLER__ */

INTERNAL_SYSCALL(name, nr, args…) 最终替换为 internal_syscall{0,1,2,3,4,5,6}(__NR_name, args) ,其中__NR_name为系统调用号,internal_syscallX内部通过syscall指令完成系统调用。

熟悉的__ASSEMBLER__,上一章“汇编系统调用详解”中讲到,如果定义该宏,那么系统调用将采用汇编方式。此处,正好说明,如果不定义该宏,那么系统调用就采用宏的方式。

5. 同名c文件使用问题

通过上节我们知道clock_gettime.c有多个文件,如下:

maminjie@fedora /m/h/p/l/glibc (master)> find -name clock_gettime.c
./sysdeps/mach/clock_gettime.c
./sysdeps/unix/sysv/linux/clock_gettime.c
./time/clock_gettime.c

最终我们是通过编译日志或二进制中调试信息确认用的是 sysdeps/unix/sysv/linux/clock_gettime.c。那么,对于其它同名c文件,我们也要每次都通过编译日志来确认吗?会不会有什么规律可循呢?

猜测
glibc中对于系统调用会有很多同名c文件(实现文件),这些文件最终使用的是哪个,作者没有找到直接依据,只是大胆猜测其使用顺序:优先使用特定架构下的,其次是linux下的,再是generic下面的,最后是glibc自实现的(往往是空函数)。

下面结合构建日志和源码举几个例子来论证(x86_64平台下):
1)time.c

maminjie@fedora /m/h/p/l/glibc (master)> find -name time.c
./sysdeps/unix/sysv/linux/powerpc/time.c
./sysdeps/unix/sysv/linux/time.c
./sysdeps/unix/sysv/linux/x86/time.c
./time/time.c

time.c使用的是:sysdeps/unix/sysv/linux/x86/time.c
在这里插入图片描述
2)times.c

maminjie@fedora /m/h/p/l/glibc (master)> find -name times.c
./posix/times.c
./sysdeps/mach/hurd/times.c
./sysdeps/unix/sysv/linux/times.c
./sysdeps/unix/sysv/linux/x86_64/x32/times.c

times.c使用的是:sysdeps/unix/sysv/linux/times.c
在这里插入图片描述
3)clock.c

maminjie@fedora /m/h/p/l/glibc (master)> find -name clock.c
./sysdeps/mach/hurd/clock.c
./sysdeps/posix/clock.c
./sysdeps/unix/sysv/linux/clock.c
./time/clock.c

clock.c使用的是:sysdeps/unix/sysv/linux/clock.c
在这里插入图片描述
4)unwind-resume.c

maminjie@fedora /m/h/p/l/glibc (master)> find -name unwind-resume.c
./sysdeps/arm/unwind-resume.c
./sysdeps/generic/unwind-resume.c
./sysdeps/ia64/unwind-resume.c

unwind-resume.c使用的是:sysdeps/generic/unwind-resume.c
在这里插入图片描述
5)wait.c

maminjie@fedora /m/h/p/l/glibc (master)> find -name wait.c
./posix/wait.c

在这里插入图片描述

6. 内核中系统调用

glibc中的系统调用最终使用linux内核实现的,那么linux内核中该如何查询具体的系统调用呢?

本节我们还是以x86_64平台的系统调用clock_gettime进行讲解,通过上面知道clock_gettime最终调用的是clock_gettime64,下面看看内核中是如何定义clock_gettime64的。

6.1. syscall.tbl

在这里插入图片描述
类似于glibc中的汇编系统调用,linux内核中系统调用也有对应的调用表文件,内容格式如下:
arch/x86/entry/syscalls/syscall_32.tbl

#
# 32-bit system call numbers and entry vectors
#
# The format is:
# <number> <abi> <name> <entry point> <compat entry point>
#
# The __ia32_sys and __ia32_compat_sys stubs are created on-the-fly for
# sys_*() system calls and compat_sys_*() compat system calls if
# IA32_EMULATION is defined, and expect struct pt_regs *regs as their only
# parameter.
#
# The abi is always "i386" for this file.
#
0   i386    restart_syscall     sys_restart_syscall
1   i386    exit            sys_exit
2   i386    fork            sys_fork
3   i386    read            sys_read
4   i386    write           sys_write
5   i386    open            sys_open            compat_sys_open
6   i386    close           sys_close
7   i386    waitpid         sys_waitpid
8   i386    creat           sys_creat
9   i386    link            sys_link
10  i386    unlink          sys_unlink
11  i386    execve          sys_execve          compat_sys_execve
...
402 i386    msgctl          sys_msgctl              compat_sys_msgctl
403 i386    clock_gettime64     sys_clock_gettime
404 i386    clock_settime64     sys_clock_settime
405 i386    clock_adjtime64     sys_clock_adjtime
...

第1列:系统调用编号
第2列:abi类型
第3列:系统调用函数名
第4列:系统调用入口点(最终实现的地方)
第5列:兼容性的系统调用入口点

clock_gettime64对应的系统调用号为403,系统调用入口为sys_clock_gettime。

6.2. syscalls 头文件

在这里插入图片描述
linux内核会根据syscall.tbl自动生成对应的头文件。
在这里插入图片描述
__SYSCALL_I386内容如下:
在这里插入图片描述
第一个__SYSCALL_I386定义用于声明系统调用函数,第二个__SYSCALL_I386定义用于初始化数组。

6.3. SYSCALL_DEFINE 宏

在这里插入图片描述
在这里插入图片描述

sys_clock_gettime在linux内核中是通过宏SYSCALL_DEFINE2(clock_gettime, …)进行包装,宏SYSCALL_DEFINE2定义如下:

在这里插入图片描述
include/linux/syscalls.h

#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
/*
 * It may be useful for an architecture to override the definitions of the
 * SYSCALL_DEFINE0() and __SYSCALL_DEFINEx() macros, in particular to use a
 * different calling convention for syscalls. To allow for that, the prototypes
 * for the sys_*() functions below will *not* be included if
 * CONFIG_ARCH_HAS_SYSCALL_WRAPPER is enabled.
 */
#include <asm/syscall_wrapper.h>
#endif /* CONFIG_ARCH_HAS_SYSCALL_WRAPPER */
...
#else
#define SYSCALL_METADATA(sname, nb, ...)

static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
{
    return 0;
}
#endif

#ifndef SYSCALL_DEFINE0
#define SYSCALL_DEFINE0(sname)                  \
    SYSCALL_METADATA(_##sname, 0);              \
    asmlinkage long sys_##sname(void);          \
    ALLOW_ERROR_INJECTION(sys_##sname, ERRNO);      \
    asmlinkage long sys_##sname(void)
#endif /* SYSCALL_DEFINE0 */

#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)

#define SYSCALL_DEFINE_MAXARGS  6

#define SYSCALL_DEFINEx(x, sname, ...)              \
    SYSCALL_METADATA(sname, x, __VA_ARGS__)         \
    __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)

/*
 * The asmlinkage stub is aliased to a function named __se_sys_*() which
 * sign-extends 32-bit ints to longs whenever needed. The actual work is
 * done within __do_sys_*().
 */
#ifndef __SYSCALL_DEFINEx
#define __SYSCALL_DEFINEx(x, name, ...)                 \
    __diag_push();                          \
    __diag_ignore(GCC, 8, "-Wattribute-alias",          \
              "Type aliasing is used to sanitize syscall arguments");\
    asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))   \
        __attribute__((alias(__stringify(__se_sys##name))));    \
    ALLOW_ERROR_INJECTION(sys##name, ERRNO);            \
    static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\
    asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
    asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__))  \
    {                               \
        long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\
        __MAP(x,__SC_TEST,__VA_ARGS__);             \
        __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));   \
        return ret;                     \
    }                               \
    __diag_pop();                           \
    static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
#endif /* __SYSCALL_DEFINEx */
...

SYSCALL_DEFINE2屏蔽了clock_gettime名字的前缀信息,如果定义了宏CONFIG_ARCH_HAS_SYSCALL_WRAPPER,那么SYSCALL_DEFINE2将使用asm/syscall_wrapper.h中的__SYSCALL_DEFINEx,否则将使用本文件中的__SYSCALL_DEFINEx,被替换为不带前缀的sys_clock_gettime。

arch/x86/include/asm/syscall_wrapper.h头文件内容如下:
在这里插入图片描述
在这里插入图片描述
将在sys_*前面增加不同的标志,如__ia32__sys_*前缀,同syscalls.h中定义一致。

这里只是简单的解读,没有做严谨的逻辑推导,但是基本规律应该是这样的。

7. 参考资料

  • 5
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值