失落的strace
在《BPF Performance Tools》一书中,作者评批strace用于监测系统应用的创建execve
时写道:
The current implementation of strace(1) uses breakpoints that can greatly slow the target (over 100x), making it dangerous for production use. ...
Also, this example traces the exec() syscall system-wide, which strace(1) currently cannot do.
指出strace
工具使用了ptrace
系统调用以跟踪应用创建的效率低下(与perf和基于eBPF
的工具相比较),并且不能全局监测。此外,笔者很早就观察到,strace
在跟踪应用的系统调用时,时常不能捕获clock_gettime
等与时间相关的系统调用,这也渐渐让笔者对该工具产生了一些疑惑。例如,笔者编写了以下代码:
/* timestamp.c: gcc -Wall -O0 -ggdb -o timestamp timestamp.c */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
static void dump_time(clockid_t clkid, int isdate)
{
struct timespec spec;
clock_gettime(clkid, &spec);
if (isdate) {
struct tm * ptm, stm;
memset(&stm, 0, sizeof(stm));
ptm = localtime_r(&spec.tv_sec, &stm);
if (ptm == NULL)
ptm = &stm;
fprintf(stdout, "Date & time: %d-%02d-%02d %02d:%02d:%02d\n",
ptm->tm_year + 1900, ptm->tm_mon + 1, ptm->tm_mday,
ptm->tm_hour, ptm->tm_min, ptm->tm_sec);
} else {
fprintf(stdout, "clock: %#x, tv_sec: %ld, tv_nsec: %ld\n",
(unsigned int) clkid, (long) spec.tv_sec, spec.tv_nsec);
}
fflush(stdout);
}
int main(int argc, char *argv[])
{
fprintf(stdout, "argc: %d, argv[0]: %s\n",
argc, argv[0]);
fflush(stdout);
dump_time(CLOCK_REALTIME, 1);
dump_time(CLOCK_BOOTTIME, 0);
return 0;
}
使用strace
跟踪其运行时的系统调用,部分结果如下:
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}) = 0
brk(NULL) = 0x555555559000
brk(0x55555557a000) = 0x55555557a000
write(1, "argc: 1, argv[0]: ./timestamp\n", 30argc: 1, argv[0]: ./timestamp
) = 30
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 573
lseek(3, -348, SEEK_CUR) = 225
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 348
close(3) = 0
write(1, "Date & time: 2022-04-03 09:13:59"..., 33Date & time: 2022-04-03 09:13:59
) = 33
write(1, "clock: 0x7, tv_sec: 46060, tv_ns"..., 46clock: 0x7, tv_sec: 46060, tv_nsec: 739873096
) = 46
exit_group(0) = ?
+++ exited with 0 +++
可见,strace
并未能跟踪到两次clock_gettime
系统调用。这不禁让笔者产生疑问,strace
还会不会错过其他的系统调用?为什么会丢失clock_gettime
系统调用?
Linux内核提供的虚拟共享库:VDSO
笔者使用gdb
调试工具加载上面编译得到的可执行文件,观察到clock_gettime
函数符号位于进程的vdso地址空间:
(gdb) break main
Breakpoint 1 at 0x10e0: file timestamp.c, line 17.
(gdb) run
Starting program: /home/yejq/program/blogs/20220322/timestamp
Breakpoint 1, main (argc=1, argv=0x7fffffffde68) at timestamp.c:27
32 {
(gdb) info address clock_gettime
Symbol "clock_gettime" is at 0x7ffff7fcda10 in a file compiled without debugging.
(gdb) info proc mappings
process 14679
Mapped address spaces:
Start Addr End Addr Size Offset objfile
0x555555554000 0x555555555000 0x1000 0x0 /home/yejq/program/timestamp
0x555555555000 0x555555556000 0x1000 0x1000 /home/yejq/program/timestamp
0x555555556000 0x555555557000 0x1000 0x2000 /home/yejq/program/timestamp
0x555555557000 0x555555558000 0x1000 0x2000 /home/yejq/program/timestamp
0x555555558000 0x555555559000 0x1000 0x3000 /home/yejq/program/timestamp
0x7ffff7db9000 0x7ffff7ddb000 0x22000 0x0 /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff7ddb000 0x7ffff7f53000 0x178000 0x22000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff7f53000 0x7ffff7fa1000 0x4e000 0x19a000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff7fa1000 0x7ffff7fa5000 0x4000 0x1e7000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff7fa5000 0x7ffff7fa7000 0x2000 0x1eb000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff7fa7000 0x7ffff7fad000 0x6000 0x0
0x7ffff7fc9000 0x7ffff7fcd000 0x4000 0x0 [vvar]
0x7ffff7fcd000 0x7ffff7fcf000 0x2000 0x0 [vdso]
0x7ffff7fcf000 0x7ffff7fd0000 0x1000 0x0 /usr/lib/x86_64-linux-gnu/ld-2.31.so
0x7ffff7fd0000 0x7ffff7ff3000 0x23000 0x1000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
0x7ffff7ff3000 0x7ffff7ffb000 0x8000 0x24000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
0x7ffff7ffc000 0x7ffff7ffd000 0x1000 0x2c000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
0x7ffff7ffd000 0x7ffff7ffe000 0x1000 0x2d000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
0x7ffff7ffe000 0x7ffff7fff000 0x1000 0x0
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 [vsyscall]
符号clock_gettime
的地址为0x7ffff7fcda10
,恰位于[vdso]
地址空间的范围内。不过,我们的应用并未直接调用到vdso
中,而是经由glibc
间接调用:
(gdb) break *0x7ffff7fcda10
Breakpoint 2 at 0x7ffff7fcda10
(gdb) c
Continuing.
argc: 1, argv[0]: /home/yejq/program/blogs/20220322/timestamp
Breakpoint 2, 0x00007ffff7fcda10 in clock_gettime ()
(gdb) bt
#0 0x00007ffff7fcda10 in clock_gettime ()
#1 0x00007ffff7e960e5 in __GI___clock_gettime (clock_id=0, tp=0x7fffffffdd10) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#2 0x0000555555555147 in dump_time (isdate=1, clkid=0) at timestamp.c:10
#3 main (argc=<optimized out>, argv=<optimized out>) at timestamp.c:27
通过vdso
的官方手册可了解到,这是一个虚拟的共享库,由内核在应用运行时动态地加载,且其符号解析是由C语言动态库完成的。官方手册中给出了该共享库提供了哪些系统调用的替代函数(这样就解决了其中一个疑惑:除了vdso
提供的系统调用,strace
应当能捕捉到所有其他的系统调用),并说明这些被替代的函数具有高性能的特点,主要动机是提升应用的运行效率。那么现在的问题是,例如clock_gettime
之类的函数,如何不通过系统调用,访问到系统的时间呢?
虚拟共享库vdso
的真实性
官方手册提到,vdso
是一个完整的ELF镜像:
Since the vDSO is a fully formed ELF image, you can do symbol lookups on it.
其镜像源码可以在Linux内核中找到,例如对于x86架构,其实现代码路径为arch/x86/entry/vdso
;内核编译完成后,可以看到该路径文件夹中有以下相关文件:
yejq@ubuntu:~/program/focal-linux/arch/x86/entry/vdso$ ls -lh *.so*
-rwxrwxr-x 1 yejq yejq 8.0K 3月 22 22:29 vdso32.so
-rwxrwxr-x 1 yejq yejq 121K 3月 22 22:29 vdso32.so.dbg
-rwxrwxr-x 1 yejq yejq 5.5K 3月 22 22:29 vdso64.so
-rwxrwxr-x 1 yejq yejq 152K 3月 22 22:29 vdso64.so.dbg
-rwxrwxr-x 1 yejq yejq 4.6K 3月 22 22:29 vdsox32.so
-rwxrwxr-x 1 yejq yejq 150K 3月 22 22:29 vdsox32.so.dbg
可以想见,以上动态库都是以应用的编译选项来编译的,与内核中其他的C代码文件编译选项有差异;此外,该动态库不依赖其他的C语言函数。更重要的是,该动态库中的函数被调用时,肯定不会频繁地进行系统调用,否则其存在就没有意义了。以上动态库编译完成后,会被vdso2c
命令行工具转换为C代码vdso-image-64.c
,该代码最终会被编译链接到Linux内核中:
arch/x86/entry/vdso/vdso2c arch/x86/entry/vdso/vdso64.so.dbg \
arch/x86/entry/vdso/vdso64.so arch/x86/entry/vdso/vdso-image-64.c
这样看来,虚拟共享库vdso
是真实存在的,并不是虚拟的。
Linux内核加载vdso
动态库的过程
为了调试分析Linux内核的vdso动态库加载过程,需要获取Linux内核的调试信息。一种方案是重编Linux内核(笔者的电脑编译内核需要超过三个小时),另一种方案是从网络上下载带有调试信息的内核镜像。笔者使用的系统为Ubuntu 20.04
,为节约时间,笔者选择了后一种方案。安装指定版本内核的操作如下:
sudo apt-get install --install-recommends linux-generic-hwe-20.04
sudo apt install --reinstall linux-headers-5.13.0-35-generic linux-hwe-5.13-headers-5.13.0-35 linux-image-5.13.0-35-generic linux-modules-5.13.0-35-generic linux-modules-extra-5.13.0-35-generic
wget http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-unsigned-5.13.0-35-generic-dbgsym_5.13.0-35.40_amd64.ddeb
sudo dpkg -i linux-image-unsigned-5.13.0-35-generic-dbgsym_5.13.0-35.40_amd64.ddeb
sudo reboot
带有调试信息的Linux内核镜像路径为/usr/lib/debug/boot/vmlinux-5.13.0-35-generic
,该镜像会在后面使用到。通过分析代码,笔者编写了bpftrace
脚本0-probe-arch_setup_additional_pages.bt
:
#!/usr/bin/bpftrace
kprobe:arch_setup_additional_pages,
kprobe:compat_arch_setup_additional_pages
{
$vdso_enabled = kaddr("vdso64_enabled");
printf("PID: %d, comm: %s, vdso64_enabled: %d (%p), stack backtrace for %s:",
pid, comm, *$vdso_enabled, $vdso_enabled, probe);
print(kstack(perf));
}
使用bpftrace
(笔者使用的bpftrace
版本为v0.13.0-241-g275f5
)加载该脚本,可以得到应用在创建时,内核为其加载vdso
动态库的调用栈回溯:
# bpftrace ./0-probe-arch_setup_additional_pages.bt
Attaching 2 probes...
PID: 16305, comm: timestamp, vdso64_enabled: 1 (0xffffffffb8d5b84c), stack backtrace for kprobe:arch_setup_additional_pages:
ffffffffb6c04d81 arch_setup_additional_pages+1
ffffffffb6f2549a exec_binprm+314
ffffffffb6f26b1d bprm_execve+365
ffffffffb6f270b9 do_execveat_common.isra.0+393
ffffffffb6f272d7 __x64_sys_execve+55
ffffffffb77f3b21 do_syscall_64+97
ffffffffb7a0007c entry_SYSCALL_64_after_hwframe+68
可见,Linux内核中的变量vdso64_enabled
是使能的,在execve
加载新的ELF镜像时,vdso
动态库被Linux内核加载。可以更进一步,获取到为一个新的进程加载的vdso
的地址,这就需要使用gdb
调试器加载上面提到的vmlinux-5.13.0-35-generic
文件并反汇编相关函数。分析相关函数的反汇编结果,可以编写1-probe-map_vdso.bt
脚本:
#!/usr/bin/bpftrace
#include <asm/vdso.h>
kprobe:map_vdso
{
$vimg = (struct vdso_image *) arg0;
printf("map_vdso[%s, %d] at address: %p\n",
comm, pid, arg1);
printf("sym_vvar_start: %lx (%ld), image size: %lx",
$vimg->sym_vvar_start, $vimg->sym_vvar_start, $vimg->size);
print(kstack(perf));
}
/*
static int map_vdso(const struct vdso_image *image, unsigned long addr)
{
...
addr = get_unmapped_area(NULL, addr,
image->size - image->sym_vvar_start, 0, 0);
*/
kprobe:map_vdso+0x65
{
printf("get_unmapped_area has returned: %p\n", reg("ax"));
}
/*
text_start = addr - image->sym_vvar_start;
// MAYWRITE to allow gdb to COW and set breakpoints
vma = _install_special_mapping(mm,
text_start,
*/
kprobe:map_vdso+0x88
{
printf("text_start for VDSO: %p\n", reg("r15"));
}
/*
Dump of assembler code for function map_vdso:
0xffffffff81004710 <+0>: callq 0xffffffff810777e0 <__fentry__>
0xffffffff81004715 <+5>: push %rbp
0xffffffff81004716 <+6>: mov %gs:0x1fbc0,%rax
0xffffffff8100471f <+15>: mov %rsp,%rbp
0xffffffff81004722 <+18>: push %r15
0xffffffff81004724 <+20>: mov %rsi,%r15
0xffffffff81004727 <+23>: push %r14
0xffffffff81004729 <+25>: push %r13
0xffffffff8100472b <+27>: push %r12
0xffffffff8100472d <+29>: push %rbx
0xffffffff8100472e <+30>: mov %rdi,%rbx
0xffffffff81004731 <+33>: sub $0x8,%rsp
0xffffffff81004735 <+37>: mov 0x868(%rax),%r13
0xffffffff8100473c <+44>: nopl 0x0(%rax,%rax,1)
0xffffffff81004741 <+49>: lea 0x78(%r13),%r14
0xffffffff81004745 <+53>: mov %r14,%rdi
0xffffffff81004748 <+56>: callq 0xffffffff81c3a8c0 <down_write_killable>
0xffffffff8100474d <+61>: mov %eax,%r12d
0xffffffff81004750 <+64>: nopl 0x0(%rax,%rax,1)
0xffffffff81004755 <+69>: test %r12d,%r12d
0xffffffff81004758 <+72>: jne 0xffffffff8100487f <map_vdso+367>
0xffffffff8100475e <+78>: mov 0x8(%rbx),%rdx
0xffffffff81004762 <+82>: xor %r8d,%r8d
0xffffffff81004765 <+85>: sub 0x38(%rbx),%rdx
0xffffffff81004769 <+89>: xor %ecx,%ecx
0xffffffff8100476b <+91>: xor %edi,%edi
0xffffffff8100476d <+93>: mov %r15,%rsi
0xffffffff81004770 <+96>: callq 0xffffffff812a9d30 <get_unmapped_area>
0xffffffff81004775 <+101>: cmp $0xfffffffffffff000,%rax
0xffffffff8100477b <+107>: ja 0xffffffff8100485f <map_vdso+335>
0xffffffff81004781 <+113>: mov %rax,%r15
0xffffffff81004784 <+116>: mov 0x8(%rbx),%rdx
0xffffffff81004788 <+120>: sub 0x38(%rbx),%r15
0xffffffff8100478c <+124>: mov $0x75,%ecx
0xffffffff81004791 <+129>: mov $0xffffffff822010e0,%r8
0xffffffff81004798 <+136>: mov %r15,%rsi
0xffffffff8100479b <+139>: mov %r13,%rdi
0xffffffff8100479e <+142>: mov %rax,-0x30(%rbp)
0xffffffff810047a2 <+146>: callq 0xffffffff812ae560 <_install_special_mapping>
*/
同时,笔者编写了以下调试代码dump-maps.c
,可以读取自身的地址空间映射文件:
/* dump-maps.c */
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <time.h>
typedef int (* clockp_gettime)(clockid_t, struct timespec *);
int main(int argc, char *argv[])
{
int fd;
ssize_t rl1;
char * pbuf = NULL;
struct timespec tspec;
clockp_gettime clk_gtm;
clk_gtm = clock_gettime;
fprintf(stdout, "PID of %s: %ld\n", argv[0], (long) getpid());
fprintf(stdout, "address of clock_gettime: %p, vdso: %p\n",
(void *) clk_gtm, (void *) getauxval(AT_SYSINFO_EHDR));
fflush(stdout);
fd = open("/proc/self/maps", O_RDONLY);
if (fd == -1) {
fprintf(stderr, "Error, failed to open maps file: %s\n",
strerror(errno));
fflush(stderr);
return 1;
}
pbuf = (char *) malloc(8192);
if (pbuf == NULL) {
close(fd);
fputs("Error, system out of memory!\n", stderr);
fflush(stderr);
return 2;
}
for (;;) {
rl1 = read(fd, pbuf, 8192);
if (rl1 <= 0)
break;
fwrite(pbuf, 0x1, (size_t) rl1, stdout);
}
tspec.tv_sec = 0;
tspec.tv_nsec = 0;
clk_gtm(CLOCK_REALTIME, &tspec);
fprintf(stdout, "REALTIME: %ld, %ld\n",
(long) tspec.tv_sec, tspec.tv_nsec);
fflush(stdout);
close(fd);
free(pbuf);
return 0;
}
之后,分别在两个终端下先后执行bpftrace
和dump-maps
,可以得到:
# bpftrace ./1-probe-map_vdso.bt
Attaching 3 probes...
map_vdso[dump-maps, 16684] at address: 0x7ffffffff000
sym_vvar_start: ffffffffffffc000 (-16384), image size: 2000
ffffffffb6c04951 map_vdso+1
ffffffffb6c04da2 arch_setup_additional_pages+34
ffffffffb6fb13ab load_elf_binary+3323
ffffffffb6f2549a exec_binprm+314
ffffffffb6f26b1d bprm_execve+365
ffffffffb6f270b9 do_execveat_common.isra.0+393
ffffffffb6f272d7 __x64_sys_execve+55
ffffffffb77f3b21 do_syscall_64+97
ffffffffb7a0007c entry_SYSCALL_64_after_hwframe+68
get_unmapped_area has returned: 0x7ffff7fc9000
text_start for VDSO: 0x7ffff7fcd000
$ ./dump-maps
PID of ./dump-maps: 16684
address of clock_gettime: 0x7ffff7e960c0, vdso: 0x7ffff7fcd000
555555554000-555555555000 r--p 00000000 08:08 3278887 /home/yejq/program/blogs/20220322/dump-maps
555555555000-555555556000 r-xp 00001000 08:08 3278887 /home/yejq/program/blogs/20220322/dump-maps
555555556000-555555557000 r--p 00002000 08:08 3278887 /home/yejq/program/blogs/20220322/dump-maps
555555557000-555555558000 r--p 00002000 08:08 3278887 /home/yejq/program/blogs/20220322/dump-maps
555555558000-555555559000 rw-p 00003000 08:08 3278887 /home/yejq/program/blogs/20220322/dump-maps
555555559000-55555557a000 rw-p 00000000 00:00 0 [heap]
...
7ffff7fa7000-7ffff7fad000 rw-p 00000000 00:00 0
7ffff7fc9000-7ffff7fcd000 r--p 00000000 00:00 0 [vvar]
7ffff7fcd000-7ffff7fcf000 r-xp 00000000 00:00 0 [vdso]
...
可见,通过bpftrace
获得的vvar
/vdso
加载地址与dump-maps
读取到的结果是一致的。
应用读取vvar
以获取系统时间
上面的调试结果表明,与vdso
可执行代码段一同加载的,还有一个名称为vvar
只读内存段。该只读内存段就应该保存了系统时间等信息。对系统上所有运行的应用而言,内核为其映射的vdso
/vvar
是同一个物理地址,也就是说,每一个进程并不拥有独立的vdso
及vvar
地址空间,它们是共享的,且对应用是只读的;这样内核在更新系统时间时,只需更新一段内存区域即可。vvar
有其特定的数据结构,该结构定义于include/vdso/datapage.h
:
/**
* struct vdso_data - vdso datapage representation
* @seq: timebase sequence counter
* @clock_mode: clock mode
* @cycle_last: timebase at clocksource init
* @mask: clocksource mask
* @mult: clocksource multiplier
* @shift: clocksource shift
* @basetime[clock_id]: basetime per clock_id
* @offset[clock_id]: time namespace offset per clock_id
* @tz_minuteswest: minutes west of Greenwich
* @tz_dsttime: type of DST correction
* @hrtimer_res: hrtimer resolution
* @__unused: unused
* @arch_data: architecture specific data (optional, defaults
* to an empty struct)
*
* vdso_data will be accessed by 64 bit and compat code at the same time
* so we should be careful before modifying this structure.
*
* @basetime is used to store the base time for the system wide time getter
* VVAR page.
*
* @offset is used by the special time namespace VVAR pages which are
* installed instead of the real VVAR page. These namespace pages must set
* @seq to 1 and @clock_mode to VDSO_CLOCKMODE_TIMENS to force the code into
* the time namespace slow path. The namespace aware functions retrieve the
* real system wide VVAR page, read host time and add the per clock offset.
* For clocks which are not affected by time namespace adjustment the
* offset must be zero.
*/
struct vdso_data {
u32 seq;
s32 clock_mode;
u64 cycle_last;
u64 mask;
u32 mult;
u32 shift;
union {
struct vdso_timestamp basetime[VDSO_BASES];
struct timens_offset offset[VDSO_BASES];
};
s32 tz_minuteswest;
s32 tz_dsttime;
u32 hrtimer_res;
u32 __unused;
struct arch_vdso_data arch_data;
};
该结构体的定义与具体的内核版本相关,glibc
并不解析之,而是由编译链到的内核中的vdso64.so
来解析。vdso
提供的clock_gettime
函数读取时间的功能主要由内核代码lib/vdso/gettimeofday.c
实现:
/* linux-5.13.0/lib/vdso/gettimeofday.c */
static __always_inline int
__cvdso_clock_gettime_common(const struct vdso_data *vd, clockid_t clock,
struct __kernel_timespec *ts)
{
u32 msk;
/* Check for negative values or invalid clocks */
if (unlikely((u32) clock >= MAX_CLOCKS))
return -1;
/*
* Convert the clockid to a bitmask and use it to check which
* clocks are handled in the VDSO directly.
*/
msk = 1U << clock;
if (likely(msk & VDSO_HRES))
vd = &vd[CS_HRES_COARSE];
else if (msk & VDSO_COARSE)
return do_coarse(&vd[CS_HRES_COARSE], clock, ts);
else if (msk & VDSO_RAW)
vd = &vd[CS_RAW];
else
return -1;
return do_hres(vd, clock, ts);
}
不过,应当指出,vdso
只能读取一定程度上与进程无关的内核数据。当调用clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)
以获取当前进程的运行时间时,上面的函数会返回-1
,此时,vdso会发出系统调用(通过函数clock_gettime_fallback(...)
):
static __maybe_unused int
__cvdso_clock_gettime_data(const struct vdso_data *vd, clockid_t clock,
struct __kernel_timespec *ts)
{
int ret = __cvdso_clock_gettime_common(vd, clock, ts);
if (unlikely(ret))
return clock_gettime_fallback(clock, ts);
return 0;
}
这一点,可以通过strace
工具来验证。在以上的timestamp.c
函数中加入:
diff --git a/20220322/timestamp.c b/20220322/timestamp.c
index 0559e7e..727d194 100644
--- a/timestamp.c
+++ b/timestamp.c
@@ -36,5 +36,6 @@ int main(int argc, char *argv[])
dump_time(CLOCK_REALTIME, 1);
dump_time(CLOCK_BOOTTIME, 0);
+ dump_time(CLOCK_PROCESS_CPUTIME_ID, 0);
return 0;
}
编译后以strace
跟踪系统调用,可以得到clock_gettime
相关的信息,但只有一个:
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0
brk(NULL) = 0x555555559000
brk(0x55555557a000) = 0x55555557a000
write(1, "argc: 1, argv[0]: ./timestamp\n", 30argc: 1, argv[0]: ./timestamp
) = 30
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 573
lseek(3, -348, SEEK_CUR) = 225
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 348
close(3) = 0
write(1, "Date & time: 2022-04-03 10:45:58"..., 33Date & time: 2022-04-03 10:45:58
) = 33
write(1, "clock: 0x7, tv_sec: 51579, tv_ns"..., 46clock: 0x7, tv_sec: 51579, tv_nsec: 433065897
) = 46
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=2112517}) = 0
write(1, "clock: 0x2, tv_sec: 0, tv_nsec: "..., 40clock: 0x2, tv_sec: 0, tv_nsec: 2112517
) = 40
exit_group(0) = ?
+++ exited with 0 +++
至此,我们可以回答以上问题了,并得出结论:strace
可能跟踪不到一些系统调用,这些系统调用被vdso
中相关的函数替代,避免系统调用带来的性能开销。在x86平台,可以通过增加内核启动的命令行参数vdso=0
将vdso64_enabled
置0,以此来禁用vdso
,感兴趣的可以尝试一下。