虚拟动态共享库VDSO的实现机制

失落的strace

在《BPF Performance Tools》一书中,作者评批strace用于监测系统应用的创建execve时写道:

The current implementation of strace(1) uses breakpoints that can greatly slow the target (over 100x), making it dangerous for production use. ...
Also, this example traces the exec() syscall system-wide, which strace(1) currently cannot do.

指出strace工具使用了ptrace系统调用以跟踪应用创建的效率低下(与perf和基于eBPF的工具相比较),并且不能全局监测。此外,笔者很早就观察到,strace在跟踪应用的系统调用时,时常不能捕获clock_gettime等与时间相关的系统调用,这也渐渐让笔者对该工具产生了一些疑惑。例如,笔者编写了以下代码:

/* timestamp.c: gcc -Wall -O0 -ggdb -o timestamp timestamp.c */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>

static void dump_time(clockid_t clkid, int isdate)
{
	struct timespec spec;
	clock_gettime(clkid, &spec);
	if (isdate) {
		struct tm * ptm, stm;
		memset(&stm, 0, sizeof(stm));
		ptm = localtime_r(&spec.tv_sec, &stm);
		if (ptm == NULL)
			ptm = &stm;
		fprintf(stdout, "Date & time: %d-%02d-%02d %02d:%02d:%02d\n",
			ptm->tm_year + 1900, ptm->tm_mon + 1, ptm->tm_mday,
			ptm->tm_hour, ptm->tm_min, ptm->tm_sec);
	} else {
		fprintf(stdout, "clock: %#x, tv_sec: %ld, tv_nsec: %ld\n",
			(unsigned int) clkid, (long) spec.tv_sec, spec.tv_nsec);
	}
	fflush(stdout);
}

int main(int argc, char *argv[])
{
	fprintf(stdout, "argc: %d, argv[0]: %s\n",
		argc, argv[0]);
	fflush(stdout);
	dump_time(CLOCK_REALTIME, 1);
	dump_time(CLOCK_BOOTTIME, 0);
	return 0;
}

使用strace跟踪其运行时的系统调用,部分结果如下:

fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}) = 0
brk(NULL)                               = 0x555555559000
brk(0x55555557a000)                     = 0x55555557a000
write(1, "argc: 1, argv[0]: ./timestamp\n", 30argc: 1, argv[0]: ./timestamp
) = 30
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 573
lseek(3, -348, SEEK_CUR)                = 225
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 348
close(3)                                = 0
write(1, "Date & time: 2022-04-03 09:13:59"..., 33Date & time: 2022-04-03 09:13:59
) = 33
write(1, "clock: 0x7, tv_sec: 46060, tv_ns"..., 46clock: 0x7, tv_sec: 46060, tv_nsec: 739873096
) = 46
exit_group(0)                           = ?
+++ exited with 0 +++

可见,strace并未能跟踪到两次clock_gettime系统调用。这不禁让笔者产生疑问,strace还会不会错过其他的系统调用?为什么会丢失clock_gettime系统调用?

Linux内核提供的虚拟共享库:VDSO

笔者使用gdb调试工具加载上面编译得到的可执行文件,观察到clock_gettime函数符号位于进程的vdso地址空间:

(gdb) break main
Breakpoint 1 at 0x10e0: file timestamp.c, line 17.
(gdb) run
Starting program: /home/yejq/program/blogs/20220322/timestamp 

Breakpoint 1, main (argc=1, argv=0x7fffffffde68) at timestamp.c:27
32	{
(gdb) info address clock_gettime
Symbol "clock_gettime" is at 0x7ffff7fcda10 in a file compiled without debugging.
(gdb) info proc mappings
process 14679
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
      0x555555554000     0x555555555000     0x1000        0x0 /home/yejq/program/timestamp
      0x555555555000     0x555555556000     0x1000     0x1000 /home/yejq/program/timestamp
      0x555555556000     0x555555557000     0x1000     0x2000 /home/yejq/program/timestamp
      0x555555557000     0x555555558000     0x1000     0x2000 /home/yejq/program/timestamp
      0x555555558000     0x555555559000     0x1000     0x3000 /home/yejq/program/timestamp
      0x7ffff7db9000     0x7ffff7ddb000    0x22000        0x0 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      0x7ffff7ddb000     0x7ffff7f53000   0x178000    0x22000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      0x7ffff7f53000     0x7ffff7fa1000    0x4e000   0x19a000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      0x7ffff7fa1000     0x7ffff7fa5000     0x4000   0x1e7000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      0x7ffff7fa5000     0x7ffff7fa7000     0x2000   0x1eb000 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      0x7ffff7fa7000     0x7ffff7fad000     0x6000        0x0 
      0x7ffff7fc9000     0x7ffff7fcd000     0x4000        0x0 [vvar]
      0x7ffff7fcd000     0x7ffff7fcf000     0x2000        0x0 [vdso]
      0x7ffff7fcf000     0x7ffff7fd0000     0x1000        0x0 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      0x7ffff7fd0000     0x7ffff7ff3000    0x23000     0x1000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      0x7ffff7ff3000     0x7ffff7ffb000     0x8000    0x24000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      0x7ffff7ffc000     0x7ffff7ffd000     0x1000    0x2c000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      0x7ffff7ffd000     0x7ffff7ffe000     0x1000    0x2d000 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      0x7ffff7ffe000     0x7ffff7fff000     0x1000        0x0 
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0 [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0 [vsyscall]

符号clock_gettime的地址为0x7ffff7fcda10,恰位于[vdso]地址空间的范围内。不过,我们的应用并未直接调用到vdso中,而是经由glibc间接调用:

(gdb) break *0x7ffff7fcda10
Breakpoint 2 at 0x7ffff7fcda10
(gdb) c
Continuing.
argc: 1, argv[0]: /home/yejq/program/blogs/20220322/timestamp

Breakpoint 2, 0x00007ffff7fcda10 in clock_gettime ()
(gdb) bt
#0  0x00007ffff7fcda10 in clock_gettime ()
#1  0x00007ffff7e960e5 in __GI___clock_gettime (clock_id=0, tp=0x7fffffffdd10) at ../sysdeps/unix/sysv/linux/clock_gettime.c:38
#2  0x0000555555555147 in dump_time (isdate=1, clkid=0) at timestamp.c:10
#3  main (argc=<optimized out>, argv=<optimized out>) at timestamp.c:27

通过vdso官方手册可了解到,这是一个虚拟的共享库,由内核在应用运行时动态地加载,且其符号解析是由C语言动态库完成的。官方手册中给出了该共享库提供了哪些系统调用的替代函数(这样就解决了其中一个疑惑:除了vdso提供的系统调用,strace应当能捕捉到所有其他的系统调用),并说明这些被替代的函数具有高性能的特点,主要动机是提升应用的运行效率。那么现在的问题是,例如clock_gettime之类的函数,如何不通过系统调用,访问到系统的时间呢?

虚拟共享库vdso的真实性

官方手册提到,vdso是一个完整的ELF镜像:

Since the vDSO is a fully formed ELF image, you can do symbol lookups on it.

其镜像源码可以在Linux内核中找到,例如对于x86架构,其实现代码路径为arch/x86/entry/vdso;内核编译完成后,可以看到该路径文件夹中有以下相关文件:

yejq@ubuntu:~/program/focal-linux/arch/x86/entry/vdso$ ls -lh *.so*
-rwxrwxr-x 1 yejq yejq 8.0K 3月  22 22:29 vdso32.so
-rwxrwxr-x 1 yejq yejq 121K 3月  22 22:29 vdso32.so.dbg
-rwxrwxr-x 1 yejq yejq 5.5K 3月  22 22:29 vdso64.so
-rwxrwxr-x 1 yejq yejq 152K 3月  22 22:29 vdso64.so.dbg
-rwxrwxr-x 1 yejq yejq 4.6K 3月  22 22:29 vdsox32.so
-rwxrwxr-x 1 yejq yejq 150K 3月  22 22:29 vdsox32.so.dbg

可以想见,以上动态库都是以应用的编译选项来编译的,与内核中其他的C代码文件编译选项有差异;此外,该动态库不依赖其他的C语言函数。更重要的是,该动态库中的函数被调用时,肯定不会频繁地进行系统调用,否则其存在就没有意义了。以上动态库编译完成后,会被vdso2c命令行工具转换为C代码vdso-image-64.c,该代码最终会被编译链接到Linux内核中:

arch/x86/entry/vdso/vdso2c arch/x86/entry/vdso/vdso64.so.dbg \
  arch/x86/entry/vdso/vdso64.so arch/x86/entry/vdso/vdso-image-64.c

这样看来,虚拟共享库vdso是真实存在的,并不是虚拟的。

Linux内核加载vdso动态库的过程

为了调试分析Linux内核的vdso动态库加载过程,需要获取Linux内核的调试信息。一种方案是重编Linux内核(笔者的电脑编译内核需要超过三个小时),另一种方案是从网络上下载带有调试信息的内核镜像。笔者使用的系统为Ubuntu 20.04,为节约时间,笔者选择了后一种方案。安装指定版本内核的操作如下:

sudo apt-get install --install-recommends linux-generic-hwe-20.04
sudo apt install --reinstall linux-headers-5.13.0-35-generic linux-hwe-5.13-headers-5.13.0-35 linux-image-5.13.0-35-generic linux-modules-5.13.0-35-generic linux-modules-extra-5.13.0-35-generic
wget http://ddebs.ubuntu.com/pool/main/l/linux/linux-image-unsigned-5.13.0-35-generic-dbgsym_5.13.0-35.40_amd64.ddeb
sudo dpkg -i linux-image-unsigned-5.13.0-35-generic-dbgsym_5.13.0-35.40_amd64.ddeb
sudo reboot

带有调试信息的Linux内核镜像路径为/usr/lib/debug/boot/vmlinux-5.13.0-35-generic,该镜像会在后面使用到。通过分析代码,笔者编写了bpftrace脚本0-probe-arch_setup_additional_pages.bt

#!/usr/bin/bpftrace
kprobe:arch_setup_additional_pages,
kprobe:compat_arch_setup_additional_pages
{
	$vdso_enabled = kaddr("vdso64_enabled");
	printf("PID: %d, comm: %s, vdso64_enabled: %d (%p), stack backtrace for %s:",
		pid, comm, *$vdso_enabled, $vdso_enabled, probe);
	print(kstack(perf));
}

使用bpftrace(笔者使用的bpftrace版本为v0.13.0-241-g275f5)加载该脚本,可以得到应用在创建时,内核为其加载vdso动态库的调用栈回溯:

# bpftrace ./0-probe-arch_setup_additional_pages.bt
Attaching 2 probes...
PID: 16305, comm: timestamp, vdso64_enabled: 1 (0xffffffffb8d5b84c), stack backtrace for kprobe:arch_setup_additional_pages:
	ffffffffb6c04d81 arch_setup_additional_pages+1
	ffffffffb6f2549a exec_binprm+314
	ffffffffb6f26b1d bprm_execve+365
	ffffffffb6f270b9 do_execveat_common.isra.0+393
	ffffffffb6f272d7 __x64_sys_execve+55
	ffffffffb77f3b21 do_syscall_64+97
	ffffffffb7a0007c entry_SYSCALL_64_after_hwframe+68

可见,Linux内核中的变量vdso64_enabled是使能的,在execve加载新的ELF镜像时,vdso动态库被Linux内核加载。可以更进一步,获取到为一个新的进程加载的vdso的地址,这就需要使用gdb调试器加载上面提到的vmlinux-5.13.0-35-generic文件并反汇编相关函数。分析相关函数的反汇编结果,可以编写1-probe-map_vdso.bt脚本:

#!/usr/bin/bpftrace
#include <asm/vdso.h>

kprobe:map_vdso
{
	$vimg = (struct vdso_image *) arg0;
	printf("map_vdso[%s, %d] at address: %p\n",
		comm, pid, arg1);
	printf("sym_vvar_start: %lx (%ld), image size: %lx",
		$vimg->sym_vvar_start, $vimg->sym_vvar_start, $vimg->size);
	print(kstack(perf));
}

/*
static int map_vdso(const struct vdso_image *image, unsigned long addr)
{
    ...
	addr = get_unmapped_area(NULL, addr,
				 image->size - image->sym_vvar_start, 0, 0);
*/
kprobe:map_vdso+0x65
{
	printf("get_unmapped_area has returned: %p\n", reg("ax"));
}


/*
	text_start = addr - image->sym_vvar_start;

	// MAYWRITE to allow gdb to COW and set breakpoints
	vma = _install_special_mapping(mm,
				       text_start,
*/
kprobe:map_vdso+0x88
{
	printf("text_start for VDSO: %p\n", reg("r15"));
}

/*
Dump of assembler code for function map_vdso:
   0xffffffff81004710 <+0>:	callq  0xffffffff810777e0 <__fentry__>
   0xffffffff81004715 <+5>:	push   %rbp
   0xffffffff81004716 <+6>:	mov    %gs:0x1fbc0,%rax
   0xffffffff8100471f <+15>:	mov    %rsp,%rbp
   0xffffffff81004722 <+18>:	push   %r15
   0xffffffff81004724 <+20>:	mov    %rsi,%r15
   0xffffffff81004727 <+23>:	push   %r14
   0xffffffff81004729 <+25>:	push   %r13
   0xffffffff8100472b <+27>:	push   %r12
   0xffffffff8100472d <+29>:	push   %rbx
   0xffffffff8100472e <+30>:	mov    %rdi,%rbx
   0xffffffff81004731 <+33>:	sub    $0x8,%rsp
   0xffffffff81004735 <+37>:	mov    0x868(%rax),%r13
   0xffffffff8100473c <+44>:	nopl   0x0(%rax,%rax,1)
   0xffffffff81004741 <+49>:	lea    0x78(%r13),%r14
   0xffffffff81004745 <+53>:	mov    %r14,%rdi
   0xffffffff81004748 <+56>:	callq  0xffffffff81c3a8c0 <down_write_killable>
   0xffffffff8100474d <+61>:	mov    %eax,%r12d
   0xffffffff81004750 <+64>:	nopl   0x0(%rax,%rax,1)
   0xffffffff81004755 <+69>:	test   %r12d,%r12d
   0xffffffff81004758 <+72>:	jne    0xffffffff8100487f <map_vdso+367>
   0xffffffff8100475e <+78>:	mov    0x8(%rbx),%rdx
   0xffffffff81004762 <+82>:	xor    %r8d,%r8d
   0xffffffff81004765 <+85>:	sub    0x38(%rbx),%rdx
   0xffffffff81004769 <+89>:	xor    %ecx,%ecx
   0xffffffff8100476b <+91>:	xor    %edi,%edi
   0xffffffff8100476d <+93>:	mov    %r15,%rsi
   0xffffffff81004770 <+96>:	callq  0xffffffff812a9d30 <get_unmapped_area>
   0xffffffff81004775 <+101>:	cmp    $0xfffffffffffff000,%rax
   0xffffffff8100477b <+107>:	ja     0xffffffff8100485f <map_vdso+335>
   0xffffffff81004781 <+113>:	mov    %rax,%r15
   0xffffffff81004784 <+116>:	mov    0x8(%rbx),%rdx
   0xffffffff81004788 <+120>:	sub    0x38(%rbx),%r15
   0xffffffff8100478c <+124>:	mov    $0x75,%ecx
   0xffffffff81004791 <+129>:	mov    $0xffffffff822010e0,%r8
   0xffffffff81004798 <+136>:	mov    %r15,%rsi
   0xffffffff8100479b <+139>:	mov    %r13,%rdi
   0xffffffff8100479e <+142>:	mov    %rax,-0x30(%rbp)
   0xffffffff810047a2 <+146>:	callq  0xffffffff812ae560 <_install_special_mapping>
*/

同时,笔者编写了以下调试代码dump-maps.c,可以读取自身的地址空间映射文件:

/* dump-maps.c */
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <time.h>

typedef int (* clockp_gettime)(clockid_t, struct timespec *);

int main(int argc, char *argv[])
{
	int fd;
	ssize_t rl1;
	char * pbuf = NULL;
	struct timespec tspec;
	clockp_gettime clk_gtm;

	clk_gtm = clock_gettime;
	fprintf(stdout, "PID of %s: %ld\n", argv[0], (long) getpid());
	fprintf(stdout, "address of clock_gettime: %p, vdso: %p\n",
		(void *) clk_gtm, (void *) getauxval(AT_SYSINFO_EHDR));
	fflush(stdout);

	fd = open("/proc/self/maps", O_RDONLY);
	if (fd == -1) {
		fprintf(stderr, "Error, failed to open maps file: %s\n",
			strerror(errno));
		fflush(stderr);
		return 1;
	}

	pbuf = (char *) malloc(8192);
	if (pbuf == NULL) {
		close(fd);
		fputs("Error, system out of memory!\n", stderr);
		fflush(stderr);
		return 2;
	}

	for (;;) {
		rl1 = read(fd, pbuf, 8192);
		if (rl1 <= 0)
			break;
		fwrite(pbuf, 0x1, (size_t) rl1, stdout);
	}

	tspec.tv_sec = 0;
	tspec.tv_nsec = 0;
	clk_gtm(CLOCK_REALTIME, &tspec);
	fprintf(stdout, "REALTIME: %ld, %ld\n",
		(long) tspec.tv_sec, tspec.tv_nsec);
	fflush(stdout);

	close(fd);
	free(pbuf);
	return 0;
}

之后,分别在两个终端下先后执行bpftracedump-maps,可以得到:

# bpftrace ./1-probe-map_vdso.bt
Attaching 3 probes...
map_vdso[dump-maps, 16684] at address: 0x7ffffffff000
sym_vvar_start: ffffffffffffc000 (-16384), image size: 2000
	ffffffffb6c04951 map_vdso+1
	ffffffffb6c04da2 arch_setup_additional_pages+34
	ffffffffb6fb13ab load_elf_binary+3323
	ffffffffb6f2549a exec_binprm+314
	ffffffffb6f26b1d bprm_execve+365
	ffffffffb6f270b9 do_execveat_common.isra.0+393
	ffffffffb6f272d7 __x64_sys_execve+55
	ffffffffb77f3b21 do_syscall_64+97
	ffffffffb7a0007c entry_SYSCALL_64_after_hwframe+68

get_unmapped_area has returned: 0x7ffff7fc9000
text_start for VDSO: 0x7ffff7fcd000

$ ./dump-maps 
PID of ./dump-maps: 16684
address of clock_gettime: 0x7ffff7e960c0, vdso: 0x7ffff7fcd000
555555554000-555555555000 r--p 00000000 08:08 3278887                    /home/yejq/program/blogs/20220322/dump-maps
555555555000-555555556000 r-xp 00001000 08:08 3278887                    /home/yejq/program/blogs/20220322/dump-maps
555555556000-555555557000 r--p 00002000 08:08 3278887                    /home/yejq/program/blogs/20220322/dump-maps
555555557000-555555558000 r--p 00002000 08:08 3278887                    /home/yejq/program/blogs/20220322/dump-maps
555555558000-555555559000 rw-p 00003000 08:08 3278887                    /home/yejq/program/blogs/20220322/dump-maps
555555559000-55555557a000 rw-p 00000000 00:00 0                          [heap]
...
7ffff7fa7000-7ffff7fad000 rw-p 00000000 00:00 0 
7ffff7fc9000-7ffff7fcd000 r--p 00000000 00:00 0                          [vvar]
7ffff7fcd000-7ffff7fcf000 r-xp 00000000 00:00 0                          [vdso]
... 

可见,通过bpftrace获得的vvar/vdso加载地址与dump-maps读取到的结果是一致的。

应用读取vvar以获取系统时间

上面的调试结果表明,与vdso可执行代码段一同加载的,还有一个名称为vvar只读内存段。该只读内存段就应该保存了系统时间等信息。对系统上所有运行的应用而言,内核为其映射的vdso/vvar是同一个物理地址,也就是说,每一个进程并不拥有独立的vdsovvar地址空间,它们是共享的,且对应用是只读的;这样内核在更新系统时间时,只需更新一段内存区域即可。vvar有其特定的数据结构,该结构定义于include/vdso/datapage.h

/**    
 * struct vdso_data - vdso datapage representation    
 * @seq:        timebase sequence counter    
 * @clock_mode:     clock mode    
 * @cycle_last:     timebase at clocksource init    
 * @mask:       clocksource mask    
 * @mult:       clocksource multiplier    
 * @shift:      clocksource shift    
 * @basetime[clock_id]: basetime per clock_id    
 * @offset[clock_id]:   time namespace offset per clock_id    
 * @tz_minuteswest: minutes west of Greenwich    
 * @tz_dsttime:     type of DST correction    
 * @hrtimer_res:    hrtimer resolution    
 * @__unused:       unused    
 * @arch_data:      architecture specific data (optional, defaults    
 *          to an empty struct)    
 *    
 * vdso_data will be accessed by 64 bit and compat code at the same time    
 * so we should be careful before modifying this structure.    
 *    
 * @basetime is used to store the base time for the system wide time getter    
 * VVAR page.    
 *    
 * @offset is used by the special time namespace VVAR pages which are    
 * installed instead of the real VVAR page. These namespace pages must set    
 * @seq to 1 and @clock_mode to VDSO_CLOCKMODE_TIMENS to force the code into    
 * the time namespace slow path. The namespace aware functions retrieve the    
 * real system wide VVAR page, read host time and add the per clock offset.    
 * For clocks which are not affected by time namespace adjustment the    
 * offset must be zero.    
 */ 
 struct vdso_data {    
    u32         seq;    
    
    s32         clock_mode;    
    u64         cycle_last;    
    u64         mask;    
    u32         mult;    
    u32         shift;    
    
    union {    
        struct vdso_timestamp   basetime[VDSO_BASES];    
        struct timens_offset    offset[VDSO_BASES];    
    };    
    
    s32         tz_minuteswest;    
    s32         tz_dsttime;    
    u32         hrtimer_res;
    u32         __unused;

    struct arch_vdso_data   arch_data;
};

该结构体的定义与具体的内核版本相关,glibc并不解析之,而是由编译链到的内核中的vdso64.so来解析。vdso提供的clock_gettime函数读取时间的功能主要由内核代码lib/vdso/gettimeofday.c实现:

/* linux-5.13.0/lib/vdso/gettimeofday.c */
static __always_inline int    
__cvdso_clock_gettime_common(const struct vdso_data *vd, clockid_t clock,    
                 struct __kernel_timespec *ts)    
{    
    u32 msk;    
    
    /* Check for negative values or invalid clocks */    
    if (unlikely((u32) clock >= MAX_CLOCKS))    
        return -1;    
    
    /*    
     * Convert the clockid to a bitmask and use it to check which    
     * clocks are handled in the VDSO directly.    
     */    
    msk = 1U << clock;    
    if (likely(msk & VDSO_HRES))    
        vd = &vd[CS_HRES_COARSE];    
    else if (msk & VDSO_COARSE)    
        return do_coarse(&vd[CS_HRES_COARSE], clock, ts);    
    else if (msk & VDSO_RAW)    
        vd = &vd[CS_RAW];    
    else    
        return -1;    
    
    return do_hres(vd, clock, ts);    
} 

不过,应当指出,vdso只能读取一定程度上与进程无关的内核数据。当调用clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)以获取当前进程的运行时间时,上面的函数会返回-1,此时,vdso会发出系统调用(通过函数clock_gettime_fallback(...)):

static __maybe_unused int
__cvdso_clock_gettime_data(const struct vdso_data *vd, clockid_t clock,
               struct __kernel_timespec *ts)
{                                                                                                      
    int ret = __cvdso_clock_gettime_common(vd, clock, ts);
                                                                                                      
    if (unlikely(ret))
        return clock_gettime_fallback(clock, ts);
    return 0;
} 

这一点,可以通过strace工具来验证。在以上的timestamp.c函数中加入:

diff --git a/20220322/timestamp.c b/20220322/timestamp.c
index 0559e7e..727d194 100644
--- a/timestamp.c
+++ b/timestamp.c
@@ -36,5 +36,6 @@ int main(int argc, char *argv[])
 
        dump_time(CLOCK_REALTIME, 1);
        dump_time(CLOCK_BOOTTIME, 0);
+       dump_time(CLOCK_PROCESS_CPUTIME_ID, 0);
        return 0;
 }

编译后以strace跟踪系统调用,可以得到clock_gettime相关的信息,但只有一个:

fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x1), ...}) = 0
brk(NULL)                               = 0x555555559000
brk(0x55555557a000)                     = 0x55555557a000
write(1, "argc: 1, argv[0]: ./timestamp\n", 30argc: 1, argv[0]: ./timestamp
) = 30
openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=573, ...}) = 0
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 573
lseek(3, -348, SEEK_CUR)                = 225
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\3\0\0\0\3\0\0\0\0"..., 4096) = 348
close(3)                                = 0
write(1, "Date & time: 2022-04-03 10:45:58"..., 33Date & time: 2022-04-03 10:45:58
) = 33
write(1, "clock: 0x7, tv_sec: 51579, tv_ns"..., 46clock: 0x7, tv_sec: 51579, tv_nsec: 433065897
) = 46
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=0, tv_nsec=2112517}) = 0
write(1, "clock: 0x2, tv_sec: 0, tv_nsec: "..., 40clock: 0x2, tv_sec: 0, tv_nsec: 2112517
) = 40
exit_group(0)                           = ?
+++ exited with 0 +++

至此,我们可以回答以上问题了,并得出结论:strace可能跟踪不到一些系统调用,这些系统调用被vdso中相关的函数替代,避免系统调用带来的性能开销。在x86平台,可以通过增加内核启动的命令行参数vdso=0vdso64_enabled置0,以此来禁用vdso,感兴趣的可以尝试一下。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值