理解系统调用的关键在于洞悉系统调用号是联系用户模式与内核模式的纽带。而在Solaris x64平台上,系统调用号被保存在寄存器RAX中,从用户模式传递到内核模式。一旦进入内核模式,内核的sys_syscall入口程序就根据保存在RAX中的系统调用号,从内核维护的系统调用表(sysent)中查询出对应的系统调用处理程序,从而进行系统调用。系统调用最多支持6个参数,参数被顺序保存在寄存器RDI, RSI, RDX, RCX, R8, R9中完成传递。另外,从用户模式陷入内核模式,通过汇编指令syscall实现切换,而从内核模式返回到用户模式,则通过汇编指令sysret完成切换。
1 系统调用概述
1.1 什么是系统调用
在现代操作系统中,用户的应用程序访问并使用内核所提供的各种服务的途径被称之为系统调用(syscall)。
1.2 为什么需要系统调用
第一,系统调用可以为用户空间提供访问硬件资源的统一接口,以至于用户程序不必去关注具体的硬件操作。比如,读写文件时,用户完全没有必要关心文件存放在何种磁盘上,也不用关心文件在何种文件系统上。
第二,系统调用可以对操作系统进行保护,保证系统的稳定和安全。系统调用的存在规定了用户进程进入操作系统内核的具体方式。换言之,用户进程访问内核的路径是事先规定好了的,只能从规定的位置进入内核,而不允许随便跳入内核。有了这样的进入内核的统一访问路径上的限制,才能充分保证内核的安全。
1.3 系统调用与C库函数的关系
内核提供的系统调用在C库中都有相应的封装函数。系统调用与其封装的C库函数名称常常相同。例如: modctl系统调用在C库中的封装函数即为modctl函数,其实现位于modctl.s汇编文件中。
1.4 系统调用与系统命令的关系
系统命令位于C库函数的上一层,是利用C库函数实现的可执行程序。例如: 命令modinfo调用C库函数modctl()查询内核模块的信息。而C库函数封装了进入内核的系统调用,modctl()使用syscall指令(有别于int 0x80, 是一种快速系统调用指令)进入内核。
1.5 系统调用与系统函数的关系
内核函数与C库函数的区别仅仅是内核函数在内核中实现,因此必须遵循内核编程的规则。系统调用最终必须具有明确的操作。用户应用程序通过系统调用进入内核后,会执行系统调用对应的内核函数,也就是系统调用服务例程。例如:modctl系统调用的服务例程是内核函数modctl()。
系统调用过程如下图所示:
2 Solaris x64系统调用实现原理
Solaris 支持x64和sparc两种平台,目前内核都是64位,但是支持32位和64位的应用程序,因此,32位和64位的系统调用都是支持的。为简单起见,接下来的讨论只阐述x64平台上的64位系统调用。
2.1 AMD64 ABI基础
理解Solaris X64系统调用,不可避免地需要了解一下基本的AMD64 ABI。Solaris x64实现遵循的ABI文档是:
System V Application Binary Interface, AMD64 Architecture Processor Supplement
这里使用简化的ABI文档: http://www.x86-64.org/documentation/abi.pdf
A.2 AMD64 Linux Kernel Conventions ... A.2.1 Calling Conventions The Linux AMD64 kernel uses internally the same calling conventions as user-level applications (see section 3.2.3 for details). User-level applications that like to call system calls should use the functions from the C library. The interface between the C library and the Linux kernel is the same as for the user-level applications with the following differences: 1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9. 2. A system-call is done via the syscall instruction. The kernel destroys registers %rcx and %r11. 3. The number of the syscall has to be passed in register %rax. 4. System-calls are limited to six arguments, no argument is passed directly on the stack. 5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno. 6. Only values of class INTEGER or class MEMORY are passed to the kernel.
另外,来自d3s.mff.cuni.cz/teaching/crash_dump_analysis的slides可以作为参考。 【贴两张主要的截图】
下面给出一个内核函数反汇编后的例子帮助理解ABI。
函数原型: ibt_status_t ibt_suggest_alt_path(ibt_channel_hdl_t channel, ibt_execution_mode_t mode, ibt_suggest_alt_path_info_t *alt_path, void *priv_data, ibt_priv_data_len_t priv_data_len, ibt_spr_returns_t *ret_args); 用mdb -k进入内核反汇编 root# mdb -k > ibt_suggest_alt_path::dis ibt_suggest_alt_path: pushq %rbp ; save rbp ibt_suggest_alt_path+1: movq %rsp,%rbp ; ibt_suggest_alt_path+4: subq $0x30,%rsp ; ibt_suggest_alt_path+8: movq %rdi,-0x8(%rbp) ; arg1 : rdi ibt_suggest_alt_path+0xc: movq %rsi,-0x10(%rbp) ; arg2 : rsi ibt_suggest_alt_path+0x10: movq %rdx,-0x18(%rbp) ; arg3 : rdx ibt_suggest_alt_path+0x14: movq %rcx,-0x20(%rbp) ; arg4 : rcx ibt_suggest_alt_path+0x18: movq %r8,-0x28(%rbp) ; arg5 : r8 ibt_suggest_alt_path+0x1c: movq %r9,-0x30(%rbp) ; arg6 : r9 ibt_suggest_alt_path+0x20: pushq %rbx ; save rbx ibt_suggest_alt_path+0x21: pushq %r12 ; save r12 ibt_suggest_alt_path+0x23: pushq %r13 ; save r13 ibt_suggest_alt_path+0x25: pushq %r14 ; save r14 ibt_suggest_alt_path+0x27: pushq %r15 ; save r15 ... ibt_suggest_alt_path+0xa11: popq %r15 ; restore r15 ibt_suggest_alt_path+0xa13: popq %r14 ; restore r14 ibt_suggest_alt_path+0xa15: popq %r13 ; restore r13 ibt_suggest_alt_path+0xa17: popq %r12 ; restore r12 ibt_suggest_alt_path+0xa19: popq %rbx ; restore rbx ibt_suggest_alt_path+0xa1a: leave ; restore rsp, rbp ibt_suggest_alt_path+0xa1b: ret ; > $q // leave == movq %rbp, %rsp + popq %rbp
2.2 系统调用号
每一个系统调用都有一个独一无二的系统调用号。操作系统最多支持512个系统调用。如果一个系统调用被废弃,那么它对应的系统调用号将被保留,而不能分配给新的系统调用使用。所有系统调用号位于文件/etc/name_to_sysnum中。ABI规定了系统调用号是由寄存器rax传递给内核的,例如: modctl的系统调用号为152 (=0x98), 从modctl::dis的输出中我们可以看出,在执行syscall指令之前,%eax == 0x98。
root# egrep "modctl" /etc/name_to_sysnum modctl 152 root# echo "modctl::dis" | mdb /lib/64/libc.so.1 modctl: movq %rcx,%r10 modctl+3: movl $0x98,%eax modctl+8: syscall modctl+0xa: jb -0x126d30 <__cerror> modctl+0x10: xorq %rax,%rax modctl+0x13: ret
- modctl()的实现可参见usr/src/lib/libc/common/sys/modctl.s
有关系统调用号的定义,见源文件usr/src/uts/common/sys/syscall.h,
例如:
#define SYS_modctl 152
内核在进入sys_syscall()后,根据寄存器rax中存储的系统调用号查找相应的系统调用内核函数。
2.3 系统调用表
Solaris内核维护了一张系统调用表,表中的每一个元素是一个struct sysent。
2.3.1 结构体struct sysent
321 /* 322 * Structure of the system-entry table. 323 * 324 * Changes to struct sysent should maintain binary compatibility with 325 * loadable system calls, although the interface is currently private. 326 * 327 * This means it should only be expanded on the end, and flag values 328 * should not be reused. 329 * 330 * It is desirable to keep the size of this struct a power of 2 for quick 331 * indexing. 332 */ 333 struct sysent { 334 char sy_narg; /* total number of arguments */ 335 #ifdef _LP64 336 unsigned short sy_flags; /* various flags as defined below */ 337 #else 338 unsigned char sy_flags; /* various flags as defined below */ 339 #endif 340 int (*sy_call)(); /* argp, rvalp-style handler */ 341 krwlock_t *sy_lock; /* lock for loadable system calls */ 342 int64_t (*sy_callc)(); /* C-style call hander or wrapper */ 343 };
root# mdb -k > ::sizeof struct sysent sizeof (struct sysent) = 0x20 > ::offsetof sysent sy_callc offsetof (sysent, sy_callc) = 0x18, sizeof (...->sy_callc) = 8
注意:结构体sysent的大小为0x20(=32), 系统调用服务例程sy_callc在结构体sysent中的偏移为0x18。后面我们分析sys_syscall()汇编代码的时候会用到0x20, 0x18这两个数字。
2.3.2 系统调用表struct sysent sysent[NSYSCALL]
o 宏NSYSCALL定义于头文件usr/src/uts/common/sys/systm.h中,
#define NSYSCALL 256 /* number of system calls */
o sysent[NSYSCALL]定义于源文件usr/src/uts/common/os/sysent.c中
/* * Native sysent table. */ struct sysent sysent[NSYSCALL] = { /* 0 */ IF_LP64( SYSENT_NOSYS(), SYSENT_C("indir", indir, 1)), /* 1 */ SYSENT_CI("exit", rexit, 1), ... /* 152 */ SYSENT_CI("modctl", modctl, 6), ... /* 255 */ SYSENT_CI("umount2", umount2, 2) ... };
o 宏SYSENT_CI定义于源文件 usr/src/uts/common/os/sysent.c中,
#define SYSENT_CI(name, call, narg) \ { (narg), SE_32RVAL1, NULL, NULL, (llfcn_t)(call) }
o 宏SE_32RVAL1定义于头文件 usr/src/uts/common/sys/systm.h中,
#define SE_32RVAL1 0x0 /* handler returns int32_t in rval1 */
o 以modctl为例,其在sysent表中被展开后就是:
{6, 0x0, NULL, NULL, (llfcn_t)modctl}
o 用mdb查看一下,
> (sysent + 0x20 * 0t152)::print -Ta struct sysent fffffffffc243480 struct sysent { fffffffffc243480 char sy_narg = '\006' fffffffffc243482 unsigned short sy_flags = 0 fffffffffc243488 int (*)() sy_call = 0 fffffffffc243490 krwlock_t *sy_lock = 0 fffffffffc243498 int64_t (*)() sy_callc = modctl } >
果然,sys_narg = 6, sy_callc = modctl; 也就是说,modctl系统函数中会接收6个参数。
o modctl在usr/src/uts/common/os/sysent.c的申明如下,
int modctl(int, uintptr_t, uintptr_t, uintptr_t, uintptr_t, uintptr_t);
2.4 系统调用入口sys_syscall
o 用户态的C库函数调用syscall指令后进入内核,内核从sys_syscall()开始执行。注意sys_syscall()是通过汇编代码实现的,源文件为:
usr/src/uts/i86pc/ml/syscall_asm_amd64.s
525 _syscall_invoke: 526 movq REGOFF_RDI(%rbp), %rdi 527 movq REGOFF_RSI(%rbp), %rsi 528 movq REGOFF_RDX(%rbp), %rdx 529 movq REGOFF_RCX(%rbp), %rcx 530 movq REGOFF_R8(%rbp), %r8 531 movq REGOFF_R9(%rbp), %r9 532 533 cmpl $NSYSCALL, %eax 534 jae _syscall_ill 535 shll $SYSENT_SIZE_SHIFT, %eax 536 leaq sysent(%rax), %rbx 537 538 call *SY_CALLC(%rbx) 539 540 movq %rax, %r12 541 movq %rdx, %r13
o 对sys_syscall()用mdb查看
1 > sys_syscall::dis 2 sys_syscall: swapgs 3 ... 4 sys_syscall+0x21d: movq 0x10(%rbp),%rdi 5 sys_syscall+0x221: movq 0x18(%rbp),%rsi 6 sys_syscall+0x225: movq 0x20(%rbp),%rdx 7 sys_syscall+0x229: movq 0x28(%rbp),%rcx 8 sys_syscall+0x22d: movq 0x30(%rbp),%r8 9 sys_syscall+0x231: movq 0x38(%rbp),%r9 10 sys_syscall+0x235: cmpl $0x100,%eax 11 sys_syscall+0x23a: jae +0x11a <0xfffffffffb8014bb> 12 sys_syscall+0x240: shll $0x5,%eax 13 sys_syscall+0x243: leaq 0xfffffffffc242180(%rax),%rbx <sysent> 14 sys_syscall+0x24a: call *0x18(%rbx) 15 sys_syscall+0x24d: movq %rax,%r12 16 sys_syscall+0x250: movq %rdx,%r13 17 ... 18 nopop_sys_syscall_swapgs_sysretq: swapgs 19 nopop_sys_syscall_swapgs_sysretq+3: sysret 20 ...
o sys_syscall.s中的这3行,
535 shll $SYSENT_SIZE_SHIFT, %eax 536 leaq sysent(%rax), %rbx 538 call *SY_CALLC(%rbx)
对应于将sys_syscall反汇编后这3行
12 sys_syscall+0x240: shll $0x5,%eax 13 sys_syscall+0x243: leaq 0xfffffffffc242180(%rax),%rbx <sysent> 14 sys_syscall+0x24a: call *0x18(%rbx)
12: 将eax的值也就是系统调用号左移5位,eax = eax << 5 = eax * 32;
13: 将rax的值加上系统调用表sysent的首地址,存入rbx中;
14: 将rbx的值加上0x18, 该内存地址中的值就是系统调用服务例程的收地址,call [rbx+0x18], 就是调用对应的系统调用服务例程。
o 例如: (以modctl为例)
> ::sizeof struct sysent sizeof (struct sysent) = 0x20 > ::offsetof struct sysent sy_callc offsetof (struct sysent, sy_callc) = 0x18, sizeof (...->sy_callc) = 8 > sysent + 0x20 * 0t152 = J fffffffffc243480 > fffffffffc243480 + 0x18 = J fffffffffc243498 > fffffffffc243498/J sysent+0x1318: fffffffffbcad4e0 > fffffffffbcad4e0::whatis fffffffffbcad4e0 is modctl, in genunix's text segment > fffffffffbcad4e0/i modctl: modctl: pushq %rbp > sysent + 0x20 * 0t152 ::print -Ta struct sysent fffffffffc243480 struct sysent { fffffffffc243480 char sy_narg = '\006' fffffffffc243482 unsigned short sy_flags = 0 fffffffffc243488 int (*)() sy_call = 0 fffffffffc243490 krwlock_t *sy_lock = 0 fffffffffc243498 int64_t (*)() sy_callc = modctl } >
一旦sys_syscall()找到了系统调用的服务例程(当然是根据系统调用号计算出来的),就进入那个服务例程执行。而系统调用的参数准备在寄存器rdi, rsi, rdx, rcx, r8, r9中。
L4-9正是把储存在stack上的参数值取出来,装入对应的寄存器。系统调用服务例程将从寄存器中取得参数,例如:
4 sys_syscall+0x21d: movq 0x10(%rbp),%rdi 5 sys_syscall+0x221: movq 0x18(%rbp),%rsi 6 sys_syscall+0x225: movq 0x20(%rbp),%rdx 7 sys_syscall+0x229: movq 0x28(%rbp),%rcx 8 sys_syscall+0x22d: movq 0x30(%rbp),%r8 9 sys_syscall+0x231: movq 0x38(%rbp),%r9 10 sys_syscall+0x235: cmpl $0x100,%eax 11 sys_syscall+0x23a: jae +0x11a <0xfffffffffb8014bb> 12 sys_syscall+0x240: shll $0x5,%eax 13 sys_syscall+0x243: leaq 0xfffffffffc242180(%rax),%rbx <sysent> 14 sys_syscall+0x24a: call *0x18(%rbx)
系统调用服务例程将从寄存器中取得参数,例如:
> fffffffffbcad4e0::dis modctl: pushq %rbp modctl+1: movq %rsp,%rbp modctl+4: subq $0x30,%rsp modctl+8: movq %rdi,-0x8(%rbp) modctl+0xc: movq %rsi,-0x10(%rbp) modctl+0x10: movq %rdx,-0x18(%rbp) modctl+0x14: movq %rcx,-0x20(%rbp) modctl+0x18: movq %r8,-0x28(%rbp) modctl+0x1c: movq %r9,-0x30(%rbp) ...
到此为止,用户程序调用C库函数modctl()的参数已经从用户空间传入内核空间,等待内核空间的modctl()执行。当然,中间经过了sys_syscall()存入stack中又从stack中取出来的过程。一旦内核空间的modctl()执行完毕,sys_syscall()就通过sysret指令返回给用户空间的modctl().
3 系统调用过程观察实例
3.1 最简单直接的观察
o 在终端A上启动mdb, 调试命令modinfo
root# mdb /usr/sbin/modinfo > main:b > :r -i 16 mdb: target stopped at: ld.so.1`rtld_db_postinit: pushq %rbp > :c mdb: target stopped at: ld.so.1`rtld_db_dlactivity: pushq %rbp mdb: You've got symbols! Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ] > modctl:b > modctl::dis libc.so.1`modctl: movq %rcx,%r10 libc.so.1`modctl+3: movl $0x98,%eax libc.so.1`modctl+8: syscall libc.so.1`modctl+0xa: jb -0x126d30 <libc.so.1`__cerror> libc.so.1`modctl+0x10: xorq %rax,%rax libc.so.1`modctl+0x13: ret > :s mdb: target stopped at: ld.so.1`rtld_db_dlactivity+1: movq %rsp,%rbp > :s mdb: target stopped at: libc.so.1`modctl+3: movl $0x98,%eax > :s mdb: target stopped at: libc.so.1`modctl+8: syscall > // 在进入内核模式前,先看看寄存器 > $r %rax = 0x0000000000000098 %r8 = 0x0000000000000000 %rbx = 0xffff80fdae44e2d0 %r9 = 0x0000000000000000 %rcx = 0x000000061bf989a0 %r10 = 0x000000061bf989a0 %rdx = 0xffff80fdae44e2d0 %r11 = 0x00007ff91dcbfe90 %rsi = 0x0000000000000010 %r12 = 0xffff80fdae44e2d8 %rdi = 0x0000000000000002 %r13 = 0x0000000000000000 ... > // rax = 0x98 = 152 // modctl的系统调用号 > // rdi = 0x2 // cmd MODINFO = 2 > // rsi = 0x10 = 16 // mod ID > // rdx = 0xffff80fdae44e2d0 // struct modinfo * > 0xffff80fdae44e2d0::print struct modinfo mi_id mi_name mi_msinfo[0] mi_id = 0x10 mi_name = [ '\0', ... ] mi_msinfo[0] = { mi_msinfo[0].msi_linkinfo = [ '\001', ... ] mi_msinfo[0].msi_p0 = 0xae44e438 } > > :s // 现在不能按回车键, 一旦按下回车键,就执行syscall进入内核模式
o 在终端B(console)上启动mdb -K, 设置断点modctl:b; 这样,一旦在终端A >:s 后敲入回车,立即进入内核模式,可从终端B上观察到
root# mdb -K kmdb: target stopped at: kmdb_enter+0xb: movq %rax,%rdi [5]> modctl:b [5]> :c root#
o 在终端A上键入回车
> :s
[光标在此闪烁];用户程序被暂停了!
与此同时,终端B上进入内核模式 root# mdb -K kmdb: target stopped at: kmdb_enter+0xb: movq %rax,%rdi [5]> modctl:b [5]> :c root# kmdb: stop at modctl kmdb: target stopped at: modctl: pushq %rbp [22]>
o 在终端B上查看调用参数
[22]> $r %rax = 0x0000000000001300 %r9 = 0x0000000000000000 %rbx = 0xfffffffffc243480 sysent+0x1300 %r10 = 0xffff80fdae44e5d8 %rcx = 0x000000061bf989a0 %r11 = 0xffff80fdae44e268 %rdx = 0xffff80fdae44e2d0 %r12 = 0x0000000000000000 %rsi = 0x0000000000000010 %r13 = 0x0000000000000000 %rdi = 0x0000000000000002 %r14 = 0xffffa1c009e87000 %r8 = 0x0000000000000000 %r15 = 0xffffa1c00458a4c0 %rip = 0xfffffffffbcad4e0 modctl ... > // rdi = 0x2 > // rsi = 0x10 > // rdx = 0xffff80fdae44e2d0 [22]> 0xffff80fdae44e2d0::print struct modinfo mi_id mi_name mi_id = 0x10 mi_name = [ '\0', ...]
o 在终端B上输入 :z, :c返回
[22]> [22]> :z [22]> :c
与此同时,终端A上的被暂停的:s被激活, > :s mdb: target stopped at: libc.so.1`modctl+0xa: jb -0x126d30 <libc.so.1`__cerror> > 说明内核调用已经结束,返回到用户模式。
o 在终端A上查看地址0xffff80fdae44e2d0处的内容,我们期望的数据应该已被内核调用填好
> 0xffff80fdae44e2d0::print struct modinfo { mi_info = 5 mi_state = 3 mi_id = 0x10 mi_nextid = 0x10 mi_base = 0xfffffffffbdbc2d8 mi_size = 0x5d198 mi_rev = 1 mi_loadcnt = 1 mi_name = [ "pcie" ] mi_msinfo = [ { msi_linkinfo = [ "PCI Express Framework Module" ] msi_p0 = 0xffffffff }, ...
注意: mis_base, mi_size, mi_name, ms_msinfo[0]中的数据已如我们期望的被填充好。
ID LOADADDR SIZE INFO REV NAMEDESC 16 fffffffffbdbc2d8 5d198 -- 1 pcie (PCI Express Framework Module) mdb: target has terminated >
到此为止,我们观察到了从用户模式进入内核模式,再从内核模式返回到用户模式的全过程。系统调用的神秘面纱已经被揭开。接下来将用DTrace深入观察内核服务例程的行为。
3.2 使用DTrace观察内核行为
o 在终端A启动mdb, 调试命令modinfo
root# mdb /usr/sbin/modinfo > main:b > :r -i 16 mdb: target stopped at: ld.so.1`rtld_db_postinit: pushq %rbp > :c mdb: target stopped at: ld.so.1`rtld_db_dlactivity: pushq %rbp mdb: You've got symbols! Loading modules: [ ld.so.1 libc.so.1 libuutil.so.1 ] > modctl:b > :c mdb: stop at libc.so.1`modctl mdb: target stopped at: libc.so.1`modctl: movq %rcx,%r10 > :s mdb: target stopped at: libc.so.1`modctl+3: movl $0x98,%eax > :s mdb: target stopped at: libc.so.1`modctl+8: syscall > > $r %rax = 0x0000000000000098 %r8 = 0x0000000000000000 %rbx = 0xffff80dbdcb43120 %r9 = 0x0000000000000000 %rcx = 0x0000000f25c48f90 %r10 = 0x0000000f25c48f90 %rdx = 0xffff80dbdcb43120 %r11 = 0x00007ffdc84bfe90 %rsi = 0x0000000000000010 %r12 = 0xffff80dbdcb43128 %rdi = 0x0000000000000002 %r13 = 0x0000000000000000 ... > 0xffff80dbdcb43120::print struct modinfo mi_id mi_name mi_id = 0x10 mi_name = [ '\0', ...]
o 在终端B启动dtrace脚本
root# ./fook.d
fook.d的代码如下:
1 #!/usr/sbin/dtrace -qs 2 3 syscall::modctl:entry 4 /execname == "modinfo"/ 5 { 6 printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 7 printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n", 8 arg0, arg1, arg2, arg3, arg4, arg5); 9 stack(); 10 printf("\n---------------------------------------------------------\n"); 11 self->n = 1; 12 } 13 14 syscall::modctl:return 15 /execname == "modinfo"/ 16 { 17 self->n = 0; 18 } 19 20 fbt::modctl:entry 21 /self->n == 1/ 22 { 23 printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 24 printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n", 25 arg0, arg1, arg2, arg3, arg4, arg5); 26 stack(); 27 printf("\n---------------------------------------------------------\n"); 28 } 29 30 fbt::modctl_modinfo:entry 31 /self->n == 1/ 32 { 33 printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 34 printf("args: 0x%X, 0x%X\n", arg0, arg1); 35 stack(); 36 37 self->mip = (struct modinfo *)arg1; 38 39 /* 40 * usr/src/uts/common/sys/modctl.h#421 41 * struct modinfo { 42 * int mi_info; // Flags for info wanted 43 * int mi_state; // Flags for module state 44 * int mi_id; // id of this loaded module 45 * int mi_nextid; // id of next module or -1 46 * caddr_t mi_base; // virtual addr of text 47 * size_t mi_size; // size of module in bytes 48 * int mi_rev; // loadable modules rev 49 * int mi_loadcnt; // # of times loaded 50 * char mi_name[MODMAXNAMELEN]; // name of module 51 * struct modspecific_info mi_msinfo[MODMAXLINK]; 52 * // mod specific info 53 * }; 54 * 55 * struct modspecific_info { 56 * char msi_linkinfo[MODMAXLINKINFOLEN]; // name in linkage struct 57 * int msi_p0; // module specific information 58 * }; 59 * 60 * usr/src/cmd/modload/modinfo.c#248 61 * static boolean_t 62 * print_mod_cb(ofmt_arg_t *ofarg, char *buf, uint_t bufsize) 63 * 64 * XXX: Here we cannot use self->mip->mi_id,... directly, so copyin ! 65 */ 66 self->mi = (struct modinfo *)(copyin((uintptr_t)(self->mip), 67 sizeof(struct modinfo))); 68 printf("\n"); 69 printf("ENT: ID mi->mi_id = %d\n", 70 self->mi->mi_id); 71 printf("ENT: LOADADDR mi->mi_base = %p\n", 72 self->mi->mi_base); 73 printf("ENT: SIZE mi->mi_size = %x\n", 74 self->mi->mi_size); 75 printf("ENT: INFO mi->mi_mi_msinfo[0].msi_p0 = %d\n", 76 self->mi->mi_msinfo[0].msi_p0); 77 printf("ENT: REV mi->mi_rev = %x\n", 78 self->mi->mi_rev); 79 printf("ENT: NAME mi->mi_name = %s\n", 80 stringof(self->mi->mi_name)); 81 printf("ENT: DESC mi->mi_msinfo[0].msi_linkinfo = %s\n", 82 stringof(self->mi->mi_msinfo[0].msi_linkinfo)); 83 printf("\n---------------------------------------------------------\n"); 84 } 85 86 fbt::modctl_modinfo:return 87 /self->n == 1/ 88 { 89 printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 90 91 stack(); 92 93 /* 94 * XXX: Here we cannot use self->mip->mi_id,... directly, so copyin ! 95 */ 96 self->mi = (struct modinfo *)(copyin((uintptr_t)(self->mip), 97 sizeof(struct modinfo))); 98 printf("\n"); 99 printf("RET: ID mi->mi_id = %d\n", 100 self->mi->mi_id); 101 printf("RET: LOADADDR mi->mi_base = %p\n", 102 self->mi->mi_base); 103 printf("RET: SIZE mi->mi_size = %x\n", 104 self->mi->mi_size); 105 printf("RET: INFO mi->mi_mi_msinfo[0].msi_p0 = %d\n", 106 self->mi->mi_msinfo[0].msi_p0); 107 printf("RET: REV mi->mi_rev = %x\n", 108 self->mi->mi_rev); 109 printf("RET: NAME mi->mi_name = %s\n", 110 stringof(self->mi->mi_name)); 111 printf("RET: DESC mi->mi_msinfo[0].msi_linkinfo = %s\n", 112 stringof(self->mi->mi_msinfo[0].msi_linkinfo)); 113 printf("\n---------------------------------------------------------\n"); 114 115 self->mip = 0; 116 } 117 118 fbt::copyin:entry, 119 fbt::copyout:entry 120 /self->n == 1/ 121 { 122 printf("%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 123 printf("args: 0x%X, 0x%X, 0x%X\n", arg0, arg1, arg2); 124 stack(); 125 printf("\n---------------------------------------------------------\n"); 126 }
注意:内核内存和用户内存是严格隔离的,当内核需要访问用户内存时,必须使用copyin();反之,如内核需要把数据传递会用户空间,必须使用copyout()。
o 在终端A中执行:s (执行汇编指令syscall)
> :s mdb: target stopped at: libc.so.1`modctl+0xa: jb -0x126d30 <libc.so.1`__cerror> >
与此同时,终端B的输出如下
root# ./fook.d syscall::modctl :entry args: 0x2, 0x10, 0xFFFF80DBDCB43120, 0xF25C48F90, 0x0, 0x0 unix`sys_syscall+0x24d --------------------------------------------------------- fbt:genunix:modctl :entry args: 0x2, 0x10, 0xFFFF80DBDCB43120, 0xF25C48F90, 0x0, 0x0 genunix`dtrace_systrace_syscall+0x14d unix`sys_syscall+0x24d --------------------------------------------------------- fbt:genunix:modctl_modinfo :entry args: 0x10, 0xFFFF80DBDCB43120 genunix`modctl+0x4e7 genunix`dtrace_systrace_syscall+0x14d unix`sys_syscall+0x24d ENT: ID mi->mi_id = 16 ENT: LOADADDR mi->mi_base = 0 ENT: SIZE mi->mi_size = 0 ENT: INFO mi->mi_mi_msinfo[0].msi_p0 = -592170360 ENT: REV mi->mi_rev = 0 ENT: NAME mi->mi_name = ENT: DESC mi->mi_msinfo[0].msi_linkinfo = ## --------------------------------------------------------- fbt:unix:copyin :entry args: 0xFFFF80DBDCB43120, 0xFFFFFFFC81AEDA30, 0x1B0 genunix`modctl_modinfo+0xa0 genunix`modctl+0x4e7 genunix`dtrace_systrace_syscall+0x14d unix`sys_syscall+0x24d --------------------------------------------------------- fbt:unix:copyout :entry args: 0xFFFFFFFC81AEDA30, 0xFFFF80DBDCB43120, 0x1B0 genunix`modctl_modinfo+0x1e6 genunix`modctl+0x4e7 genunix`dtrace_systrace_syscall+0x14d unix`sys_syscall+0x24d --------------------------------------------------------- fbt:genunix:modctl_modinfo :return genunix`modctl+0x4e7 genunix`dtrace_systrace_syscall+0x14d unix`sys_syscall+0x24d RET: ID mi->mi_id = 16 RET: LOADADDR mi->mi_base = fffffffffbdbc2d8 RET: SIZE mi->mi_size = 5d198 RET: INFO mi->mi_mi_msinfo[0].msi_p0 = -1 RET: REV mi->mi_rev = 1 RET: NAME mi->mi_name = pcie RET: DESC mi->mi_msinfo[0].msi_linkinfo = PCI Express Framework Module ---------------------------------------------------------
o 在终端A中查看地址0xffff80dbdcb43120的内容
> 0xffff80dbdcb43120::print struct modinfo { mi_info = 5 mi_state = 3 mi_id = 0x10 mi_nextid = 0x10 mi_base = 0xfffffffffbdbc2d8 mi_size = 0x5d198 mi_rev = 1 mi_loadcnt = 1 mi_name = [ "pcie" ] mi_msinfo = [ { msi_linkinfo = [ "PCI Express Framework Module" ] msi_p0 = 0xffffffff }, ...
该输出跟DTrace中观测到的数据一致。
o 在终端A上执行dtrace脚本观察用户模式下的调用栈
root# ./foou.d -c "modinfo -i 16" ID LOADADDR SIZE INFO REV NAMEDESC 16 fffffffffbdbc2d8 5d198 -- 1 pcie (PCI Express Framework Module) pid22782:libc.so.1:modctl :entry args: 0x2, 0x10, 0xFFFF80F02D9A1730, 0x927D90DF0, 0x0 libc.so.1`modctl modinfo`main+0x3b6 modinfo`0x7ffe48701b34
foou.d的代码如下:
1 #!/usr/sbin/dtrace -qs 2 3 pid$target::modctl:entry 4 /execname == "modinfo"/ 5 { 6 printf("\n%s:%s:%s :%s\n", probeprov, probemod, probefunc, probename); 7 printf("args: 0x%X, 0x%X, 0x%X, 0x%X, 0x%X\n", 8 arg0, arg1, arg2, arg3, arg4); 9 ustack(); 10 }
o 在终端B上观察内核模式下的调用栈
(1) 在终端B上启动DTrace,
root# dtrace -n "fbt::mod_infonull:entry {stack();}" dtrace: description 'fbt::mod_infonull:entry ' matched 1 probe
(2) 在终端A上执行命令 modinfo -i 16
root# modinfo -i 16 ID LOADADDR SIZE INFO REV NAMEDESC 16 fffffffffbdbc2d8 5d198 -- 1 pcie (PCI Express Framework Module)
与此同时,终端B上的输出为:
root# dtrace -n "fbt::mod_infonull:entry {stack();}" dtrace: description 'fbt::mod_infonull:entry ' matched 1 probe CPU ID FUNCTION:NAME 3 54642 mod_infonull:entry genunix`mod_info+0x66 pcie`_info+0x1f genunix`mod_getinfo+0x5a genunix`modinfo+0x125 genunix`modctl_modinfo+0xd5 genunix`modctl+0x4e7 unix`sys_syscall+0x24d ^C
4. 直接使用系统调用编程
下面给出一个简单的例子,说明只需要准备好系统调用号和相应的参数,直接使用汇编指令syscall就可以完成系统调用。
1 BITS 64 2 3 SECTION .data 4 5 Hello: db "Hello world!", 10 6 len_Hello: equ $-Hello 7 8 SECTION .text 9 10 global _start 11 12 _start: 13 mov rdi, 1 ; fd = stdout 14 mov rsi, Hello ; *buf = Hello 15 mov rdx, len_Hello ; count = len_Hello 16 mov rax, 4 ; write syscall (x86_64) 17 syscall 18 19 mov rdi, 0 ; status = 0 (exit normally) 20 mov rax, 1 ; exit syscall (x86_64) 21 syscall
编译,执行如下所示:
root# yasm -f elf64 foo.asm root# ld -o foo foo.o root# ./foo Hello world! root# echo $? 0
最后,关于如何给Solaris添加一个系统调用,请参考《Solaris内核结构》(第2版)一书的附录B:Adding a System Call to Solaris。