trace系列4 - kprobe学习笔记

0.前言

本文主要是根据阅码场 《Linux内核tracers的实现原理与应用》视频课程在aarch64上的实践。通过观察钩子函数的创建过程以及替换过程,理解trace的原理。本文同样以blk_update_request函数为例进行说明kprobe的工作原理,此处的kprobe是基于trace event来实现,同时使用了ftrace的框架。
首先让我们大致了解一下trace point、trace event、kprobe之间的关联:

  1. tracepoint会被放到代码中,需要提供一个probe函数与之关联,当tracepoint被打开时,当tracepoint执行时,提供的probe函数会被调用,probe函数返回时会继续原来函数的执行。使用trace point有3个步骤:
    (1) 在头文件include/trace/events/subsys.h中通过DECLARE_TRACE宏添加tracepoint声明
    (2) 在系统文件subsys/file.c中通过DEFINE_TRACE创建trace point
    (3)通过register_trace_subsys_eventname将tracepoint与probe关联

  2. trace event是建立在tracepoint的基础之上,它可以通过一个宏来实现如上前两步骤,并定义注册和注销函数,通过echo 1 > events/subsys_event来执行注册。从内核4.0及以后都将鼓励使用trace event,不再提倡直接使用trace point。这个可以从 tracepoint的相关sample示例从3.9移除可以得出。使用Tracing event不用像tracingpoint那样需要自己定义probe函数,而且这些probe函数往往要通过模块的方式进行定义,然后加载,而Tracing event提供了TRACE_EVENT宏,可以通过复杂宏帮助定义统一格式的probe函数,而Tracing event需要用户指定trace 信息以何种格式存放到ring buffer中,trace信息将以何种格式打印。

  3. kprobe可以理解为动态的trace event,可以在除了__kprobes/nokprobe_inline annotation 和那些标记为 NOKPROBE_SYMBOL的任何函数设置trace event。使用前需要打开内核选项:CONFIG_KPROBE_EVENTS=y.。

kprobe主要有两种使用方法,一是通过模块加载;二是通过debugfs接口。
(1)模块加载的方式:以内核的kprobe_example为例。首先声明一个kprobe结构体,然后定义其中几个关键成员变量,包括symbol_name,pre_handler,post_handler。然后通过register_kprobe函数注册kprobe即可。将kprobe_example.ko inmod进内核之后,每当系统新启动一个进程,比如执行ls,cat等,都会执行pre_handler和post_handler回调。
(2) 通过debugfs接口的方式:可以通过/sys/kernel/debug/tracing/kprobe_events来增加kprobe跟踪点,然后通过写入/sys/kernel/debug/tracing/events/kprobes//enabled使能。

kernel版本:5.10
平台:arm64

1. kprobe的总体原理

注:如下参考自kprobe原理解析(二)

kprobe的工作过程大致如下:

  1. 注册kprobe。 注册的每个kprobe对应一个kprobe结构体,该结构中记录着插入点(位置),以及该插入点本来对应的指令original_opcode;
  2. 替换原有指令。 使能kprobe的时候,将插入点位置的指令替换为一条异常(BRK)指令,这样当CPU执行到插入点位置时会陷入到异常态;
  3. 执行pre_handler。 进入异常态后,首先执行pre_handler,然后利用CPU提供的单步调试(single-step)功能,设置好相应的寄存器,将
    下一条指令设置为插入点处本来的指令,从异常态返回;
  4. 再次陷入异常态。 上一步骤中设置了single-step相关的寄存器,所以originnal_opcode刚一执行,便会二进宫:再次陷入异常态,此时将single-step 清除,并且执行post_handler,然后从异常态安全返回。
    步骤2,3,4便是一次kprobe工作的过程,它的一个基本思路就是将本来执行一条指令扩展成执行 kprobe->pre_handler =>
    指令 => kprobe–>post_hander这样三个过程。

2. kprobe领域模型

在这里插入图片描述

当kprobe创建后,形成如上的领域模型

  • trace_kprobe:既包含了kretprobe,又包含了trace_probe

  • kretprobe:用于描述kretprobe

  • trace_probe:用于描述kprobe, 它将与kretprobe公用kprobe结构,其中的args数组就保存了echo到节点的参数

  • kprobe: 用于描述kprobe的核心结构体,会连入全局哈希表kprobe_table

  • trace_probe_event:描述kprobe的trace event,包含了核心结构体trace_event_class和trace_event_call

  • trace_event_class:用于描述trace event的类

  • trace_event_call:是trace_event的封装,会连入全局ftrace_events链表

  • trace_event:主要关联了trace_event_functions结构体, trace_event_functions定义了trace_event的回调,trace_event会连入全局的ftrace_event_list

  • trace_event_functions:定义了trace_event的回调

  • trace_array:用于描述trace的最顶层的结构体,目前ftrace_trace_arrays只有一个全局的trace_array即global_trace,可以看出每个trace_event_call对应一个trace_array,trace_array->event_dir指向/sys/kernel/debug/tracing/events目录

  • trace_event_file: 管理kprobe trace event下所有的文件,通过event_call指向trace_event_call,通过system指向trace_subsystem_dir,通过tr指向trace_array,可见trace_event_file, trace_event_call,trace_array是一一对应的,trace_event_file通过list连入trace_array的events链表

  • trace_subsystem_dir: 管理kprobe trace event的目录,通过entry指向管理的目录节点(/sys/kernel/debug/tracing/events/kprobe),通过tr指向trace_array,通过list连入trace_array的systems链表。从上述图示可以看出,trace_subsystem_dir:本例中就表示events/kprobes目录

trace_event_file,trace_array, trace_event_call,trace_subsystem_dir一一对应

3. kprobe创建

kprobe的工作过程与前述function trace和function graph trace有所区别,但是kprobe仍然复用了ftrace的框架,在执行如下操作后,将执行probes_write

ubuntu@VM-0-9-ubuntu:~$ echo 'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32' > /sys/kernel/debug/tracing/kprobe_events

如上p表示kprobe,对于retkprobe则需要改成r;
blk_update为本次trace的名称,可以自己设置;
arg1表示需要跟踪的第一个参数,与其它几个参数一起保存到trace_probe.args数组中

static ssize_t probes_write(struct file *file, const char __user *buffer,size_t count, loff_t *ppos)
    \--trace_parse_run_command(file, buffer, count, ppos,create_or_delete_trace_kprobe);
           |--char *kbuf, *buf, *tmp
           |  size_t done = 0
           |  //分配空间用于存放event命令,本例中为:
           |  //'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32'
           |--kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL)
           |  //将event命令从用户空间拷贝到内核空间
           |--copy_from_user(kbuf, buffer + done, size))
           |--buf = kbuf;
           |  //对event命令进行解析
           \--trace_run_command(buf, create_or_delete_trace_kprobe)
                 |  //对event参数以空格为分割符,存放到argv数组,此时每个argv一维数组就保存了一个命令参数
                 |  //argv[0]:"p:blk_update",    argv[1]:"blk_update_request", argv[2]:"request=$arg1",
                 |  //argv[3]: "status=$arg2:u8",argv[4]:"bytes=$arg3:u32"
                 |--argv = argv_split(GFP_KERNEL, buf, &argc)
                 \--create_or_delete_trace_kprobe(argc, argv)

trace_parse_run_command主要用于解析kprobe命令,

(gdb) p *argv@20
$6 = {0xffff000007467700 "p:blk_update", 0xffff00000746770d "blk_update_request", 
      0xffff000007467720 "request=$arg1", 0xffff00000746772e "status=$arg2:u8",
      0xffff00000746773e "bytes=$arg3:u32", 0x0, 0
.....

3.1 create_or_delete_trace_kprobe

create_or_delete_trace_kprobe(argc, argv)
   |--trace_kprobe_create(argc, (const char **)argv)
          |--struct trace_kprobe *tk = NULL;
          |  const char *event = NULL
          |  //初始化全局trace_probe_log结构体,包含子系统名、trace参数格式、参数个数
          |--trace_probe_log_init("trace_kprobe", argc, argv)
          |  //获取到event名,本例为blk_update
          |--event = strchr(&argv[0][1], ':');
          |  //将字符串blk_update_request转换为unsigned long
          |--if (kstrtoul(argv[1], 0, (unsigned long *)&addr))
          |      trace_probe_log_set_index(1)
          |      //本例symbol为blk_update_request
          |      symbol = kstrdup(argv[1], GFP_KERNEL)
          |      //查询symbol入口地址偏移offset是否在符号表存在,此处是offset为0,检查blk_update_request符号是否存在
          |      if (kprobe_on_func_entry(NULL, symbol, offset))
          |          flags |= TPARG_FL_FENTRY
          |--trace_probe_log_set_index(0);
          |  //解析event名称
          |--traceprobe_parse_event_name(&event, &group, buf,event - argv[0])
          |  //分配trace_kprobe结构体
          |--tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,	
          |             argc - 2, is_return);
          |      |--if (is_return)
          |             tk->rp.handler = kretprobe_dispatcher;
          |         else
          |             tk->rp.kp.pre_handler = kprobe_dispatcher
          |  //解析event命令行参数,结果保存在tk->tp->args数组中,其中:
          |  //args[i].name为“=”左侧部分,args[i].comm为“=”右侧部分
          |--for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++)
          |      traceprobe_parse_probe_arg(&tk->tp, i, tmp, flags)
          |  // 设置trace event打印格式,执行完tk->tp->event->call->print_fmt格式为:
          |  //"(%lx) request=0x%Lx status=%u bytes=%u\", REC->__probe_ip, REC->request, REC->status, REC->bytes
          |--traceprobe_set_print_fmt(&tk->tp, is_return);
          |  //注册kprobe event
          \--register_trace_kprobe(tk)       

create_or_delete_trace_kprobe中通过gdb可以查看到argv就是上述echo的部分,注册kprobe_event,注册trace_event_call,注册kprobe,关于这三者的关系,可参考 1.kprobe领域模型 部分。其中最主要的pre_handler为kprobe_dispatcher,同时设置了打印格式,并完成trace_kprobe的注册

  • alloc_trace_kprobe:为trace_kprobe分配空间,主要初始化了kprobe的pre_handler为kprobe_dispatcher,和post_handler

  • traceprobe_set_print_fmt:设置kprobe的打印格式

  • register_trace_kprobe:初始化trace_kprobe.,trace_probe., trace_probe_event, trace_event_call,并注册了trace_event, trace_event_call, kprobe,

3.1.1 register_trace_kprobe

register_trace_kprobe(tk)
     |  //注册kprobe event
     |--register_kprobe_event(tk)
     |      |  //初始化trace_event_call
     |      |--init_trace_event_call(tk)
     |      |      |--if (trace_kprobe_is_return(tk))
     |      |      |      call->event.funcs = &kretprobe_funcs	
     |      |      |      call->class->fields_array = kretprobe_fields_array;
     |      |      |  else
     |      |      |      call->event.funcs = &kprobe_funcs;
     |      |      |      call->class->fields_array = kprobe_fields_array
     |      |      |--call->flags = TRACE_EVENT_FL_KPROBE
     |      |      |  //此函数将作为kprobe trace event使能/禁用的回调
     |      |      \--call->class->reg = kprobe_register
     |      \--trace_probe_register_event_call(&tk->tp)
     |            |--struct trace_event_call *call = trace_probe_event_call(tp)
     |            |  //注册trace_event
     |            |--register_trace_event(&call->event)
     |            |     |--INIT_LIST_HEAD(&event->list)
     |            |     |--event->funcs->raw = trace_nop_print;
     |            |     |  event->funcs->hex = trace_nop_print
     |            |     |  event->funcs->binary = trace_nop_print
     |            |     \--hlist_add_head(&event->node, &event_hash[key]);
     |            |  //注册trace_event_call
     |            \--trace_add_event_call(call)
     |  //注册kprobe
     |-- __register_trace_kprobe(tk)
     |        |--if (trace_kprobe_is_return(tk))
     |               register_kretprobe(&tk->rp)
     |           else
     |               register_kprobe(&tk->rp.kp)
     |                   |  //通过kprobe的名字(blk_update_request)查符号表,得到符号地址,p为kprobe
     |                   |--addr = kprobe_addr(p)
     |                   |  //此addr为blk_update_request函数
     |                   |--p->addr = addr;
     |                   |  p->flags &= KPROBE_FLAG_DISABLED;
     |                   |  //因为在使能kprobe event时会在探测函数入口用brk替换原有指令,因此要保存原有指令
     |                   |  //首次不会注册kprobe,因此old_p 为空
     |                   |--old_p = get_kprobe(p->addr)
     |                   |--prepare_kprobe(p)
     |                   |--INIT_HLIST_NODE(&p->hlist)
     |                   |--hlist_add_head_rcu(&p->hlist, &kprobe_table[hash_ptr(p->addr, KPROBE_HASH_BITS)]);
     \--dyn_event_add(&tk->devent)
           \--list_add_tail(&ev->list, &dyn_event_list)       

register_trace_kprobe:初始化trace_kprobe.,trace_probe., trace_probe_event, trace_event_call,并注册了trace_event, trace_event_call, kprobe

  • init_trace_event_call:初始化trace_kprobe.,trace_probe. trace_probe_event. trace_event_call

  • trace_probe_register_event_call: 分别调用了register_trace_event和trace_add_event_call。
    (1)register_trace_event将trace_event注册到全局event_hash哈希链表;
    (2)trace_add_event_call将trace_event_call注册到全局ftrace_events链表,trace_add_event_call创建trace_event_file,在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点

  • __register_trace_kprobe:注册kprobe,这里区分了retkprobe和kprobe,其中register_kprobe会将kprobe注册到全局kprobe_table哈希表中
    (1)register_kprobe:完成了kprobe向全局kprobe_table添加哈希节点,完成kprobe的注册,它以插入的位置addr作为哈希值.
    (1-1)prepare_kprobe:主要是为探测点触发以后如何返回到blk_update_request原始的指令做准备,p->opcode保存了blk_update_request原始的入口指令;p->ainsn.api.insn保存了blk_update_request原始的入口指令slot;p->ainsn.api.restore保存了原始入口指令的下一条指令的地址,这样当断点指令返回后就可以从这条指令执行,这样就可以沿着原始执行路径执行。

3.1.1.1 trace_add_event_call
trace_add_event_call(call)
	|--__register_event(call, NULL)
	|     |--event_init(call)
	|      \--list_add(&call->list, &ftrace_events)
	\--__add_event_to_tracers(call)
	       |  //本例中ftrace_trace_arrays链表只有一个全局的global_trace, 此处tr为global_trace
	       |--list_for_each_entry(tr, &ftrace_trace_arrays, list)
	              __trace_add_new_event(call, tr);
	                  |--struct trace_event_file *file
	                  |  //创建trace_event_file,初始化并连入tr->events链表
	                  |--file = trace_create_new_event(call, tr)
	                  |      |--file->event_call = call;
	                  |      |  file->tr = tr
	                  |      |  atomic_set(&file->sm_ref, 0)
	                  |      |  atomic_set(&file->tm_ref, 0)
	                  |      |  INIT_LIST_HEAD(&file->triggers)
	                  |      |--list_add(&file->list, &tr->events)
	                  \--event_create_dir(tr->event_dir, file)

trace_add_event_call创建trace_event_file,在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点

  • trace_create_new_event:创建trace_event_file,初始化并连入tr->events链表

  • event_create_dir:主要是在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录,并在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点

event_create_dir(tr->event_dir, file)
   |--struct trace_event_call *call = file->event_call
   |  struct trace_array *tr = file->tr
   |  struct dentry *d_events
   |--d_events = event_subsystem_dir(tr, call->class->system, file, parent)
   |     |--struct trace_subsystem_dir *dir;
   |     |  struct event_subsystem *system;
   |     |  struct dentry *entry
   |     |  //为trace_subsystem_dir分配空间
   |     |--dir = kmalloc(sizeof(*dir), GFP_KERNEL)
   |     |  //在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录
   |     |--dir->entry = tracefs_create_dir(name, parent)
   |     |--dir->tr = tr;	
   |	 |  dir->ref_count = 1; 
   |	 |  dir->nr_events = 1; 
   |	 |  dir->subsystem = system;
   |	 |  file->system = dir;
   |	 |--tracefs_create_file("filter"...)
   | 	 |--trace_create_file("enable"...)
   |	 |--list_add(&dir->list, &tr->systems)
   |  //name为blk_update
   |--name = trace_event_name(call)
   |  //在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录
   |--file->dir = tracefs_create_dir(name, d_events) 
   |--trace_create_file("enable", 0644, file->dir, file,&ftrace_enable_fops);
   |--trace_create_file("id", 0444, file->dir,   ...)
   |--event_define_fields(call)   
   |--trace_create_file("filter", 0644, file->dir, file,&ftrace_event_filter_fops)
   |--trace_create_file("trigger", 0644, file->dir, file,&event_trigger_fops);

event_create_dir主要是在"sys/kernel/debug/tracing/events"目录下创建"kprobes"目录,并在sys/kernel/debug/tracing/events/kprobes目录下创建blk_update目录,同时在其下创建其它的文件节点

trace_create_file("enable", 0644, file->dir, file,&ftrace_enable_fops);
    |--tracefs_create_file(name, mode, parent, data, fops)
           |--struct dentry *dentry
           |  struct inode *inode;
           |  //返回kprobes目录对应的dentry
           |--dentry = start_creating(name, parent)
           |--inode = tracefs_get_inode(dentry->d_sb)
           |--inode->i_mode = mode;
           |  inode->i_fop = fops ? fops : &tracefs_file_operations;
           |  //data就是trace_event_file
           |  inode->i_private = data
           |-- d_instantiate(dentry, inode)
           |--fsnotify_create(dentry->d_parent->d_inode, dentry);

trace_create_file在sys/kernel/debug/tracing/events/kprobes下创建enable文件,此处重点关注inode->i_private,它被初始化为trace_add_event_call时创建的trace_event_file,后面在event_enable_write时会用到

3.1.1.2 prepare_kprobe
|--prepare_kprobe(p)
      \--arch_prepare_kprobe(p)
             |  //拷贝指令blk_update_request原有入口指令 sub     sp, sp, #0x60
             |--unsigned long probe_addr = (unsigned long)p->addr;
             |--p->opcode = le32_to_cpu(*p->addr)
             |--search_exception_tables(probe_addr)
             |  //p->ainsn.api.insn存放了blk_update_request的入口的下一条指令
             |  //  stp     x29, x30, [sp,#-32]! 和  brk     #0x6
             |--p->ainsn.api.insn = get_insn_slot()
             |     |--__get_insn_slot(struct kprobe_insn_cache *c)
             \--arch_prepare_ss_slot(p)
                    |--kprobe_opcode_t *addr = p->ainsn.api.insn;
                    | //addr:stp x29, x30, [sp,#-32]!(blk_update_request的入口的下一条指令)
                    | //addr+1:brk     #0x6
                    |--void *addrs[] = {addr, addr + 1};
                    | //p->opcod:sub     sp, sp, #0x60
                    |--u32 insns[] = {p->opcode, BRK64_OPCODE_KPROBES_SS};
                    |--aarch64_insn_patch_text(addrs, insns, 2);
                    |--flush_icache_range((uintptr_t)addr, (uintptr_t)(addr + MAX_INSN_SIZE))
                    |  //Needs restoring of return address after stepping xol
                    |  //p->addr   为sub     sp, sp, #0x60的地址0xffff8000104ec1f0
                    |  //p->addr+4 为stp     x29, x30, [sp,#16]的地址0xffff8000104ec1f4
                    \--p->ainsn.api.restore = (unsigned long) p->addr +sizeof(kprobe_opcode_t);

prepare_kprobe:主要是为探测点触发以后如何返回到blk_update_request原始的指令做准备,p->opcode保存了blk_update_request原始的入口指令;p->ainsn.api.insn保存了blk_update_request原始的入口指令slot;p->ainsn.api.restore保存l了原始入口指令的下一条指令的地址。

4. kprobe brk指令替换

先来看下未替换指令前blk_update_request的反汇编:

Dump of assembler code for function blk_update_request:
   0xffff8000104ec1f0 <+0>:     sub     sp, sp, #0x60
   0xffff8000104ec1f4 <+4>:     stp     x29, x30, [sp,#16]
   0xffff8000104ec1f8 <+8>:     add     x29, sp, #0x10
   0xffff8000104ec1fc <+12>:    stp     x19, x20, [sp,#32]
   0xffff8000104ec200 <+16>:    stp     x21, x22, [sp,#48]
   0xffff8000104ec204 <+20>:    stp     x23, x24, [sp,#64]
   0xffff8000104ec208 <+24>:    str     x25, [sp,#80]
   0xffff8000104ec20c <+28>:    mov     x22, x0
   0xffff8000104ec210 <+32>:    uxtb    w24, w1
   0xffff8000104ec214 <+36>:    mov     w21, w2
   0xffff8000104ec218 <+40>:    mov     x0, x30
   0xffff8000104ec21c <+44>:    nop
   ......

在执行如下命令后

ubuntu@VM-0-9-ubuntu:~$ echo 1 > /sys/kernel/debug/tracing/events/kprobes/blk_update/enable

我们可以看到,在执行如上操作后,blk_update_request的入口处的指令

sub     sp, sp, #0x60

被替换为:

0xffff8000104ec1f0 <+0>:     brk     #0x4

那么这个替换过程是怎么完成的呢?通过gdb可以跟踪过程

event_enable_write(struct file *filp, const char __user *ubuf, size_t cnt,loff_t *ppos)
    |--struct trace_event_file *file
    |--kstrtoul_from_user(ubuf, cnt, 10, &val)
    |--tracing_update_buffers()
    |  //前面在trace_create_file的时候会将trace_event_file保存在inode->i_private
    |--file = event_file_data(filp)
    |--ftrace_event_enable_disable(file, val)
           |  //此处以enable为1举例
           |--__ftrace_event_enable_disable(file, enable, 0)
                  |--clear_bit(EVENT_FILE_FL_SOFT_DISABLED_BIT, &file->flags)
                  |--trace_buffered_event_enable()
                         |--struct trace_event_call *call = file->event_call
                         |  //call->class->reg(call, TRACE_REG_REGISTER, file),见register_kprobe_event
                         |--kprobe_register(struct trace_event_call *event,enum trace_reg type, void *data)
                                |--enable_trace_kprobe(event, file)
                                       |--struct trace_probe *pos, *tp
                                       |--tp = trace_probe_primary_from_call(call)
                                       |-- trace_probe_add_file(tp, file);
                                       |       |--struct event_file_link *link
                                       |       |--link = kmalloc(sizeof(*link), GFP_KERNEL)  
                                       |       |--list_add_tail_rcu(&link->list, &tp->event->files)
                                       |--list_for_each_entry(pos, trace_probe_probe_list(tp), list)
                                              tk = container_of(pos, struct trace_kprobe, tp)
                                              if (trace_kprobe_has_gone(tk))
                                                   continue;
                                              ret = __enable_trace_kprobe(tk)
                                              if (ret)  break;
                                              enabled = true;
static inline int __enable_trace_kprobe(struct trace_kprobe *tk) 
{
        int ret = 0; 

        if (trace_kprobe_is_registered(tk) && !trace_kprobe_has_gone(tk)) {
                if (trace_kprobe_is_return(tk))
                        ret = enable_kretprobe(&tk->rp);
                else 
                        ret = enable_kprobe(&tk->rp.kp);
        }    

        return ret; 
}
int enable_kprobe(struct kprobe *kp)
    |--struct kprobe *p
    |--p = __get_valid_kprobe(kp)
    |--arm_kprobe(p)
           |  //Put a breakpoint for a probe
           |--__arm_kprobe(kp);
                 |  //arm kprobe: install breakpoint in text
                 |--arch_arm_kprobe(p)
                        |--void *addr = p->addr
                        |--u32 insn = BRK64_OPCODE_KPROBES;
                        |--aarch64_insn_patch_text(&addr, &insn, 1);
           
int __kprobes aarch64_insn_patch_text(void *addrs[], u32 insns[], int cnt)                                                                       
{       
        struct aarch64_insn_patch patch = {                                                                                                      
                .text_addrs = addrs,                                                                                                             
                .new_insns = insns,                                                                                                              
                .insn_cnt = cnt,
                .cpu_count = ATOMIC_INIT(0),                                                                                                     
        };                                                                                                                                       
        
        if (cnt <= 0)
                return -EINVAL;                                                                                                                  
        //stop_machine_cpuslocked会调用aarch64_insn_patch_text_cb回调,参数为&patch
        return stop_machine_cpuslocked(aarch64_insn_patch_text_cb, &patch,                                                                       
                                       cpu_online_mask);                                                                                         
}
static int __kprobes aarch64_insn_patch_text_cb(void *arg)                                                                                       
{       
        int i, ret = 0;
        struct aarch64_insn_patch *pp = arg;                                                                                                     
        
        /* The first CPU becomes master */
        if (atomic_inc_return(&pp->cpu_count) == 1) {
                for (i = 0; ret == 0 && i < pp->insn_cnt; i++)
                        ret = aarch64_insn_patch_text_nosync(pp->text_addrs[i],                                                                  
                                                             pp->new_insns[i]);                                                                  
                /* Notify other processors with an additional increment. */                                                                      
                atomic_inc(&pp->cpu_count);                                                                                                      
        } else {
                while (atomic_read(&pp->cpu_count) <= num_online_cpus())                                                                         
                        cpu_relax();                                                                                                             
                isb();                                                                                                                           
        }                                                                                                                                        
        
        return ret;                                                                                                                              
}
int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
        u32 *tp = addr;
        int ret;

        /* A64 instructions must be word aligned */
        if ((uintptr_t)tp & 0x3)
                return -EINVAL;

        ret = aarch64_insn_write(tp, insn);
        if (ret == 0)
                __flush_icache_range((uintptr_t)tp,
                                     (uintptr_t)tp + AARCH64_INSN_SIZE);

        return ret;
}
int __kprobes aarch64_insn_write(void *addr, u32 insn)
{
        return __aarch64_insn_write(addr, cpu_to_le32(insn));
}
static int __kprobes __aarch64_insn_write(void *addr, __le32 insn)
{
        void *waddr = addr;
        unsigned long flags = 0;
        int ret;

        raw_spin_lock_irqsave(&patch_lock, flags);
        waddr = patch_map(addr, FIX_TEXT_POKE0);

        ret = copy_to_kernel_nofault(waddr, &insn, AARCH64_INSN_SIZE);

        patch_unmap(FIX_TEXT_POKE0);
        raw_spin_unlock_irqrestore(&patch_lock, flags);

        return ret;
}

5. kprobe钩子函数的执行

注:如下5.2~5.8是层层调用的关系,并非同级函数关系

5.1 断点异常回调函数初始化=>

# arch/arm64/include/asm/debug-monitors.h 
/* AArch64 */
#define DBG_ESR_EVT_HWBP        0x0
#define DBG_ESR_EVT_HWSS        0x1
#define DBG_ESR_EVT_HWWP        0x2
#define DBG_ESR_EVT_BRK         0x6
# arch/arm64/mm/fault.c
/*
 * __refdata because early_brk64 is __init, but the reference to it is
 * clobbered at arch_initcall time.
 * See traps.c and debug-monitors.c:debug_traps_init().
 */
static struct fault_info __refdata debug_fault_info[] = { 
        { do_bad,       SIGTRAP,        TRAP_HWBKPT,    "hardware breakpoint"   },  
        { do_bad,       SIGTRAP,        TRAP_HWBKPT,    "hardware single-step"  },  
        { do_bad,       SIGTRAP,        TRAP_HWBKPT,    "hardware watchpoint"   },  
        { do_bad,       SIGKILL,        SI_KERNEL,      "unknown 3"             },  
        { do_bad,       SIGTRAP,        TRAP_BRKPT,     "aarch32 BKPT"          },  
        { do_bad,       SIGKILL,        SI_KERNEL,      "aarch32 vector catch"  },  
        { early_brk64,  SIGTRAP,        TRAP_BRKPT,     "aarch64 BRK"           },  
        { do_bad,       SIGKILL,        SI_KERNEL,      "unknown 7"             },  
};

void __init hook_debug_fault_code(int nr, 
                                  int (*fn)(unsigned long, unsigned int, struct pt_regs *), 
                                  int sig, int code, const char *name)
{
        BUG_ON(nr < 0 || nr >= ARRAY_SIZE(debug_fault_info));

        debug_fault_info[nr].fn         = fn; 
        debug_fault_info[nr].sig        = sig;
        debug_fault_info[nr].code       = code;
        debug_fault_info[nr].name       = name;
}
#arch/arm64/kernel/debug-monitors.c
void __init debug_traps_init(void)
{
        hook_debug_fault_code(DBG_ESR_EVT_HWSS, single_step_handler, SIGTRAP,
                              TRAP_TRACE, "single-step handler");
        hook_debug_fault_code(DBG_ESR_EVT_BRK, brk_handler, SIGTRAP,
                              TRAP_BRKPT, "BRK handler");
}

通过hook_debug_fault_code动态定义了异常处理的钩子函数brk_handler,它将在断点异常处理函数中被调用

下面我们就来关注下kprobe的执行流程,brk #0x4 会跳转到arch/arm64/kernel/entry.S的sync异常处理

5.2 brk #0x4 =>

    //将栈大小扩容0x150,sp保存了栈帧顶地址
    0xffff800010010a00 <vectors+512>        sub    sp, sp, #0x150
    0xffff800010010a04 <vectors+516>        add    sp, sp, x0
    0xffff800010010a08 <vectors+520>        sub    x0, sp, x0
    0xffff800010010a0c <vectors+524>        tbnz   w0, #14, 0xffff800010010a1c <vectors+540>
    0xffff800010010a10 <vectors+528>        sub    x0, sp, x0
    0xffff800010010a14 <vectors+532>        sub    sp, sp, x0
    0xffff800010010a18 <vectors+536>        b      0xffff800010011940 <el1_sync>

brk断点异常触发后会执行不同的回调,后面的#0x4决定了调用断点异常处理函数的哪个回调

5.3 el1_sync =>

# arch/arm64/kernel/entry.S
SYM_CODE_START(vectors)
        kernel_ventry   1, sync_invalid                 // Synchronous EL1t
        kernel_ventry   1, irq_invalid                  // IRQ EL1t
        kernel_ventry   1, fiq_invalid                  // FIQ EL1t
        kernel_ventry   1, error_invalid                // Error EL1t
        
        //kprobe断点指令会跳转到此处执行
        kernel_ventry   1, sync                         // Synchronous EL1h
        kernel_ventry   1, irq                          // IRQ EL1h
        kernel_ventry   1, fiq_invalid                  // FIQ EL1h
        kernel_ventry   1, error                        // Error EL1h

        kernel_ventry   0, sync                         // Synchronous 64-bit EL0
        kernel_ventry   0, irq                          // IRQ 64-bit EL0
        kernel_ventry   0, fiq_invalid                  // FIQ 64-bit EL0
        kernel_ventry   0, error                        // Error 64-bit EL0

#ifdef CONFIG_COMPAT
        kernel_ventry   0, sync_compat, 32              // Synchronous 32-bit EL0
        kernel_ventry   0, irq_compat, 32               // IRQ 32-bit EL0
        kernel_ventry   0, fiq_invalid_compat, 32       // FIQ 32-bit EL0
        kernel_ventry   0, error_compat, 32             // Error 32-bit EL0
#else
        kernel_ventry   0, sync_invalid, 32             // Synchronous 32-bit EL0
        kernel_ventry   0, irq_invalid, 32              // IRQ 32-bit EL0
        kernel_ventry   0, fiq_invalid, 32              // FIQ 32-bit EL0
        kernel_ventry   0, error_invalid, 32            // Error 32-bit EL0
#endif
SYM_CODE_END(vectors)

brk #0x4 会触发arm64异常处理,进入异常会跳转到arch/arm64/kernel/entry.S的sync异常处理,此处会跳转到el1_sync

SYM_CODE_START_LOCAL_NOALIGN(el1_sync)
        kernel_entry 1
        //通过kernel_entry可知,x0指向保存的通用寄存器
        mov     x0, sp
        bl      el1_sync_handler
        kernel_exit 1
SYM_CODE_END(el1_sync)

关于kernel_entry,我们可以看下它的反汇编:

   //保存通用寄存器x0~x29
   0xffff800010011940 <el1_sync>:       stp     x0, x1, [sp]
   0xffff800010011944 <el1_sync+4>:     stp     x2, x3, [sp,#16]
   0xffff800010011948 <el1_sync+8>:     stp     x4, x5, [sp,#32]
   0xffff80001001194c <el1_sync+12>:    stp     x6, x7, [sp,#48]
   0xffff800010011950 <el1_sync+16>:    stp     x8, x9, [sp,#64]
   0xffff800010011954 <el1_sync+20>:    stp     x10, x11, [sp,#80]
   0xffff800010011958 <el1_sync+24>:    stp     x12, x13, [sp,#96]
   0xffff80001001195c <el1_sync+28>:    stp     x14, x15, [sp,#112]
   0xffff800010011960 <el1_sync+32>:    stp     x16, x17, [sp,#128]
   0xffff800010011964 <el1_sync+36>:    stp     x18, x19, [sp,#144]
   0xffff800010011968 <el1_sync+40>:    stp     x20, x21, [sp,#160]
   0xffff80001001196c <el1_sync+44>:    stp     x22, x23, [sp,#176]
   0xffff800010011970 <el1_sync+48>:    stp     x24, x25, [sp,#192]
   0xffff800010011974 <el1_sync+52>:    stp     x26, x27, [sp,#208]
   0xffff800010011978 <el1_sync+56>:    stp     x28, x29, [sp,#224]
   0xffff80001001197c <el1_sync+60>:    add     x21, sp, #0x150
   //x28保存了当前进程描述符指针
   0xffff800010011980 <el1_sync+64>:    mrs     x28, sp_el0   
   0xffff800010011984 <el1_sync+68>:    ldr     x20, [x28,#8]
   0xffff800010011988 <el1_sync+72>:    str     x20, [sp,#288]
   0xffff80001001198c <el1_sync+76>:    mov     x20, #0xffffffffffff            // #281474976710655
   0xffff800010011990 <el1_sync+80>:    str     x20, [x28,#8]
   0xffff800010011994 <el1_sync+84>:    mrs     x22, elr_el1
   0xffff800010011998 <el1_sync+88>:    mrs     x23, spsr_el1
   //lr入栈
   0xffff80001001199c <el1_sync+92>:    stp     x30, x21, [sp,#240]
   //elr入栈
   0xffff8000100119a0 <el1_sync+96>:    stp     x29, x22, [sp,#304]
   //x29指向栈顶
   0xffff8000100119a4 <el1_sync+100>:   add     x29, sp, #0x130
   //spsr入栈
   0xffff8000100119a8 <el1_sync+104>:   stp     x22, x23, [sp,#256]
   0xffff8000100119ac <el1_sync+108>:   nop
   0xffff8000100119b0 <el1_sync+112>:   nop

kernel_entry宏参数为1表示保存发生在EL1的异常现场;若为0表示保存发生在EL0的异常现场。通过如上对kernel_entry宏的分析可知,发生异常进程的现场上下文会被保存在发生异常的进程内核栈,这个异常的现场主要包括:栈帧、PSTATE、LR、SP以及通用寄存器X0~X29。之后将跳转到el1_sync_handler

5.4 el1_sync_handler =>

asmlinkage void noinstr el1_sync_handler(struct pt_regs *regs)
{
        unsigned long esr = read_sysreg(esr_el1);
        
        //通过esr可以判断出异常类型
        switch (ESR_ELx_EC(esr)) {
        case ESR_ELx_EC_DABT_CUR:
        case ESR_ELx_EC_IABT_CUR:
                el1_abort(regs, esr);
                break;
        /*  
         * We don't handle ESR_ELx_EC_SP_ALIGN, since we will have hit a
         * recursive exception when trying to push the initial pt_regs.
         */
        case ESR_ELx_EC_PC_ALIGN:
                el1_pc(regs, esr);
                break;
        case ESR_ELx_EC_SYS64:
        case ESR_ELx_EC_UNKNOWN:
                el1_undef(regs);
                break;
        case ESR_ELx_EC_BREAKPT_CUR:
        case ESR_ELx_EC_SOFTSTP_CUR:
        case ESR_ELx_EC_WATCHPT_CUR:
        case ESR_ELx_EC_BRK64:
                el1_dbg(regs, esr);
                break;
        case ESR_ELx_EC_FPAC:
                el1_fpac(regs, esr);
                break;
        default:
                el1_inv(regs, esr);
        }   
}

esr_el1为arm64异常综合信息寄存器,其中bit31-26为异常类型(EC), bit24-0为具体的异常指令编码(ISS),对于esr寄存器不同的异常类型EC有不同的ISS表(bit24-0)。此处,kprobe中的断点指令brk所引发异常时ESR_EL1的值为:0xf2000004,由于kprobe时插入的是brk断点指令,对应的EC值为0x3c(ESR_ELx_EC_BRK64),因此会跳转到el1_dbg执行

关于ESR_EL1的寄存器定义可以参考ARMV8 ARM
在这里插入图片描述
在这里插入图片描述
不同的异常类型EC有不同的ISS表,对于断点debug异常对应的ISS表结构如下,ISS表中Comment值不同,又进一步区分不同的debug异常
在这里插入图片描述

5.5 el1_dbg=>

//esr Holds syndrome information for an exception taken to EL1
static void noinstr el1_dbg(struct pt_regs *regs, unsigned long esr)
{
        //far为Fault Address Register,它保存了引起异常发生的指令地址
        unsigned long far = read_sysreg(far_el1);

        /*
         * The CPU masked interrupts, and we are leaving them masked during
         * do_debug_exception(). Update PMR as if we had called
         * local_daif_mask().
         */
        if (system_uses_irq_prio_masking())
                gic_write_pmr(GIC_PRIO_IRQON | GIC_PRIO_PSR_I_SET);

        arm64_enter_el1_dbg(regs);
        do_debug_exception(far, esr, regs);
        arm64_exit_el1_dbg(regs);
}

el1_dbg会调用do_debug_exception处理debug异常.

void do_debug_exception(unsigned long addr_if_watchpoint, unsigned int esr,
                        struct pt_regs *regs)
{
        const struct fault_info *inf = esr_to_debug_fault_info(esr);
        unsigned long pc = instruction_pointer(regs);

        if (cortex_a76_erratum_1463225_debug_handler(regs))
                return;

        debug_exception_enter(regs);

        if (user_mode(regs) && !is_ttbr0_addr(pc))
                arm64_apply_bp_hardening();
        //由初始化时候debug_traps_init,以及esr_el1的值可知inf->fn为brk_handler
        if (inf->fn(addr_if_watchpoint, esr, regs)) {
                arm64_notify_die(inf->name, regs,
                                 inf->sig, inf->code, (void __user *)pc, esr);
        }   

        debug_exception_exit(regs);
}
NOKPROBE_SYMBOL(do_debug_exception);

esr_el1的bit27~bit29指示了debug异常类型,对应debug_fault_info数组的索引,此处可知debug异常类型为0x6,对应DBG_ESR_EVT_BRK,由初始化函数debug_traps_init可知inf->fn为brk_handler,此处的addr_if_watchpoint为引发断点指令的地址。我们通过gdb查看pt_gets结构体,可知函数中 pc变量值就是blk_update_request中插入的断点指令brk 0x4的地址。

static int brk_handler(unsigned long unused, unsigned int esr,
                       struct pt_regs *regs)
{
        if (call_break_hook(regs, esr) == DBG_HOOK_HANDLED)
                return 0;

        if (user_mode(regs)) {
                send_user_sigtrap(TRAP_BRKPT);
        } else {
                pr_warn("Unexpected kernel BRK exception at EL1\n");
                return -EFAULT;
        }   

        return 0;
}
NOKPROBE_SYMBOL(brk_handler);

brk_handler会调用call_break_hook,它实际是根据具体的某种断点异常类型来回调不同的hook,主要是根据ESR_EL1.ISS.Comment进行区分,也就是不同的ESR_EL1.ISS.Comment对应不同的hook。

5.6 call_break_hook=>

static int call_break_hook(struct pt_regs *regs, unsigned int esr)
{
        struct break_hook *hook;
        struct list_head *list;
        int (*fn)(struct pt_regs *regs, unsigned int esr) = NULL;

        //通过user_mode(regs)可知是发生在el1模式,因此list为kernel_break_hook
        list = user_mode(regs) ? &user_break_hook : &kernel_break_hook;

        /*  
         * Since brk exception disables interrupt, this function is
         * entirely not preemptible, and we can use rcu list safely here.
         */
        list_for_each_entry_rcu(hook, list, node) {
                //#define ESR_ELx_BRK64_ISS_COMMENT_MASK  0xffff
                unsigned int comment = esr & ESR_ELx_BRK64_ISS_COMMENT_MASK;

                if ((comment & ~hook->mask) == hook->imm)
                        fn = hook->fn;
        }   

        return fn ? fn(regs, esr) : DBG_HOOK_ERROR;
}
NOKPROBE_SYMBOL(call_break_hook);

由于在el1模式下发生的断点异常,因此此处list被赋值为kernel_break_hook,关于kernel_break_hook我们可以参看如下初始化相关代码,在初始化时register_kernel_break_hook会向kernel_break_hook链表注册不同的hook,这包括kprobes_break_hook和kprobes_break_ss_hook。list_for_each_entry_rcu(hook, list, node)主要通过遍历kernel_break_hook链表,根据debug断点异常类型找到匹配的hook。其中异常类型主要根据esr_el1.ISS.Comment进行判断,此处由于esr_el1为0xf2000004,因此esr_el1.ISS为0x004,Comment为0x004, 因此会调用到kprobes_break_hook.fn回调kprobe_breakpoint_handler

# arch/arm64/include/asm/brk-imm.h
/*                                                                                                                                               
 * #imm16 values used for BRK instruction generation                                                                                             
 * 0x004: for installing kprobes                                                                                                                 
 * 0x005: for installing uprobes                                                                                                                 
 * 0x006: for kprobe software single-step                                                                                                        
 * Allowed values for kgdb are 0x400 - 0x7ff                                                                                                     
 * 0x100: for triggering a fault on purpose (reserved)                                                                                           
 * 0x400: for dynamic BRK instruction                                                                                                            
 * 0x401: for compile time BRK instruction                                                                                                       
 * 0x800: kernel-mode BUG() and WARN() traps                                                                                                     
 * 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff)                                                                                    
 */                                                                                                                                              
#define KPROBES_BRK_IMM                 0x004                                                                                                    
#define UPROBES_BRK_IMM                 0x005                                                                                                    
#define KPROBES_BRK_SS_IMM              0x006                                                                                                    
#define FAULT_BRK_IMM                   0x100                                                                                                    
#define KGDB_DYN_DBG_BRK_IMM            0x400                                                                                                    
#define KGDB_COMPILED_DBG_BRK_IMM       0x401                                                                                                    
#define BUG_BRK_IMM                     0x800                                                                                                    
#define KASAN_BRK_IMM                   0x900                                                                                                    
#define KASAN_BRK_MASK 
# arch/arm64/kernel/debug-monitors.c
static LIST_HEAD(kernel_break_hook);

static struct break_hook kprobes_break_hook = {
        .imm = KPROBES_BRK_IMM,
        .fn = kprobe_breakpoint_handler,
};

void register_kernel_break_hook(struct break_hook *hook)
{                                                                                                                                                
        register_debug_hook(&hook->node, &kernel_break_hook);                                                                                    
}

int __init arch_init_kprobes(void)
{
        register_kernel_break_hook(&kprobes_break_hook);
        register_kernel_break_hook(&kprobes_break_ss_hook);

        return 0;
}

此处我们简单的总结一下:

  1. ESR_EL1.EC指示了异常类型,此处ESR_EL1.EC值为0x3c,表示ESR_ELx_EC_BRK64断点异常类;
  2. ESR_EL1.EC的bit27~bit29进一步指示了断点异常类型,包括:
    DBG_ESR_EVT_HWBP
    DBG_ESR_EVT_HWSS
    DBG_ESR_EVT_HWWP
    DBG_ESR_EVT_BRK
    debug_fault_info数组维护着不同的断点异常,ESR_EL1.EC的bit27~bit29对应debug_fault_info数组的索引,此处断点异常类型为0x6,对应DBG_ESR_EVT_BRK,由初始化函数debug_traps_init可知inf->fn为brk_handler
  3. ESR_EL1.ISS.Comment进一步指示了某种具体的断点异常hook,此处esr_el1.ISS为0x004,Comment为0x004, 因此会调用到kprobes_break_hook.fn回调, 即kprobe_breakpoint_handler, kprobe_breakpoint_handler如下:
kprobe_breakpoint_handler(struct pt_regs *regs, unsigned int esr)
{
        kprobe_handler(regs);
        return DBG_HOOK_HANDLED;
}

5.7 kprobe_handler

static void __kprobes kprobe_handler(struct pt_regs *regs)
    |--struct kprobe *p, *cur_kprobe;
    |  struct kprobe_ctlblk *kcb;
    |  //获取被probe点的pc
    |--unsigned long addr = instruction_pointer(regs);
    |  //addr地址为blk_update_request的入口地址,它为哈希表索引
    |--p = get_kprobe((kprobe_opcode_t *) addr);
    \--if (p)
           if (!p->pre_handler || !p->pre_handler(p, regs))
               setup_singlestep(p, regs, kcb, 0);

进入kprobe_handler,通过跟踪点的地址作为哈希值,通过get_kprobe获取的kprobe如下,这个kprobe就是通过如下指令注册的kprobe:

ubuntu@VM-0-9-ubuntu:~$ echo 'p:blk_update blk_update_request request=$arg1 status=$arg2:u8 bytes=$arg3:u32' > /sys/kernel/debug/tracing/kprobe_events

可以看到p->pre_handler为kprobe_dispatcher

(gdb) p *(struct kprobe *)0xffff0000070b5618
$7 = {
  hlist = {
    next = 0x0,
    pprev = 0xffff80001203a410 <kprobe_table+208>
  },
  list = {
    next = 0xffff0000070b5628,
    prev = 0xffff0000070b5628
  },
  nmissed = 0,
  addr = 0xffff8000104ec1f0 <blk_update_request>,
  symbol_name = 0xffff00000758b200 "blk_update_request",
  offset = 0,
  pre_handler = 0xffff8000101b1354 <kprobe_dispatcher>,
  post_handler = 0x0,
  fault_handler = 0x0,
  opcode = 3506537471,
  ainsn = {
    api = {
      insn = 0xffff800012533000,
      pstate_cc = 0x0,
      handler = 0x0,
      restore = 18446603336494793204
    }
  },
  flags = 0
}

5.7.1 kprobe_dispatcher

kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs)
     |  //由当前的kprobe获取到trace_kprobe 
     |--struct trace_kprobe *tk = container_of(kp, struct trace_kprobe, rp.kp);
     |--if (trace_probe_test_flag(&tk->tp, TP_FLAG_TRACE))
     |      kprobe_trace_func(tk, regs)
#ifdef CONFIG_PERF_EVENTS
     \--if (trace_probe_test_flag(&tk->tp, TP_FLAG_PROFILE))
           ret = kprobe_perf_func(tk, regs);
#endif
 kprobe_trace_func(tk, regs)
    |--struct event_file_link *link;
    \--trace_probe_for_each_link_rcu(link, &tk->tp)
           __kprobe_trace_func(tk, regs, link->file)
              |--struct kprobe_trace_entry_head *entry; 
              |  struct trace_event_call *call = trace_probe_event_call(&tk->tp);
              |  struct trace_event_buffer fbuffer
              |--if (trace_trigger_soft_disabled(trace_file))
              |      return|--fbuffer.pc = preempt_count();
              |--fbuffer.trace_file = trace_file;
              |  fbuffer.event = trace_event_buffer_lock_reserve(&fbuffer.buffer, trace_file,
              |                          call->event.type,
              |                          sizeof(*entry) + tk->tp.size + dsize,
              |                          fbuffer.flags, fbuffer.pc);
              |  fbuffer.regs = regs;
              |  entry = fbuffer.entry = ring_buffer_event_data(fbuffer.event);
              |  entry->ip = (unsigned long)tk->rp.kp.addr;
              |--store_trace_args(&entry[1], &tk->tp, regs, sizeof(*entry), dsize);
              |  //将kprobe打印放入trace buffer
              \--trace_event_buffer_commit(&fbuffer);
  • trace_probe_for_each_link_rcu:根据前述enable_trace_kprobe,此函数会为enable文件节点创建event_file_link,它会连入trace_probe_event的files链表,此处通过trace_probe_for_each_link_rcu来遍历链表,执行__kprobe_trace_func(tk, regs, link->file),可以看到__kprobe_trace_func中执行的动作就是trace_event的probe执行的操作,从这里可以看出kprobe和trace event的不同之处在于触发执行probe回调的方式不同,kprobe是通过断点指令异常中触发其trace event的probe回调,而trace event是通过在函数的固定位置触发probe回调,且kprobe的参数输出格式是动态设定并解析的,而trace event格式是静态设定的

5.7.2 setup_singlestep

setup_singlestep(p, regs, kcb, 0)
    |--unsigned long slot;
    |--kcb->kprobe_status = KPROBE_HIT_SS;
    |--if (p->ainsn.api.insn)
           //slot存放了blk_update_request的入口指令:sub     sp, sp, #0x60
           slot = (unsigned long)p->ainsn.api.insn;
           set_ss_context(kcb, slot);
               |--kcb->ss_ctx.ss_pending = true;
               |  //slot(kcb->ss_ctx.match_addr)同时存放了指令: brk     #0x6
               |--kcb->ss_ctx.match_addr = addr + sizeof(kprobe_opcode_t);
           kprobes_save_local_irqflag(kcb, regs);
           instruction_pointer_set(regs, slot);
               |  //将regs->pc赋值为val, 此处val就是slot, 它对应指令为sub     sp, sp, #0x60
               |--regs->pc = val

instruction_pointer_set设置了当断点指令返回执行的pc值,它就是blk_update_request原始的入口指令,当断点指令异常返回后,将执行blk_update_request的原始入口指令(注意:它位于另一个内存地址p->ainsn.api.insn,非原始内存地址)。由于slot槽同时还有一条端点指令brk #0x6,因此会继续执行断点指令brk #0x6

0xffff800012533000      sub    sp, sp, #0x60 
0xffff800012533004      brk    #0x6
0xffff800012533008      stp    x29, x30, [sp,#-32]!
0xffff80001253300c      brk    #0x6 
0xffff800012533010      .inst  0x00000000 ; undefined
......

5.8 brk #0x6=>

#arch/arm64/kernel/probes/kprobes.c

static struct break_hook kprobes_break_ss_hook = { 
        .imm = KPROBES_BRK_SS_IMM,
        .fn = kprobe_breakpoint_ss_handler,
};

int __init arch_init_kprobes(void)
{       
        register_kernel_break_hook(&kprobes_break_hook);
        register_kernel_break_hook(&kprobes_break_ss_hook);

        return 0;
}

同前面 brk #0x4 执行过程类似, arch_init_kprobes时会执行注册kprobes_break_ss_hook,它定义了用于断点的单步执行回调函数kprobe_breakpoint_ss_handler

static int __kprobes
kprobe_breakpoint_ss_handler(struct pt_regs *regs, unsigned int esr)
{
        struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
        int retval;

        /* return error if this is not our step */
        retval = kprobe_ss_hit(kcb, instruction_pointer(regs));

        if (retval == DBG_HOOK_HANDLED) {
                kprobes_restore_local_irqflag(kcb, regs);
                post_kprobe_handler(kcb, regs);
        }   

        return retval;
}

当触发brk #0x6断点异常后,将执行如下的异常处理路径:
brk #0x6 => el1_sync => el1_sync_handler => el1_dbg=> call_break_hook
call_break_hook会根据imm码遍历所有的断点处理回调,此处imm码就是0x6(KPROBES_BRK_SS_IMM),因此将会执行断点单步回调函数kprobe_breakpoint_ss_handler,其中会调用post_kprobe_handler

post_kprobe_handler(struct kprobe_ctlblk *kcb, struct pt_regs *regs)
    |--struct kprobe *cur = kprobe_running();
    |--if (cur->ainsn.api.restore != 0)
    |      //用cur->ainsn.api.restore来恢复pc值
    |      instruction_pointer_set(regs, cur->ainsn.api.restore)
    |          |--regs->pc = val;
    |--if (kcb->kprobe_status == KPROBE_REENTER)
           restore_previous_kprobe(kcb);
    

instruction_pointer_set:用cur->ainsn.api.restore来恢复pc值,cur->ainsn.api.restore实际就是注册register_kprobe中预先初始化好的,它就是blk_update_request入口指令的下一条指令:stp x29, x30, [sp,#16],这样当brk异常返回时,将继续沿着blk_update_request的第二条指令运行

6. 总结

我们再来简单总结kprobe的工作流程:

  1. 首先要注册kprobe
    这主要是通过向/sys/kernel/debug/tracing/kprobe_events节点写入命令完成,这个过程将会:
    (1)完成kprobe的注册,这其中最重要的是初始化pre_handler回调,它将在brk #0x4断点处理函数中被调用,执行kprobe的主要功能;
    (2)同时会保存被探测函数探测点的原始指令,再加上一条brk #0x6断点指令,一起被保存到slot中,将来被替换的brk #0x4返回后将首先执行此slot中的指令代码;
    (3)同时也会记录探测点的后一条指令地址,将来从brk #0x6返回时将执行此指令,从而恢复原始的指令执行路径;

  2. 断点指令插入
    主要通过echo 1 > /sys/kernel/debug/tracing/events/kprobes/blk_update/enable完成。它将会将被探测函数探测点的指令替换为brk #0x4。
    注:brk #0x4和brk #0x6将对应不同的断点处理回调

  3. 执行kprobe回调
    当进入被探测函数探测点时,会执行brk断点指令引发断点异常,根据0x4参数将执行断点立即处理回调,最终将执行pre_handler回调,完成kprobe功能;之后将执行第一步初始化好的slot槽中的指令,slot槽的第一条指令就是被探测函数原始执行的指令,之后将执行brk #0x6再次陷入断点异常,此时根据参数0x6将执行断点单步异常处理函数,它将会通过将第1步(3)中记录的指令地址恢复PC,这样brk #0x6返回时,将继续沿着被探测函数探测点之后的指令路径执行,恢复正常的指令执行路径。

从上面的分析可以看出,kprobe基于trace event,与trace event的不同在于,kprobe是通过断点指令异常中触发其trace event的probe回调,而trace event是通过在函数的固定位置触发probe回调,且kprobe的参数输出格式是动态设定并解析的,而trace event格式是静态设定的

执行结果如下:

/ # cat /sys//kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 6/6   #P:2
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
     ksoftirqd/0-9       [000] d.s1    37.846957: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ce2140 status=0 bytes=1024
     ksoftirqd/0-9       [000] d.s1    41.417047: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c44280 status=0 bytes=3072
          <idle>-0       [000] d.s2    41.419396: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ad8000 status=0 bytes=0
          <idle>-0       [000] d.s3    41.419896: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c463c0 status=0 bytes=1024
          <idle>-0       [000] d.s2    41.421701: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006ad8000 status=0 bytes=0
          <idle>-0       [000] d.s3    41.421729: blk_update: (blk_update_request+0x0/0x530) request=0xffff000006c463c0 status=0 bytes=0

参考文档

Kernel调试追踪技术之 Kprobe on ARM64
https://blog.csdn.net/whatday/article/details/100511447
Linux TraceEvent - 我见过的史上最长宏定义
kprobe原理解析(一)
kprobe原理解析(二)

附录

主要数据结构:

struct kprobe {
        struct hlist_node hlist;

        /* list of kprobes for multi-handler support */
        struct list_head list;

        /*count the number of times this probe was temporarily disarmed */
        unsigned long nmissed;

        /* location of the probe point */
        kprobe_opcode_t *addr;

        /* Allow user to indicate symbol name of the probe point */
        const char *symbol_name;

        /* Offset into the symbol */
        unsigned int offset;

        /* Called before addr is executed. */
        kprobe_pre_handler_t pre_handler;

        /* Called after addr is executed, unless... */
        kprobe_post_handler_t post_handler;

        /*  
         * ... called if executing addr causes a fault (eg. page fault).
         * Return 1 if it handled fault, otherwise kernel will see it.
         */
        kprobe_fault_handler_t fault_handler;

        /* Saved opcode (which has been replaced with breakpoint) */
        kprobe_opcode_t opcode;

        /* copy of the original instruction */
        struct arch_specific_insn ainsn;

        /*  
         * Indicates various status flags.
         * Protected by kprobe_mutex after this kprobe is registered.
         */
        u32 flags;
};
  • 1
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Linux设备驱动程序是用于控制和管理硬件设备的软件模块。学习Linux设备驱动程序可以帮助开发人员理解和掌握Linux内核的工作原理,以及如何编写和调试设备驱动程序。 以下是一些学习Linux设备驱动程序的笔记和建议: 1. 理解Linux设备模型:Linux设备模型是一种用于管理设备的框架,它提供了一种统一的方式来表示和操作设备。学习Linux设备模型可以帮助你理解设备的注册、初始化和销毁过程。 2. 学习字符设备驱动程序:字符设备是一种以字节为单位进行读写的设备,如串口、终端等。学习字符设备驱动程序可以帮助你了解字符设备的打开、关闭、读写等操作,并学习如何实现设备文件的注册和操作。 3. 学习块设备驱动程序:块设备是一种以块为单位进行读写的设备,如硬盘、闪存等。学习块设备驱动程序可以帮助你了解块设备的分区、缓存、IO调度等操作,并学习如何实现块设备的注册和操作。 4. 学习中断处理:中断是设备向处理器发送信号的一种机制,用于通知处理器设备的状态变化。学习中断处理可以帮助你了解中断的注册、处理和释放过程,并学习如何编写中断处理程序。 5. 学习设备驱动程序的调试技巧:设备驱动程序的调试是一个重要的技能,可以帮助你快速定位和解决问题。学习设备驱动程序的调试技巧可以帮助你理解和使用调试工具,如 printk、kprobe等。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值