原理概述
Ftrace是为了跟踪/记录内核函数的调用过程而创建的一个模块。跟踪记录一个函数的调用和运行,最容易想到的方式是:在函数入口处中添加一条打印函数名的log,筛找log文件中函数名的打印即能够跟踪到各个函数的调用和运行。但是这种方式需要在每个函数入口添加打印,且输出的log太多,并不现实。和这个想法类似,Ftrace模块依赖于编译器的 -pg -mrecord-mcount -mfentry编译选项,在编译的时候就在每个函数的入口处插入一个函数探针的跳转指令,使得在每个函数实际运行之前,先运行函数探针,在函数探针中记录本次函数调用,记录完成之后返回原函数继续执行。扫描函数探针的记录文件,便可以获知一段时间内所有的内核函数调用过程。如下图所示,假设函数探针为function_trace函数,该函数对函数的调用进行记录:
由此一来,有如下两个问题:
- 如何动态的使能Ftrace功能?
实际上如果一直保持Ftrace功能打开,就相当于在内核每个函数入口都增加了N条指令(即上述的function_trace函数的汇编指令),会降低系统的运行效率。所以在不使用Ftrace功能的时候,一般保持Ftrace功能的常闭。那如上描述,是否只有在编译时通过指定编译选项来使能Ftrace功能呢?如果只需要对内核中的部分函数进行跟踪呢?
在插入函数探针的编译选项被打开的情况下,内核的Ftrace框架在内核启动时,将插入的函数探针指令替换为nop指令,从而减少对系统运行效率的影响。当需要使能的时候,又将nop指令替换回函数探针跳转指令,从而实现了Ftrace功能的动态开关,如下图所示。
既然要替换指令,那么这些指令的位置是如何找到?
编译器在使能了插入函数探针的编译选项的情况下,会将所有插入了函数探针跳转指令的地址统一记录到可执行文件的一个单独的段,即_mcount_loc段(_start_mcount_loc至_stop_mcount_loc)。内核启动时,扫描该段即可对所有的地址处的内容进行修改。
如果只想跟踪某些函数调用,而不对所有的内核函数进行跟踪,如何实现呢?
编译器统一添加编译选项的单位是模块,并不能指定对某些函数添加编译选项,所以通过编译控制解决不了该问题。结合上文所述,既然内核能够在启动时动态修改函数探针跳转指令为nop,并且能够通过_mcount_loc段知道所有的地址,根据地址可以通过kallsyms_lookup函数知道其对应的symbol,然后就可以根据函数名匹配对指定地址处的跳转指令进行修改,来实现部分函数的跟踪了。
- 函数探针如何记录函数调用?
函数探针本身也是一个函数,可以通过向函数中传递当前所在函数的PC值,函数探针通过记录PC值来间接记录函数调用情况,即通过PC值和函数地址列表,间接计算出当前所属的函数。此外,根据PC和LR,结合跟踪的需求,函数探针可以有多种,比如function_tracer的函数探针,function_graph_tracer的函数探针。在编译的时候,并没有使用固定的一个函数探针,而是使用的标记即_mcount作为函数探针的替代,在内核初始化阶段是将bl _mcount替换成为了nop。
当需要动态的替换函数探针时,是否需要将_mcount_loc段中所有的地址内容都更新一遍呢?即当使用function tracer时,首先将nop替换为bl function_tracer;在此基础之上,当切换函数探针为function graph tracer时,又需要将所有bl function_tracer指令替换为bl function_graph_tracer?如果是这样的切换过程,假设一次指令替换所耗费的时间为N,那么M次函数探针的切换所耗费的时间则为N * M。为了减少函数探针切换耗费的时间,Ftrace模块使用了二级指针的概念:当使能Ftrace功能时,将nop指令替换为bl tracer,tracer函数里面有一条指令为bl specific_tracer的指令,当切换函数探针时,只需要替换bl specific_tracer指令到具体的函数探针跳转指令即可,如下图所示。
Ftrace源码解析
初始化
如前文Ftrace实现原理所述,在内核初始化阶段,会扫描可执行文件的_mcount_loc段,将其中记录的所有地址处的内容修改为nop。结合__do_softirq函数反汇编,即将如下bl _mcount处替换为nop:
ffffffc010081d98 <__do_softirq>:
__do_softirq():
ffffffc010081d98: d10243ff sub sp, sp, #0x90
ffffffc010081d9c: f800865e str x30, [x18],#8
ffffffc010081da0: a9037bfd stp x29, x30, [sp,#48]
ffffffc010081da4: a9046ffc stp x28, x27, [sp,#64]
ffffffc010081da8: a90567fa stp x26, x25, [sp,#80]
ffffffc010081dac: a9065ff8 stp x24, x23, [sp,#96]
ffffffc010081db0: a90757f6 stp x22, x21, [sp,#112]
ffffffc010081db4: a9084ff4 stp x20, x19, [sp,#128]
ffffffc010081db8: 9100c3fd add x29, sp, #0x30
ffffffc010081dbc: aa1e03f8 mov x24, x30
ffffffc010081dc0: 94000d58 bl ffffffc010085320 <_mcount>
将bl _mcount替换为nop的函数调用过程如下所示:
ftrace_init
ftrace_process_locs(NULL, __start_mcount_loc, __stop_mcount_loc);
ftrace_allocate_pages(count); //分配dyn_ftrace表占据的空间
ftrace_update_code(mod, start_pg);//对表中的每一项地址进行code修改
ftrace_code_disable
ftrace_make_nop(mod, rec, MCOUNT_ADDR);//将bl _mcount对应的指令码替换掉
ftrace_check_current_call(rec->ip, call);//校验地址中的内容的合法性
ftrace_modify_code(pc, old, new, validate);//这里的new就是nop对应的指令码,old是根据MCOUNT_ADDR计算出来的bl addr的指令码
aarch64_insn_patch_text_nosync
aarch64_insn_write(tp, insn);//更改指令码
ftrace_process_locs从参数_start_mcount_loc和_stop_mcount_loc也可以看出,是对记录了所有bl _mcount地址的_mcount_loc进行处理:
-
ftrace_allocate_pages
该函数会为每一个位于_mcount_loc段内的地址分配一个表项struct dyn_ftrace,所有的表项组成一张表格,用于后续对这些地址的遍历和查找。表格由一系列连续的页组成,其对应关系如下图所示:
-
ftrace_update_code
该函数依次取出表格中的每一项,对表项记录的地址中的内容进行校验,如果无误,则修改对应的内容,即将bl _mcount修改为nop。其中,修改指令为nop的函数实现如下:
int ftrace_make_nop(struct module *mod, struct dyn_ftrace *rec, unsigned long addr)
{
unsigned int call[2];
int ret;
【1】 make_call(rec->ip, addr, call);
【2】 ret = ftrace_check_current_call(rec->ip, call);
if (ret)
return ret;
return __ftrace_modify_call(rec->ip, addr, false);
}
【1】make_call,其函数原型为:
#define make_call(caller, callee, call) \
do { \
call[0] = to_auipc_insn((unsigned int)((unsigned long)callee - \
(unsigned long)caller)); \
call[1] = to_jalr_insn((unsigned int)((unsigned long)callee - \
(unsigned long)caller)); \
} while (0)
#define to_jalr_insn(offset) \
(((offset & JALR_OFFSET_MASK) << JALR_SHIFT) | JALR_BASIC)
#define to_auipc_insn(offset) \
((offset & JALR_SIGN_MASK) ? \
(((offset & AUIPC_OFFSET_MASK) + AUIPC_PAD) | AUIPC_BASIC) : \
((offset & AUIPC_OFFSET_MASK) | AUIPC_BASIC))
make_call函数的入参caller是指调用者,callee为被调用者,call则是通过caller和callee的地址计算得到的跳转指令码。对比如上do_softirq函数的汇编,则caller是ffffffc010081dc0,callee是ffffffc010085320(即_mcount的地址),计算出来的call则应该是94000d58。
ffffffc010081dc0: 94000d58 bl ffffffc010085320 <_mcount>
【2】ftrace_check_current_call,其函数原型为:
static int ftrace_check_current_call(unsigned long hook_pos,
unsigned int *expected)
{
unsigned int replaced[2];
unsigned int nops[2] = {NOP4, NOP4};
/* we expect nops at the hook position */
if (!expected)
expected = nops;
/*
* Read the text we want to modify;
* return must be -EFAULT on read error
*/
if (probe_kernel_read(replaced, (void *)hook_pos, MCOUNT_INSN_SIZE))
return -EFAULT;
/*
* Make sure it is what we expect it to be;
* return must be -EINVAL on failed comparison
*/
if (memcmp(expected, replaced, sizeof(replaced))) {
pr_err("%p: expected (%08x %08x) but got (%08x %08x)\n",
(void *)hook_pos, expected[0], expected[1], replaced[0],
replaced[1]);
return -EINVAL;
}
return 0;
}
即通过probe_kernel_read读取表项中记录的地址处的内容,然后和此前通过make_call计算得到的指令码进行比较,如果两者相同则继续执行,否则初始化将失败。即如上do_softirq的案例,即为先读取ffffffc010081dc0处的内容,然后和94000d58进行比较。进行比较的目的是为了保证所有的修改都是对_mcount_loc里的地址的修改,因为此前将这些地址加载到了内存的数据结构,如果内存里的数据结构被破坏,就有可能导致修改到其他地址处的内容,从而造成其他程序破话进而导致系统的崩溃。
函数探针动态配置
当系统初始化完成,所有位于_mcount_loc中的地址处的内容均被修改成为了nop,如果需要对函数进行跟踪记录,则需要进一步将nop指令修改为函数探针的跳转指令。Ftrace框架中提供了多种的函数探针(后统称tracer),通过函数register_tracer进行注册,并通过/sys/kernel/debug/tracing/current_tracer节点来进行配置。根据该节点的操作函数可知,由函数tracing_set_trace_write对函数探针进行配置:
trace_create_file("current_tracer", 0644, d_tracer, tr, &set_tracer_fops);
static const struct file_operations set_tracer_fops = {
.open = tracing_open_generic,
.read = tracing_set_trace_read,
.write = tracing_set_trace_write,
.llseek = generic_file_llseek,
};
tracing_set_trace_write函数原型如下:
static ssize_t
tracing_set_trace_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *ppos)
{
struct trace_array *tr = filp->private_data;
char buf[MAX_TRACER_SIZE+1];
int i;
size_t ret;
int err;
ret = cnt;
if (cnt > MAX_TRACER_SIZE)
cnt = MAX_TRACER_SIZE;
if (copy_from_user(buf, ubuf, cnt))
return -EFAULT;
buf[cnt] = 0;
/* strip ending whitespace. */
for (i = cnt - 1; i > 0 && isspace(buf[i]); i--)
buf[i] = 0;
err = tracing_set_tracer(tr, buf); //buf中为tracer的名称
if (err)
return err;
*ppos += ret;
return ret;
}
static int tracing_set_tracer(struct trace_array *tr, const char *buf)
{
struct tracer *t;
#ifdef CONFIG_TRACER_MAX_TRACE
bool had_max_tr;
#endif
int ret = 0;
mutex_lock(&trace_types_lock);
if (!ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
if (ret < 0)
goto out;
ret = 0;
}
【1】 for (t = trace_types; t; t = t->next) {
if (strcmp(t->name, buf) == 0)
break;
}
if (!t) {
ret = -EINVAL;
goto out;
}
if (t == tr->current_trace)
goto out;
#ifdef CONFIG_TRACER_SNAPSHOT
if (t->use_max_tr) {
arch_spin_lock(&tr->max_lock);
if (tr->cond_snapshot)
ret = -EBUSY;
arch_spin_unlock(&tr->max_lock);
if (ret)
goto out;
}
#endif
/* Some tracers won't work on kernel command line */
if (system_state < SYSTEM_RUNNING && t->noboot) {
pr_warn("Tracer '%s' is not allowed on command line, ignored\n",
t->name);
goto out;
}
/* Some tracers are only allowed for the top level buffer */
if (!trace_ok_for_array(t, tr)) {
ret = -EINVAL;
goto out;
}
/* If trace pipe files are being read, we can't change the tracer */
if (tr->current_trace->ref) {
ret = -EBUSY;
goto out;
}
trace_branch_disable();
tr->current_trace->enabled--;
【2】 if (tr->current_trace->reset)
tr->current_trace->reset(tr);
/* Current trace needs to be nop_trace before synchronize_rcu */
tr->current_trace = &nop_trace;
#ifdef CONFIG_TRACER_MAX_TRACE
had_max_tr = tr->allocated_snapshot;
if (had_max_tr && !t->use_max_tr) {
/*
* We need to make sure that the update_max_tr sees that
* current_trace changed to nop_trace to keep it from
* swapping the buffers after we resize it.
* The update_max_tr is called from interrupts disabled
* so a synchronized_sched() is sufficient.
*/
synchronize_rcu();
free_snapshot(tr);
}
#endif
#ifdef CONFIG_TRACER_MAX_TRACE
if (t->use_max_tr && !had_max_tr) {
ret = tracing_alloc_snapshot_instance(tr);
if (ret < 0)
goto out;
}
#endif
【3】 if (t->init) {
ret = tracer_init(t, tr); //如果指定tracer有init函数,就调用trace_init
if (ret)
goto out;
}
【4】 tr->current_trace = t;
tr->current_trace->enabled++;
trace_branch_enable(tr);
out:
mutex_unlock(&trace_types_lock);
return ret;
}
【1】判断当前设置的tracer是否被注册过(trace_types是一个全局变量,当使能了某个tracer时,调用register_tracer将tracer挂接到trace_types的链表中)
【2】需要设置成为的tracer和当前正在使用中的tracer不同时,先要调用正在使用的tracer的reset函数
【3】调用需要设置成为的tracer的init函数
int tracer_init(struct tracer *t, struct trace_array *tr)
{
tracing_reset_online_cpus(&tr->trace_buffer);
return t->init(tr);
}
【4】设置当前正在使用的tracer为设置的tracer
以function tracer为例,对tracer的init过程进行解析:
static int function_trace_init(struct trace_array *tr)
{
ftrace_func_t func;
/*
* Instance trace_arrays get their ops allocated
* at instance creation. Unless it failed
* the allocation.
*/
if (!tr->ops)
return -ENOMEM;
/* Currently only the global instance can do stack tracing */
【1】 if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
func_flags.val & TRACE_FUNC_OPT_STACK)
func = function_stack_trace_call;
else
func = function_trace_call;
【2】 ftrace_init_array_ops(tr, func);
tr->trace_buffer.cpu = get_cpu();
put_cpu();
tracing_start_cmdline_record();
【3】 tracing_start_function_trace(tr);
return 0;
}
【1】根据标志位设定tracer function(是否栈回溯)
【2】设定tr->ops->func为前面设定的func,这个tr是贯穿整个过程的数据结构
【3】开始开启function tracer
static void tracing_start_function_trace(struct trace_array *tr)
{
tr->function_enabled = 0;
register_ftrace_function(tr->ops);
tr->function_enabled = 1;
}
int register_ftrace_function(struct ftrace_ops *ops)
{
int ret = -1;
ftrace_ops_init(ops);
mutex_lock(&ftrace_lock);
ret = ftrace_startup(ops, 0);
mutex_unlock(&ftrace_lock);
return ret;
}
int ftrace_startup(struct ftrace_ops *ops, int command)
{
int ret;
if (unlikely(ftrace_disabled))
return -ENODEV;
【1】 ret = __register_ftrace_function(ops);
if (ret)
return ret;
ftrace_start_up++;
/*
* Note that ftrace probes uses this to start up
* and modify functions it will probe. But we still
* set the ADDING flag for modification, as probes
* do not have trampolines. If they add them in the
* future, then the probes will need to distinguish
* between adding and updating probes.
*/
ops->flags |= FTRACE_OPS_FL_ENABLED | FTRACE_OPS_FL_ADDING;
【2】 ret = ftrace_hash_ipmodify_enable(ops);
if (ret < 0) {
/* Rollback registration process */
__unregister_ftrace_function(ops);
ftrace_start_up--;
ops->flags &= ~FTRACE_OPS_FL_ENABLED;
return ret;
}
if (ftrace_hash_rec_enable(ops, 1))
command |= FTRACE_UPDATE_CALLS;
【3】 ftrace_startup_enable(command);
ops->flags &= ~FTRACE_OPS_FL_ADDING;
return 0;
}
【1】将当前function tracer进行注册,此时的注册是指将其挂接到开启了的tracer链表当中:
int __register_ftrace_function(struct ftrace_ops *ops)
{
if (ops->flags & FTRACE_OPS_FL_DELETED)
return -EINVAL;
if (WARN_ON(ops->flags & FTRACE_OPS_FL_ENABLED))
return -EBUSY;
#ifndef CONFIG_DYNAMIC_FTRACE_WITH_REGS
/*
* If the ftrace_ops specifies SAVE_REGS, then it only can be used
* if the arch supports it, or SAVE_REGS_IF_SUPPORTED is also set.
* Setting SAVE_REGS_IF_SUPPORTED makes SAVE_REGS irrelevant.
*/
if (ops->flags & FTRACE_OPS_FL_SAVE_REGS &&
!(ops->flags & FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED))
return -EINVAL;
if (ops->flags & FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED)
ops->flags |= FTRACE_OPS_FL_SAVE_REGS;
#endif
if (!core_kernel_data((unsigned long)ops))
ops->flags |= FTRACE_OPS_FL_DYNAMIC;
【1.1】 add_ftrace_ops(&ftrace_ops_list, ops);
/* Always save the function, and reset at unregistering */
ops->saved_func = ops->func;
if (ftrace_pids_enabled(ops))
ops->func = ftrace_pid_func;
ftrace_update_trampoline(ops);
if (ftrace_enabled)
【1.2】 update_ftrace_function();
return 0;
}
【1.1】add_ftrace_ops,将tracer对应的ops添加到全局链表ftrace_ops_list中。
【1.2】update_ftrace_function,更新tracer列表并设置tracer
static void update_ftrace_function(void)
{
ftrace_func_t func;
/*
* Prepare the ftrace_ops that the arch callback will use.
* If there's only one ftrace_ops registered, the ftrace_ops_list
* will point to the ops we want.
*/
【1.2.1】 set_function_trace_op = rcu_dereference_protected(ftrace_ops_list,
lockdep_is_held(&ftrace_lock));
/* If there's no ftrace_ops registered, just call the stub function */
if (set_function_trace_op == &ftrace_list_end) {
func = ftrace_stub;
/*
* If we are at the end of the list and this ops is
* recursion safe and not dynamic and the arch supports passing ops,
* then have the mcount trampoline call the function directly.
*/
} else if (rcu_dereference_protected(ftrace_ops_list->next,
lockdep_is_held(&ftrace_lock)) == &ftrace_list_end) {
func = ftrace_ops_get_list_func(ftrace_ops_list);
} else {
/* Just use the default ftrace_ops */
set_function_trace_op = &ftrace_list_end;
func = ftrace_ops_list_func;
}
update_function_graph_func();
/* If there's no change, then do nothing more here */
if (ftrace_trace_function == func)
return;
/*
* If we are using the list function, it doesn't care
* about the function_trace_ops.
*/
if (func == ftrace_ops_list_func) {
ftrace_trace_function = func;
/*
* Don't even bother setting function_trace_ops,
* it would be racy to do so anyway.
*/
return;
}
#ifndef CONFIG_DYNAMIC_FTRACE
/*
* For static tracing, we need to be a bit more careful.
* The function change takes affect immediately. Thus,
* we need to coorditate the setting of the function_trace_ops
* with the setting of the ftrace_trace_function.
*
* Set the function to the list ops, which will call the
* function we want, albeit indirectly, but it handles the
* ftrace_ops and doesn't depend on function_trace_op.
*/
ftrace_trace_function = ftrace_ops_list_func;
/*
* Make sure all CPUs see this. Yes this is slow, but static
* tracing is slow and nasty to have enabled.
*/
schedule_on_each_cpu(ftrace_sync);
/* Now all cpus are using the list ops. */
function_trace_op = set_function_trace_op;
/* Make sure the function_trace_op is visible on all CPUs */
smp_wmb();
/* Nasty way to force a rmb on all cpus */
smp_call_function(ftrace_sync_ipi, NULL, 1);
/* OK, we are all set to update the ftrace_trace_function now! */
#endif /* !CONFIG_DYNAMIC_FTRACE */
【1.2.2】 ftrace_trace_function = func; //设置ftrace_trace_function
}
【1.2.1】读取注册的tracer链表ftrace_ops_list的头部,并对其进行判断:如果链表头和链表尾相同,代表没有注册任何tracer,则设置ftrace_trace_function(【1.2.2】)为默认的ftrace_stub;如果链表头的下一个元素为链表尾,则表示只注册了一个tracer,则设置ftrace_trace_function为该注册的tracer;当注册了多个tracer时,则设置ftrace_trace_function为ftrace_ops_list_func,该函数会依次执行所有注册过的tracer:
static void ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *op, struct pt_regs *regs)
{
__ftrace_ops_list_func(ip, parent_ip, NULL, regs);
}
static nokprobe_inline void
__ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
struct ftrace_ops *ignored, struct pt_regs *regs)
{
struct ftrace_ops *op;
int bit;
bit = trace_test_and_set_recursion(TRACE_LIST_START, TRACE_LIST_MAX);
if (bit < 0)
return;
/*
* Some of the ops may be dynamically allocated,
* they must be freed after a synchronize_rcu().
*/
preempt_disable_notrace();
do_for_each_ftrace_op(op, ftrace_ops_list) {
/* Stub functions don't need to be called nor tested */
if (op->flags & FTRACE_OPS_FL_STUB)
continue;
/*
* Check the following for each ops before calling their func:
* if RCU flag is set, then rcu_is_watching() must be true
* if PER_CPU is set, then ftrace_function_local_disable()
* must be false
* Otherwise test if the ip matches the ops filter
*
* If any of the above fails then the op->func() is not executed.
*/
if ((!(op->flags & FTRACE_OPS_FL_RCU) || rcu_is_watching()) &&
ftrace_ops_test(op, ip, regs)) {
if (FTRACE_WARN_ON(!op->func)) {
pr_warn("op=%p %pS\n", op, op);
goto out;
}
op->func(ip, parent_ip, op, regs);
}
} while_for_each_ftrace_op(op);
out:
preempt_enable_notrace();
trace_clear_recursion(bit);
}
即通过do_for_each_ftrace_op遍历函数探针列表,并依次执行op->func。在【1】对应的函数中,由于【1.1】注册了function tracer ops,所以在【1.2】中肯定包含了funtion tracer ops。
【2】ftrace_hash_ipmodify_enable更新hash链表,根据此前的描述:tracer可以只跟踪记录部分内核函数的调用,这些部分需要被跟踪记录的函数就被挂接到一个hash链表。当重新设置tracer时,由于前后tracer的变化,需要跟踪的函数hash链表也有变化,在该函数中对其标记进行更新。
static int ftrace_hash_ipmodify_enable(struct ftrace_ops *ops)
{
struct ftrace_hash *hash = ops->func_hash->filter_hash;
if (ftrace_hash_empty(hash))
hash = NULL;
return __ftrace_hash_update_ipmodify(ops, EMPTY_HASH, hash);
}
static int __ftrace_hash_update_ipmodify(struct ftrace_ops *ops,
struct ftrace_hash *old_hash,
struct ftrace_hash *new_hash)
{
struct ftrace_page *pg;
struct dyn_ftrace *rec, *end = NULL;
int in_old, in_new;
/* Only update if the ops has been registered */
if (!(ops->flags & FTRACE_OPS_FL_ENABLED))
return 0;
if (!(ops->flags & FTRACE_OPS_FL_IPMODIFY))
return 0;
/*
* Since the IPMODIFY is a very address sensitive action, we do not
* allow ftrace_ops to set all functions to new hash.
*/
if (!new_hash || !old_hash)
return -EINVAL;
/* Update rec->flags */
do_for_each_ftrace_rec(pg, rec) {
if (rec->flags & FTRACE_FL_DISABLED)
continue;
/* We need to update only differences of filter_hash */
in_old = !!ftrace_lookup_ip(old_hash, rec->ip);
in_new = !!ftrace_lookup_ip(new_hash, rec->ip);
if (in_old == in_new)
continue;
if (in_new) {
/* New entries must ensure no others are using it */
if (rec->flags & FTRACE_FL_IPMODIFY)
goto rollback;
rec->flags |= FTRACE_FL_IPMODIFY;
} else /* Removed entry */
rec->flags &= ~FTRACE_FL_IPMODIFY;
} while_for_each_ftrace_rec();
return 0;
rollback:
end = rec;
/* Roll back what we did above */
do_for_each_ftrace_rec(pg, rec) {
if (rec->flags & FTRACE_FL_DISABLED)
continue;
if (rec == end)
goto err_out;
in_old = !!ftrace_lookup_ip(old_hash, rec->ip);
in_new = !!ftrace_lookup_ip(new_hash, rec->ip);
if (in_old == in_new)
continue;
if (in_new)
rec->flags &= ~FTRACE_FL_IPMODIFY;
else
rec->flags |= FTRACE_FL_IPMODIFY;
} while_for_each_ftrace_rec();
err_out:
return -EBUSY;
}
【3】开始修改_mcount_loc中记录的地址处的内容,将其指向tracer(以此前没有开启过任何tracer,当前欲开启function tracer的情况下进行分析),其调用过程如下:
ftrace_startup_enable
ftrace_run_update_code
arch_ftrace_update_code
ftrace_run_stop_machine
__ftrace_modify_code
ftrace_modify_all_code
void ftrace_modify_all_code(int command)
{
int update = command & FTRACE_UPDATE_TRACE_FUNC;
int mod_flags = 0;
int err = 0;
if (command & FTRACE_MAY_SLEEP)
mod_flags = FTRACE_MODIFY_MAY_SLEEP_FL;
/*
* If the ftrace_caller calls a ftrace_ops func directly,
* we need to make sure that it only traces functions it
* expects to trace. When doing the switch of functions,
* we need to update to the ftrace_ops_list_func first
* before the transition between old and new calls are set,
* as the ftrace_ops_list_func will check the ops hashes
* to make sure the ops are having the right functions
* traced.
*/
if (update) {
【1】 err = ftrace_update_ftrace_func(ftrace_ops_list_func);
if (FTRACE_WARN_ON(err))
return;
}
if (command & FTRACE_UPDATE_CALLS)
【2】 ftrace_replace_code(mod_flags | FTRACE_MODIFY_ENABLE_FL);
else if (command & FTRACE_DISABLE_CALLS)
ftrace_replace_code(mod_flags);
if (update && ftrace_trace_function != ftrace_ops_list_func) {
function_trace_op = set_function_trace_op;
smp_wmb();
/* If irqs are disabled, we are in stop machine */
if (!irqs_disabled())
smp_call_function(ftrace_sync_ipi, NULL, 1);
err = ftrace_update_ftrace_func(ftrace_trace_function);
if (FTRACE_WARN_ON(err))
return;
}
if (command & FTRACE_START_FUNC_RET)
err = ftrace_enable_ftrace_graph_caller();
else if (command & FTRACE_STOP_FUNC_RET)
err = ftrace_disable_ftrace_graph_caller();
FTRACE_WARN_ON(err);
}
【1】更新tracer function
int ftrace_update_ftrace_func(ftrace_func_t func)
{
unsigned long pc;
u32 new;
pc = (unsigned long)&ftrace_call;
new = aarch64_insn_gen_branch_imm(pc, (unsigned long)func,
AARCH64_INSN_BRANCH_LINK);
//修改ftrace_call这个函数为tracer func
return ftrace_modify_code(pc, 0, new, false);
}
前面提到,当更换tracer时,只需要修改二级指针里面的内容指向即可,此时的ftrace_call即为这个二级指针,aarch64_insn_gen_branch_imm函数计算ftracer_call这个地址到tracer func的地址的跳转指令,然后修改ftrace_call处的内容为该跳转指令,即如下的nop会替换为bl tracer指令:
GLOBAL(ftrace_call) // tracer(pc, lr);
nop // This will be replaced with "bl xxx"
// where xxx can be any kind of tracer.
【2】ftrace_replace_code替换所有_mcount_loc中地址的内容,使其指向前面所说的二级指针
void __weak ftrace_replace_code(int mod_flags)
{
struct dyn_ftrace *rec;
struct ftrace_page *pg;
bool enable = mod_flags & FTRACE_MODIFY_ENABLE_FL;
int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
int failed;
if (unlikely(ftrace_disabled))
return;
do_for_each_ftrace_rec(pg, rec) {
if (rec->flags & FTRACE_FL_DISABLED)
continue;
failed = __ftrace_replace_code(rec, enable);
if (failed) {
ftrace_bug(failed, rec);
/* Stop processing */
return;
}
if (schedulable)
cond_resched();
} while_for_each_ftrace_rec();
}
对于每一个表项,都对其进行判断,对于需要跟踪的函数,调用__ftrace_replace_code将nop替换为ftrace_caller,对于不需要跟踪的函数,则不做任何操作,是否需要根据FTRACE_FL_DISABLED 标志来判断,该标志由set_ftrace_filter来添加。
static int
__ftrace_replace_code(struct dyn_ftrace *rec, bool enable)
{
unsigned long ftrace_old_addr;
unsigned long ftrace_addr;
int ret;
【1】 ftrace_addr = ftrace_get_addr_new(rec);
/* This needs to be done before we call ftrace_update_record */
ftrace_old_addr = ftrace_get_addr_curr(rec);
【2】 ret = ftrace_update_record(rec, enable);
ftrace_bug_type = FTRACE_BUG_UNKNOWN;
switch (ret) {
case FTRACE_UPDATE_IGNORE:
return 0;
case FTRACE_UPDATE_MAKE_CALL:
ftrace_bug_type = FTRACE_BUG_CALL;
return ftrace_make_call(rec, ftrace_addr);
case FTRACE_UPDATE_MAKE_NOP:
ftrace_bug_type = FTRACE_BUG_NOP;
return ftrace_make_nop(NULL, rec, ftrace_old_addr);
case FTRACE_UPDATE_MODIFY_CALL:
ftrace_bug_type = FTRACE_BUG_UPDATE;
return ftrace_modify_call(rec, ftrace_old_addr, ftrace_addr);
}
return -1; /* unknown ftrace bug */
}
【1】查找二级指针的地址
unsigned long ftrace_get_addr_new(struct dyn_ftrace *rec)
{
struct ftrace_ops *ops;
/* Trampolines take precedence over regs */
if (rec->flags & FTRACE_FL_TRAMP) {
ops = ftrace_find_tramp_ops_new(rec);
if (FTRACE_WARN_ON(!ops || !ops->trampoline)) {
pr_warn("Bad trampoline accounting at: %p (%pS) (%lx)\n",
(void *)rec->ip, (void *)rec->ip, rec->flags);
/* Ftrace is shutting down, return anything */
return (unsigned long)FTRACE_ADDR;
}
return ops->trampoline;
}
if (rec->flags & FTRACE_FL_REGS)
return (unsigned long)FTRACE_REGS_ADDR;
else
return (unsigned long)FTRACE_ADDR;
}
这里返回的二级指针并不是此前说的ftrace_call的地址,ftrace_call是属于FTRACE_CALL函数中的一部分:
#ifndef FTRACE_ADDR
#define FTRACE_ADDR ((unsigned long)ftrace_caller)
#endif
#ifndef FTRACE_GRAPH_ADDR
#define FTRACE_GRAPH_ADDR ((unsigned long)ftrace_graph_caller)
#endif
#ifndef FTRACE_REGS_ADDR
#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
# define FTRACE_REGS_ADDR ((unsigned long)ftrace_regs_caller)
#else
# define FTRACE_REGS_ADDR FTRACE_ADDR
#endif
#endif
.macro mcount_get_pc0 reg
mcount_adjust_addr \reg, x30
.endm
.macro mcount_get_pc reg
ldr \reg, [x29, #8]
mcount_adjust_addr \reg, \reg
.endm
.macro mcount_get_lr reg
ldr \reg, [x29]
ldr \reg, [\reg, #8]
.endm
.macro mcount_get_lr_addr reg
ldr \reg, [x29]
add \reg, \reg, #8
.endm
ENTRY(ftrace_caller)
mcount_enter
mcount_get_pc0 x0 // function's pc
mcount_get_lr x1 // function's lr
GLOBAL(ftrace_call) // tracer(pc, lr);
nop // This will be replaced with "bl xxx"
// where xxx can be any kind of tracer.
#ifdef CONFIG_FUNCTION_GRAPH_TRACER
GLOBAL(ftrace_graph_call) // ftrace_graph_caller();
nop // If enabled, this will be replaced
// "b ftrace_graph_caller"
#endif
mcount_exit
ENDPROC(ftrace_caller)
【2】检查函数是否被跟踪,即该地址处的原内容是否为nop,如果是从nop替换为bl tracer,则返回FTRACE_UPDATE_MAKE_CALL;如果原本该地址已经被其他tracer跟踪,则返回FTRACE_UPDATE_MODIFY_CALL
int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
{
unsigned long pc = rec->ip;
u32 old, new;
long offset = (long)pc - (long)addr;
.........
old = aarch64_insn_gen_nop(); //nop指令为旧的指令码
//根据需要跳转的函数的地址,计算出来的新的指令码
new = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK);
//将nop指令码替换为新的指令码
return ftrace_modify_code(pc, old, new, true);
}