Linux:ftrace & kprobetrace

277 篇文章 25 订阅 ¥69.90 ¥99.00

参考

在/sys/kernel/debug/tracing/README 这个是一个简易的使用说明
详细的可以在这里找到
https://docs.kernel.org/trace/kprobes.html
https://lwn.net/Articles/370423/
https://www.brendangregg.com/blog/2014-08-30/ftrace-the-hidden-light-switch.html
https://lwn.net/Articles/132196/ ;An introduction to KProbes
https://lwn.net/Articles/852112/ ; systemtap与bpftrace的对比
https://lwn.net/Articles/505465/
这个上面说需要关闭这个配置项:STRICT_MEMORY_RWX;不知真假,先放着这里作为一个问题调试的参考。

https://docs.kernel.org/trace/kprobetrace.html
https://www.kernel.org/doc/html/v4.18/trace/ftrace.html

gcc 8 产生的cold 函数

这个cold函数没有添加ftrace相关的代码。所以也不能做trace。

blacklist

Kprobes can probe most of the kernel except itself. This means that there are some functions where kprobes cannot probe. Probing (trapping) such functions can cause a recursive trap (e.g. double fault) or the nested probe handler may never be called. Kprobes manages such functions as a blacklist. If you want to add a function into the blacklist, you just need
to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro to specify a blacklisted function.
Kprobes checks the given probe address against the blacklist and rejects registering it, if the given address is in the blacklist.

相关Linux配置

CONFIG_TRACING

config TRACING
bool
select DEBUG_FS
select RING_BUFFER
select STACKTRACE if STACKTRACE_SUPPORT
select TRACEPOINTS
select NOP_TRACER
select BINARY_PRINTF
select EVENT_TRACING
select TRACE_CLOCK

实现

notrace

这个的意思是说,不要将function放到ftrace里去。

#define notrace __attribute__((no_instrument_function))
/*
 * We must have this function blacklisted from kprobes, tagged with notrace
 * and call read_cr2() before calling anything else. To avoid calling any
 * kind of tracing machinery before we've observed the CR2 value.
 *
 * exception_{enter,exit}() contains all sorts of tracepoints.
 */
dotraplinkage void notrace
do_page_fault(struct pt_regs *regs, unsigned long error_code)

注意事项

根据pid做trace

Optionally, to trace only the kernel functions executed in the context of a particular process, set the pseudo-variable ‘set-ftrace-pid’ to the process id (pid) of the process:
echo pid > /sys/kernel/tracing/set_ftrace_pid

trace 文件里的pid

是一个父进程的id,比如如果在ksh界面执行了一个操作,这个操作下调用的函数,其实记录的是ksh进程的pid。而不是新命令clone出来的pid。

mount

mount -t tracefs nodev /sys/kernel/tracing
redhat8, 默认的目录是在 /sys/kernel/debug/tracing/

ftrace

相关的sysctl配置

ftrace_enabled
sysctl kernel.ftrace_enabled=0
sysctl kernel.ftrace_enabled=1
or
echo 0 > /proc/sys/kernel/ftrace_enabled
echo 1 > /proc/sys/kernel/ftrace_enabled

function_graph

Similar to the function tracer except that the function tracer probes the functions on their entry whereas the function graph tracer traces on both entry and exit of the functions. It then provides the ability to draw a graph of function calls similar to C code source.
函数调用的流程图,厉害了这个。可以使用一下,体验体验。

$ echo 0 > tracing_on
$ echo function_graph > current_tracer
$ echo ubi* > set_graph_function
$ cat enabled_functions | grep ubi_attach
ubi_attach_mtd_dev
ubi_attach
$ echo 1 > tracing_on
$ ubiattach -m 3 -d 3
UBI device number 3, total 144 LEBs (18855936 bytes, 18.0 MiB), available 0 LEBs (0 bytes), LEB size 130944 bytes (127.9 KiB)
$ head trace
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  6)   0.059 us    |      } /* dput */
  6)   0.055 us    |      mntput();
  6)   1.238 us    |    } /* path_put */
  6)   0.072 us    |    unroll_tree_refs();
  6)   0.056 us    |    kfree();
  6)   2.833 us    |  } /* __audit_syscall_exit */

需要注意的一点是,这个依赖:这个文件里的函数,必须是所有的函数,如果是单一的函数,可能就显示不出来所有的。
[root@epsilon-sim0 tracing]# more set_ftrace_filter
####all functions enabled ####

function

[root@mylinux1 tracing]# pwd
/sys/kernel/debug/tracing
[root@mylinux1 tracing]# echo pci_bus_write_config_byte > set_ftrace_filter
[root@mylinux1 tracing]# echo function > current_tracer    、、、、 
这里的“function”,是一个关键字,是要输入到current_tracer 文件的一个选项,意思是,当前追踪的是函数。
[root@mylinux1 tracing]# echo 1 > tracing_on 

如果set_ftrace_filter 设置为空,就是所有的function 都做trace

#more set_ftrace_filter
####all functions enabled ####

实现机制

通过反汇编发现,每个函数都一开始就会调用函数 <fentry>

(gdb) disass __hrtimer_run_queues
Dump of assembler code for function __hrtimer_run_queues:
   0xffffffff8117d200 <+0>:     callq  0xffffffff81a01900 <__fentry__>
   0xffffffff8117d205 <+5>:     push   %r15

这个函数的定义在x86下,arch/x86/kernel/ftrace_32/64.S

# define function_hook	__fentry__
EXPORT_SYMBOL(__fentry__)

#ifdef CONFIG_DYNAMIC_FTRACE

SYM_FUNC_START(function_hook)
	retq
SYM_FUNC_END(function_hook)


#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_FUNC_START(ftrace_graph_caller)
	/* Saves rbp into %rdx and fills first parameter  */
	save_mcount_regs

	leaq MCOUNT_REG_SIZE+8(%rsp), %rsi
	movq $0, %rdx	/* No framepointers needed */
	call	prepare_ftrace_return

	restore_mcount_regs

	retq
SYM_FUNC_END(ftrace_graph_caller)

/include/linux/linkage.h
/*
 * The same as SYM_FUNC_START_ALIAS, but we will need to distinguish these two later.
 */
#define SYM_FUNC_START(name)				\
	SYM_START(name, SYM_L_GLOBAL, SYM_A_ALIGN)
#endif

/* SYM_START -- use only if you have to */
#ifndef SYM_START
#define SYM_START(name, linkage, align...)		\
	SYM_ENTRY(name, linkage, align)
#endif

/* SYM_ENTRY -- use only if you have to for non-paired symbols */
#ifndef SYM_ENTRY
#define SYM_ENTRY(name, linkage, align...)		\
	linkage(name) ASM_NL				\
	align ASM_NL					\
	name:
#endif

最终会使用
#define MCOUNT_ADDR ((unsigned long)(fentry))
这个MCOUNT_ADDR 表示fentry的函数地址
#define MCOUNT_INSN_SIZE 5 /* sizeof mcount call */

ftrace_make_nop/ftrace_code_disable,在系统启动或者是加载module时,将这个地址调用指令,改成nop;在disable ftrace时也会改成nop指令

dump stack

这里需要注意;要将函数和stacktrace同时写进到set_ftrace_filter文件,不然会dump所有的函数
[root@vmtca-2101 tracing]# more set_ftrace_filter
page_add_new_anon_rmap
page_add_new_anon_rmap:stacktrace:unlimited

echo :stacktrace > set_ftrace_filter

            triggers: a command to perform when function is hit
              Format: <function>:<trigger>[:count]
             trigger: traceon, traceoff
                      enable_event:<system>:<event>
                      disable_event:<system>:<event>
                      stacktrace
                      snapshot
                      dump
                      cpudump
             example: echo do_fault:traceoff > set_ftrace_filter
                      echo do_trap:traceoff:3 > set_ftrace_filter

event

有些trace event,例如
下面的 trace_kfree_skb 就是一个trace点;可以通过
/sys/kernel/debug/tracing/events/skb/kfree_skb

[root@rms01-host01 kfree_skb]# echo 1 > enable
[root@rms01-host01 kfree_skb]# more format
name: kfree_skb
ID: 1278
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;

    field:void * skbaddr;   offset:8;       size:8; signed:0;
    field:void * location;  offset:16;      size:8; signed:0;
    field:unsigned short protocol;  offset:24;      size:2; signed:0;

print fmt: “skbaddr=%p protocol=%u location=%p”, REC->skbaddr, REC->protocol, REC->location

void kfree_skb(struct sk_buff *skb)
{
	if (!skb_unref(skb))
		return;

	trace_kfree_skb(skb, __builtin_return_address(0));
	__kfree_skb(skb);
}

问题

设置 tracing_on 失败:

sec01-fw-a:root/sys/kernel/debug/tracing:
#echo 1 > tracing_on
sec01-fw-a:root/sys/kernel/debug/tracing:
#more tracing_on
0

如果使用tail -f trace 命令,打开trace文件时,怎么设置都不成功。

set_ftrace_filter里的内容不能删除

在这个文件里设置:icmp_reply.constprop.35:stacktrace 这个之后,怎么也删不掉这一行。
解决办法就是:echo ‘!skb_release_data:stacktrace’ > set_ftrace_filter
在 前面加叹号,然后需要使用 单引号引起来这个内容,echo到这个文件,就好了。

开放问题

是否支持kernel module里的函数的跟踪

其实可以使用kprobe 或者user probe的功能更,查看kernel module里的行为;还没实践过。

kprobetrace

简单的步骤

如果执行命令出现错误,可用使用dmesg查看错误原因。
echo ‘r:myretprobe do_sys_open $retval’ >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
echo 0 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace

echo ‘p:myretprobe1 unix_dgram_sendmsg msg=%ax’ >> /sys/kernel/debug/tracing/kprobe_events
echo ‘p:myretprobe1 unix_dgram_sendmsg arg3=%ax’ >> /sys/kernel/debug/tracing/kprobe_events
这里可以将msg变量打印出来,以及第三个参数。
static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)

错误write to 1 failed [No such file or directory]

#echo ‘r:myretprobe acl_permission_check $retval’ >> /sys/kernel/debug/tracing/kprobe_events
-ksh: echo: write to 1 failed [No such file or directory]
这个错误的原因是,acl_permission_check 这个函数不可以做kprobe;
需要查看 available_filter_functions 看函数是否可用kprobe

错误实例

这个错误的原因是,当从这个网页复制命令到Console时,单引号其实是中文格式,需要改一改。
[443041.985255] Probe definition must be started with ‘p’, ‘r’ or ‘-’.
[443078.589547] Probe definition must be started with ‘p’, ‘r’ or ‘-’.

错误

#echo ‘r:myretprobe __sched_setscheduler $retval’ >> /sys/kernel/debug/tracing/kprobe_events
-ksh: echo: write to 1 failed [Device or resource busy]
需要将真正tracing的 事件停掉; 才能修改这个probe

简介

这个kprobetrace类似基于事件的tracepoint(跟踪点)。不过这个是基于kprobes(kprobe and kretprobe)。所有kprobe可以探测的地方,都可以使用这个kprobetrace功能。也就意味着,除了带有__kprobes/nokprobe_inline 关键字的函数,还有带有标记NOKPROBE_SYMBOL的函数外,所有的其他函数都可以做kprobetrace。而且是可以动态做探测,不像tracepoint,是静态编译在内核模块里。
需要打开设置:CONFIG_KPROBE_EVENTS=y.
像事件跟踪器,kprobetrace,也不是通过current_tracer激活。而是通过向 /sys/kernel/debug/tracing/kprobe_events文件添加跟踪点。然后通过文件/sys/kernel/debug/tracing/events/kprobes//enable激活跟踪。
除了kprobe_events,也可以使用文件:/sys/kernel/debug/tracing/dynamic_events。T这个统一的接口可以访问其他的动态事件。
语法:

 p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]  : 设置一个探测
 r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]  : 设置一个返回值探测
 p:[GRP/]EVENT] [MOD:]SYM[+0]%return [FETCHARGS]       : 设置一个返回值探测
 -:[GRP/]EVENT                                         : 清楚探测

GRP            : 组名称. 如果省略,默认使用"kprobes"。
EVENT          : 事件名,如果省略,通过SYM+offs或者MEMADDR生成。
MOD            : 提供SYM的模块名称。
SYM[+offs]     : Symbol+offset要插入的探测点。
SYM%return     : 符合的返回地址
MEMADDR        : 探测插入的地址。
MAXACTIVE      : Maximum number of instances of the specified function that
                 can be probed simultaneously, or 0 for the default value
                 as defined in Documentation/trace/kprobes.rst section 1.3.1.
FETCHARGS      : Arguments. Each probe can have up to 128 args.
 %REG          : Fetch register REG
 @ADDR         : Fetch memory at ADDR (ADDR should be in kernel)
 @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
 $stackN       : Fetch Nth entry of stack (N >= 0)
 $stack        : Fetch stack address.
 $argN         : Fetch the Nth function argument. (N >= 1) (\*1)
 $retval       : Fetch return value.(\*2)
 $comm         : Fetch current task comm.
 +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
 \IMM          : Store an immediate value to the argument.
 NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
 FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
                 (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
                 (x8/x16/x32/x64), "string", "ustring" and bitfield
                 are supported.

 (\*1) only for the probe on function entry (offs == 0).
 (\*2) only for return probe.
 (\*3) this is useful for fetching a field of data structures.
 (\*4) "u" means user-space dereference. See :ref:`user_mem_access`.

类型

Several types are supported for fetch-args. Kprobe tracer will access memory by given type. Prefix ‘s’ and ‘u’ means those types are signed and unsigned respectively. ‘x’ prefix implies it is unsigned. Traced arguments are shown in decimal (‘s’ and ‘u’) or hexadecimal (‘x’). Without type casting, ‘x32’ or ‘x64’ is used depends on the architecture (e.g. x86-32 uses x32, and x86-64 uses x64). These value types can be an array. To record array data, you can add ‘[N]’ (where N is a fixed number, less than 64) to the base type. E.g. ‘x16[4]’ means an array of x16 (2bytes hex) with 4 elements. Note that the array can be applied to memory type fetchargs, you can not apply it to registers/stack-entries etc. (for example, ‘ s t a c k 1 : x 8 [ 8 ] ’ i s w r o n g , b u t ‘ + 8 ( stack1:x8[8]’ is wrong, but ‘+8( stack1:x8[8]iswrong,but+8(stack):x8[8]’ is OK.) String type is a special type, which fetches a “null-terminated” string from kernel space. This means it will fail and store NULL if the string container has been paged out. “ustring” type is an alternative of string for user-space. See User Memory Access for more info… The string array type is a bit different from other types. For other base types, [1] is equal to (e.g. +0(%di):x32[1] is same as +0(%di):x32.) But string[1] is not equal to string. The string type itself represents “char array”, but string array type represents “char * array”. So, for example, +0(%di):string[1] is equal to +0(+0(%di)):string. Bitfield is another special type, which takes 3 parameters, bit-width, bit- offset, and container-size (usually 32). The syntax is:

b@/

Symbol type(‘symbol’) is an alias of u32 or u64 type (depends on BITS_PER_LONG) which shows given pointer in “symbol+offset” style. For $comm, the default type is “string”; any other type is invalid.
User Memory Access

Kprobe events supports user-space memory access. For that purpose, you can use either user-space dereference syntax or ‘ustring’ type.

The user-space dereference syntax allows you to access a field of a data structure in user-space. This is done by adding the “u” prefix to the dereference syntax. For example, +u4(%si) means it will read memory from the address in the register %si offset by 4, and the memory is expected to be in user-space. You can use this for strings too, e.g. +u0(%si):string will read a string from the address in the register %si that is expected to be in user- space. ‘ustring’ is a shortcut way of performing the same task. That is, +0(%si):ustring is equivalent to +u0(%si):string.

Note that kprobe-event provides the user-memory access syntax but it doesn’t use it transparently. This means if you use normal dereference or string type for user memory, it might fail, and may always fail on some archs. The user has to carefully check if the target data is in kernel or user space.
Per-Probe Event Filtering

Per-probe event filtering feature allows you to set different filter on each probe and gives you what arguments will be shown in trace buffer. If an event name is specified right after ‘p:’ or ‘r:’ in kprobe_events, it adds an event under tracing/events/kprobes/, at the directory you can see ‘id’, ‘enable’, ‘format’, ‘filter’ and ‘trigger’.

enable:

You can enable/disable the probe by writing 1 or 0 on it.

format:

This shows the format of this probe event.

filter:

You can write filtering rules of this event.

id:

This shows the id of this probe event.

trigger:

This allows to install trigger commands which are executed when the event is hit (for details, see Event Tracing, section 6).

Event Profiling

You can check the total number of probe hits and probe miss-hits via /sys/kernel/debug/tracing/kprobe_profile. The first column is event name, the second is the number of probe hits, the third is the number of probe miss-hits.
Kernel Boot Parameter

You can add and enable new kprobe events when booting up the kernel by “kprobe_event=” parameter. The parameter accepts a semicolon-delimited kprobe events, which format is similar to the kprobe_events. The difference is that the probe definition parameters are comma-delimited instead of space. For example, adding myprobe event on do_sys_open like below

p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)

should be below for kernel boot parameter (just replace spaces with comma)

p:myprobe,do_sys_open,dfd=%ax,filename=%dx,flags=%cx,mode=+4($stack)

Usage examples

To add a probe as a new event, write a new definition to kprobe_events as below:

echo ‘p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)’ > /sys/kernel/debug/tracing/kprobe_events

This sets a kprobe on the top of do_sys_open() function with recording 1st to 4th arguments as “myprobe” event. Note, which register/stack entry is assigned to each function argument depends on arch-specific ABI. If you unsure the ABI, please try to use probe subcommand of perf-tools (you can find it under tools/perf/). As this example shows, users can choose more familiar names for each arguments.

echo ‘r:myretprobe do_sys_open $retval’ >> /sys/kernel/debug/tracing/kprobe_events

This sets a kretprobe on the return point of do_sys_open() function with recording return value as “myretprobe” event. You can see the format of these events via /sys/kernel/debug/tracing/events/kprobes//format.

cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
name: myprobe
ID: 780
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1;signed:0;
field:int common_pid; offset:4; size:4; signed:1;

    field:unsigned long __probe_ip; offset:12;      size:4; signed:0;
    field:int __probe_nargs;        offset:16;      size:4; signed:1;
    field:unsigned long dfd;        offset:20;      size:4; signed:0;
    field:unsigned long filename;   offset:24;      size:4; signed:0;
    field:unsigned long flags;      offset:28;      size:4; signed:0;
    field:unsigned long mode;       offset:32;      size:4; signed:0;

print fmt: “(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx”, REC->__probe_ip,
REC->dfd, REC->filename, REC->flags, REC->mode

You can see that the event has 4 arguments as in the expressions you specified.

下面的命令可以清除所有的探测点:
echo > /sys/kernel/debug/tracing/kprobe_events

或者,使用下面的命令定点清除:
清除kprobe 跟踪事件,下面这个命令可以清除选定的探测点。
echo -:myprobe >> kprobe_events

Right after definition, each event is disabled by default. For tracing these events, you need to enable it.
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
开启跟踪:
#echo 1 > tracing_on
关闭:
#echo 0 > tracing_on
通过文件 /sys/kernel/debug/tracing/trace,查看收集到的日志。

cat /sys/kernel/debug/tracing/trace
#tracer: nop
#					TASK-PID    CPU#    TIMESTAMP  FUNCTION
#						| |       |          |         |
           <...>-1447  [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe
           <...>-1447  [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6
           <...>-1447  [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3
           <...>-1447  [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3

每一行显示了什么时候内核遇到了一个事件。“<- SYMBOL” 表示内核从SYMBOL退出。例如上面的例子,
“sys_open+0x1b/0x1d <- do_sys_open”就是说内核从do_sys_open退出,回到 sys_open+0x1b。

mellanox firmware 相关的trace

3.5.1.8Firmware Tracer
This mechanism allows for the device’s FW/HW to log important events into the event tracing system (/sys/kernel/debug/tracing) without requiring any Mellanox tool.
This feature is enabled by default, and can be controlled using sysfs commands.
To disable the feature:
To enable the feature:
To view FW traces using vim text editor:
[257480.719070] mlx4_core 0000:00:05.0: Internal error detected:
[257480.726019] mlx4_core 0000:00:05.0: buf[00]: 0fffffff
[257480.732082] mlx4_core 0000:00:05.0: buf[01]: 00000000

[257480.806531] mlx4_core 0000:00:05.0: buf[0f]: 00000000
[257480.811534] mlx4_core 0000:00:05.0: device is going to be reset
[257482.781154] mlx4_core 0000:00:05.0: crdump: Crash snapshot collected to /proc/driver/mlx4_core/crdump/0000:00:05.0
[257483.789230] mlx4_core 0000:00:05.0: device was reset successfully
In mlx4 driver, CRDUMP will not be collected if internal_err_reset module parameter is set to 0.
To be able to use this feature, trace points must be enabled in the kernel.
echo 0 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
echo 1 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable
vim /sys/kernel/debug/tracing/trace

性能影响

如果使用iperf发包测试,cpu使用率上升一倍;
速率减小:9-10倍
perf的占用情况。

Samples: 50K of event 'cpu-clock', Event count (approx.): 12703250000
  Overhead  Command  Shared Object      Symbol        
-   10.75%  iperf3   [kernel.kallsyms]  [k] pvclock_clocksource_reada
   - pvclock_clocksource_read    
      - 97.88% kvm_sched_clock_read   
         - sched_clock           
            - 99.36% trace_clock_local
               - 99.25% rb_reserve_next_event               
                    ring_buffer_lock_reserve                
                    trace_buffer_lock_reserve               
                    trace_function    
                    function_trace_call  a
                    ftrace_ops_list_func a
                  + ftrace_call  
               + 0.75% rb_move_tail.isra.38                 
            + 0.64% local_clock  
      + 2.01% sched_clock        
-    9.48%  iperf3   [kernel.kallsyms]  [k] function_trace_call  
   - function_trace_call         
      - 58.23% ftrace_ops_list_func   
         - ftrace_call                      

调查函数执行时间

commit 8e1e1df29d837c589c8b4d7b49864481ff7795b8
Author: Byungchul Park <byungchul.park@lge.com>
Date:   Mon Nov 24 09:34:19 2014 +0900

    tracing: Add additional marks to signal very large time deltas

    Currently, function graph tracer prints "!" or "+" just before
    function execution time to signal a function overhead, depending
    on the time. And some tracers tracing latency also print "!" or
    "+" just after time to signal overhead, depending on the interval
    between events. Even it is usually enough to do that, we sometimes
    need to signal for bigger execution time than 100 micro seconds.

    For example, I used function graph tracer to detect if there is
    any case that exit_mm() takes too much time. I did following steps
    in /sys/kernel/debug/tracing. It was easier to detect very large
    excution time with patched kernel than with original kernel.

    $ echo exit_mm > set_graph_function
    $ echo function_graph > current_tracer
    $ echo > trace
    $ cat trace_pipe > $LOGFILE
     ... (do something and terminate logging)
    $ grep "\\$" $LOGFILE
     3) $ 22082032 us |                      } /* kernel_map_pages */
     3) $ 22082040 us |                    } /* free_pages_prepare */
     3) $ 22082113 us |                  } /* free_hot_cold_page */
     3) $ 22083455 us |                } /* free_hot_cold_page_list */
     3) $ 22083895 us |              } /* release_pages */
     3) $ 22177873 us |            } /* free_pages_and_swap_cache */
     3) $ 22178929 us |          } /* unmap_single_vma */
     3) $ 22198885 us |        } /* unmap_vmas */
     3) $ 22206949 us |      } /* exit_mmap */
     3) $ 22207659 us |    } /* mmput */
     3) $ 22207793 us |  } /* exit_mm */

    And then, it was easy to find out that a schedule-out occured by
    sub_preempt_count() within kernel_map_pages().

    To detect very large function exection time caused by either problematic
    function implementation or scheduling issues, this patch can be useful.
#undef MARK
#define MARK(v, s) {.val = v, .sym = s}
/* trace overhead mark */
static const struct trace_mark {
	unsigned long long	val; /* unit: nsec */
	char			sym;
} mark[] = {
	MARK(1000000000ULL	, '$'), /* 1 sec */
	MARK(100000000ULL	, '@'), /* 100 msec */
	MARK(10000000ULL	, '*'), /* 10 msec */
	MARK(1000000ULL		, '#'), /* 1000 usecs */
	MARK(100000ULL		, '!'), /* 100 usecs */
	MARK(10000ULL		, '+'), /* 10 usecs */
};
#undef MARK

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
Linux中有两个常用的跟踪工具,分别是strace和ftrace。 strace是一个用户空间的跟踪工具,它可以追踪和记录应用程序的系统调用和信号。通过strace,我们可以了解应用程序在执行过程中与操作系统之间的交互,包括文件操作、网络通信、进程管理等。strace可以帮助开发人员调试和分析应用程序的问题,定位性能瓶颈和错误。 ftraceLinux内核提供的跟踪工具,用于追踪内核中的函数调用和事件。它是内建于Linux内核中的,可以通过配置内核选项来启用和配置。ftrace提供了多个跟踪器,用于不同的场景,比如跟踪函数调用、上下文切换、中断关闭时长等。通过ftrace,我们可以深入了解内核的执行流程和性能瓶颈,帮助开发人员进行内核调试和性能优化。 总结来说,strace用于跟踪用户空间应用程序的系统调用和信号,而ftrace用于跟踪内核中的函数调用和事件。它们都是在Linux系统中进行高级性能分析和调试的重要工具。\[1\]\[2\]\[3\] #### 引用[.reference_title] - *1* *2* [Linux 追踪技术 ftrace 简介(一)](https://blog.csdn.net/weixin_45030965/article/details/125738911)[target=&quot;_blank&quot; data-report-click={&quot;spm&quot;:&quot;1018.2226.3001.9630&quot;,&quot;extra&quot;:{&quot;utm_source&quot;:&quot;vip_chatgpt_common_search_pc_result&quot;,&quot;utm_medium&quot;:&quot;distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v4^insert_chatgpt&quot;}} ] [.reference_item] - *3* [Linux:内核调试方法之strace ,ltrace, ptrace, ftrace, sysrq](https://blog.csdn.net/hhd1988/article/details/129562459)[target=&quot;_blank&quot; data-report-click={&quot;spm&quot;:&quot;1018.2226.3001.9630&quot;,&quot;extra&quot;:{&quot;utm_source&quot;:&quot;vip_chatgpt_common_search_pc_result&quot;,&quot;utm_medium&quot;:&quot;distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v4^insert_chatgpt&quot;}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mzhan017

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值