Debugging the kernel using Ftrace - part 1



Debugging the kernel using Ftrace - part 1
Ftrace调试内核 第一部分

Ftrace is a tracing utility built directly into the Linux kernel. Many distributions already have various configurations of Ftrace enabled in their most recent releases. One of the benefits that Ftrace brings to Linux is the ability to see what is happening inside the kernel. As such, this makes finding problem areas or simply tracking down that strange bug more manageable.

ftrace是嵌入到linux内核的内核调试工具。许多linux的发行版本已经集成了不同版本的ftrace。ftrace的最大好处是可以实时查看linux内核运行。因此,ftrace可以发现问题出现的区域。

Ftrace's ability to show the events that lead up to a crash gives a better chance of finding exactly what caused it and can help the developer in creating the correct solution. This article is a two part series that will cover various methods of using Ftrace for debugging the Linux kernel. This first part will talk briefly about setting up Ftrace, using the function tracer, writing to the Ftrace buffer from within the kernel, and various ways to stop the tracer when a problem is detected.

ftrace可以帮助开发者准确发现导致crash的原因,从而正确处理这个问题。本文包括两个部分,涵盖了利用ftrace调试kernel的各种方法。第一部分主要讨论配置ftrace, 使用function tracer,从内核写入ftrace的缓存,和发现问题后停止tracer的各种方法。

Ftrace was derived from two tools. One was the "latency tracer" by Ingo Molnar used in the -rt tree. The other was my own "logdev" utility that had its primary use on debugging the Linux kernel. This article will mostly describe features that came out of logdev, but will also look at the function tracer that originated in the latency tracer.

ftrace由两种工具发展而来。 一是Ingo Molnar的"latency tracer",主要用在实时linux中,另外一个是我自己的"logdev"工具。本文主要讨论"logdev"发展而来的各种特性。同时也简单看一下"latency tracer"发展而来的function tracer.

Setting up Ftrace
配置ftrace

Currently the API to interface with Ftrace is located in the Debugfs file system. Typically, that is mounted at /sys/kernel/debug. For easier accessibility, I usually create a /debug directory and mount it there. Feel free to choose your own location for Debugfs.

目前ftrace的控制接口在Debugfs文件系统中。通常Debugfs文件系统的挂接在/sys/kernel/debug目录中。为了方便访问,我通常创建/debug目录,把Debugfs文件系统挂接在这个目录下。你可以自己选择目录来挂接Debugfs文件系统。

When Ftrace is configured, it will create its own directory called tracing within the Debugfs file system. This article will reference those files in that directory as though the user first changed directory to the Debugfs tracing directory to avoid any confusion as to where the Debugfs file system has been mounted.

配置好ftrace后,ftrace将在Debugfs文件系统中创建tracing目录。本文将使用tracing目录下的文件。读者最好先到这个路径下。

    [~]# cd /sys/kernel/debug/tracing
    [tracing]#

This article is focusing on using Ftrace as a debugging tool. Some configurations for Ftrace are used for other purposes, like finding latency or analyzing the system. For the purpose of debugging, the kernel configuration parameters that should be enabled are:

本文主要把ftrace用作调试工具。ftrace的其它功能,例如发现延迟和分析kernel,将不做重点讲述。为了调试系统,kernel应该打开如下的选项:

    CONFIG_FUNCTION_TRACER
    CONFIG_FUNCTION_GRAPH_TRACER
    CONFIG_STACK_TRACER
    CONFIG_DYNAMIC_FTRACE

Function tracing - no modification necessary
跟踪函数-无须修改代码

One of the most powerful tracers of Ftrace is the function tracer. It uses the -pg option of gcc to have every function in the kernel call a special function "mcount()". That function must be implemented in assembly because the call does not follow the normal C ABI.

ftrace的最强大的功能是function tracer.ftrace在gcc中使用-pg选项,让kernel中的每个函数都调用一个特殊的函数"mcount()". 这个函数必须用汇编来写,因为这个函数不遵循C语言的规则。

When CONFIG_DYNAMIC_FTRACE is configured the call is converted to a NOP at boot time to keep the system running at 100% performance. During compilation the mcount() call-sites are recorded. That list is used at boot time to convert those sites to NOPs. Since NOPs are pretty useless for tracing, the list is saved to convert the call-sites back into trace calls when the function (or function graph) tracer is enabled.

当配置了CONFIG_DYNAMIC_FTRACE,在系统启动的时候,mcount被转换为NOP来保持系统以最高的性能运行。在编译的时候mcount的调用点被记录下来。在启动的时候,这个记录转换调用点为NOP.由于NOP对跟踪无用,因此这个记录被保存下来,并把调用点转换为跟踪点,当function tracer 被启动。

It is highly recommended to enable CONFIG_DYNAMIC_FTRACE because of this performance enhancement. In addition, CONFIG_DYNAMIC_FTRACE gives the ability to filter which function should be traced. Note, even though the NOPs do not show any impact in benchmarks, the addition of frame pointers that come with the -pg option has been known to cause a slight overhead.

为了提高性能,推荐启动CONFIG_DYNAMIC_FTRACE。而且这个选项可以使我们过滤函数。注意,即使NOP不会影响性能,但是我们都知道-pg选项带来的帧指针的叠加,将带来一些系统开销。

To find out which tracers are available, simply cat the available_tracers file in the tracing directory:

为了发现哪些tracer可用, 用cat显示tracing目录中的available_tracers文件的内容。

    [tracing]# cat available_tracers
    function_graph function sched_switch nop

To enable the function tracer, just echo "function" into the current_tracer file.

为了启动function tracer,把“function”添加到current_tracer文件中。

    [tracing]# echo function > current_tracer
    [tracing]# cat current_tracer
    function

    [tracing]# cat trace | head -10
    # tracer: function
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
                bash-16939 [000]  6075.461561: mutex_unlock <-tracing_set_tracer
              <idle>-0     [001]  6075.461561: _spin_unlock_irqrestore <-hrtimer_get_next_event
              <idle>-0     [001]  6075.461562: rcu_needs_cpu <-tick_nohz_stop_sched_tick
                bash-16939 [000]  6075.461563: inotify_inode_queue_event <-vfs_write
              <idle>-0     [001]  6075.461563: mwait_idle <-cpu_idle
                bash-16939 [000]  6075.461563: __fsnotify_parent <-vfs_write

The header explains the format of the output pretty well. The first two items are the traced task name and PID. The CPU that the trace was executed on is within the brackets. The timestamp is the time since boot, followed by the function name. The function in this case is the function being traced with its parent following the "<-" symbol.

输出的标题已经很好的解释了输出。前两项是被追踪的任务名和PID. 方括号中是追踪执行的CPU。时间戳是从系统启动算起的,后面是函数名。这种跟踪中,显示的是被跟踪的函数和调用它的函数。

This information is quite powerful and shows the flow of functions nicely. But it can be a bit hard to follow. The function graph tracer, created by Frederic Weisbecker, traces both the entry and exit of a function, which gives the tracer the ability to know the depth of functions that are called. The function graph tracer can make following the flow of execution within the kernel much easier to follow with the human eye:

这个信息很有用,并且很好地显示了函数调用的过程。但是跟踪比较困难。Frederic Weisbecker创建的function graph tracer可以跟踪函数的入口和出口,可以方便的知道函数调用的深度。function graph tracer可以用如下的方式显示,这样比较容易阅读。

    [tracing]# echo function_graph > current_tracer
    [tracing]# cat trace | head -20
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     1)   1.015 us    |        _spin_lock_irqsave();
     1)   0.476 us    |        internal_add_timer();
     1)   0.423 us    |        wake_up_idle_cpu();
     1)   0.461 us    |        _spin_unlock_irqrestore();
     1)   4.770 us    |      }
     1)   5.725 us    |    }
     1)   0.450 us    |    mutex_unlock();
     1) + 24.243 us   |  }
     1)   0.483 us    |  _spin_lock_irq();
     1)   0.517 us    |  _spin_unlock_irq();
     1)               |  prepare_to_wait() {
     1)   0.468 us    |    _spin_lock_irqsave();
     1)   0.502 us    |    _spin_unlock_irqrestore();
     1)   2.411 us    |  }
     1)   0.449 us    |  kthread_should_stop();
     1)               |  schedule() {

This gives the start and end of a function denoted with the C like annotation of "{" to start a function and "}" at the end. Leaf functions, which do not call other functions, simply end with a ";". The DURATION column shows the time spent in the corresponding function. The function graph tracer records the time the function was entered and exited and reports the difference as the duration. These numbers only appear with the leaf functions and the "}" symbol. Note that this time also includes the overhead of all functions within a nested function as well as the overhead of the function graph tracer itself. The function graph tracer hijacks the return address of the function in order to insert a trace callback for the function exit. This breaks the CPU's branch prediction and causes a bit more overhead than the function tracer. The closest true timings only occur for the leaf functions.

上面的输出,用类似于C语言的标记"{"来表示函数开始,“}”来表示函数结束。没有调用其他函数的叶子函数,直接用";"表示结束。“DURATION”栏表示函数执行所需的时间。function graph tracer记录函数进入和退出的时间,用它们的差作为函数执行的时间。这些数只显示在叶子函数和"}"标记旁。注意,这个数包括函数执行的时间和function graph tracer执行的时间。function graph tracer劫持了函数的返回地址,插入trace回调函数来记录退出时间。这给CPU带来了额外的开销。最准确的值是叶子函数的值。

The lonely "+" that is there is an annotation marker. When the duration is greater than 10 microseconds, a "+" is shown. If the duration is greater than 100 microseconds a "!" will be displayed.

当执行时间超过10毫秒,用+标记,当执行时间超过100毫秒,用!标记。

Using trace_printk()
使用trace_printk

printk() is the king of all debuggers, but it has a problem. If you are debugging a high volume area such as the timer interrupt, the scheduler, or the network, printk() can lead to bogging down the system or can even create a live lock. It is also quite common to see a bug "disappear" when adding a few printk()s. This is due to the sheer overhead that printk() introduces.
printk是非常好用的工具。但它有个问题。如果你调试一个调用频繁的区域,例如定时器,中断,调度器,或者网络。printk将瘫痪系统,或者引起live lock。有时加入一些printk后,你会发现bug消失了。这是因为printk引入的开销造成的。

Ftrace introduces a new form of printk() called trace_printk(). It can be used just like printk(), and can also be used in any context (interrupt code, NMI code, and scheduler code). What is nice about trace_printk() is that it does not output to the console. Instead it writes to the Ftrace ring buffer and can be read via the trace file.

ftrace引入了一个类似于printk的函数trace_printk. 它的用法与printk一样,可以在任何上下文中使用(中断,NMI,调度器)。 该函数的精彩之处在于,它不把内容输出到终端,而是输出到ftrace的缓存中,用trace文件来查看。

Writing into the ring buffer with trace_printk() only takes around a tenth of a microsecond or so. But using printk(), especially when writing to the serial console, may take several milliseconds per write. The performance advantage of trace_printk() lets you record the most sensitive areas of the kernel with very little impact.

trace_printk只用大约十分之一毫秒,使用printk大约几毫秒。trace_printk的性能优势可以让你在对kernel最小影响的情况下记录敏感区域的数据。

For example you can add something like this to the kernel or module:

例如,你可以在kernel或模块中使用下面的函数

    trace_printk("read foo %d out of bar %p\n", bar->foo, bar);

Then by looking at the trace file, you can see your output.
然后在trace文件中,你可以看到如下输出

    [tracing]# cat trace
    # tracer: nop
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
               <...>-10690 [003] 17279.332920: : read foo 10 out of bar ffff880013a5bef8

The above example was done by adding a module that actually had a foo and bar construct.
上面的例子需要bar结构体和foo成员。

trace_printk() output will appear in any tracer, even the function and function graph tracers.
trace_printk在任何tracer中显示,包括function tracer和function graph tracer.

    [tracing]# echo function_graph > current_tracer
    [tracing]# insmod ~/modules/foo.ko
    [tracing]# cat trace
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     3) + 16.283 us   |      }
     3) + 17.364 us   |    }
     3)               |    do_one_initcall() {
     3)               |      /* read foo 10 out of bar ffff88001191bef8 */
     3)   4.221 us    |    }
     3)               |    __wake_up() {
     3)   0.633 us    |      _spin_lock_irqsave();
     3)   0.538 us    |      __wake_up_common();
     3)   0.563 us    |      _spin_unlock_irqrestore();

Yes, the trace_printk() output looks like a comment in the function graph tracer.
在上面的function graph tracer例子中, trace_printk的输出象一个注释。

Starting and stopping the trace
开关trace

Obviously there are times where you only want to trace a particular code path. Perhaps you only want to trace what is happening when you run a specific test. The file tracing_on is used to disable the ring buffer from recording data:

或许你只想跟踪某段代码,或者某个测试。文件tracing_on可以用来开关trace.

    [tracing]# echo 0 > tracing_on

This will disable the Ftrace ring buffer from recording. Everything else still happens with the tracers and they will still incur most of their overhead. They do notice that the ring buffer is not recording and will not attempt to write any data, but the calls that the tracers make are still performed.

上面的命令可以关闭trace.跟踪和跟踪的开销继续进行。这些函数注意到缓存不再记录,就不再向缓存中写入任何数据。但是tracer和它跟踪的函数照样执行。

To re-enable the ring buffer, simply write a '1' into that file:
开启缓存,向文件中写入 "1"

    [tracing]# echo 1 > tracing_on

Note, it is very important that you have a space between the number and the greater than sign ">". Otherwise you may be writing standard input or output into that file.
注意,在数和">"之间有一个空格。否则,你要把标准输入输出导入到那个文件中了。

    [tracing]# echo 0> tracing_on   /* this will not work! */

A common run might be:
通常命令如下:

    [tracing]# echo 0 > tracing_on
    [tracing]# echo function_graph > current_tracer
    [tracing]# echo 1 > tracing_on; run_test; echo 0 > tracing_on

The first line disables the ring buffer from recording any data. The next enables the function graph tracer. The overhead of the function graph tracer is still present but nothing will be recorded into the trace buffer. The last line enables the ring buffer, runs the test program, then disables the ring buffer. This narrows the data stored by the function graph tracer to include mostly just the data accumulated by the run_test program.


第一行关闭记录数据的缓存。第二行启动function graph tracer. function graph tracer的开销仍然存在,但是缓存中什么也不记录。最后一行启动缓存,运行测试程序,然后闭缓存。这缩小了function graph tracer记录的数据的范围。

What's next?
下章简介

The next article will continue the discussion on debugging the kernel with Ftrace. The method above to disable the tracing may not be fast enough. The latency between the end of the program run_test and echoing the 0 into the tracing_on file may cause the ring buffer to overflow and lose the relevant data. I will discuss other methods to stop tracing a bit more efficiently, how to debug a crash, and looking at what functions in the kernel are stack hogs. The best way to find out more is to enable Ftrace and just play with it. You can learn a lot about how the kernel works by just following the function graph tracer.

下章将继续讨论使用ftrace调试kernel. 本章使用的关闭trace的方法不够快。run_test程序结束和关闭trace的时延将导致缓存溢出,丢失一些数据。下面我将介绍其它比较有效关闭trace的方法,调试crash, 哪个函数可以显示调用栈。最好的学习方法是使用ftrace。通过ftrace,你可以学习kernel是如何工作的。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值