Debugging the kernel using Ftrace - part 2

Debugging the kernel using Ftrace - part 2
使用ftrace调试kernel-第二部分

The Ftrace tracing utility has many different features that will assist in tracking down Linux kernel problems. The previous article discussed setting up Ftrace, using the function and function graph tracers, using trace_printk(), and a simple way to stop the recording of a trace from user space. This installment will touch on how user space can interact with Ftrace, faster ways of stopping the trace, debugging a crash, and finding what kernel functions are the biggest stack hogs.
ftrace有许多特性可以帮助查找kernel中的问题。前面的文章讨论了配置ftrace, 使用function trace和function graph trace，使用trace_printk，和从用户空间关闭ftrace的简单方法。本文将讨论如何在用户空间与ftrace交互，关闭ftrace的较快方法，调试crash等。

Trace Markers
trace标记

Seeing what happens inside the kernel gives the user a better understanding of how their system works. But sometimes there needs to be coordination between what is happening in user space and what is happening inside the kernel. The timestamps that are shown in the traces are all relative to what is happening within the trace, but they do not correspond well with wall time.

查看kernel的运行过程可以使用户对kernel有更好的理解。但是有时需要协调kernel和用户空间。trace中显示的时间戳，与平时所用的时间有所不同。

To help synchronize between the actions in user space and kernel space, the trace_marker file was created. It provides a way to write into the Ftrace ring buffer from user space. This marker will then appear in the trace to give a location in the trace of where a specific event occurred.

为了同步kernel和用户空间，trace_marker文件就被创建。该文件提供了一种从用户空间写入信息到缓存的方法。这个标记将在trace中显示，表示某个特定事件发生时的位置。

    [tracing]# echo hello world > trace_marker
    [tracing]# cat trace
    # tracer: nop
    #
    #           TASK-PID    CPU#    TIMESTAMP FUNCTION
    #              | |       |          |         |
               <...>-3718 [001] 5546.183420: 0: hello world

The <...> indicates that the name of the task that wrote the marker was not recorded. Future releases may fix this.
<...>表示写入这个标记的任务没有被记录。以后的版本会修复这个问题。

Starting, Stopping and Recording in a Program
在程序中开关和记录

The tracing_on and trace_marker files work very well to trace the activities of an application if the source of the application is available. If there is a problem within the application and you need to find out what is happening inside the kernel at a particular location of the application, these two files come in handy.

如果有程序的源代码，tracing_on和trace_marker可以工作得非常好。如果在用户空间的程序中出现问题，你需要了解此时kernel中发生了什么。这两个文件可以帮助我们。

At the start of the application, you can open these files to have the file descriptors ready:

在用户程序的开始，你打开这两个文件的描述符，以便向文件中写入信息。

int trace_fd = -1;
int marker_fd = -1;

    int main(int argc, char *argv)
    {
     char *debugfs;
     char path[256];
     [...]

     debugfs = find_debugfs();
     if (debugfs) {
      strcpy(path, debugfs);
      strcat(path,"/tracing/tracing_on");
      trace_fd = open(path, O_WRONLY);
      if (trace_fd >= 0)
       write(trace_fd, "1", 1);

      strcpy(path, debugfs);
      strcat(path,"/tracing/trace_marker");
      marker_fd = open(path, O_WRONLY);

Then, at some critical location in the code, markers can be placed to show where the application currently is:

然后，在代码的关键位置，需要写入标记来表示程序运行到了什么地方。

if (marker_fd >= 0)
write(marker_fd, "In critical area\n", 17);

    if (critical_function() < 0) {
     /* we failed! */
     if (trace_fd >= 0)
      write(trace_fd, "0", 1);
    }

In looking at the example, first you see a function called "find_debugfs()". The proper location to mount the debug file system is at /sys/kernel/debug but a robust tool should not depend on the debug file system being mounted there. An example of find_debugfs() is located here. The file descriptors are initialized to -1 to allow this code to work both with and without a tracing enabled kernel.

在上面的例子中，首先你会看到一个find_debugfs()的函数。通常debugfs的挂接点是/sys/kernel/debug。为了增强kernel的健壮性，开发了find_debugfs()函数来确定debugfs的挂接点。为了适应kernel打开或者未打开trace，文件描述符被初始化为-1.

When the problem is detected, writing the ASCII character "0" into the trace_fd file descriptor stops tracing. As discussed in part 1, this only disables the recording into the Ftrace ring buffer, but the tracers are still incurring overhead.

程序运行后，写入"0"来关闭trace。从第一部分，我们知道这仅仅关闭了缓存的记录功能。trace仍然在kernel中运行。

When using the initialization code above, tracing will be enabled at the beginning of the application because the tracer runs in overwrite mode. That is, when the trace buffer fills up, it will remove the old data and replace it with the new. Since only the most recent trace information is relevant when the problem occurs there is no need to stop and start the tracing during the normal running of the application. The tracer only needs to be disabled when the problem is detected so the trace will have the history of what led up to the error. If interval tracing is needed within the application, it can write an ASCII "1" into the trace_fd to enable the tracing.

使用上面例子中的初始化代码，程序一运行，运行在overwrite模式的trace就启动了。即，trace的缓存被写满了，它就用新的数据覆盖旧的数据。因为问题发生时，只有最近的信息是有用的。因此在程序正常运行时，没有必要开启trace.只有在问题即将发生时，打开trace来记录什么导致了这个问题的发生。程序内部想打开trace，只需要向trace_fd中写入1即可。

Here is an example of a simple program called simple_trace.c that uses the initialization process described above:

这里有个simple_trace.c的例子，这个例子很好的实现了上面所说的内容。

    req.tv_sec = 0;
    req.tv_nsec = 1000;
    write(marker_fd, "before nano\n", 12);
    nanosleep(&req, NULL);
    write(marker_fd, "after nano\n", 11);
    write(trace_fd, "0", 1);

(No error checking was added due to this being a simple program for example purposes only.)
(没有出错检查，是因为这只是一个简单的例子程序)

Here is the process to trace this simple program:
下面是例子的操作过程。

    [tracing]# echo 0 > tracing_on
    [tracing]# echo function_graph > current_tracer
    [tracing]# ~/simple_trace
    [tracing]# cat trace

The first line disables tracing because the program will enable it at start up. Next the function graph tracer is selected. The program is executed, which results in the following trace. Note that the output can be a little verbose so much of it has been cut and replaced with [...]:

上面第一行关闭trace，因为程序在执行的时候会打开trace.接下来使用function graph tracer。程序执行，缓存记录。注意程序输出太多了，因此用[...]省略了一部分无关内容。

    [...]
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
    [...]
     1)   0.523 us    |          _spin_unlock();
     0)   0.495 us    |    current_kernel_time();
     1)               |          it_real_fn() {
     0)   1.602 us    | }
     1)   0.728 us    |            __rcu_read_lock();
     0)               | sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     1)   0.418 us    |            __rcu_read_lock();
     0)               |      do_nanosleep() {
     1)   1.114 us    |            _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     1)   2.760 us    | }
     0)   0.556 us    |        get_slab();
     1)               | mwait_idle() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

Notice that the writes to trace_marker show up as comments in the function graph tracer.

注意，trace_marker在function graph tracer中以注释的形式出现。

The first column here represents the CPU. When we have the CPU traces interleaved like this, it may become hard to read the trace. The tool grep can easily filter this, or the per_cpu trace files may be used. The per_cpu trace files are located in the debugfs tracing directory under per_cpu.

第一列是CPU。当在输出中有不同的CPU交换出现的时候，比较难读。grep可以很容易地过滤，或者使用per_cpu trace文件。per_cpu trace文件在debugfs的tracing目录下的per_cpu目录中。

[tracing]# ls per_cpu
cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7

There exists a trace file in each one of these CPU directories that only show the trace for that CPU.

在这些CPU目录中存在trace文件。这些文件只显示这个CPU的trace.

To get a nice view of the function graph tracer without the interference of other CPUs just look at per_cpu/cpu0/trace.
只看per_cpu/cpu0/trace文件，就可以避免其它CPU的干扰。

    [tracing]# cat per_cpu/cpu0/trace
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
     0)   0.488 us    |      inotify_dentry_parent_queue_event();
     0)   1.106 us    |      fsnotify();
    [...]
     0)   0.721 us    |    _spin_unlock_irqrestore();
     0)   3.380 us    | }
     0)               | audit_syscall_entry() {
     0)   0.495 us    |    current_kernel_time();
     0)   1.602 us    | }
     0)               | sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     0)               |      do_nanosleep() {
     0)               |        hrtimer_start_range_ns() {
     0)               |          __hrtimer_start_range_ns() {
     0)               |            lock_hrtimer_base() {
     0)   0.866 us    |              _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     0)               |        get_slab() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

Disabling the Tracer Within the Kernel
在kernel中关闭trace

During the development of a kernel driver there may exist strange errors that occur during testing. Perhaps the driver gets stuck in a sleep state and never wakes up. Trying to disable the tracer from user space when a kernel event occurs is difficult and usually results in a buffer overflow and loss of the relevant information before the user can stop the trace.

在开发过程中，经常会遇到各种各样的错误。或许driver停在休眠状态。当kernel遇到这种情况，在用户空间关闭trace是非常困难的。并且关闭trace会导致缓存溢出，丢失一些数据。

There are two functions that work well inside the kernel: tracing_on() and tracing_off(). These two act just like echoing "1" or "0" respectively into the tracing_on file. If there is some condition that can be checked for inside the kernel, then the tracer may be stopped by adding something like the following:

在kernel中有两个函数可以避免上面的情况:tracing_on和tracing_off.这两个函数就象把“0”或"1"分别写入tracing_on文件中。如果在kernel中需要关闭trace，那么可以象下面的例子一样关闭trace.

if (test_for_error())
tracing_off();

Next, add several trace_printk()s (see part 1), recompile, and boot the kernel. You can then enable the function or function graph tracer and just wait for the error condition to happen. Examining the tracing_on file will let you know when the error condition occurred. It will switch from "1" to "0" when the kernel calls tracing_off().

然后，添加几个trace_printk()（见第一部分），重新编译，启动kernel.你可以启动function tracer 或者function graph tracer，等着错误发生。检查tracing_on文件，你就知道错误是否发生。当kernel调用tracing_off()，这个文件中的值会从1变为0.

After examining the trace, or saving it off in another file with:

检查trace输出后，你可以用下面的命令保存到另外一个文件中。

cat trace > ~/trace.sav

you can continue the trace to examine another hit. To do so, just echo "1" into tracing_on, and the trace will continue. This is also useful if the condition that triggers the tracing_off() call can be triggered legitimately. If the condition was triggered by normal operation, just restart the trace by echoing a "1" back into tracing_on and hopefully the next time the condition is hit will be because of the abnormality.

你用echo "1"到tracing_on文件中，你就可以继续调试。这样做是很有用的，如果发生的错误关闭了trace. 如果错误发生后，trace被关闭，只要想tracing_on中写入"1"，就可以继续进行调试。

ftrace_dump_on_oops
调试oops

There are times that the kernel will crash and examining the memory and state of the crash is more of a CSI science than a program debugging science. Using kdump/kexec with the crash utility is a valuable way to examine the state of the system at the point of the crash, but it does not let you see what has happened prior to the event that caused the crash.

有时kernel发生了崩溃，调试crash文件是一件很痛苦的事情。使用kdump/kexec等工具可以检查崩溃时kernel的状态，但是这些工具不能让你看到kernel崩溃前所发生的事情。

Having Ftrace configured and enabling ftrace_dump_on_oops in the kernel boot parameters, or by echoing a "1" into /proc/sys/kernel/ftrace_dump_on_oops, will enable Ftrace to dump to the console the entire trace buffer in ASCII format on oops or panic. Having the console output to a serial log makes debugging crashes much easier. You can now trace back the events that led up to the crash.

配置好ftrace,在kernel的启动参数中启动ftrace_dump_on_oops，或者向/proc/sys/kernel/ftrace_dump_on_oops文件中写入"1"，将启动ftrace向控制台输出所有的oops信息。将控制台的输出转向串口后，你将很容易地看到错误信息。现在你可以查看导致系统崩溃的事件了。

Dumping to the console may take a long time since the default Ftrace ring buffer is over a megabyte per CPU. To shrink the size of the ring buffer, write the number of kilobytes you want the ring buffer to be to buffer_size_kb. Note that the value is per CPU, not the total size of the ring buffer.

由于ftrace在每个cpu上配置了超过1M的缓存，因此向控制台输出错误信息所需要的时间比较长。为了缩小缓存的大小，用buffer_size_kb控制缓存的大小。注意这个值是per-cpu的，不是缓存的总的大小。

[tracing]# echo 50 > buffer_size_kb

The above will shrink the Ftrace ring buffer down to 50 kilobytes per CPU.

上面的命令将把缓存大小减小为每个CPU 50K.

You can also trigger a dump of the Ftrace buffer to the console with sysrq-z.

你也可以用sysrq-z来输出ftrace的缓存。

To choose a particular location for the kernel dump, the kernel may call ftrace_dump() directly. Note, this may permanently disable Ftrace and a reboot may be necessary to enable it again. This is because ftrace_dump() reads the buffer. The buffer is made to be written to in all contexts (interrupt, NMI, scheduling) but the reading of the buffer requires locking. To be able to perform ftrace_dump() the locking is disabled and the buffer may end up being corrupted after the output.

为了给kernel输出选择一个特定的位置，kernel将调用ftrace_dump函数。注意，这将永久地关闭ftrace.想重新启动ftrace，需要重启kernel。这是因为ftrace_dump需要读缓存，但是缓存被置成所有上下文可写(中断，NMI，调度),然而读缓存需要锁。为了能执行ftrace_dump函数，读锁被关闭。在输出后，缓存也就corrected

    /*
     * The following code will lock up the box, so we dump out the
     * trace before we hit that location.
     */
    ftrace_dump();

/* code that locks up */

Stack Tracing
栈的跟踪

The final topic to discuss is the ability to examine the size of the kernel stack and how much stack space each function is using. Enabling the stack tracer (CONFIG_STACK_TRACER) will show where the biggest use of the stack takes place.

最后一个主题是讨论检查kernel中栈的大小和每个函数占用的栈空间。启动stack tracer（CONFIG_STACK_TRACER）将显示哪个函数占用的栈空间最多。

The stack tracer is built from the function tracer infrastructure. It does not use the Ftrace ring buffer, but it does use the function tracer to hook into every function call. Because it uses the function tracer infrastructure, it does not add overhead when not enabled. To enable the stack tracer, echo 1 into /proc/sys/kernel/stack_tracer_enabled. To see the max stack size during boot up, add "stacktrace" to the kernel boot parameters.

stack tracer从function tracer发展而来。它没有使用ftrace的缓存，但是它使用function tracer来获得每个函数的调用。因为它使用function tracer的框架，所以没有增加系统开销。向/proc/sys/kernel/stack_tracer_enabled文件中写入"1"，启动stack tracer。在kernel的启动参数中加入"stacktrace",就可以查看占用的最大栈空间。

The stack tracer checks the size of the stack at every function call. If it is greater than the last recorded maximum, it records the stack trace and updates the maximum with the new size. To see the current maximum, look at the stack_max_size file.

stack tracer在每个函数调用时检查占用的栈空间。如果大于上次记录的最大值，记录下stack trace，然后更新最大值。查看stack_max_size文件就可以看到当前栈空间的最大值。

    [tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled
    [tracing]# cat stack_max_size
    2928
    [tracing]# cat stack_trace
            Depth    Size   Location    (34 entries)
            -----    ----   --------
      0)     2952      16   mempool_alloc_slab+0x15/0x17
      1)     2936     144   mempool_alloc+0x52/0x104
      2)     2792      16   scsi_sg_alloc+0x4a/0x4c [scsi_mod]
      3)     2776     112   __sg_alloc_table+0x62/0x103
    [...]
     13)     2072      48   __elv_add_request+0x98/0x9f
     14)     2024     112   __make_request+0x43e/0x4bb
     15)     1912     224   generic_make_request+0x424/0x471
     16)     1688      80   submit_bio+0x108/0x115
     17)     1608      48   submit_bh+0xfc/0x11e
     18)     1560     112   __block_write_full_page+0x1ee/0x2e8
     19)     1448      80   block_write_full_page_endio+0xff/0x10e
     20)     1368      16   block_write_full_page+0x15/0x17
     21)     1352      16   blkdev_writepage+0x18/0x1a
     22)     1336      32   __writepage+0x1a/0x40
     23)     1304     304   write_cache_pages+0x241/0x3c1
     24)     1000      16   generic_writepages+0x27/0x29
    [...]
     30)      424      64   bdi_writeback_task+0x3f/0xb0
     31)      360      48   bdi_start_fn+0x76/0xd7
     32)      312     128   kthread+0x7f/0x87
     33)      184     184   child_rip+0xa/0x20

Not only does this give you the size of the maximum stack found, it also shows the breakdown of the stack sizes used by each function. Notice that write_cache_pages had the biggest stack with 304 bytes being used, followed by generic_make_request with 224 bytes of stack.

这不仅让你查看每个函数占用的最大栈空间，还可以显示函数的一些问题。注意write_cache_pages使用了304字节的栈空间，名列第一，generic_make_request占用了224字节的栈空间，名列第二。

To reset the maximum, echo "0" into the stack_max_size file.

向stack_max_size文件中写入"0"，可以重置.

[tracing]# echo 0 > stack_max_size

Keeping this running for a while will show where the kernel is using a bit too much stack. But remember that the stack tracer only has no overhead when it is not enabled. When it is running you may notice a bit of a performance degradation.

运行一段时间，你就可以看到哪个函数用得栈空间太多。但是记住，stack tracer如果没有启动，对系统一点影响都没有。一旦被启动，你会看到系统性能有点降低。

Note that the stack tracer will not trace the max stack size when the kernel is using a separate stack. Because interrupts have their own stack, it will not trace the stack usage there. The reason is that currently there is no easy way to quickly see what the top of the stack is when the stack is something other than the current task's stack. When using split stacks, a process stack may be two pages but the interrupt stack may only be one. This may be fixed in the future, but keep this in mind when using the stack tracer.

当kernel使用单独的栈的时候，stack tracer不起作用。因为中断有自己的调用栈，stack tracer不起作用。原因是目前没有一种比较快的方法来查看栈顶，当栈不是当前任务的栈的时候。当时有split栈的时候，一个进程的栈是2页，但是中断的栈上1页。以后可能会做调整，现在使用stack tracer时，请记住这一点。

Conclusion
结论

Ftrace is a very powerful tool and easy to configure. No extra tools are necessary. Everything that was shown it this tutorial can be used on embedded devices that only have Busybox installed. Taking advantage of the Ftrace infrastructure should cut the time needed to debug that hard-to-find race condition. I seldom use printk() any more because using the function and function graph tracers along with trace_printk() and tracing_off() have become my main tools for debugging the Linux kernel.

ftrace是非常强大的工具，简单易用。不需要特殊的工具。ftrace可以用在只安装了busybox的嵌入式设备上。利用ftrace可以缩短调试的时间。使用ftrace中的functin tracer, function graph tracer, trace_printk 和tracing_off后，我就很少用printk了。