从高Ceph延迟到使用eBPF / BCC的内核补丁-CSDN博客

文章介绍了在Linux环境中，如何使用eBPF和BCC工具解决Ceph集群中出现的高延迟问题。通过调查主机、分析`fdatasync()`和`generic_make_request()`等函数的性能，发现uevent和系统调用的影响。通过编写BCC工具跟踪慢速`generic_make_request()`，最终定位到内核队列冻结问题。通过对udev规则和内核源码的分析，修复了问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

There are a lot of tools for debugging kernel and userspace programs in Linux. Most of them have performance impact and cannot easily be run in production environments. A few years ago, eBPF was developed, which provides the ability to trace the kernel and userspace with low overhead, without needing to recompile programs or load kernel modules.

在Linux中有很多工具可以调试内核和用户空间程序。它们中的大多数会影响性能，并且不能轻松地在生产环境中运行。几年前，开发了 eBPF，它提供了以低开销跟踪内核和用户空间的功能，而无需重新编译程序或加载内核模块。

There are now plenty of tools that use eBPF and in this article, we’ll explain how to write your own profiling tool using the PythonBCC library. This article is based on a real issue from the production environment. We’ll walk you through solving the problem and show how existing bcc tools could be used in some cases.

现在有大量使用eBPF的工具，在本文中，我们将说明如何使用PythonBCC库编写自己的分析工具。本文基于生产环境中的实际问题。我们将引导您解决问题，并说明在某些情况下如何使用现有的密件抄送工具。

Ceph慢 (Ceph Is Slow)

A new platform was added to a ceph cluster. After migrating some data to the platform, the latency for write requests was higher than on the other servers.

新平台已添加到ceph集群。将某些数据迁移到平台后，写入请求的延迟比其他服务器上的延迟高。

This platform has a new caching virtual device — bcache, which we haven’t used in this cluster before — and a new kernel — 4.15, which still isn’t used anywhere else in this cluster. The root of the problem could be anywhere, so let’s take a deeper look.

该平台具有一个新的缓存虚拟设备-bcache(我们之前在该集群中未使用过)和一个新内核-4.15，该内核在该集群中的其他任何地方仍未使用。问题的根源可能在任何地方，因此让我们更深入地研究一下。

调查主机 (Investigating the Host)

Let’s look at what’s going on inside the ceph-osd process. We use the tracing tool perf and flamescope to build flamegraphs:

让我们看看ceph-osd进程内部发生了什么。我们使用跟踪工具perf和Flamescope来构建火焰图：

As we can see from the flamegraph,

从火焰图可以看出

fdatasync() (fdatasync())

spent a lot of time submitting bio in the

花了很多时间在

generic_make_request() (generic_make_request())

function. Thus, the root of our problem is somewhere outside the ceph daemon. It might be a kernel, bcache, or disk issue. The iostat output showed high latency for bcache devices.

功能。因此，问题的根源在ceph守护程序之外。可能是内核，bcache或磁盘问题。 iostat输出显示bcache设备的高延迟。

Another suspicious finding is that the systemd-udevd daemon is consuming CPU; about 20% on multiple CPUs. This is odd behavior, so we have to find out what’s going on. Since systemd-udevd works with uevents, we have to use

另一个可疑的发现是systemd-udevd守护程序正在消耗CPU；它在运行时占用大量CPU。在多个CPU上大约占20％。这是奇怪的行为，所以我们必须找出正在发生的事情。由于systemd-udevd可用于uevents，因此我们必须使用

udevadm监视器 (udevadm monitor)

to find out if there are any uevents in the system. After checking, we saw that a lot of «change» uevents were being generated for every block device in the system.

找出系统中是否有任何事件。检查后，我们发现系统中的每个块设备都生成了许多“更改”事件。

This is unusual, so we’re going to find out what’s causing all of these uevents to be sent.

这是不寻常的，因此我们将找出导致所有这些事件发送的原因。

使用BCC工具包 (Using the BCC Toolkit)

As we already know, the kernel (and ceph daemon) is spending a lot of time performing

众所周知，内核(和ceph守护程序)花费大量时间执行

generic_make_requst() (generic_make_requst())

functions. Let’s measure its latency using

功能。让我们使用以下方法衡量其延迟

功能性 (funclatency)

from the

来自

BCC toolkit, just to make sure that we’re on the right path. We’ll trace the ceph daemon’s PID (-p argument) in 1-second intervals (-i) and print the latency in milliseconds (-m). BCC工具包，只是为了确保我们走在正确的道路上。我们将以1秒的间隔(-i)跟踪ceph守护程序的PID(-p参数)，并以毫秒(-m)显示延迟。

This function usually works very fast. All it does is submit the bio structure to the device driver’s queue.

此功能通常运行非常快。它所做的只是将生物结构提交到设备驱动程序的队列中。

缓存 (Bcache)

is a complex device; in fact, it consists of 3 devices: a backing device, which is a slow HDD in our case; a caching device, which is the NVMe drive’s partition; and a bcache virtual device, which is used by the application. We know submission is slow, but for which device? This is something we’ll look at a bit later.

是一个复杂的设备；实际上，它由3个设备组成：一个后备设备，在我们的例子中是慢速硬盘；缓存设备，即NVMe驱动器的分区；应用程序使用的bcache虚拟设备。我们知道提交速度很慢，但是使用哪种设备？我们稍后会讨论这一点。

For now, we know that uevents cause problems in ceph daemons and we have to find the software triggering uevents.It’s not easy to find what causes uevents to be generated. We assume it’s software that only runs periodically. To see what is being executed on the system, we use

到目前为止，我们知道uevent会在ceph守护进程中引起问题，我们必须找到触发uevent的软件，要找到导致uevent生成的原因并不容易。我们假设它是仅定期运行的软件。要查看系统上正在执行什么，我们使用

execsnoop (execsnoop)

from the BCC toolkit. We can run it and redirect

来自BCC工具箱。我们可以运行它并重定向

标准输出 (stdout)

to a file.

到一个文件。

For example:

例如：

/usr/share/bcc/tools/execsnoop  | tee ./execdump

We won’t give the full execsnoop output here, but one interesting string we found there was:

我们不会在此处提供完整的execsnoop输出，但是我们发现其中一个有趣的字符串是：

sh 1764905 5802 0 sudo arcconf getconfig 1 AD | grep Temperature | awk -F '[:/]' '{print $2}' | sed 's/^ \([0-9]*\) C.*/\1/'

The third column is the process’s PPID. We checked what 5802 was and saw that it’s one of our monitoring daemon threads. Looking further at the monitoring system configuration, we found a faulty parameter. The HBA temperature was being retrieved every 30 seconds, which is too often. After changing the check interval to a more appropriate value, we saw that our ceph latency matched the other platforms.

第三列是流程的PPID。我们检查了5802是什么，并发现它是我们的监视守护程序线程之一。进一步查看监视系统配置，我们发现一个错误的参数。每30秒钟检索一次HBA温度，这太频繁了。将检查间隔更改为更合适的值后，我们看到我们的ceph延迟与其他平台匹配。

But we still don’t know why the bcache latency was high. We set up a testing platform with the same configuration and tried to reproduce the problem with fio on the bcache device while simultaneously triggering udev with the udevadm trigger command.

但是我们仍然不知道为什么bcache延迟如此之高。我们使用相同的配置建立了一个测试平台，并尝试在bcache设备上重现fio问题，同时使用udevadm trigger命令触发udev。

编写基于BCC的工具 (Writing BCC-Based Tools)

What we’re going to do here is write a simple tool that traces the slowest generic_make_request() calls and prints the name of the disk that the function was called for.

我们在这里要做的是编写一个简单的工具，该工具可以跟踪最慢的generic_make_request()调用并打印该函数所需要的磁盘的名称。

The plan is simple:

该计划很简单：

Register
寄存器

探针 (kprobe)

on
上

generic_make_request() (generic_make_request())

:
：
- Save the disk name available from the function’s argument
  保存函数名称中可用的磁盘名称
- Save the current timestamp
  保存当前时间戳
Register
寄存器

探针 (kretprobe)

on the
在

generic_make_request() (generic_make_request())

return statement:
退货声明：
- Retrieve the current timestamp
  检索当前时间戳
- Look up previously saved timestamps and compare them with current ones
  查找以前保存的时间戳并将其与当前时间戳进行比较
- If the result is higher than the threshold, look up previously saved disk names and print them to the terminal with additional information
  如果结果高于阈值，请查找先前保存的磁盘名称，并将其与其他信息一起打印到终端上

探针 (Kprobes)

and

和

Kretprobes (kretprobes)

use breakpoints to change a function’s code in runtime. You can find

使用断点在运行时更改函数的代码。你可以找到

documentation as well as a good 文档以及关于此的好 article on this. If you take a look at the code for different 文章。如果查看不同 BCC tools, you’ll see that they all have an identical structure. We’ll skip argument parsing and focus on the BPF program itself. BCC工具的代码，就会发现它们都具有相同的结构。我们将跳过参数解析，而将重点放在BPF程序本身上。

Our program’s text will be defined in python as follows:

我们程序的文本将在python中定义如下：

bpf_text = “”” # Here will be the bpf program code “””

BPF programs use hashmaps to share data between different functions. We’ll use PID as a key and self-defined structure as a value.

BPF程序使用哈希图在不同功能之间共享数据。我们将使用PID作为键，并使用自定义结构作为值。

struct data_t {
	u64 pid;
	u64 ts;
	char comm[TASK_COMM_LEN];
	u64 lat;
	char disk[DISK_NAME_LEN];
};

BPF_HASH(p, u64, struct data_t);
BPF_PERF_OUTPUT(events);

Here we register a hashmap called

在这里，我们注册一个名为

p (p)

with a

与

u64 (u64)

key type and a

密钥类型和

结构data_t (struct data_t)

value type. This map is accessible from our BPF program context. The

值类型。可从我们的BPF程序上下文访问此地图。的

BPF_PERF_OUTPUT (BPF_PERF_OUTPUT)

macro registers another map called

宏注册另一个称为

大事记 (events)

, which is used to

，用于

push data to the userspace. 将数据推送到用户空间。

When measuring latency between the function call and its return or between one function call and another, you have to be sure that the data you saved and access later relates to the same context. In other words, you have to be aware of any other parallel executions of the same function. It’s possible to trace latency between the function call of one process and the same function’s returns from another process, but this doesn’t help us. A good example is the biolatency tool where pointer to

在测量函数调用与其返回之间或一个函数调用与另一个函数调用之间的等待时间时，必须确保保存并以后访问的数据与同一上下文相关。换句话说，您必须知道同一功能的任何其他并行执行。可以跟踪一个进程的函数调用与另一个进程的同一函数的返回之间的延迟，但这对我们没有帮助。一个很好的例子是生物潜伏期工具

结构要求 (struct request)

is used as a hashmap key.

用作哈希映射键。

Next, we have to write a code that will be executed on function calls via a kprobe mechanism:

接下来，我们必须编写将通过kprobe机制在函数调用上执行的代码：

void start(struct pt_regs *ctx, struct bio *bio) {
	u64 pid = bpf_get_current_pid_tgid();
	struct data_t data = {};
	u64 ts = bpf_ktime_get_ns();
	data.pid = pid;
	data.ts = ts;
	bpf_probe_read_str(&data.disk, sizeof(data.disk), (void*)bio->bi_disk->disk_name);
	p.update(&pid, &data);
}

Here we have the first generic_make_request() argument as our function’s second argument. Then we get the PID and current timestamp in nanoseconds and write it to the newly allocated

在这里，我们有第一个generic_make_request()参数作为函数的第二个参数。然后我们获得PID和当前时间戳(以纳秒为单位)，并将其写入新分配的

struct data_t数据 (struct data_t data)

. We get the disk name from the bio structure, which is passed to

。我们从生物结构中获取磁盘名称，该名称将传递给

generic_make_request() (generic_make_request())

, and save it to our

，并将其保存到我们的

数据 (data)

. The last step is to add an entry to the hashmap we described earlier.

。最后一步是将条目添加到我们之前描述的哈希图中。

This function will be executed on

该功能将在

generic_make_request() (generic_make_request())

returns:

void stop(struct pt_regs *ctx) {
    u64 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    struct data_t* data = p.lookup(&pid);
    if (data != 0 && data->ts > 0) {
        bpf_get_current_comm(&data->comm, sizeof(data->comm));
        data->lat = (ts - data->ts)/1000;
        if (data->lat > MIN_US) {
            FACTOR
            data->pid >>= 32;
            events.perf_submit(ctx, data, sizeof(struct data_t));
        }
        p.delete(&pid);
    }
}

We get the PID and timestamp from the previous output and look up the hashmap for the value where

我们从上一个输出中获取PID和时间戳，并在哈希图中查找以下值：

键==当前PID (key == current PID)

. If it’s found, we get the name of the running process and add it to the

。如果找到了，我们得到正在运行的进程的名称，并将其添加到

数据 (data)

structure. What we do with

结构体。我们做什么

数据-> pid (data->pid)

here gives us the thread group id. The previously called

这里给我们线程组ID。以前称为

bpf_get_current_pid_tgid() function returns the thread GID and PID of the process in the same 64-bit value. bpf_get_current_pid_tgid()函数以相同的64位值返回进程的线程GID和PID。

We’re not interested in every thread’s ID, but we want to know the PID of the main thread. After checking that the latency is above the threshold, we submit our

我们对每个线程的ID都不感兴趣，但是我们想知道主线程的PID。在检查延迟是否超过阈值之后，我们提交

数据 (data)

structure to the userspace via

通过用户空间结构

大事记 (events)

map, then delete the hashmap entry at the end.

映射，然后在末尾删除哈希映射条目。

In our python script, we have to replace

在我们的python脚本中，我们必须替换

减去 (MIN_US)

and

和

因子 (FACTOR)

according to the threshold we want and unit of time we want to see in our result:

根据我们想要的阈值和我们想要在结果中看到的时间单位：

bpf_text = bpf_text.replace('MIN_US',str(min_usec))
if args.milliseconds:
	bpf_text = bpf_text.replace('FACTOR','data->lat /= 1000;')
	label = "msec"
else:
	bpf_text = bpf_text.replace('FACTOR','')
	label = "usec"

Then we have to prepare the BPF program with a BPF() macro and register probes:

然后，我们必须使用BPF()宏准备BPF程序并注册探针：

b = BPF(text=bpf_text)
b.attach_kprobe(event="generic_make_request",fn_name="start")
b.attach_kretprobe(event="generic_make_request",fn_name="stop")

We also need to define the same structure as

我们还需要定义与以下相同的结构

结构data_t (struct data_t)

in our script to read the data from the BPF program:

在我们的脚本中从BPF程序中读取数据：

TASK_COMM_LEN = 16	# linux/sched.h
DISK_NAME_LEN = 32	# linux/genhd.h
class Data(ct.Structure):
	_fields_ = [("pid", ct.c_ulonglong),
            	("ts", ct.c_ulonglong),
            	("comm", ct.c_char * TASK_COMM_LEN),
            	("lat", ct.c_ulonglong),
            	("disk",ct.c_char * DISK_NAME_LEN)]

The last step is to print the data we want:

最后一步是打印我们想要的数据：

def print_event(cpu, data, size):
    global start
    event = ct.cast(data, ct.POINTER(Data)).contents
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d   %-1s %s   %s" % (time_s, event.comm, event.pid, event.lat, label, event.disk))

b["events"].open_perf_buffer(print_event)
# format output
start = 0
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

The full script is available on GitHub. Let’s run the script and trigger udev events while fio writes to a bcache device:

完整脚本可在GitHub上获得。让我们运行脚本并在fio写入bcache设备时触发udev事件：

Success! Now we see that what looked like high latency for bcache is really

成功！现在我们看到bcache的高延迟确实是

generic_make_request() (generic_make_request())

latency for its backing device.

备用设备的延迟。

挖掘内核 (Dig into the Kernel)

What drags when submitting requests? We see that a latency spike occurred before request accounting even started. This could be easily checked by running either iostat during the problem or the biolatency BCC script, which are based on accounting request start, so neither tool will show the disk problem.

提交请求时会拖累什么？我们看到甚至在请求计费开始之前就发生了延迟高峰。可以通过在问题期间运行iostat或基于等待启动请求的biolatency BCC脚本轻松地检查此问题，因此这两种工具都不会显示磁盘问题。

If we take a look at

如果我们看看

generic_make_request() (generic_make_request())

, we see that there are two functions running before accounting starts. The first is

，我们看到开始记帐之前有两个功能正在运行。首先是

generic_make_request_checks() (generic_make_request_checks())

, which is lightweight and checks bio according to the device settings, etc. The second is

，它很轻巧，并根据设备设置等检查生物。第二个是

blk_queue_enter(), which has a blk_queue_enter() ，它具有一个 wait_event_interruptible() call: wait_event_interruptible()调用：

ret = wait_event_interruptible(q->mq_freeze_wq,
	(atomic_read(&q->mq_freeze_depth) == 0 &&
	(preempt || !blk_queue_preempt_only(q))) ||
	blk_queue_dying(q));

Here the kernel waits until the queue is unfrozen. Let’s measure the latency of blk_queue_enter():

在这里，内核等待直到队列被冻结为止。让我们测量一下blk_queue_enter()的延迟：

~# /usr/share/bcc/tools/funclatency  blk_queue_enter -i 1 -m               	 
Tracing 1 functions for "blk_queue_enter"... Hit Ctrl-C to end.

 	msecs           	: count 	distribution
     	0 -> 1      	: 341  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 316  	|****************************************|

 	msecs           	: count 	distribution
     	0 -> 1      	: 255  	|****************************************|
     	2 -> 3      	: 0    	|                                    	|
     	4 -> 7      	: 0    	|                                    	|
     	8 -> 15     	: 1    	|                                    	|

It looks like we’re close. The functions used to freeze/unfreeze the queue are blk_mq_freeze_queue and blk_mq_unfreeze_queue. They’re used to change queue settings, which could affect io requests currently in flight. When

看来我们接近了。用于冻结/解冻队列的函数是blk_mq_freeze_queue和blk_mq_unfreeze_queue 。它们用于更改队列设置，这可能会影响当前正在运行的io请求。什么时候

blk_mq_freeze_queue() (blk_mq_freeze_queue())

is called,

叫做，

q-> mq_freeze_depth (q->mq_freeze_depth)

is incremented in

在增加

blk_freeze_queue_start(). After that, the kernel waits for the queue to be empty in blk_freeze_queue_start() 。之后，内核在 blk_mq_freeze_queue_wait(). blk_mq_freeze_queue_wait()中等待队列为空。

This waiting time is equal to disk latency, because the kernel has to wait for all io operations to finish. When the queue is empty, changes can be made. The final step is to call blk_mq_unfreeze_queue(), which decreases the

该等待时间等于磁盘延迟，因为内核必须等待所有io操作完成。当队列为空时，可以进行更改。最后一步是调用blk_mq_unfreeze_queue() ，这将减少

Frozen_depth (freeze_depth)

counter.

计数器。

Now we know enough to fix the issue. The udevadm trigger command changes the settings for block devices. Those settings are described in udev rules. We can find out what settings freeze the queue by changing them via sysfs or by looking at the kernel source code. Alternatively, we can call trace from the BCC toolkit to print kernel and user stacks for every

现在我们知道足以解决此问题。 udevadm trigger命令更改块设备的设置。这些设置在udev规则中进行了描述。我们可以通过sysfs或查看内核源代码来找出冻结队列的设置。另外，我们可以从BCC工具包中调用trace来为每一个打印内核和用户堆栈

blk_freeze_queue (blk_freeze_queue)

call:

呼叫：

~# /usr/share/bcc/tools/trace blk_freeze_queue -K -U
PID 	TID 	COMM        	FUNC        	 
3809642 3809642 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	elevator_switch+0x29 [kernel]
    	elv_iosched_store+0x197 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

3809631 3809631 systemd-udevd   blk_freeze_queue
    	blk_freeze_queue+0x1 [kernel]
    	queue_requests_store+0xb6 [kernel]
    	queue_attr_store+0x5c [kernel]
    	sysfs_kf_write+0x3c [kernel]
    	kernfs_fop_write+0x125 [kernel]
    	__vfs_write+0x1b [kernel]
    	vfs_write+0xb8 [kernel]
    	sys_write+0x55 [kernel]
    	do_syscall_64+0x73 [kernel]
    	entry_SYSCALL_64_after_hwframe+0x3d [kernel]
    	__write_nocancel+0x7 [libc-2.23.so]
    	[unknown]

Udev rules don’t change often, so even assigning already existing values to certain parameters causes a spike in submit latency for the application. Of course generating udev events when there aren’t any changes in a device’s configuration (no device is attached or detached) is not a good practice. Still, we can prevent the kernel from freezing the queue if there’s no reason to do this. Three small commits fix the issue.

Udev规则不会经常更改，因此即使将已经存在的值分配给某些参数也会导致应用程序的提交延迟高峰。当然，当设备的配置没有任何更改(未连接或分离设备)时生成udev事件不是一个好习惯。尽管如此，如果没有理由，我们可以防止内核冻结队列。三个小的提交解决了该问题。

结论 (Conclusion)

eBPF is highly flexible and powerful instrument. In this article, we looked at only one case and demonstrated a little bit of what it’s capable of. If you’re interested in developing BCC-based tools, you should take a look at the official tutorial, which describes its fundamental concepts.

eBPF是高度灵活且功能强大的工具。在本文中，我们仅研究了一种情况，并演示了其功能。如果您对开发基于BCC的工具感兴趣，则应查看官方教程，该教程描述了其基本概念。

There’re also other interesting eBPF-based tools available for profiling and debugging. One of them is bpftrace, which lets you write powerful oneliners and little programs in an awk-like language. Another one is ebpf_exporter, which can collect low-level high-resolution metrics to your prometheus server with its great visualization and alerting abilities.

还有其他有趣的基于eBPF的工具可用于性能分析和调试。其中之一是bpftrace ，它使您可以使用类似于awk的语言编写功能强大的oneliners和小程序。另一个是ebpf_exporter ，它具有强大的可视化和警报功能，可以为您的prometheus服务器收集低级别的高分辨率指标。