Linux perf event 的使用(二)

前言

perf的简单使用如下所示:Linux perf的使用(一)

Linux perf event事件如下图所示:
在这里插入图片描述
图片来自于:https://www.brendangregg.com/perf.html

事件类型包括:
Hardware Events:CPU性能监视计数器(CPU performance monitoring counters)。
Software Events:这些是基于内核计数器的低级事件。例如,CPU migrations, minor faults, major faults等。
Kernel Tracepoint Events:这是静态内核级的插入点,它们被硬编码在内核中比较重要的逻辑位置。
User Statically-Defined Tracing (USDT):这些是用户级程序和应用程序的静态跟踪点。
Dynamic Tracing:软件可以动态检测,在任何位置创建事件。对于内核软件,这使用kprobes框架。对于用户级软件,是uprobes框架。
Timed Profiling:可以使用perf-record-FHz以任意频率收集快照。这通常用于CPU使用情况分析,并通过创建自定义定时中断事件来工作。

通过perf可以收集有关上述事件的详细信息,包括时间戳、导致事件的代码路径以及其他特定的详细信息。perf_events的功能非常强大。

event是perf工作的基础,主要有两种:有使用硬件的PMU里的event,也有在内核代码中注册的event。

perf_event框架图如下所示:
在这里插入图片描述
图片来自于:https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html
备注:Perf 通过系统调用perf_event_open()来完成对perf event的计数或者采样。

一、perf Events

可以使用list子命令列出当前可用的事件:

[root@localhost perf]# perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  node-load-misses                                   [Hardware cache event]
  node-loads                                         [Hardware cache event]
  node-store-misses                                  [Hardware cache event]
  node-stores                                        [Hardware cache event]

  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]
  bus-cycles OR cpu/bus-cycles/                      [Kernel PMU event]
  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
  cache-references OR cpu/cache-references/          [Kernel PMU event]
  cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]
  instructions OR cpu/instructions/                  [Kernel PMU event]
  intel_bts//                                        [Kernel PMU event]
  intel_pt//                                         [Kernel PMU event]
  mem-loads OR cpu/mem-loads/                        [Kernel PMU event]
  mem-stores OR cpu/mem-stores/                      [Kernel PMU event]
  
  ......
  
  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
  
  mem:<addr>[/len][:access]                          [Hardware breakpoint]
  
  block:block_bio_backmerge                          [Tracepoint event]
  block:block_bio_bounce                             [Tracepoint event]
  block:block_bio_complete                           [Tracepoint event]
  block:block_bio_frontmerge                         [Tracepoint event]
  block:block_bio_queue                              [Tracepoint event]
  block:block_bio_remap                              [Tracepoint event]
  block:block_dirty_buffer                           [Tracepoint event]
  block:block_getrq                                  [Tracepoint event]
  block:block_plug                                   [Tracepoint event]
  block:block_rq_abort                               [Tracepoint event]
  block:block_rq_complete                            [Tracepoint event]
  block:block_rq_insert                              [Tracepoint event]
  block:block_rq_issue                               [Tracepoint event]
  block:block_rq_remap                               [Tracepoint event]
  block:block_rq_requeue                             [Tracepoint event]
  block:block_sleeprq                                [Tracepoint event]
  block:block_split                                  [Tracepoint event]
  block:block_touch_buffer                           [Tracepoint event]
  block:block_unplug                                 [Tracepoint event]
  bridge:br_fdb_add                                  [Tracepoint event]
  bridge:br_fdb_external_learn_add                   [Tracepoint event]
  bridge:br_fdb_update                               [Tracepoint event]
  bridge:fdb_delete                                  [Tracepoint event]
  compaction:mm_compaction_isolate_freepages         [Tracepoint event]
  compaction:mm_compaction_isolate_migratepages      [Tracepoint event]
  compaction:mm_compaction_migratepages              [Tracepoint event]
  context_tracking:user_enter                        [Tracepoint event]
  context_tracking:user_exit                         [Tracepoint event]
  devlink:devlink_hwmsg                              [Tracepoint event]
  ......

事件类型包括:

■ Hardware event: Mostly processor events (implemented using PMCs)
■ Software event: A kernel counter event
■ Hardware cache event: Processor cache events (PMCs)
■ Kernel PMU event: Performance Monitoring Unit (PMU) events (PMCs)
■ cache, floating point...: Processor vendor events (PMCs) and brief descriptions
■ Raw hardware event descriptor: PMCs specified using raw codes
■ Hardware breakpoint: Processor breakpoint event
■ Tracepoint event: Kernel static instrumentation events
■ SDT event: User-level static instrumentation events (USDT)
■ pfm-events: libpfm events (added in Linux 5.8)

跟踪点和SDT事件主要列出静态检测点,但如果您创建了一些动态检测探测,那么也会列出这些检测点。比如示例:probe:do_nanosleep被描述为基于kprobe的“跟踪点事件”。

从这里我们可以了解到event都有哪些类型, perf list 列出的每个event后面都有一个"[]“,里面写了这个event属于什么类型,比如"Hardware event”、"Software event"等。完整的event类型,我们在内核代码枚举结构perf_type_id里可以看到:

// include/uapi/linux/perf_event.h

*
 * attr.type
 */
enum perf_type_id {
	PERF_TYPE_HARDWARE			= 0,
	PERF_TYPE_SOFTWARE			= 1,
	PERF_TYPE_TRACEPOINT			= 2,
	PERF_TYPE_HW_CACHE			= 3,
	PERF_TYPE_RAW				= 4,
	PERF_TYPE_BREAKPOINT			= 5,

	PERF_TYPE_MAX,				/* non-ABI */
};

uapi 文件夹:内核版本从3.5开始,Linux Kernel 里多了一个 uapi 文件夹。uapi 只是把内核用到的头文件和用户态用到的头文件分开。
请参考:https://tinylab.org/linux-kernel-uapi/

perf-list命令接受搜索子字符串作为参数。例如,列出包含“mem_load_l3”的事件:

[root@localhost perf]# perf list mem_load_l3

List of pre-defined events (to be used in -e):


cache:
  mem_load_l3_hit_retired.xsnp_hit
       [Retired load instructions which data sources were L3 and cross-core snoop hits in on-pkg core cache Supports address when precise (Precise event)]
  mem_load_l3_hit_retired.xsnp_hitm
       [Retired load instructions which data sources were HitM responses from shared L3 Supports address when precise (Precise event)]
  mem_load_l3_hit_retired.xsnp_miss
       [Retired load instructions which data sources were L3 hit and cross-core snoop missed in on-pkg core cache Supports address when precise (Precise event)]
  mem_load_l3_hit_retired.xsnp_none
       [Retired load instructions which data sources were hits in L3 without snoops required Supports address when precise (Precise event)]

这些是硬件事件(基于PMC),输出包括简要描述。(精确事件)是指基于事件的精确采样(PEBS)功能事件。

二、Software Events

2.1 Software event简介

Software event是定义在Linux内核代码中的几个特定的事件,比较典型的有进程上下文切换(内核态到用户态的转换)事件context-switches、发生缺页中断的事件page-faults、cpu-migrations等。
软件事件通常映射到硬件事件,但在软件中进行检测。与硬件事件一样,它们可能具有默认的采样频率,通常为4000,因此在使用record子命令时只捕获一个子集。

perf提供了少量固定的软件事件:

[root@localhost perf]# perf list

List of pre-defined events (to be used in -e):
  ......
  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]
  ......
// include/uapi/linux/perf_event.h

/*
 * Special "software" events provided by the kernel, even if the hardware
 * does not support performance events. These events measure various
 * physical and sw events of the kernel (and allow the profiling of them as
 * well):
 */
enum perf_sw_ids {
	PERF_COUNT_SW_CPU_CLOCK			= 0,
	PERF_COUNT_SW_TASK_CLOCK		= 1,
	PERF_COUNT_SW_PAGE_FAULTS		= 2,
	PERF_COUNT_SW_CONTEXT_SWITCHES		= 3,
	PERF_COUNT_SW_CPU_MIGRATIONS		= 4,
	PERF_COUNT_SW_PAGE_FAULTS_MIN		= 5,
	PERF_COUNT_SW_PAGE_FAULTS_MAJ		= 6,
	PERF_COUNT_SW_ALIGNMENT_FAULTS		= 7,
	PERF_COUNT_SW_EMULATION_FAULTS		= 8,

	PERF_COUNT_SW_MAX,			/* non-ABI */
};

手册页 perf_event_open 中也记录了这些内容:

The perf_event_attr structure provides detailed configuration information for the event being created.

           struct perf_event_attr {
               __u32 type;                 /* Type of event */
               ......   
           }

type   This field specifies the overall event type.  It has one of the following values:

PERF_TYPE_SOFTWARE
                     This indicates one of the software-defined events provided by the kernel (even if no hardware support is available).

If type is PERF_TYPE_SOFTWARE, we are measuring software events provided by the kernel.  Set config to one of the following:

                   PERF_COUNT_SW_CPU_CLOCK
                          This reports the CPU clock, a high-resolution per-CPU timer.

                   PERF_COUNT_SW_TASK_CLOCK
                          This reports a clock count specific to the task that is running.

                   PERF_COUNT_SW_PAGE_FAULTS
                          This reports the number of page faults.

                   PERF_COUNT_SW_CONTEXT_SWITCHES
                          This counts context switches.  Until Linux 2.6.34, these were all reported as user-space events, after that they are reported as happening in the kernel.

                   PERF_COUNT_SW_CPU_MIGRATIONS
                          This reports the number of times the process has migrated to a new CPU.

                   PERF_COUNT_SW_PAGE_FAULTS_MIN
                          This counts the number of minor page faults.  These did not require disk I/O to handle.

                   PERF_COUNT_SW_PAGE_FAULTS_MAJ
                          This counts the number of major page faults.  These required disk I/O to handle.

                   PERF_COUNT_SW_ALIGNMENT_FAULTS (since Linux 2.6.33)
                          This  counts  the number of alignment faults.  These happen when unaligned memory accesses happen; the kernel can handle these but it reduces performance.  This hap‐
                          pens only on some architectures (never on x86).

                   PERF_COUNT_SW_EMULATION_FAULTS (since Linux 2.6.33)
                          This counts the number of emulation faults.  The kernel sometimes traps on unimplemented instructions and emulates them for user space.  This can  negatively  impact
                          performance.

                   PERF_COUNT_SW_DUMMY (since Linux 3.12)
                          This  is  a placeholder event that counts nothing.  Informational sample record types such as mmap or comm must be associated with an active event.  This dummy event
                          allows gathering such records without requiring a counting event.

2.2 内核源码示例

内核源码中进程上下文切换事件context-switches:

/*
 * context_switch - switch to the new MM and the new
 * thread's register state.
 */
static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)
{
	......
	prepare_task_switch(rq, prev, next);
	......
	finish_task_switch(this_rq(), prev);
	......
}
prepare_task_switch()
	-->perf_event_task_sched_out()
		{
		perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, NULL, 0);
	
		if (static_key_false(&perf_sched_events.key))
			__perf_event_task_sched_out(prev, next);
		}
}
finish_task_switch()
	-->perf_event_task_sched_in()
		{
		if (static_key_false(&perf_sched_events.key))
			__perf_event_task_sched_in(prev, task);
		}

缺页中断的事件page-faults:

do_page_fault()
	-->__do_page_fault()
		{
			/* Get the faulting address: */
			address = read_cr2();
			
			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
			
		}
		

2.3 Software event的使用

软件事件可能有默认周期。这意味着当您使用它们进行采样时,您是在采样事件的子集,而不是跟踪每个事件。您可以使用perf record-vv进行检查:

[root@localhost perf]# perf record -vv -e context-switches /bin/true
Using CPUID GenuineIntel-6-9E
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x3
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|PERIOD
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 11021  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 11021  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 11021  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 11021  cpu 3  group_fd -1  flags 0x8 = 8
mmap size 528384B
perf event ring buffer mmapped per cpu
Synthesizing TSC conversion information
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data ]

有关这些字段的说明,请参阅perf_event_open手册页。默认情况下,内核调整采样率,以便每秒捕获大约4000个上下文切换事件。如果您真的想全部记录,请使用-c 1:

 -c, --count=
           Event period to sample.
[root@localhost perf]# perf record -vv -e context-switches -c 1 /bin/true
Using CPUID GenuineIntel-6-9E
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             112
  config                           0x3
  { sample_period, sample_freq }   1
  sample_type                      IP|TID|TIME
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid 14388  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 14388  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 14388  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 14388  cpu 3  group_fd -1  flags 0x8 = 8
mmap size 528384B
perf event ring buffer mmapped per cpu
Synthesizing TSC conversion information
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.017 MB perf.data ]

注意记录每一个事件的数量和所涉及的开销,尤其是频繁的上下文切换。您可以使用perf-stat检查事件的速率,以便您可以估计将要捕获的数据量。

三、Hardware Events (PMCs)

[root@localhost perf]# perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  ......
[root@localhost perf]# perf stat
^C
 Performance counter stats for 'system wide':

      10641.642600      cpu-clock (msec)          #    3.998 CPUs utilized
               475      context-switches          #    0.045 K/sec
                34      cpu-migrations            #    0.003 K/sec
             3,505      page-faults               #    0.329 K/sec
        94,260,640      cycles                    #    0.009 GHz
        54,033,265      instructions              #    0.57  insn per cycle
        10,901,207      branches                  #    1.024 M/sec
           364,597      branch-misses             #    3.34% of all branches

       2.661520522 seconds time elapsed

处理器和其他设备通常支持用于观察活动的硬件计数器。主要来源是处理器,它们通常称为性能监视计数器(PMC)。它们也有其他名称:CPU性能计数器(CP)、性能检测计数器(PIC)和性能监视单元事件(PMU事件)。这些都指的是同一件事:处理器上的可编程硬件寄存器,它们在CPU周期级别提供低级别的性能信息。

PMC是性能分析的重要资源。只有通过PMC,您才能测量CPU指令的效率、CPU缓存的命中率、内存和设备总线的利用率、互连利用率、暂停周期等。使用这些来分析性能可能会导致各种性能优化。

在这里插入图片描述
虽然有很多PMC,但Intel选择了上表中的事件作为“架构集”,提供了一些核心功能的高级概述。可以使用cpuid指令检查这些体系结构集PMC的存在。

支持架构性能监视的处理器可能不支持所有预定义的架构性能事件(表中所列举的事件),架构事件的数量通过CPUID.0AH:EAX[31:24].

在Linux上,PMC通过perf_event_open系统调用进行访问,并由包括perf在内的工具使用。

虽然有数百个PMC可用,但在CPU中只有固定数量的寄存器可用于同时测量它们,可能只有6个。您需要选择要在这六个寄存器上测量哪些PMC,或者循环使用不同的PMC集作为对它们进行采样的一种方式(Linux perf自动支持此功能)。其他软件计数器不会受到这些限制的影响。

PMC可以在不同的模式下使用:计数,在这种模式下,它们以几乎为零的开销来计数事件;以及溢出采样,在每一个可配置的事件数中就有一个事件发生中断,这样就可以捕获状态。计数可用于量化问题;溢出采样可用于显示负责的代码路径。

为了易于使用,perf提供了可替代使用的可读映射。例如,事件“branch-instructions”希望映射到系统上的分支指令PMC。其中一些人可读的名称在前面的列表中可见(硬件和PMU事件)。

将perf记录与PMC一起使用时,将使用默认采样频率,这样就不会记录所有事件。例如,记录周期事件:

root@localhost perf]# perf record -vve cycles -a sleep 1
Using CPUID GenuineIntel-6-9E
intel_pt default config: tsc,mtc,mtc_period=3,psb_period=3,pt,branch
------------------------------------------------------------
perf_event_attr:
  size                             112
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|CPU|PERIOD
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
------------------------------------------------------------
sys_perf_event_open: pid -1  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid -1  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid -1  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid -1  cpu 3  group_fd -1  flags 0x8 = 8
mmap size 528384B
perf event ring buffer mmapped per cpu
Synthesizing TSC conversion information
[ perf record: Woken up 1 times to write data ]

输出显示频率采样已启用(频率1),采样频率为4000。这告诉内核调整采样率,以便每台CPU每秒捕获大约4000个事件。这是可取的,因为某些PMC检测每秒可能发生数十亿次的事件(例如,CPU周期),并且记录每个事件的开销将令人望而却步。但这也是一个错误:perf的默认输出(没有非常详细的选项:-vv)并不表示正在使用频率采样,您可能希望记录所有事件。此事件频率仅影响记录子命令;stat统计所有事件。
可以使用-F选项修改事件频率,或者使用-c将事件频率更改为一个周期,该周期在每个周期中捕获一个事件(也称为溢出采样)。作为使用-F的示例:

perf record -F 99 -e cycles -a sleep 1

这将以99HZ(每秒事件数)的目标速率进行采样。

请注意,频率有限制,perf的CPU利用率也有限制,可以使用sysctl查看和设置:

[root@localhost perf]# sysctl kernel.perf_event_max_sample_rate
kernel.perf_event_max_sample_rate = 50000
[root@localhost perf]# cat /proc/sys/kernel/perf_event_max_sample_rate
50000

这表明该系统的最大采样率为50000Hertz。

[root@localhost perf]# sysctl kernel.perf_cpu_time_max_percent
kernel.perf_cpu_time_max_percent = 25
[root@localhost perf]# cat /proc/sys/kernel/perf_cpu_time_max_percent
25

perf(特别是PMU中断)允许的最大CPU利用率为25%。

四、Tracepoint Events

4.1 Tracepoint Events简介

这些跟踪点被硬编码在内核源码中较为重要,基于观测意义的位置,以便可以容易地跟踪更高级别的行为。例如,系统调用、TCP事件、文件系统I/O、磁盘I/O等。这些被分为跟踪点库;例如,“sock:”表示套接字事件,“sched:”表示CPU调度程序事件。跟踪点的一个关键价值是它们应该有一个稳定的API,所以如果您编写的工具在一个内核版本上使用它们,那么它们也应该在以后的版本上运行。

跟踪点通常通过放置include/trace/events/*中的宏添加到内核代码中。

关于Tracepoint简介请参考:Linux tracepoint 简介

4.2 Tracepoint Events的使用

[root@localhost perf]# perf list tracepoint

List of pre-defined events (to be used in -e):

  block:block_bio_backmerge                          [Tracepoint event]
  block:block_bio_bounce                             [Tracepoint event]
  block:block_bio_complete                           [Tracepoint event]
  block:block_bio_frontmerge                         [Tracepoint event]
  block:block_bio_queue                              [Tracepoint event]
  block:block_bio_remap                              [Tracepoint event]
  block:block_dirty_buffer                           [Tracepoint event]
  block:block_getrq                                  [Tracepoint event]
  block:block_plug                                   [Tracepoint event]
  block:block_rq_abort                               [Tracepoint event]
  block:block_rq_complete                            [Tracepoint event]
  block:block_rq_insert                              [Tracepoint event]
  block:block_rq_issue                               [Tracepoint event]
  block:block_rq_remap                               [Tracepoint event]
  block:block_rq_requeue                             [Tracepoint event]
  block:block_sleeprq                                [Tracepoint event]
  block:block_split                                  [Tracepoint event]
  block:block_touch_buffer                           [Tracepoint event]
  block:block_unplug                                 [Tracepoint event]
  bridge:br_fdb_add                                  [Tracepoint event]
  bridge:br_fdb_external_learn_add                   [Tracepoint event]
  bridge:br_fdb_update                               [Tracepoint event]
  bridge:fdb_delete                                  [Tracepoint event]
  compaction:mm_compaction_isolate_freepages         [Tracepoint event]
  compaction:mm_compaction_isolate_migratepages      [Tracepoint event]
  compaction:mm_compaction_migratepages              [Tracepoint event]
  ......

在centos 7.6 发行版中,总结跟踪点库名称和跟踪点数量:

[root@localhost perf]# perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END {
>     for (l in lib) { printf "  %-16.16s %d\n", l, lib[l] } }' | sort | column
    block          19       iommu          7        rcu            1
    bridge         4        irq            5        regmap         15
    compaction     3        irq_vectors    22       rpm            4
    context_tracki 2        kmem           12       sched          23
    devlink        1        kvm            55       scsi           5
    dma_fence      8        kvmmmu         14       signal         2
    drm            3        libata         6        skb            3
    exceptions     2        mce            1        sock           2
    filelock       6        migrate        2        sunrpc         31
    filemap        2        module         5        syscalls       612
    fs_dax         14       mpx            5        task           2
    ftrace         1        napi           1        timer          13
    gpio           2        net            9        ucsi           7
    gvt            14       oom            1        udp            1
    hda            5        pagemap        2        vmscan         15
    hda_controller 6        power          20       vsyscall       1
    hda_intel      4        printk         1        workqueue      4
    hyperv         1        qdisc          1        writeback      25
    i2c            8        random         15       xen            35
    i915           39       ras            3        xfs            376
    intel_ish      1        raw_syscalls   2        xhci-hcd       48

以下简单的一行计数系统调用执行的命令,并打印摘要(非零计数):

[root@localhost perf]# perf stat -e 'syscalls:sys_enter_*' gzip file 2>&1 | awk '$1 != 0'

 Performance counter stats for 'gzip file':

                 1      syscalls:sys_enter_utimensat
                 1      syscalls:sys_enter_unlink
                 3      syscalls:sys_enter_newfstat
                 1      syscalls:sys_enter_lseek
                 2      syscalls:sys_enter_read
                 1      syscalls:sys_enter_write
                 1      syscalls:sys_enter_access
                 1      syscalls:sys_enter_fchmod
                 1      syscalls:sys_enter_fchown
                 4      syscalls:sys_enter_open
                 7      syscalls:sys_enter_close
                 4      syscalls:sys_enter_mprotect
                 1      syscalls:sys_enter_brk
                 1      syscalls:sys_enter_munmap
                 4      syscalls:sys_enter_rt_sigprocmask
                12      syscalls:sys_enter_rt_sigaction
                 1      syscalls:sys_enter_exit_group
                 7      syscalls:sys_enter_mmap

       0.000933241 seconds time elapsed

在系统范围内跟踪10秒并打印事件:

oot@ubuntu:/home/yl/study/perf# perf record -e block:block_rq_issue -a sleep 10; perf script
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.931 MB perf.data (153 samples) ]
 kworker/u256:0-  5045 [002]  1264.855601: block:block_rq_issue: 0,0 N 8 () 0 + 0 [kworker/u256:0]
 kworker/2:1H-ev   106 [002]  1266.331217: block:block_rq_issue: 8,0 WM 8192 () 1052672 + 16 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331300: block:block_rq_issue: 8,0 WM 4096 () 1052696 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331353: block:block_rq_issue: 8,0 WM 4096 () 1052720 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331409: block:block_rq_issue: 8,0 WM 4096 () 5246992 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331470: block:block_rq_issue: 8,0 WM 4096 () 22024200 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331521: block:block_rq_issue: 8,0 WM 4096 () 22024672 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331564: block:block_rq_issue: 8,0 WM 4096 () 22024712 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331600: block:block_rq_issue: 8,0 WM 4096 () 22026384 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331639: block:block_rq_issue: 8,0 WM 4096 () 22026568 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331672: block:block_rq_issue: 8,0 WM 4096 () 22295408 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331720: block:block_rq_issue: 8,0 WM 4096 () 42995712 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331768: block:block_rq_issue: 8,0 WM 12288 () 42995840 + 24 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331834: block:block_rq_issue: 8,0 WM 4096 () 42995968 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.331913: block:block_rq_issue: 8,0 WM 4096 () 42996096 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332091: block:block_rq_issue: 8,0 WM 4096 () 42996584 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332136: block:block_rq_issue: 8,0 WM 4096 () 42996744 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332180: block:block_rq_issue: 8,0 WM 4096 () 42996784 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332226: block:block_rq_issue: 8,0 WM 4096 () 42996904 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332260: block:block_rq_issue: 8,0 WM 4096 () 42997088 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332312: block:block_rq_issue: 8,0 WM 4096 () 42999216 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332353: block:block_rq_issue: 8,0 WM 4096 () 43000792 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332389: block:block_rq_issue: 8,0 WM 8192 () 43000872 + 16 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332423: block:block_rq_issue: 8,0 WM 4096 () 43000976 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332464: block:block_rq_issue: 8,0 WM 4096 () 43001328 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332493: block:block_rq_issue: 8,0 WM 12288 () 43001632 + 24 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332530: block:block_rq_issue: 8,0 WM 4096 () 43003072 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332580: block:block_rq_issue: 8,0 WM 32768 () 43003088 + 64 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332683: block:block_rq_issue: 8,0 WM 77824 () 43004512 + 152 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332819: block:block_rq_issue: 8,0 WM 4096 () 43061504 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.332926: block:block_rq_issue: 8,0 WM 4096 () 43062520 + 8 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333026: block:block_rq_issue: 8,0 WM 8192 () 43130624 + 16 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333089: block:block_rq_issue: 8,0 WM 16384 () 43130648 + 32 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333136: block:block_rq_issue: 8,0 WM 24576 () 43130688 + 48 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333239: block:block_rq_issue: 8,0 WM 12288 () 43130744 + 24 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333310: block:block_rq_issue: 8,0 WM 12288 () 43130776 + 24 [kworker/2:1H]
 kworker/2:1H-ev   106 [002]  1266.333352: block:block_rq_issue: 8,0 WM 16384 () 43130808 + 32 [kworker/2:1H]
 ......

打印此跟踪点的参数及其格式字符串(元数据摘要):

oot@ubuntu:/home/yl/study/perf# cat /sys/kernel/debug/tracing/events/block/block_rq_issue/format
name: block_rq_issue
ID: 1193
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:dev_t dev;	offset:8;	size:4;	signed:0;
	field:sector_t sector;	offset:16;	size:8;	signed:0;
	field:unsigned int nr_sector;	offset:24;	size:4;	signed:0;
	field:unsigned int bytes;	offset:28;	size:4;	signed:0;
	field:char rwbs[8];	offset:32;	size:8;	signed:1;
	field:char comm[16];	offset:40;	size:16;	signed:1;
	field:__data_loc char[] cmd;	offset:56;	size:4;	signed:1;

print fmt: "%d,%d %s %u (%s) %llu + %u [%s]", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, REC->bytes, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->comm

仅过滤大于65536字节的块I/O:

perf record -e block:block_rq_issue --filter 'bytes > 65536' -a sleep 10

跟踪由 “man ls” 命令触发的新进程:

[root@localhost perf]# perf list sched:sched_process_exec

List of pre-defined events (to be used in -e):

  sched:sched_process_exec                           [Tracepoint event]
[root@localhost perf]# perf record -e sched:sched_process_exec -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.661 MB perf.data (59 samples) ]

[root@localhost perf]# perf report -n --sort comm --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 59  of event 'sched:sched_process_exec'
# Event count (approx.): 59
#
# Overhead       Samples  Command
# ........  ............  ...........
#
    16.95%            10  grep
    16.95%            10  head
     8.47%             5  sleep
     8.47%             5  tail
     5.08%             3  df
     5.08%             3  who
     3.39%             2  bash
     3.39%             2  consoletype
     3.39%             2  grepconf.sh
     3.39%             2  id
     3.39%             2  lspci
     3.39%             2  tty
     3.39%             2  uname
     1.69%             1  groff
     1.69%             1  grotty
     1.69%             1  less
     1.69%             1  locale
     1.69%             1  man
     1.69%             1  nroff
     1.69%             1  preconv
     1.69%             1  tbl
     1.69%             1  troff


#
# (Tip: Use parent filter to see specific call path: perf report -p <regex>)
#

4.3 Tracepoints Interface

跟踪工具可以通过tracefs中的跟踪事件文件(通常安装在/sys/kernel/debug/tracing)或perf_event_open系统调用来使用跟踪点。例如,基于Ftrace的iosnoop工具使用tracefs文件:

git clone --depth 1 https://github.com/brendangregg/perf-tools
root@localhost ~]# strace -e open,chdir perf-tools/bin/iosnoop
......
chdir("/sys/kernel/debug/tracing")      = 0
open("/var/tmp/.ftrace-lock", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
open("current_tracer", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
open("buffer_size_kb", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
open("events/block/block_rq_issue/enable", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
open("events/block/block_rq_complete/enable", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
......

对于较高版本的内核open系统调用发生了变化:

strace -e openat,chdir perf-tools/bin/iosnoop

输出包括指向tracefs目录的chdir和打开块跟踪点的“enable”文件。它还包括一个/var/tmp/.ftrace锁:这是编写的阻止并发工具用户的预防措施,tracefs接口不容易支持。

perf_event_open接口支持并发用户,并且在可能的情况下首选。同一工具的更新BCC版本使用该工具:

root@yl-virtual-machine:# strace -e perf_event_open /usr/share/bcc/tools/biolatency 
perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0, sample_period=1, sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 7
perf_event_open({type=0x6 /* PERF_TYPE_??? */, size=PERF_ATTR_SIZE_VER7, config=0, sample_period=1, sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, ...}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 9
......

perf_event_open是内核perf_events子系统的接口,它提供各种评测和跟踪功能。

4.4 Tracepoints Overhead

当跟踪点被激活时,它们会为每个事件增加少量的CPU开销。跟踪工具还可以增加后处理事件的CPU开销,以及记录事件的文件系统开销。开销是否高到足以干扰生产应用程序取决于事件的速率和CPU的数量,这是使用跟踪点时需要考虑的问题。

在当今的典型系统(4到128个CPU)上,发现低于每秒10000个的事件速率的开销可以忽略不计,只有超过100000个的开销才开始变得可测量。作为事件示例,您可能会发现磁盘事件通常小于每秒10000个,但调度程序事件可能远远超过每秒100000个,因此跟踪成本可能很高。

2018年在Linux 4.7中添加了一种新的跟踪点,称为原始跟踪点(raw tracepoints),它避免了创建稳定跟踪点参数的成本,从而减少了开销。

除了在使用跟踪点时启用的开销外,还有使跟踪点可用的禁用开销。禁用的跟踪点变为少量指令:对于x86_64,它是一个5字节无操作(nop)指令。在函数的末尾还添加了一个跟踪点处理程序,这会稍微增加它的文本大小。虽然这些开销很小,但在向内核添加跟踪点时,应该分析和理解这些开销。

4.5 User-Level Statically Defined Tracing (USDT)

与内核跟踪点类似,用户级静态定义跟踪(USDT)是 tracepoints 的用户空间版本。
USDT 与 uprobes 的关系类似于 tracepoints 与 kprobes 之间的关系。

USDT 和 tracepoints 是硬编码在源码中,是稳定的API,开销较小,跟踪点数量较少。
而uprobes和kprobes是修改运行代码的指令,以便在需要时插入指令,使用时有一定的开销,可以跟踪的数量很多。

一些应用程序和库在其代码中添加了USDT探针,为跟踪应用程序级事件提供了一个稳定的(和文档化的)API。这些跟踪点在应用程序源代码中的重要逻辑位置进行硬编码(通常通过放置宏),并作为一个稳定的API呈现(事件名称和参数)。许多应用程序已经包含了跟踪点,这些跟踪点是为了支持DTrace而添加的。例如,Java JDK、PostgreSQL数据库和libc中都有USDT探针。

然而,许多这些应用程序在默认情况下不会在Linux上编译它们。通常需要使用–withdtrace标志自己编译应用程序。

例如,使用此版本的Node.js编译USDT事件:

ubuntun下:

$ sudo apt-get install systemtap-sdt-dev       # adds "dtrace", used by node build

由于我已经在centos 7上安装了systemtap,编译node.js即可(–with-dtrace)。

$ wget https://nodejs.org/dist/v4.4.1/node-v4.4.1.tar.gz
$ tar xvf node-v4.4.1.tar.gz 
$ cd node-v4.4.1
$ ./configure --with-dtrace
$ make -j 8

要检查生成的节点二进制文件是否包含探测,请执行以下操作:

[root@localhost node-v4.4.1]# readelf -n node

Displaying notes found at file offset 0x00000254 with length 0x00000020:
  Owner                 Data size       Description
  GNU                  0x00000010       NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.32

Displaying notes found at file offset 0x00000274 with length 0x00000024:
  Owner                 Data size       Description
  GNU                  0x00000014       NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 7cb25e0287bec225698aa0e38de31fdad07b0abf

Displaying notes found at file offset 0x00e67bc8 with length 0x000003c4:
  Owner                 Data size       Description
  stapsdt              0x0000003c       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: gc__start
    Location: 0x0000000000d98f74, Base: 0x000000000111133c, Semaphore: 0x0000000001467b94
    Arguments: 4@%esi 4@%edx 8@%rdi
  stapsdt              0x0000003b       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: gc__done
    Location: 0x0000000000d98f84, Base: 0x000000000111133c, Semaphore: 0x0000000001467b96
    Arguments: 4@%esi 4@%edx 8@%rdi
  stapsdt              0x00000068       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: net__server__connection
    Location: 0x0000000000d99251, Base: 0x000000000111133c, Semaphore: 0x0000000001467b88
    Arguments: 8@%rax 8@-1096(%rbp) -4@-1100(%rbp) -4@-1104(%rbp)
  stapsdt              0x00000061       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: net__stream__end
    Location: 0x0000000000d994d1, Base: 0x000000000111133c, Semaphore: 0x0000000001467b8a
    Arguments: 8@%rax 8@-1096(%rbp) -4@-1100(%rbp) -4@-1104(%rbp)
  stapsdt              0x00000089       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: http__server__request
    Location: 0x0000000000d999c0, Base: 0x000000000111133c, Semaphore: 0x0000000001467b8c
    Arguments: 8@%r14 8@%rax 8@-4328(%rbp) -4@-4332(%rbp) 8@-4288(%rbp) 8@-4296(%rbp) -4@-4336(%rbp)
  stapsdt              0x00000067       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: http__server__response
    Location: 0x0000000000d99c41, Base: 0x000000000111133c, Semaphore: 0x0000000001467b8e
    Arguments: 8@%rax 8@-1096(%rbp) -4@-1100(%rbp) -4@-1104(%rbp)
  stapsdt              0x00000089       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: http__client__request
    Location: 0x0000000000d99f7b, Base: 0x000000000111133c, Semaphore: 0x0000000001467b90
    Arguments: 8@%rax 8@%rdx 8@-2136(%rbp) -4@-2140(%rbp) 8@-2184(%rbp) 8@-2192(%rbp) -4@-2144(%rbp)
  stapsdt              0x00000060       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: http__client__response
    Location: 0x0000000000d9a20e, Base: 0x000000000111133c, Semaphore: 0x0000000001467b92
    Arguments: 8@%rdx 8@-1096(%rbp) -4@%eax -4@-1104(%rbp)

五、perf vs strace

进一步解释一下区别:strace的当前实现使用ptrace()附加到目标进程,并在系统调用期间停止它,就像调试器一样,这可能会造成严重的开销。为了证明这一点,下面的系统调用重载程序使用perf和strace自行运行。我只列出了显示其性能的输出行:

[root@localhost perf]# dd if=/dev/zero of=/dev/null bs=512 count=10000k
10240000+0 records in
10240000+0 records out
5242880000 bytes (5.2 GB) copied, 9.13118 s, 574 MB/s

[root@localhost perf]# perf stat -e 'syscalls:sys_enter_*' dd if=/dev/zero of=/dev/null bs=512 count=10000k
10240000+0 records in
10240000+0 records out
5242880000 bytes (5.2 GB) copied, 12.1365 s, 432 MB/s

[root@localhost perf]# strace -c dd if=/dev/zero of=/dev/null bs=512 count=10000k
10240000+0 records in
10240000+0 records out
5242880000 bytes (5.2 GB) copied, 283.28 s, 18.5 MB/s

使用perf时,程序运行速度慢1.3倍。但有了strace,它跑得慢了32倍。这很可能是最坏的结果:如果系统调用不那么频繁,两种工具之间的差异就不会那么大。

最新版本的perf包含了trace子命令,以提供与strace类似的功能,但开销要低得多。

perf trace 于Linux 3.7内核版本主线引入:
perf trace是perf性能分析基础设施中新增的一个工具。该工具的设计灵感来自于广受欢迎的’strace’工具,但它不使用ptrace()系统调用,而是使用Linux的追踪基础设施。它的目的是使追踪对更广泛的Linux用户群体更加容易。

perf trace将显示与目标相关的事件,最初是系统调用,但也包括其他系统事件,如页面错误、任务生命周期事件、调度事件等。这个工具目前还处于早期版本,因此只支持实时模式,并且还有很多细节需要改进,但最终将与其他perf工具一样,能够处理perf.data文件,允许从分析阶段分离出来的“记录”进行后续分析。

参考资料

https://www.brendangregg.com/perf.html
perf trace, an alternative to strace
Systems.Performance.Enterprise.and.the.Cloud.2nd.Edition

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值