linux诊断CPU软中断问题

最新推荐文章于 2024-06-10 23:21:01 发布

shen1936

最新推荐文章于 2024-06-10 23:21:01 发布

阅读量2.4k

点赞数

分类专栏：性能调优

性能调优专栏收录该内容

11 篇文章 0 订阅

订阅专栏

在XEN虚拟机上的Nginx服务器存在一个问题：软中断过高，而且大部分都集中在同一个CPU，一旦系统繁忙，此CPU就会成为木桶的短板。

在问题服务器上运行「top」命令可以很明显看到「si」存在异样，大部分软中断都集中在 1 号CPU上，其它的CPU完全使不上劲儿：

top 打开后，按1

shell> top
Cpu0: 11.3%us,  4.7%sy,  0.0%ni, 82.5%id,  ...  0.8%si,  0.8%st
Cpu1: 21.3%us,  7.4%sy,  0.0%ni, 51.5%id,  ... 17.8%si,  2.0%st
Cpu2: 16.6%us,  4.5%sy,  0.0%ni, 77.7%id,  ...  0.8%si,  0.4%st
Cpu3: 15.9%us,  3.6%sy,  0.0%ni, 79.3%id,  ...  0.8%si,  0.4%st
Cpu4: 17.7%us,  4.9%sy,  0.0%ni, 75.3%id,  ...  1.2%si,  0.8%st
Cpu5: 23.6%us,  6.6%sy,  0.0%ni, 68.1%id,  ...  0.9%si,  0.9%st
Cpu6: 18.1%us,  4.9%sy,  0.0%ni, 75.7%id,  ...  0.4%si,  0.8%st
Cpu7: 21.1%us,  5.8%sy,  0.0%ni, 71.4%id,  ...  1.2%si,  0.4%st

%si过高，表示CPU一直在处理中断请求。

查询一下软中断相关数据，发现主要集中在 NET_RX 上，猜测是网卡问题：

[root@localhost ~]# cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0:        123          0          0          0          0          0          0          0   IO-APIC-edge      timer
  1:          2          0          0          0          0          0          0          0   IO-APIC-edge      i8042
  8:          1          0          0          0          0          0          0          0   IO-APIC-edge      rtc0
  9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          4          0          0          0          0          0          0          0   IO-APIC-edge      i8042
 16:         29         23         13          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
 23:        593          3          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
 24:        513          0          0          0          0          0          0          0  HPET_MSI-edge      hpet2
 25:          0          0          0          0          0          0          0          0  HPET_MSI-edge      hpet3
 26:          0          0          0          0          0          0          0          0  HPET_MSI-edge      hpet4
 27:          0          0          0          0          0          0          0          0  HPET_MSI-edge      hpet5
 28:          0          0          0          0          0          0          0          0  HPET_MSI-edge      hpet6
 32:          1          0          0          0          0          0          0          0   PCI-MSI-edge      radeon
 33:       5162    1163101       4076          0          0          0          0          0   PCI-MSI-edge      ahci
 34:        488          0          0          0          0          0          0          0   PCI-MSI-edge      snd_hda_intel
 35:  199742018          0          0          0          0          0          0          0   PCI-MSI-edge      p3p1
NMI:      16881      14322       2532       5796       7100       2561       2663       1527   Non-maskable interrupts
LOC:   55274327   26593870   20414558   17905601   37594073   20023723   18890990   16268588   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:      16881      14322       2532       5796       7100       2561       2663       1527   Performance monitoring interrupts
IWI:          0          0          0          0          0          0          0          0   IRQ work interrupts
RES:     797221     421698     366626     368397     446879     370276     403842     408027   Rescheduling interrupts
CAL:        619        823        848        817        822        848        786        821   Function call interrupts
TLB:     460130     729375     366044     454289     337530     391140     345373     280296   TLB shootdowns
TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
MCP:        515        515        515        515        515        515        515        515   Machine check polls
ERR:          0
MIS:          0

watch -d -n 1 cat /proc/softirqs

Every 10.0s: cat /proc/softirqs                                                                                                                                                                                                                       Wed Sep 24 13:49:46 2014

                CPU0	   CPU1       CPU2	 CPU3       CPU4       CPU5	  CPU6       CPU7
      HI:          0          0          0          0          0          0          0          0
   TIMER:   49000789   16240383    9295008    6670902   24386999    5254307    4493437    3460921
  NET_TX:        165         84         81         41        144      30811         59         46
  NET_RX:  200668368     435663     305644     221200     675570     169646     136810     109890
   BLOCK:	5154    1162289       4076          1         37         19          6         25
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
 TASKLET:     245677          1          0          1          4          0          2          2
   SCHED:   10168410    5682917    5131262    2813832   12061600    3116420    2773199    2444974
 HRTIMER:      27794	  46129      42233	29640	   47991      35655	 29513      25951
     RCU:   49166364   16990678    8614455    6761719   23939281    5569768    4625969    3569736

确认一下宿主机上的网卡信息：

[root@localhost ~]# grep -A 10 -i network /var/log/dmesg
Initalizing network drop monitor service
Freeing unused kernel memory: 1276k freed
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 800k freed
Freeing unused kernel memory: 1588k freed
dracut: dracut-004-335.el6
dracut: rd_NO_LUKS: removing cryptoluks activation
dracut: rd_NO_LVM: removing LVM activation
device-mapper: uevent: version 1.0.3
device-mapper: ioctl: 4.24.6-ioctl (2013-01-15) initialised: dm-devel@redhat.com
udev: starting version 147
--
udev: renamed network interface eth0 to p3p1
EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: 
SELinux: initialized (dev sdb1, type ext4), uses xattr
EXT4-fs (sda1): warning: maximal mount count reached, running e2fsck is recommended
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: 
SELinux: initialized (dev sda1, type ext4), uses xattr
Adding 16383992k swap on /dev/sdb2.  Priority:-1 extents:1 across:16383992k 
SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts

接着确认一下网卡的中断号，因为是单队列，所以只有一个中断号35：

[root@localhost ~]# grep p3p1 /proc/interrupts |awk '{print $1,$NF}'
35: p3p1

知道了网卡的中断号，就可以查询其中断亲缘性配置「smp_affinity」：

[root@localhost ~]# cat /proc/irq/35/smp_affinity
01

这里的 01 实际上是十六进制，表示 0 号CPU，计算方法如下（参考资料）：

         Binary       Hex 
  CPU 0    0001         1 
  CPU 1    0010         2
  CPU 2    0100         4
+ CPU 3    1000         8
  -----------------------
  both     1111         f

说明：如果 4 个CPU都参与中断处理，那么设为 f；同理 8 个CPU的就设置成 ff：

echo ff > /proc/irq/35/smp_affinity

此外还有一个类似的配置「smp_affinity_list」：

[root@localhost ~]# cat /proc/irq/35/smp_affinity_list 
0

两个配置是相通的，修改了一个，另一个会跟着变。不过「smp_affinity_list」使用的是十进制，相比较「smp_affinity」的十六进制，可读性更好些。

了解了这些基本知识，我们可以尝试换一个CPU试试看会发生什么：

echo 7 > /proc/irq/35/smp_affinity_list

再通过「top」命令观察，会发现处理软中断的CPU变成了 7 号CPU。

说明：如果希望多个CPU参与中断处理的话，可以使用类似下面的语法：

echo 3,5 > /proc/irq/35/smp_affinity_list
echo 0-7 > /proc/irq/35/smp_affinity_list

坏消息是对单队列网卡而言，「smp_affinity」和「smp_affinity_list」配置多CPU无效。

好消息是Linux支持RPS，通俗点来说就是在软件层面模拟实现硬件的多队列网卡功能。

首先看看如何配置RPS，如果CPU个数是 8 个的话，可以设置成 ff：

shell> echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus

接着配置内核参数rps_sock_flow_entries（官方文档推荐设置： 32768）：

shell> sysctl net.core.rps_sock_flow_entries=32768

最后配置rps_flow_cnt，单队列网卡的话设置成rps_sock_flow_entries即可：

echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

说明：如果是多队列网卡，那么就按照队列数量设置成 rps_sock_flow_entries / N。

做了如上的优化后，我们再运行「top」命令可以看到软中断已经分散到了两个CPU：

shell> top
Cpu0: 24.8%us,  9.7%sy,  0.0%ni, 52.2%id,  ... 11.5%si,  1.8%st
Cpu1:  8.8%us,  5.1%sy,  0.0%ni, 76.5%id,  ...  7.4%si,  2.2%st
Cpu2: 17.6%us,  5.1%sy,  0.0%ni, 75.7%id,  ...  0.7%si,  0.7%st
Cpu3: 11.9%us,  7.0%sy,  0.0%ni, 80.4%id,  ...  0.7%si,  0.0%st
Cpu4: 15.4%us,  6.6%sy,  0.0%ni, 75.7%id,  ...  1.5%si,  0.7%st
Cpu5: 20.6%us,  6.9%sy,  0.0%ni, 70.2%id,  ...  1.5%si,  0.8%st
Cpu6: 12.9%us,  5.7%sy,  0.0%ni, 80.0%id,  ...  0.7%si,  0.7%st
Cpu7: 15.9%us,  5.1%sy,  0.0%ni, 77.5%id,  ...  0.7%si,  0.7%st

疑问：理论上讲，我已经设置了RPS为ff，应该所有 8 个CPU一起分担软中断才对，可实际结果只有两个，有知道原因的请赐教，但是不管怎么说，两个总好过一个。
此外，因为这是一台Nginx服务器，所以通过「worker_cpu_affinity」指令可以配置Nginx使用哪些CPU，如此一来我们便可以绕开高负载的CPU，对性能会有一些帮助。
补充：如果服务器是NUMA架构的话，那么「numactl –cpubind」可能也会有用。
最后，推荐看看香草总结的一些关于软中断方面的资料和工具，很全面。