1.背景
当来自设备的中断时,操作系统会暂停它正在执行的操作并开始寻址该中断。
在某些情况下,IRQ一个接一个地非常快,操作系统无法在另一个到达之前完成一个服务。当高速网卡在短时间内收到大量数据包时,就会发生这种情况。
因为操作系统在到达时无法处理IRQ(因为它们一个接一个地到达得太快),
操作系统会将它们排队等待稍后由名为ksoftirqd
的/n(n为cpu的逻辑号)的内核线程处理。
每个ksoftirqd/n内核线程都运行ksoftirqd()函数,实际上该函数执行下面的循环:
for (;;)
{
set_current_state(TASK_INTERRUPTIBLE);
schedule();
while (local_softirq_pending())
{
preempt_disable();
do_softirq(); // 处理软中断
preempt_enable();
cond_resched();
}
}
如果ksoftirqd
占用的CPU时间超过一小部分,则表示机器处于严重的中断负载下。
2.解决方法
可以用cat /proc/interrupts来查看设备造成的中断情况
例如某天我的是这样
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
0: 44 0 0 0 0 0 0 0 IO-APIC 2-edge timer
1: 0 1 1 0 0 0 0 1 IO-APIC 1-edge i8042
8: 1 0 0 0 0 0 0 0 IO-APIC 8-edge rtc0
9: 0 0 0 0 0 0 0 0 IO-APIC 9-fasteoi acpi
12: 0 0 0 0 2 1 1 0 IO-APIC 12-edge i8042
14: 16 35619 2171878 33951 2172032 35710 2172366 35859 IO-APIC 14-edge ata_piix
15: 0 0 0 0 0 0 0 0 IO-APIC 15-edge ata_piix
16: 1492 1492 1492 38096398 1489 1492 1496 1491 IO-APIC 16-fasteoi ioc0
19: 38 38 37 38 37 39 38 38 IO-APIC 19-fasteoi radeon
20: 3 3 4 4 5 3 4 4 IO-APIC 20-fasteoi uhci_hcd:usb3, uhci_hcd:usb5
21: 4 3 4 3 5 4 4 4 IO-APIC 21-fasteoi ehci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb4
24: 0 0 0 0 0 0 0 0 PCI-MSI 32768-edge PCIe PME
25: 0 0 0 0 0 0 0 0 PCI-MSI 49152-edge PCIe PME
26: 0 0 0 0 0 0 0 0 PCI-MSI 65536-edge PCIe PME
27: 0 0 0 0 0 0 0 0 PCI-MSI 81920-edge PCIe PME
28: 0 0 0 0 0 0 0 0 PCI-MSI 98304-edge PCIe PME
29: 0 0 0 0 0 0 0 0 PCI-MSI 114688-edge PCIe PME
30: 0 0 0 0 0 0 0 0 PCI-MSI 458752-edge PCIe PME
31: 1380 1346 1353 1339 1339 1357 1329 1360 PCI-MSI 3670016-edge eno2
32: 85511967 85513711 85083641 83994544 85084225 85501922 85093691 85515746 PCI-MSI 1572864-edge eno1
NMI: 234694 241328 233682 235747 227371 232679 239850 230941 Non-maskable interrupts
LOC: 809615451 844980307 827224481 872760389 798336595 856256586 824716208 858800687 Local timer interrupts
SPU: 0 0 0 0 0 0 0 0 Spurious interrupts
PMI: 234694 241328 233682 235747 227371 232679 239850 230941 Performance monitoring interrupts
IWI: 234691 241323 233680 235741 227368 232677 239846 230941 IRQ work interrupts
RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 50636771 36959166 32328971 32915351 29242823 29889193 29928230 31502838 Rescheduling interrupts
CAL: 4443659 7479033 5592654 1507128 4465404 8053047 5214542 1501877 Function call interrupts
TLB: 6476601 6460251 6517362 6225272 6466684 6345076 6491077 6193616 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 22724 22724 22724 22724 22724 22724 22724 22724 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 0 0 0 0 0 0 Posted-interrupt notification event
PIW: 0 0 0 0 0 0 0 0 Posted-interrupt wakeup event
注意到序号32那爆炸的中断数量,可以断定是网卡瘫痪了。只需要ifdown eno1 && ifup eno1重启网卡即可。