前言
内核大量打印"AMD-Vi completion-wait loop timed out",同时伴随有soft lockup或者rcu cpu stall,如下:
Dec 8 10:02:17 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:17 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:17 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:17 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:18 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:18 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:18 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:18 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:18 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:19 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:19 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:19 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:19 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:19 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:20 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:20 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:20 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:20 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:20 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:21 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:21 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:21 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:21 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:21 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: AMD-Vi: Completion-Wait loop timed out
Dec 8 10:02:22 kernel: watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [swapper/6:0]
Dec 8 10:02:22 kernel: CPU: 46 PID: 0 Comm: swapper/46 Tainted: G L 5.10.128 2
Dec 8 10:02:22 kernel: RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
Dec 8 10:02:22 kernel: Call Trace:
Dec 8 10:02:22 kernel: <IRQ>
Dec 8 10:02:22 kernel: amd_iommu_flush_iotlb_all+0x4e/0x60
Dec 8 10:02:22 kernel: iommu_dma_flush_iotlb_all+0x1d/0x20
Dec 8 10:02:22 kernel: iova_domain_flush+0x1e/0x30
Dec 8 10:02:22 kernel: fq_flush_timeout+0x39/0xb0
Dec 8 10:02:22 kernel: ? fq_ring_free+0x110/0x110
Dec 8 10:02:22 kernel: call_timer_fn+0x2e/0x100
Dec 8 10:02:22 kernel: __run_timers.part.0+0x1de/0x260
Dec 8 10:02:22 kernel: ? clockevents_program_event+0x8f/0xe0
Dec 8 10:02:22 kernel: ? tick_program_event+0x41/0x80
Dec 8 10:02:22 kernel: run_timer_softirq+0x2a/0x50
Dec 8 10:02:22 kernel: __do_softirq+0xce/0x281
Dec 8 10:02:22 kernel: asm_call_irq_on_stack+0x12/0x20
Dec 8 10:02:22 kernel: </IRQ>
Dec 8 10:02:22 kernel: do_softirq_own_stack+0x3d/0x50
Dec 8 10:02:22 kernel: irq_exit_rcu+0xc5/0x100
Dec 8 10:02:22 kernel: sysvec_apic_timer_interrupt+0x3d/0x90
Dec 8 10:02:22 kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Dec 8 10:02:22 kernel: RIP: 0010:native_safe_halt+0xe/0x10
勤快的小伙伴可能会迅速的google到下面的链接:
其中却没有解释,为啥机器上会有soft lockup,而且还一直在一个CPU上soft lockup。
Timed out log来源
AMD iommu架构中的一条命令,参考其spec,2.4.1 COMPLETION_WAIT
The COMPLETION_WAIT command allows software to serialize itself with IOMMU command processing. The COMPLETION_WAIT command does not finish until all older commands issuedsince a prior COMPLETION_WAIT have completely executed.
其命令的中,有关于该命令是否完成的说明如下:
当命令完成时,iommu会将cmd.store_data写入cmd.store_addr中;参考代码:
5.10.128
iommu_completion_wait()
---
data = ++iommu->cmd_sem_val;
build_completion_wait(&cmd, iommu, data);
ret = __iommu_queue_command_sync(iommu, &cmd, false);
if (ret)
goto out_unlock;
ret = wait_on_sem(iommu, data)