Hard lockup occurs due to an infinite loop encountered in distribute_cfs_runtime()

20 篇文章 0 订阅

Hard lockup occurs due to an infinite loop encountered in distribute_cfs_runtime()
 SOLUTION 已验证 - 已更新 2018年一月29日08:14 - English 
环境
Red Hat Enterprise Linux 7.3 (kernel-3.10.0-514.el7.x86_64)
问题
Hard lockup occurs due to an infinite loop encountered in distribute_cfs_runtime()


[ 1432.242810] Kernel panic - not syncing: Hard LOCKUP
[ 1432.242829] CPU: 25 PID: 0 Comm: swapper/25 Not tainted 3.10.0-514.el7.x86_64 #1
[ 1432.242855] Hardware name: Cisco Systems Inc UCSC-C220-M4S/UCSC-C220-M4S, BIOS C220M4.2.0.3d.0.111120141447 11/11/2014
[ 1432.242891]  ffffffff818d9764 b7b1a7a2bef0fc23 ffff88105f445b18 ffffffff81685eac
[ 1432.242921]  ffff88105f445b98 ffffffff8167f2b3 0000000000000010 ffff88105f445ba8
[ 1432.242951]  ffff88105f445b48 b7b1a7a2bef0fc23 ffff88105f445ba8 ffffffff818d946a
[ 1432.242980] Call Trace:
[ 1432.242991]  <NMI>  [<ffffffff81685eac>] dump_stack+0x19/0x1b
[ 1432.243018]  [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[ 1432.243039]  [<ffffffff8108562f>] nmi_panic+0x3f/0x40
[ 1432.243059]  [<ffffffff8112f0e6>] watchdog_overflow_callback+0xf6/0x100
[ 1432.243085]  [<ffffffff8117465e>] __perf_event_overflow+0x8e/0x1f0
[ 1432.243108]  [<ffffffff811752a4>] perf_event_overflow+0x14/0x20
[ 1432.243132]  [<ffffffff81009d88>] intel_pmu_handle_irq+0x1f8/0x4e0
[ 1432.243156]  [<ffffffff81319d7c>] ? ioremap_page_range+0x27c/0x3e0
[ 1432.243179]  [<ffffffff811bedf4>] ? vunmap_page_range+0x1c4/0x310
[ 1432.243202]  [<ffffffff811bef51>] ? unmap_kernel_range_noflush+0x11/0x20
[ 1432.243227]  [<ffffffff813c93d4>] ? ghes_copy_tofrom_phys+0x124/0x210
[ 1432.243252]  [<ffffffff813c9560>] ? ghes_read_estatus+0xa0/0x190
[ 1432.243275]  [<ffffffff8168daeb>] perf_event_nmi_handler+0x2b/0x50
[ 1432.243298]  [<ffffffff8168ef19>] nmi_handle.isra.0+0x69/0xb0
[ 1432.243320]  [<ffffffff8168f093>] do_nmi+0x133/0x410
[ 1432.243339]  [<ffffffff8168e353>] end_repeat_nmi+0x1e/0x2e
[ 1432.243360]  [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243381]  [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243402]  [<ffffffff8168d812>] ? _raw_spin_lock+0x32/0x50
[ 1432.243422]  <<EOE>>  <IRQ>  [<ffffffff810d100b>] unthrottle_cfs_rq+0x4b/0x170
[ 1432.243453]  [<ffffffff810d12e2>] distribute_cfs_runtime+0xf2/0x100
[ 1432.243476]  [<ffffffff810d147f>] sched_cfs_period_timer+0xcf/0x160
[ 1432.243499]  [<ffffffff810d13b0>] ? sched_cfs_slack_timer+0xc0/0xc0
[ 1432.243523]  [<ffffffff810b4862>] __hrtimer_run_queues+0xd2/0x260
[ 1432.243546]  [<ffffffff810b4e00>] hrtimer_interrupt+0xb0/0x1e0
[ 1432.243569]  [<ffffffff810510d7>] local_apic_timer_interrupt+0x37/0x60
[ 1432.243594]  [<ffffffff81698bcf>] smp_apic_timer_interrupt+0x3f/0x60
[ 1432.243617]  [<ffffffff8169711d>] apic_timer_interrupt+0x6d/0x80
[ 1432.243638]  <EOI>  [<ffffffff81513f52>] ? cpuidle_enter_state+0x52/0xc0
[ 1432.243665]  [<ffffffff81514099>] cpuidle_idle_call+0xd9/0x210
[ 1432.243688]  [<ffffffff8103516e>] arch_cpu_idle+0xe/0x30
[ 1432.243709]  [<ffffffff810e7c95>] cpu_startup_entry+0x245/0x290
[ 1432.243732]  [<ffffffff8104f12a>] start_secondary+0x1ba/0x230


The following backtrace observed from the panic task:
 

crash> bt
PID: 0      TASK: ffff8808fce38fb0  CPU: 25  COMMAND: "swapper/25"
 #0 [ffff88105f4459f0] machine_kexec at ffffffff81059cdb
 #1 [ffff88105f445a50] __crash_kexec at ffffffff81105182
 #2 [ffff88105f445b20] panic at ffffffff8167f2ba
 #3 [ffff88105f445ba0] nmi_panic at ffffffff8108562f
 #4 [ffff88105f445bb0] watchdog_overflow_callback at ffffffff8112f0e6
 #5 [ffff88105f445bc8] __perf_event_overflow at ffffffff8117465e
 #6 [ffff88105f445c00] perf_event_overflow at ffffffff811752a4
 #7 [ffff88105f445c10] intel_pmu_handle_irq at ffffffff81009d88
 #8 [ffff88105f445e48] perf_event_nmi_handler at ffffffff8168daeb
 #9 [ffff88105f445e68] nmi_handle at ffffffff8168ef19
#10 [ffff88105f445eb0] do_nmi at ffffffff8168f093
#11 [ffff88105f445ef0] end_repeat_nmi at ffffffff8168e353
    [exception RIP: _raw_spin_lock+50]
    RIP: ffffffff8168d812  RSP: ffff88105f443e18  RFLAGS: 00000012
    RAX: 0000000000007f72  RBX: ffff88105aab1b00  RCX: 0000000000003f34
    RDX: 0000000000003f42  RSI: 0000000000003f42  RDI: ffff881058a79d48
    RBP: ffff88105f443e18   R8: 0000000000800008   R9: 0000000000000001
    R10: 0000000000018695  R11: 0000000000000000  R12: ffff8810537cd400
    R13: ffff88105f2d6c40  R14: ffff881058a79c00  R15: ffff881058a79d48
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff88105f443e18] _raw_spin_lock at ffffffff8168d812
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f
#16 [ffff88105f443ed8] __hrtimer_run_queues at ffffffff810b4862
#17 [ffff88105f443f30] hrtimer_interrupt at ffffffff810b4e00
#18 [ffff88105f443f80] local_apic_timer_interrupt at ffffffff810510d7
#19 [ffff88105f443f98] smp_apic_timer_interrupt at ffffffff81698bcf
#20 [ffff88105f443fb0] apic_timer_interrupt at ffffffff8169711d
--- <IRQ stack> ---
#21 [ffff8808fce47da8] apic_timer_interrupt at ffffffff8169711d
    [exception RIP: cpuidle_enter_state+82]
    RIP: ffffffff81513f52  RSP: ffff8808fce47e50  RFLAGS: 00000206
    RAX: 0000014aa48e9400  RBX: 000000000000f8a0  RCX: 0000000000000018
    RDX: 0000000225c17d03  RSI: ffff8808fce47fd8  RDI: 0000014aa48e9400
    RBP: ffff8808fce47e78   R8: 000000000000608f   R9: 0000000000000018
    R10: 0000000000018695  R11: 0000000000000000  R12: ffff8808fce47e20
    R13: ffff88105f44f8e0  R14: 0000000000000082  R15: ffff88105f44f8e0
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#22 [ffff8808fce47e80] cpuidle_idle_call at ffffffff81514099
#23 [ffff8808fce47ec0] arch_cpu_idle at ffffffff8103516e
#24 [ffff8808fce47ed0] cpu_startup_entry at ffffffff810e7c95
#25 [ffff8808fce47f28] start_secondary at ffffffff8104f12a


决议
Issue has been fixed in kernel-3.10.0-693.el7.x86_64 via https://access.redhat.com/errata/RHSA-2017:1842
根源
There is a bug that should be fixed by a patch from upstream commit c06f04c70489b9deea3212af8375e2f0c2f0b184.

distribute_cfs_runtime() intentionally only hands out enough runtime to bring each cfs_rq to 1 ns of runtime, expecting the cfs_rqs to then take the runtime they need only once they actually get to run. However, if they get to run sufficiently quickly, the period timer is still in distribute_cfs_runtime() and no runtime is available, causing them to throttle. Then distribute has to handle them again, and this can go on until distribute has handed out all of the runtime 1ns at a time, which takes far too long.

诊断步骤
System Information:

crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux
    DUMPFILE: /cores/retrace/tasks/702709706/crash/vmcore  [PARTIAL DUMP]
        CPUS: 32
        DATE: Mon Nov 14 22:38:09 2016
      UPTIME: 00:23:52
LOAD AVERAGE: 4.69, 1.36, 0.50
       TASKS: 397
    NODENAME: somehost
     RELEASE: 3.10.0-514.el7.x86_64
     VERSION: #1 SMP Wed Oct 19 11:24:13 EDT 2016
     MACHINE: x86_64  (2400 Mhz)
      MEMORY: 63.9 GB
       PANIC: "Kernel panic - not syncing: Hard LOCKUP"
Backtrace of panic task:
crash> bt
PID: 0      TASK: ffff8808fce38fb0  CPU: 25  COMMAND: "swapper/25"
 #0 [ffff88105f4459f0] machine_kexec at ffffffff81059cdb
 #1 [ffff88105f445a50] __crash_kexec at ffffffff81105182
 #2 [ffff88105f445b20] panic at ffffffff8167f2ba
 #3 [ffff88105f445ba0] nmi_panic at ffffffff8108562f
 #4 [ffff88105f445bb0] watchdog_overflow_callback at ffffffff8112f0e6
 #5 [ffff88105f445bc8] __perf_event_overflow at ffffffff8117465e
 #6 [ffff88105f445c00] perf_event_overflow at ffffffff811752a4
 #7 [ffff88105f445c10] intel_pmu_handle_irq at ffffffff81009d88
 #8 [ffff88105f445e48] perf_event_nmi_handler at ffffffff8168daeb
 #9 [ffff88105f445e68] nmi_handle at ffffffff8168ef19
#10 [ffff88105f445eb0] do_nmi at ffffffff8168f093
#11 [ffff88105f445ef0] end_repeat_nmi at ffffffff8168e353
    [exception RIP: _raw_spin_lock+50]
    RIP: ffffffff8168d812  RSP: ffff88105f443e18  RFLAGS: 00000012
    RAX: 0000000000007f72  RBX: ffff88105aab1b00  RCX: 0000000000003f34
    RDX: 0000000000003f42  RSI: 0000000000003f42  RDI: ffff881058a79d48
    RBP: ffff88105f443e18   R8: 0000000000800008   R9: 0000000000000001
    R10: 0000000000018695  R11: 0000000000000000  R12: ffff8810537cd400
    R13: ffff88105f2d6c40  R14: ffff881058a79c00  R15: ffff881058a79d48
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#12 [ffff88105f443e18] _raw_spin_lock at ffffffff8168d812
#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f
#16 [ffff88105f443ed8] __hrtimer_run_queues at ffffffff810b4862
#17 [ffff88105f443f30] hrtimer_interrupt at ffffffff810b4e00
#18 [ffff88105f443f80] local_apic_timer_interrupt at ffffffff810510d7
#19 [ffff88105f443f98] smp_apic_timer_interrupt at ffffffff81698bcf
#20 [ffff88105f443fb0] apic_timer_interrupt at ffffffff8169711d
--- <IRQ stack> ---
#21 [ffff8808fce47da8] apic_timer_interrupt at ffffffff8169711d
    [exception RIP: cpuidle_enter_state+82]
    RIP: ffffffff81513f52  RSP: ffff8808fce47e50  RFLAGS: 00000206
    RAX: 0000014aa48e9400  RBX: 000000000000f8a0  RCX: 0000000000000018
    RDX: 0000000225c17d03  RSI: ffff8808fce47fd8  RDI: 0000014aa48e9400
    RBP: ffff8808fce47e78   R8: 000000000000608f   R9: 0000000000000018
    R10: 0000000000018695  R11: 0000000000000000  R12: ffff8808fce47e20
    R13: ffff88105f44f8e0  R14: 0000000000000082  R15: ffff88105f44f8e0
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018
#22 [ffff8808fce47e80] cpuidle_idle_call at ffffffff81514099
#23 [ffff8808fce47ec0] arch_cpu_idle at ffffffff8103516e
#24 [ffff8808fce47ed0] cpu_startup_entry at ffffffff810e7c95
#25 [ffff8808fce47f28] start_secondary at ffffffff8104f12a


Analysis:
This unthrottle_cfs_rq() was all started from 'sched_cfs_period_timer()' which was in the follwoing while loop.

kernel/sched/fair.c 

 3467 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
   3468 {

   [...]

   3519         while (throttled && runtime > 0) {
   3520                 raw_spin_unlock(&cfs_b->lock);
   3521                 /* we can't nest cfs_b->lock while distributing bandwidth */
   3522                 runtime = distribute_cfs_runtime(cfs_b, runtime,
   3523                                                  runtime_expires);
   3524                 raw_spin_lock(&cfs_b->lock);
   3525 
   3526                 throttled = !list_empty(&cfs_b->throttled_cfs_rq);
   3527         }

   [...]

crash> dis -lr ffffffff810d147f | tail
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3520
0xffffffff810d1469 <sched_cfs_period_timer+185>:    mov    %r12,%rdi
0xffffffff810d146c <sched_cfs_period_timer+188>:    callq  0xffffffff8168d710 <_raw_spin_unlock>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3522
0xffffffff810d1471 <sched_cfs_period_timer+193>:    mov    %r13,%rsi
0xffffffff810d1474 <sched_cfs_period_timer+196>:    mov    %r15,%rdx
0xffffffff810d1477 <sched_cfs_period_timer+199>:    mov    %r12,%rdi
0xffffffff810d147a <sched_cfs_period_timer+202>:    callq  0xffffffff810d11f0 <distribute_cfs_runtime>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3524

   3423 static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
   3424                 u64 remaining, u64 expires)
   3425 {

crash> dis -lr ffffffff810d12e2 | head
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3425
0xffffffff810d11f0 <distribute_cfs_runtime>:    nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810d11f5 <distribute_cfs_runtime+5>:  push   %rbp
0xffffffff810d11f6 <distribute_cfs_runtime+6>:  mov    %rsp,%rbp
0xffffffff810d11f9 <distribute_cfs_runtime+9>:  push   %r15        <<<<<<<< runtime_expires
0xffffffff810d11fb <distribute_cfs_runtime+11>: push   %r14
0xffffffff810d11fd <distribute_cfs_runtime+13>: mov    %rdx,%r14
0xffffffff810d1200 <distribute_cfs_runtime+16>: push   %r13        <<<<<<<< runtime
0xffffffff810d1202 <distribute_cfs_runtime+18>: push   %r12        <<<<<<<< cfs_b
0xffffffff810d1204 <distribute_cfs_runtime+20>: mov    %rsi,%r12

#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2
    ffff88105f443e60: ffff8810537cd500 b7b1a7a2bef0fc23 
    ffff88105f443e70: ffff881058a79d80 ffff881058a79d48 
                                          5th push
    ffff88105f443e80: 00000000bebc2000 ffff881058a79e40
                         4th push
    ffff88105f443e90: 0000014d2c58e112 ffff88105f443ed0 
                         2nd push
    ffff88105f443ea0: ffffffff810d147f 
#15 [ffff88105f443ea0] sched_cfs_period_timer at ffffffff810d147f


cfs_b             is ffff881058a79d48
runtime           is 00000000bebc2000
runtime_expires   is 0000014d2c58e112

It's checking if there is remaining runtime while throttle state is on.

How much runtime_remaining was remaining and how many runqueues were throttled?

crash> dis -lr ffffffff810d12e2 | tail
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3447
0xffffffff810d12ce <distribute_cfs_runtime+222>:    test   %rdx,%rdx
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3443
0xffffffff810d12d1 <distribute_cfs_runtime+225>:    mov    %rdx,0xd8(%r15)
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3447
0xffffffff810d12d8 <distribute_cfs_runtime+232>:    jle    0xffffffff810d126f <distribute_cfs_runtime+127>
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3448
0xffffffff810d12da <distribute_cfs_runtime+234>:    mov    %r15,%rdi
0xffffffff810d12dd <distribute_cfs_runtime+237>:    callq  0xffffffff810d0fc0 <unthrottle_cfs_rq>
0xffffffff810d12e2 <distribute_cfs_runtime+242>:    jmp    0xffffffff810d126f <distribute_cfs_runtime+127>

   3443                 cfs_rq->runtime_remaining += runtime;
   3444                 cfs_rq->runtime_expires = expires;
   3445 
   3446                 /* we check whether we're throttled above */
   3447                 if (cfs_rq->runtime_remaining > 0)
   3448                         unthrottle_cfs_rq(cfs_rq);

crash> dis -lr ffffffff810d100b | head
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3379
0xffffffff810d0fc0 <unthrottle_cfs_rq>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff810d0fc5 <unthrottle_cfs_rq+5>:   push   %rbp
0xffffffff810d0fc6 <unthrottle_cfs_rq+6>:   mov    %rsp,%rbp
0xffffffff810d0fc9 <unthrottle_cfs_rq+9>:   push   %r15
0xffffffff810d0fcb <unthrottle_cfs_rq+11>:  push   %r14
0xffffffff810d0fcd <unthrottle_cfs_rq+13>:  push   %r13
0xffffffff810d0fcf <unthrottle_cfs_rq+15>:  push   %r12
0xffffffff810d0fd1 <unthrottle_cfs_rq+17>:  mov    %rdi,%r12
0xffffffff810d0fd4 <unthrottle_cfs_rq+20>:  push   %rbx

   3378 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
   3379 {
   3380         struct rq *rq = rq_of(cfs_rq);
   3381         struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
   3382         struct sched_entity *se;
   3383         int enqueue = 1;
   3384         long task_delta;

#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
    ffff88105f443e28: ffff88105f2d6c40 000000001426f50a 
    ffff88105f443e38: ffff881058a79e40 0000014d2c58e112 
    ffff88105f443e48: ffff8810537cd400 ffff88105f443e98 
                         2nd push
    ffff88105f443e58: ffffffff810d12e2 
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2

cfs_rq is ffff8810537cd400

How much runtime_remaining was remaining and how many runqueues were throttled?

crash> cfs_rq.runtime_remaining,runtime_expires ffff8810537cd400
  runtime_remaining = 1
  runtime_expires = 1430968131858

crash> cfs_rq.throttled_list ffff8810537cd400 -ox
struct cfs_rq {
  [ffff8810537cd500] struct list_head throttled_list;
}

crash> list -H ffff8810537cd500 -o cfs_rq.throttled_list -s cfs_rq.runtime_remaining |grep -c runtime
28

crash> list -H ffff8810537cd500 -o cfs_rq.throttled_list -s cfs_rq.throttled,runtime_remaining
ffff8810537cd600
  throttled = 1
  runtime_remaining = -1690
ffff8810537cdc00
  throttled = 1
  runtime_remaining = -1722
ffff8810537ce000
  throttled = 1
  runtime_remaining = -1616
ffff8810537ce600
  throttled = 1
  runtime_remaining = -1688
ffff8810537cc800
  throttled = 1
  runtime_remaining = -1820
ffff8810537cde00
  throttled = 1
  runtime_remaining = -1757
ffff8810537cf600
  throttled = 1
  runtime_remaining = -1848
ffff8800369ed400
  throttled = 1
  runtime_remaining = -2394
ffff8810537cc600
  throttled = 1
  runtime_remaining = -1758
ffff8810537cd200
  throttled = 1
  runtime_remaining = -1814
ffff8800369ee800
  throttled = 1
  runtime_remaining = -2395
ffff8810537cc200
  throttled = 1
  runtime_remaining = -1759
ffff8810537cf400
  throttled = 1
  runtime_remaining = -1682
ffff8810537cfa00
  throttled = 1
  runtime_remaining = -1912
ffff8810537cca00
  throttled = 1
  runtime_remaining = -1597
ffff8800369ec000
  throttled = 1
  runtime_remaining = -15057
ffff8800369eee00
  throttled = 1
  runtime_remaining = -14083
ffff8800369ed000
  throttled = 1
  runtime_remaining = -4009
ffff8800369ede00
  throttled = 1
  runtime_remaining = -5090
ffff8800369ed800
  throttled = 1
  runtime_remaining = -14091
ffff8800369ee000
  throttled = 1
  runtime_remaining = -3746
ffff8800369ee600
  throttled = 1
  runtime_remaining = -15879
ffff8800369efe00
  throttled = 1
  runtime_remaining = -3457
ffff8800369eda00
  throttled = 1
  runtime_remaining = -3742
ffff8800369eec00
  throttled = 1
  runtime_remaining = -4032
ffff8800369ee400
  throttled = 1
  runtime_remaining = -3386
ffff8800369ed600
  throttled = 1
  runtime_remaining = -3530
ffff881058a79d40
  throttled = 48
  runtime_remaining = 0

crash> pd 1690 + 1722 + 1616 + 1688 + 1820 + 1757 + 1848 + 2394 + 1758 + 1814 + 2395 + 1759 + 1682 + 1912 + 1597 + 15057 + 14083 + 4009 + 5090 + 14091 + 3746 + 15879 + 3457 + 3742 + 4032 + 3386 + 3530
$1 = 117554

As shown above, the total runtime_remaining from each throttled runqueue was 117554. 

However if we are checking remaining runtime which had in local variable in distribute_cfs_runtime() function was 338097418.

/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3438
0xffffffff810d12be <distribute_cfs_runtime+206>:        sub    %rcx,%rdx
0xffffffff810d12c1 <distribute_cfs_runtime+209>:        cmp    %rdx,%r12
0xffffffff810d12c4 <distribute_cfs_runtime+212>:        cmovbe %r12,%rdx
/usr/src/debug/kernel-3.10.0-514.el7/linux-3.10.0-514.el7.x86_64/kernel/sched/fair.c: 3441
0xffffffff810d12c8 <distribute_cfs_runtime+216>:        sub    %rdx,%r12

   3438                 runtime = -cfs_rq->runtime_remaining + 1;
   3439                 if (runtime > remaining)
   3440                         runtime = remaining;
   3441                 remaining -= runtime;

#13 [ffff88105f443e20] unthrottle_cfs_rq at ffffffff810d100b
    ffff88105f443e28: ffff88105f2d6c40 000000001426f50a <<<<<<<< %r12 (remaining runtime)
    ffff88105f443e38: ffff881058a79e40 0000014d2c58e112 
    ffff88105f443e48: ffff8810537cd400 ffff88105f443e98 
    ffff88105f443e58: ffffffff810d12e2 
#14 [ffff88105f443e58] distribute_cfs_runtime at ffffffff810d12e2

crash> pd 0x000000001426f50a
$2 = 338097418

The value is much bigger than the total runtime_remaining from each cfs_rq. Also, throttled_list is also not empty.

crash> cfs_bandwidth.throttled_cfs_rq ffff881058a79d48
  throttled_cfs_rq = {
    next = 0xffff8810537cd500, 
    prev = 0xffff8800369ed700
  }

crash> list 0xffff8810537cd500 | wc -l
29

Consequently, it made the kernel in the while loop which was unable to break out as the local runtime_remaining was much bigger than the sum of each runqueue's runtime_remaining.

It's matching the bug that should be fixed by applying a patch from upstream commit c06f04c70489b9deea3212af8375e2f0c2f0b184.

原文:https://access.redhat.com/solutions/2786591

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值