转自: https://bugs.centos.org/view.php?id=14073


RHEL7: kernel crash in xfs_vm_writepage - kernel BUG at fs/xfs/xfs_aops.c:1062!

 SOLUTION VERIFIED - Updated August 17 2017 at 3:52 PM - 

English 

Environment

  • Red Hat Enterprise Linux (RHEL) 7.3

    • kernel prior to kernel-3.10.0-514.13.1.el7

  • Seen with

    • Java JVM's 'hsperf' memory mapped writeable 'hsperfdata' files on xfs

    • Samba while accessing its profile database (/var/lib/samba/smbprofile.tdb)

Issue

  • The systems (docker hosts) are crashing in xfs_vm_writepage()

  • spontaneous restart while running docker/kubernetes

  • kernel paniced due to BUG at fs/xfs/xfs_aops.c:1062!

Raw

[1004630.854317] kernel BUG at fs/xfs/xfs_aops.c:1062!
[1004630.854894] invalid opcode: 0000 [#1] SMP 
[1004630.855450] Modules linked in: veth br_netfilter bridge stp llc vxlan ip6_udp_tunnel udp_tunnel mptctl mptbase xt_statistic binfmt_misc bonding dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ipt_REJECT nf_reject_ipv4 xt_conntrack iptable_filter xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack vfat fat intel_powerclamp coretemp intel_rapl iosf_mbi ipmi_devintf iTCO_wdt kvm_intel iTCO_vendor_support kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr sb_edac edac_core i2c_i801 lpc_ich sg hpilo hpwdt ioatdma shpchp dca wmi ipmi_si ipmi_msghandler pcc_cpufreq acpi_power_meter ip_tables xfs sd_mod mgag200 i2c_algo_bit drm_kms_helper
[1004630.859266]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i2c_core lpfc crc32c_intel bnx2x crc_t10dif crct10dif_generic hpsa crct10dif_pclmul scsi_transport_fc mdio scsi_tgt ptp scsi_transport_sas crct10dif_common pps_core libcrc32c fjes dm_mirror dm_region_hash dm_log dm_mod
[1004630.861333] CPU: 6 PID: 56715 Comm: kworker/u48:4 Tainted: G        W      ------------   3.10.0-514.el7.x86_64 #1
[1004630.862046] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 12/28/2015
[1004630.862703] Workqueue: writeback bdi_writeback_workfn (flush-253:28)
[1004630.863414] task: ffff881f8436de20 ti: ffff881f23a4c000 task.ti: ffff881f23a4c000
[1004630.864117] RIP: 0010:[<ffffffffa083f2fb>]  [<ffffffffa083f2fb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
[1004630.864860] RSP: 0018:ffff881f23a4f948  EFLAGS: 00010246
[1004630.865749] RAX: 002fffff00040029 RBX: ffff881bedd50308 RCX: 000000000000000c
[1004630.866466] RDX: 0000000000000008 RSI: ffff881f23a4fc40 RDI: ffffea00296b7800
[1004630.867218] RBP: ffff881f23a4f9f0 R08: fffffffffffffffe R09: 000000000001a098
[1004630.867941] R10: ffff88207ffd6000 R11: 0000000000000000 R12: ffff881bedd50308
[1004630.868656] R13: ffff881f23a4fc40 R14: ffff881bedd501b8 R15: ffffea00296b7800
[1004630.869399] FS:  0000000000000000(0000) GS:ffff881fff180000(0000) knlGS:0000000000000000
[1004630.870147] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1004630.870868] CR2: 0000000000eb3d30 CR3: 0000001ff79dc000 CR4: 00000000001407e0
[1004630.871610] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1004630.872349] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[1004630.873072] Stack:
[1004630.873749]  0000000000008000 ffff880070b03644 ffff881f23a4fc40 ffff881f23a4fa68
[1004630.874480]  ffff881f23a4fa80 ffffea00296b7800 0000000000001000 000000000000000e
[1004630.875223]  0000000000001000 ffffffff81180981 0000000000000000 ffff881bedd50310
[1004630.875957] Call Trace:
[1004630.876665]  [<ffffffff81180981>] ? find_get_pages_tag+0xe1/0x1a0
[1004630.877417]  [<ffffffff8118b3b3>] __writepage+0x13/0x50
[1004630.878173]  [<ffffffff8118bed1>] write_cache_pages+0x251/0x4d0
[1004630.878915]  [<ffffffffa00c170a>] ? enqueue_cmd_and_start_io+0x3a/0x40 [hpsa]
[1004630.879626]  [<ffffffff8118b3a0>] ? global_dirtyable_memory+0x70/0x70
[1004630.880368]  [<ffffffff8118c19d>] generic_writepages+0x4d/0x80
[1004630.881157]  [<ffffffffa083e063>] xfs_vm_writepages+0x53/0x90 [xfs]
[1004630.881907]  [<ffffffff8118d24e>] do_writepages+0x1e/0x40
[1004630.882643]  [<ffffffff81228730>] __writeback_single_inode+0x40/0x210
[1004630.883403]  [<ffffffff8122941e>] writeback_sb_inodes+0x25e/0x420
[1004630.884141]  [<ffffffff8122967f>] __writeback_inodes_wb+0x9f/0xd0
[1004630.884863]  [<ffffffff81229ec3>] wb_writeback+0x263/0x2f0   
[1004630.885610]  [<ffffffff810ab776>] ? set_worker_desc+0x86/0xb0
[1004630.886378]  [<ffffffff8122bd05>] bdi_writeback_workfn+0x115/0x460
[1004630.887142]  [<ffffffff810c4cf8>] ? try_to_wake_up+0x1c8/0x330
[1004630.887875]  [<ffffffff810a7f3b>] process_one_work+0x17b/0x470
[1004630.888638]  [<ffffffff810a8d76>] worker_thread+0x126/0x410   
[1004630.889389]  [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
[1004630.890126]  [<ffffffff810b052f>] kthread+0xcf/0xe0
[1004630.890816]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[1004630.891521]  [<ffffffff81696418>] ret_from_fork+0x58/0x90
[1004630.892229]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[1004630.892877] Code: e0 80 3d 4d b4 06 00 00 0f 85 a4 fe ff ff be d7 03 00 00 48 c7 c7 4a e0 88 a0 e8 61 66 84 e0 c6 05 2f b4 06 00 01 e9 87 fe ff ff <0f> 0b 8b 4d a4 e9 e8 fb ff ff 41 b9 01 00 00 00 e9 69 fd ff ff 
[1004630.894245] RIP  [<ffffffffa083f2fb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
[1004630.894890]  RSP <ffff881f23a4f948>

Resolution

Red Hat Enterprise Linux 7

Red Hat Enterprise Linux 7.3.z (Z-STREAM)

Workaround

  • If using 'hsperf' in Java JVMs, disable this feature

Root Cause

The sequence starts with a typical dirty page with dirty buffers. The following race occurs between writeback and a userspace page fault to
redirty the page after writeback cleans it, but before XFS cleans the page buffers.

Raw

writeback                   task w/ mmap()                  "shrink_active_list()"
---------                   --------------                  ----------------------

write_cache_pages()
  lock_page()
  clear_page_dirty_for_io() (page cleaned)
  xfs_vm_writepage()
    xfs_start_page_writeback()
      unlock_page()
                        ... PTE for mapping dirtied ...
                        do_fault()
                          xfs_filemap_page_mkwrite()
                            ...
                            xfs_vm_set_page_dirty() (page+buffers redirtied)
    ...
    xfs_submit_ioend()
      xfs_start_buffer_writeback()
        clear_buffer_dirty() (page dirty, buffers clean)
 ...

At this point, writeback can complete on the page. The page is still dirty from the racing fault, and thus writeback will occur again for
this page, but the buffers have been marked clean by XFS. It is important to note that the PTE for the userspace mapping is also dirty,
as the fault occurred after clear_page_dirty_for_io() would have cleaned the PTE (via page_mkclean()).

The following 3-way race can now occur to remove the buffers, redirty the page and ultimately cause writeback to explode:

Raw


                                               lock_page()                                                ...                                                try_to_release_page()                                                  try_to_free_buffers()                                                    drop_buffers()                                                    cancel_dirty_page()                                                unlock_page()                                                (page cleaned, buffers removed)                        do_munmap()                          ...                          zap_pte_range()                            xfs_vm_set_page_dirty()                        (page dirtied, no buffers) write_cache_pages()  lock_page()  clear_page_dirty_for_io()  xfs_vm_writepage()    page_buffers()      BUG_ON(!PagePrivate())

Essentially, shrink_active_list() can clean out the buffers at any time because they are not locked or dirty. In doing so, it also cleans the
page. This would normally prevent writeback from proceeding with the page the next time around. If the mapping that dirtied the page in the first place is unmapped before writeback, however, it redirties the page even though the buffers have been removed due to the dirty PTE. writepage() now proceeds with a dirty page without buffers and explodes.

Diagnostic Steps

Simple analysis (from kernel oops)

  1. Check for an affected kernel version from the Environment section

  2. Look for a line in the oops kernel BUG at fs/xfs/xfs_aops.c:1062!. Note that the specific code should be as follows

Raw

[1004630.854317] kernel BUG at fs/xfs/xfs_aops.c:1062!

include/linux/buffer_head.h
    137 /* If we *know* page->private refers to buffer_heads */
    138 #define page_buffers(page)                                      \
    139         ({                                                      \
    140                 BUG_ON(!PagePrivate(page));                     \
    141                 ((struct buffer_head *)page_private(page));     \
    142         })

fs/xfs/xfs_aops.c 
1046        /*
1047         * The page straddles i_size.  It must be zeroed out on each
1048         * and every writepage invocation because it may be mmapped.
1049         * "A file is mapped in multiples of the page size.  For a file
1050         * that is not a multiple of the page size, the remaining
1051         * memory is zeroed when mapped, and writes to that region are
1052         * not written out to the file."
1053         */
1054        zero_user_segment(page, offset_into_page, PAGE_CACHE_SIZE);
1055 
1056        /* Adjust the end_offset to the end of file */
1057        end_offset = offset;
1058    }
1059 
1060    len = 1 << inode->i_blkbits;
1061 
1062--> bh = head = page_buffers(page);
1063    offset = page_offset(page);
1064    type = XFS_IO_OVERWRITE;
1065 
1066    if (wbc->sync_mode == WB_SYNC_NONE)
1067        nonblocking = 1;
1068