OOM relation to vm.swappiness=0 in new kernel

最新推荐文章于 2021-05-25 12:03:12 发布

rocksword

最新推荐文章于 2021-05-25 12:03:12 发布

阅读量420

点赞数

分类专栏： Linux

Linux 专栏收录该内容

34 篇文章 0 订阅

订阅专栏

I have recently been involved in diagnosing the reasons behind OOM invocation that would kill the MySQL server process. Of course these servers were primarily running MySQL. As such the MySQL server process was the one with the largest amount of memory allocated.

But the strange thing was that in all the cases, there was no swapping activity seen and there were enough pages in the page cache. Ironically all of these servers were CentOS 6.4 running kernel version 2.6.32-358. Another commonality was the fact that vm.swappiness was set to 0. This is a pretty much standard practice and one that is applied on nearly every server that runs MySQL.

Looking into this further I realized that there was a change introduced in kernel 3.5-rc1 that altered the swapping behavior when “vm.swappiness=0″.

Below is the description of the commit that changed “vm.swappiness=0″ behavior, together with the diff:

 
     
 
     
 
     
 
     
 
     
 
     
Shell
 
    
  
 
   
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        

          13 
        

          14 
        

          15 
        

          16 
        

          17 
        

          18 
        

          19 
        

          20 
        

          21 
        

          22 
        

          23 
        

          24 
        

          25 
        

          26 
        

          27 
        

          28 
        

          29 
        

          30 
        

          31 
        

          32 
        

          33 
        

          34 
        

          35 
        

          36 
        

          37 
        

          38 
        

          39 
        

          40 
        

          41 
        

          42 
        

          43 
        

          44 
        

          45 
        

          46 
        

          47 
        

          48 
        

          49 
        

          50 
        

          51 
        

          52 
        

          53 
        

          54 
        

          55 
        

          56 
        

          57 
        
 
         $ 
           
         git  
         show  
         fe35004fbf9eaf67482b074a2e032abb9c89b1dd 
        
 
         commit  
         fe35004fbf9eaf67482b074a2e032abb9c89b1dd 
        
 
         Author 
         : 
           
         Satoru  
         Moriya 
           
         < 
         satoru 
         .moriya 
         @ 
         hds 
         .com 
         > 
        
 
         Date 
         : 
           
         Tue  
         May 
           
         29 
           
         15 
         : 
         06 
         : 
         47 
           
         2012 
           
         - 
         0700 
        

            
        
 
         mm 
         : 
           
         avoid  
         swapping  
         out  
         with  
         swappiness 
         == 
         0 
        

            
        
 
         Sometimes  
         we' 
         d 
           
         like  
         to 
           
         avoid  
         swapping  
         out  
         anonymous  
         memory 
         . 
           
         In 
        
 
         particular 
         , 
           
         avoid  
         swapping  
         out  
         pages  
         of  
         important  
         process  
         or 
           
         process 
        
 
         groups  
         while 
           
         there  
         is 
           
         a 
           
         reasonable  
         amount  
         of  
         pagecache  
         on  
         RAM  
         so  
         that  
         we 
        
 
         can  
         satisfy  
         our  
         customers' 
           
         requirements 
         . 
        

            
        
 
         OTOH 
         , 
           
         we  
         can  
         control  
         how  
         aggressive  
         the  
         kernel  
         will  
         swap  
         memory  
         pages  
         with 
        
 
         / 
         proc 
         / 
         sys 
         / 
         vm 
         / 
         swappiness  
         for 
           
         global  
         and 
        
 
         / 
         sys 
         / 
         fs 
         / 
         cgroup 
         / 
         memory 
         / 
         memory 
         .swappiness 
           
         for 
           
         each 
           
         memcg 
         . 
        

            
        
 
         But  
         with  
         current  
         reclaim  
         implementation 
         , 
           
         the  
         kernel  
         may  
         swap  
         out  
         even  
         if 
        
 
         we  
         set  
         swappiness 
         = 
         0 
           
         and 
           
         there  
         is 
           
         pagecache  
         in 
           
         RAM 
         . 
        

            
        
 
         This 
           
         patch 
           
         changes  
         the  
         behavior  
         with  
         swappiness 
         == 
         0. 
           
         If 
           
         we  
         set 
        
 
         swappiness 
         == 
         0 
         , 
           
         the  
         kernel  
         does  
         not 
           
         swap  
         out  
         completely 
           
         ( 
         for 
           
         global  
         reclaim 
        
 
         until 
           
         the  
         amount  
         of  
         free  
         pages  
         and 
           
         filebacked  
         pages  
         in 
           
         a 
           
         zone  
         has  
         been 
        
 
         reduced  
         to 
           
         something  
         very  
         very  
         small 
           
         ( 
         nr_free 
           
         + 
           
         nr_filebacked 
           
         < 
           
         high 
        
 
         watermark 
         ) 
         ) 
         . 
        

            
        
 
         Signed 
         - 
         off 
         - 
         by 
         : 
           
         Satoru  
         Moriya 
           
         < 
         satoru 
         .moriya 
         @ 
         hds 
         .com 
         > 
        
 
         Acked 
         - 
         by 
         : 
           
         Minchan  
         Kim 
           
         < 
         minchan 
         @ 
         kernel 
         .org 
         > 
        
 
         Reviewed 
         - 
         by 
         : 
           
         Rik  
         van  
         Riel 
           
         < 
         riel 
         @ 
         redhat 
         .com 
         > 
        
 
         Acked 
         - 
         by 
         : 
           
         Jerome  
         Marchand 
           
         < 
         jmarchan 
         @ 
         redhat 
         .com 
         > 
        
 
         Signed 
         - 
         off 
         - 
         by 
         : 
           
         Andrew  
         Morton 
           
         < 
         akpm 
         @ 
         linux 
         - 
         foundation 
         .org 
         > 
        
 
         Signed 
         - 
         off 
         - 
         by 
         : 
           
         Linus  
         Torvalds 
           
         < 
         torvalds 
         @ 
         linux 
         - 
         foundation 
         .org 
         > 
        

            
        
 
         diff 
           
         -- 
         git 
           
         a 
         / 
         mm 
         / 
         vmscan 
         .c 
           
         b 
         / 
         mm 
         / 
         vmscan 
         .c 
        
 
         index 
           
         67a4fd4..ee97530 
           
         100644 
        
 
         -- 
         - 
           
         a 
         / 
         mm 
         / 
         vmscan 
         .c 
        
 
         ++ 
         + 
           
         b 
         / 
         mm 
         / 
         vmscan 
         .c 
        
 
         @ 
         @ 
           
         - 
         1761 
         , 
         10 
           
         + 
         1761 
         , 
         10 
           
         @ 
         @ 
           
         static  
         void 
           
         get_scan_count 
         ( 
         struct 
           
         mem_cgroup_zone 
           
         * 
         mz 
         , 
           
         struct 
           
         scan_control 
           
         * 
         sc 
         , 
        
 
         * 
           
         proportional  
         to 
           
         the  
         fraction  
         of  
         recently  
         scanned  
         pages  
         on 
        
 
         * 
           
         each 
           
         list  
         that  
         were  
         recently  
         referenced  
         and 
           
         in 
           
         active  
         use 
         . 
        
 
         * 
         / 
        
 
         - 
           
         ap 
           
         = 
           
         ( 
         anon_prio 
           
         + 
           
         1 
         ) 
           
         * 
           
         ( 
         reclaim_stat 
         -> 
         recent_scanned 
         [ 
         0 
         ] 
           
         + 
           
         1 
         ) 
         ; 
        
 
         + 
           
         ap 
           
         = 
           
         anon_prio 
           
         * 
           
         ( 
         reclaim_stat 
         -> 
         recent_scanned 
         [ 
         0 
         ] 
           
         + 
           
         1 
         ) 
         ; 
        
 
         ap 
           
         /= 
           
         reclaim_stat 
         -> 
         recent_rotated 
         [ 
         0 
         ] 
           
         + 
           
         1 
         ; 
        

            
        
 
         - 
           
         fp 
           
         = 
           
         ( 
         file_prio 
           
         + 
           
         1 
         ) 
           
         * 
           
         ( 
         reclaim_stat 
         -> 
         recent_scanned 
         [ 
         1 
         ] 
           
         + 
           
         1 
         ) 
         ; 
        
 
         + 
           
         fp 
           
         = 
           
         file_prio 
           
         * 
           
         ( 
         reclaim_stat 
         -> 
         recent_scanned 
         [ 
         1 
         ] 
           
         + 
           
         1 
         ) 
         ; 
        
 
         fp 
           
         /= 
           
         reclaim_stat 
         -> 
         recent_rotated 
         [ 
         1 
         ] 
           
         + 
           
         1 
         ; 
        
 
         spin_unlock_irq 
         ( 
         & 
         mz 
         -> 
         zone 
         -> 
         lru_lock 
         ) 
         ; 
        

            
        
 
         @ 
         @ 
           
         - 
         1777 
         , 
         7 
           
         + 
         1777 
         , 
         7 
           
         @ 
         @ 
           
         out 
         : 
        
 
         unsigned 
           
         long 
           
         scan 
         ; 
        
 
           
         scan 
           
         = 
           
         zone_nr_lru_pages 
         ( 
         mz 
         , 
           
         lru 
         ) 
         ; 
        
 
         - 
           
         if 
           
         ( 
         priority 
           
         || 
           
         noswap 
         ) 
           
         { 
        
 
         + 
           
         if 
           
         ( 
         priority 
           
         || 
           
         noswap 
           
         || 
           
         ! 
         vmscan_swappiness 
         ( 
         mz 
         , 
           
         sc 
         ) 
         ) 
           
         { 
        
 
         scan 
           
         >> 
         = 
           
         priority 
         ; 
        
 
         if 
           
         ( 
         ! 
         scan 
           
         && 
           
         force_scan 
         ) 
        
 
         scan 
           
         = 
           
         SWAP_CLUSTER_MAX 
         ; 
        
 
  

This change was merged into the RHEL kernel 2.6.32-303:

This obviously changed the way we think about “vm.swappiness=0″. Previously, setting this to 0 was thought to reduce the tendency to swap userland processes but not disable that completely. As such it was expected to see little swapping instead of OOM.

This applies to all RHEL/CentOS kernels > 2.6.32-303 and to other distributions that provide newer kernels such as Debian and Ubuntu. Or any other distribution where this change has been backported as in RHEL.

Let me share with you memory zones related statistics that were logged to the system log from one of the OOM event.

 
     
 
     
 
     
 
     
 
     
 
     
Shell
 
    
  
 
   
          1 
        

          2 
        

          3 
        

          4 
        

          5 
        

          6 
        

          7 
        

          8 
        

          9 
        

          10 
        

          11 
        

          12 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         DMA 
         : 
           
         4 
         * 
         4kB 
           
         2 
         * 
         8kB 
           
         2 
         * 
         16kB 
           
         0 
         * 
         32kB 
           
         2 
         * 
         64kB 
           
         1 
         * 
         128kB 
           
         0 
         * 
         256kB 
           
         0 
         * 
         512kB 
           
         1 
         * 
         1024kB 
           
         1 
         * 
         2048kB 
           
         3 
         * 
         4096kB 
           
         = 
           
         15680kB 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         DMA32 
         : 
           
         6 
         * 
         4kB 
           
         22 
         * 
         8kB 
           
         444 
         * 
         16kB 
           
         374 
         * 
         32kB 
           
         129 
         * 
         64kB 
           
         26 
         * 
         128kB 
           
         15 
         * 
         256kB 
           
         17 
         * 
         512kB 
           
         2 
         * 
         1024kB 
           
         0 
         * 
         2048kB 
           
         0 
         * 
         4096kB 
           
         = 
           
         45448kB 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         Normal 
         : 
           
         825 
         * 
         4kB 
           
         1012 
         * 
         8kB 
           
         382 
         * 
         16kB 
           
         169 
         * 
         32kB 
           
         69 
         * 
         64kB 
           
         74 
         * 
         128kB 
           
         14 
         * 
         256kB 
           
         0 
         * 
         512kB 
           
         0 
         * 
         1024kB 
           
         1 
         * 
         2048kB 
           
         0 
         * 
         4096kB 
           
         = 
           
         42436kB 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         452844 
           
         total  
         pagecache  
         pages 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         0 
           
         pages  
         in 
           
         swap  
         cache 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Swap  
         cache  
         stats 
         : 
           
         add 
           
         0 
         , 
           
         delete 
           
         0 
         , 
           
         find 
           
         0 
         / 
         0 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Free  
         swap  
           
         = 
           
         4128760kB 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Total  
         swap 
           
         = 
           
         4128760kB 
        
 
         . 
         . 
         . 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         DMA  
         free 
         : 
         15680kB 
           
         min 
         : 
         124kB 
           
         low 
         : 
         152kB 
           
         high 
         : 
         184kB 
           
         active_anon 
         : 
         0kB 
           
         inactive_anon 
         : 
         0kB 
           
         active_file 
         : 
         0kB 
           
         inactive_file 
         : 
         0kB 
           
         unevictable 
         : 
         0kB 
           
         isolated 
         ( 
         anon 
         ) 
         : 
         0kB 
           
         isolated 
         ( 
         file 
         ) 
         : 
         0kB 
           
         present 
         : 
         15284kB 
           
         mlocked 
         : 
         0kB 
           
         dirty 
         : 
         0kB 
           
         writeback 
         : 
         0kB 
           
         mapped 
         : 
         0kB 
           
         shmem 
         : 
         0kB 
           
         slab_reclaimable 
         : 
         0kB 
           
         slab_unreclaimable 
         : 
         0kB 
           
         kernel_stack 
         : 
         0kB 
           
         pagetables 
         : 
         0kB 
           
         unstable 
         : 
         0kB 
           
         bounce 
         : 
         0kB 
           
         writeback_tmp 
         : 
         0kB 
           
         pages_scanned 
         : 
         0 
           
         all_unreclaimable 
         ? 
           
         yes 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         DMA32  
         free 
         : 
         45448kB 
           
         min 
         : 
         25140kB 
           
         low 
         : 
         31424kB 
           
         high 
         : 
         37708kB 
           
         active_anon 
         : 
         1741812kB 
           
         inactive_anon 
         : 
         520348kB 
           
         active_file 
         : 
         4792kB 
           
         inactive_file 
         : 
         462576kB 
           
         unevictable 
         : 
         0kB 
           
         isolated 
         ( 
         anon 
         ) 
         : 
         0kB 
           
         isolated 
         ( 
         file 
         ) 
         : 
         0kB 
           
         present 
         : 
         3072160kB 
           
         mlocked 
         : 
         0kB 
           
         dirty 
         : 
         386328kB 
           
         writeback 
         : 
         76268kB 
           
         mapped 
         : 
         936kB 
           
         shmem 
         : 
         0kB 
           
         slab_reclaimable 
         : 
         20420kB 
           
         slab_unreclaimable 
         : 
         6964kB 
           
         kernel_stack 
         : 
         0kB 
           
         pagetables 
         : 
         572kB 
           
         unstable 
         : 
         0kB 
           
         bounce 
         : 
         0kB 
           
         writeback_tmp 
         : 
         0kB 
           
         pages_scanned 
         : 
         142592 
           
         all_unreclaimable 
         ? 
           
         no 
        
 
         Mar 
           
         11 
           
         11 
         : 
         01 
         : 
         45 
           
         db01  
         kernel 
         : 
           
         Node 
           
         0 
           
         Normal  
         free 
         : 
         42436kB 
           
         min 
         : 
         42316kB 
           
         low 
         : 
         52892kB 
           
         high 
         : 
         63472kB 
           
         active_anon 
         : 
         3041852kB 
           
         inactive_anon 
         : 
         643624kB 
           
         active_file 
         : 
         340156kB 
           
         inactive_file 
         : 
         1003512kB 
           
         unevictable 
         : 
         0kB 
           
         isolated 
         ( 
         anon 
         ) 
         : 
         0kB 
           
         isolated 
         ( 
         file 
         ) 
         : 
         0kB 
           
         present 
         : 
         5171200kB 
           
         mlocked 
         : 
         0kB 
           
         dirty 
         : 
         979444kB 
           
         writeback 
         : 
         22040kB 
           
         mapped 
         : 
         15616kB 
           
         shmem 
         : 
         180kB 
           
         slab_reclaimable 
         : 
         41052kB 
           
         slab_unreclaimable 
         : 
         35996kB 
           
         kernel_stack 
         : 
         2720kB 
           
         pagetables 
         : 
         19912kB 
           
         unstable 
         : 
         0kB 
           
         bounce 
         : 
         0kB 
           
         writeback_tmp 
         : 
         0kB 
           
         pages_scanned 
         : 
         31552 
           
         all_unreclaimable 
         ? 
           
         no 
        
 
  

As can be seen the amount of free memory and the amount of memory in the page cache was greater than the high watermark, which prevented any swapping activity. Yet unnecessary memory pressure caused OOM to be invoked which killed the MySQL server process.

MySQL getting OOM’ed is bad for many reasons and can have an undesirable impact such as causing loss of uncommitted transactions or transactions not yet flushed to the log because of innodb_flush_log_at_trx_commit=0, or a much more heavy impact because of cold caches upon restart.

I prefer the old behavior of vm.swappiness and as such I now set it to a value of “1”. Setting vm.swappiness=0 would mean that you will now have to be much more accurate in how you configure the size of various global and session buffers.

filed under: Insight For DBAs, MySQLtagged with: Kernel, Memory, Oom, Swap, Swappiness

about Ovais Tariq

Comments

Adam Scott

APRIL 28, 2014 AT 2:48 PM

Thank you for the heads-up Ovais. With vm.swappiness=1 does MySQL perform worse under certain loads (and roughly optimal buffer sizes)?

I set echo -17 > /proc/pidof -s mysqld/oom_adj in the init script to lower mysqld’s priority by OOM. With vm.swappiness=0, does this setting prevent getting OOMed in the newer kernels?
Alen Krmelj

APRIL 28, 2014 AT 6:07 PM

Witnessed similar invocations of OOM killer when servers were under heavy load a while ago. Our entire Hadoop clusters got killed becouse of it. My observations showed that setting it to something small from 1-10 is safe bet with basically any service/app running under heavy load.
Ovais Tariq

APRIL 28, 2014 AT 6:44 PM

Adam,
With swappiness=1 there can be some swapping activity for example in case the session buffers are not optimally sized, but that was also true for the older “swappiness=0 implementation”.

Adjusting oom_adj is a recommended practice, however even though it would make it less likely for MySQL to be OOM’ed, it wouldn’t completely prevent that in all possible conditions.
Raghavendra

APRIL 28, 2014 AT 11:47 PM

I think there may be some confusion here.

The original specifications of swappiness from kernel documentation
have indicated that it is a preference for LRU reclaim of pages from ANON v/s
FILE backed pages.

Setting it to zero implied that the user/admin didn’t want any
ANON to be reclaimed at the cost of file backed pages.

Now,

“”
But with current reclaim implementation, the kernel may swap out even if
we set swappiness=0 and there is pagecache in RAM.
“”

This was likely the bug that they fixed. ie, when zero, it should
always try to dirty write / swap out file backed pages (unless
pinned etc.) whenever possible, and only go to ANON after that.

Now, about OOM, does it make sense to reclaim pages from ANON when there
is nothing in page cache and system is running out of memory? Of course
it does (your system is asphyxiating from lack of memory, you will want
to do anything to recover from that). That is why following exists:

That is why following exist: http://lxr.linux.no/#linux+v3.13.5/mm/vmscan.c#L1859

1859 /*
1860 * Global reclaim will swap to prevent OOM even with no
1861 * swappiness, but memcg users want to use this knob to
1862 * disable swapping for individual groups completely when
1863 * using the memory controller’s swap limit feature would be
1864 * too expensive.
1865 */
1866 if (!global_reclaim(sc) && !vmscan_swappiness(sc)) {
1867 scan_balance = SCAN_FILE;
1868 goto out;
1869 }
1870
1871 /*
1872 * Do not apply any pressure balancing cleverness when the
1873 * system is close to OOM, scan both anon and file equally
1874 * (unless the swappiness setting disagrees with swapping).
1875 */
1876 if (!sc->priority && vmscan_swappiness(sc)) {
1877 scan_balance = SCAN_EQUAL;
1878 goto out;
1879 }
1880

(Please check if they backported these bits (or equivalent) to centos/rhel or
not)
=================================================================

So, the original specification of swappiness has remained
unchanged, they have fixed what seems to be an off-by-one kind of
miscalculation in its calculations. (otherwise, even with 0 it
scanned some percentage of ANON v/s none).

The OOM/reclaim shouldn’t be affected by this unless there was
another bug in between which has been fixed now.

Setting it to non-zero (>=1) would be mean asking ANON to be
scanned to that extent.

=================================

Now, for mysql the ratio of ANON to FILE is quite high (due to
Innodb, unless heavy myisam usage and/or very large innodb log
files), so tuning swappiness is not going to help much (YMMV), it
may only defer the inevitable.

If the memory usage of mysqld is very close to system limits, it
has to die unless kernel itself wants to crash. That is why it is
recommended to do this either at application level or level of
cgroups.

Mysql has to implement per-thread or global limits, since unbounded
memory like buffers (unlike BP) which scale with connections, so can
potentially explode the system’s memory. Also, this can be done with
rlimit where it can be done similar to how open-files-limit is balanced
with max_connections (there is a formula for this lurking in mysql
sources).

Regarding cgroup, it may help to corral threads into a memcg with
tight bounds on memory. Also, cgroups allow for notifications for
various stages of memory exhaustion events, mysql can use that to
adjust connections. (there is an unsubmitted talk by me on
cgroups and mysql :)).

Regarding OOM otherwise, it can be due to zone reclaim due to
unbalanced zones and NUMA, there are whole blog posts on this
(though the numactl fix is not a optimal one but ‘amortizes’ the cost of foreign accesses).

===================================

There is also the vm.vfs_cache_pressure, which can be
interesting for people with hundreds of thousands of tables and
running out of memory.
Ovais Tariq

APRIL 29, 2014 AT 10:49 AM

Raghu,

The purpose of this blog post was to share the fact that there is a difference now as to how swappiness=0 is handled by the kernel versus how it had been working for long (and what people had used to be). Of course the current implementation prefers to not swap anonymous pages. However, there is a regression somewhere in the new implementation which causes the kernel to prefer to OOM even though it could remove pages from the page cache. This is shown in the blog post, taken from one of the couple of cases where the kernel preferred to invoke the OOM with the swap usage at 0 and good amount of memory used up by pages in the page cache.

Swapping is only initiated when nr_free + nr_filebacked high watermark, and hence there was no swapping, yet kernel invoked OOM. That’s exactly what I have tried to portray in the blog post that practicality shows that vm.swappiness=0 is not working as expected.

I am pretty aware about the swapping issue and its related to NUMA, but that is a wholly different topic. In that particular case default NUMA memory allocation policy causes unnecessary swapping because of imbalance on how the memory allocation is done from the nodes. However, the case that this blog post is talking about is when you have *zero swap* usage and OOM getting invoked, when vm.swappiness=0. Comparing this to the NUMA-swappiness issue is like comparing oranges to apples.
Raghavendra

APRIL 29, 2014 AT 9:11 PM

> The purpose of this blog post was to share the fact that there is a difference now as to how swappiness=0 is handled by the kernel versus how it had been working for long (and what people had used to be).

That is fine but I wanted to point out that the behavior of swappiness
is correctly conforming to documentation now. (though even before that
it wouldn’t have caused much of a change since setting to 0 meant
it did scan it wrt. low memory conditions).

> Of course the current implementation prefers to not swap anonymous pages.

It should not if you willingly set it to 0 unless there is a OOM
condition or a global reclaim.

> However, there is a regression somewhere in the new implementation which causes the kernel to prefer to OOM even though it could remove pages from the page cache. This is shown in the blog post, taken from
> one of the couple of cases where the kernel preferred to invoke the OOM with the swap usage at 0 and good amount of memory used up by pages in the page cache.

Only cases of OOM with swappiness=0 that I have seen are when
there was no swap, ie. with full swap or swap disabled. (or
probably when swap device couldn’t handle the writes)

Otherwise, swappiness shouldn’t affect OOM in any way; if you really feel there
is a regression here, please report to redhat bugzilla (the mainline
kernel looks ok). I have not seen any reports on this otherwise.

Possible edge case that I can surmise here is if suddenly a large
allocation is done (or write a large file to tmpfs) and swapping is unable to handle
it (tmpfs is still swap backed), it may OOM but swappiness will certainly not help there either.

> Swapping is only initiated when nr_free + nr_filebacked high watermark, and hence there was no swapping, yet kernel invoked OOM.
> That’s exactly what I have tried to portray in the blog post that practicality shows that vm.swappiness=0 is not working as expected.

The nr_free for Normal zone looks way below the low watermark as
well, that can wakeup kswapd.

>I am pretty aware about the swapping issue and its related to NUMA, but that is a wholly different topic. In that particular case default NUMA memory allocation policy causes unnecessary swapping because of >imbalance on how the memory allocation is done from the nodes. However, the case that this blog post is talking about is when you have *zero swap* usage and OOM getting invoked, when vm.swappiness=0.
> Comparing this to the NUMA-swappiness issue is like comparing oranges to apples.

Zone imbalance can happen with non-NUMA too. I mentioned NUMA to
point out that there are multitude of other issues(bugs) which may have
caused this odd behavior – THP/compaction for instance.

Also, the zone information provided for that OOM is insufficient, for
instance, there isn’t any swap info at all! Did that box even have swap
enabled? Full diagnostic printed to dmesg should show that (if still
available).
Ovais Tariq

APRIL 29, 2014 AT 9:40 PM

> Only cases of OOM with swappiness=0 that I have seen are when there was no swap, ie. with full swap or swap disabled. (or probably when swap device couldn’t handle the writes)

The cases that I have worked on, I have seen OOM (but no swapping) with swap enabled when swappiness=0.

> Possible edge case that I can surmise here is if suddenly a large allocation is done (or write a large file to tmpfs) and swapping is unable to handle it (tmpfs is still swap backed), it may OOM but swappiness will certainly not help there either.

No, all such cases happened during normal operation after MySQL server was in service for a specific period of time.

> The nr_free for Normal zone looks way below the low watermark as well, that can wakeup kswapd.

Wouldn’t that factor in page in page cache as well, and waking up kswapd does not mean its simply going to OOM.

> Also, the zone information provided for that OOM is insufficient, for instance, there isn’t any swap info at all! Did that box even have swap enabled? Full diagnostic printed to dmesg should show that (if still available).

I have updated the relevant section with swap related information. On all such cases where this blog post applies, swap was present but no swapping activity was seen.

Reference: http://www.percona.com/blog/2014/04/28/oom-relation-vm-swappiness0-new-kernel/

rocksword

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
OOM relation to vm.swappiness=0 in new kernel

I have recently been involved in diagnosing the reasons behind OOM invocation that would kill the MySQL server process. Of course these servers were primarily running MySQL. As such the MySQL server p
复制链接

扫一扫

专栏目录