posix_fadvise清除缓存的误解和改进措施

最新推荐文章于 2024-09-01 12:23:47 发布

origin_lee

最新推荐文章于 2024-09-01 12:23:47 发布

阅读量2.3k

点赞数

分类专栏： linux

linux 专栏收录该内容

52 篇文章 0 订阅

订阅专栏

原创文章，转载请注明： 转载自系统技术非业余研究

本文链接地址: posix_fadvise清除缓存的误解和改进措施

在典型的IO密集型的数据库服务器如MYSQL中，会涉及到大量的文件读写，通常这些文件都是通过buffer io来使用的，以便充分利用到Linux操作系统的page cache。

Buffer IO的特点是读的时候，先检查页缓存里面是否有需要的数据，如果没有就从设备读取，返回给用户的同时，加到缓存一份;写的时候，直接写到缓存去，再由后台的进程定期涮到磁盘去。这样的机制看起来非常的好，在实践中也效果很好。

但是如果你的IO非常密集，就会出现问题。首先由于pagesize是4K，内存的利用效率比较低。其次缓存的淘汰算法很简单，由操作系统自主进行，用户不大好参与。当你的写很多，超过系统内存的某个上限的时候，后台的进程(swapd)要出来回收页面，而且一旦回收的速度小于写入的速度，就会出现不可预期的行为。

这里面最大的问题是：当你使用的内存包括缓存，没超过操作系统规定的上限的时候，操作系统选择不作为，让用户充分使用缓存，从它的角度来看这样效率最高。但是正是由于这种策略在实践中会导致问题。

比如说MYSQL服务器，我们可以把数据直接走direct IO,但是它的日志是走bufferio的。因为走directio需要对写入文件的偏移和大小都要扇区对全，这对日志系统来讲太麻烦了。由于MYSQL是基于事务的，会涉及到大量的日志动作，频繁的写入，然后fsync. 日志一旦写入磁盘，buffer page就没用了，但是一直会在内存呆着，直到达到内存上限，引起操作系统突然大量回收
页面，出现IO柱塞或者内存交换等负面问题。

那么我们知道了困境在哪里，我们可以主动避免这个现象的发生。有二种方法：
1. 日志也走direct io,需要规模的修改MYSQL代码，如percona就这么做了，提供相应的patch。
2. 日志还是走buffer io, 但是定期清除无用page cache.

第一张方法不是我们要讨论的，我们重点讨论第二种如何做：

我们在程序里知道文件的句柄，是不是就可以很轻松的用：

int posix_fadvise(int fd, off_t offset, off_t len, int advice);
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.

来解决问题呢？
比如写类似 posix_fadvise(fd, 0, len_of_file, POSIX_FADV_DONTNEED)；这样的代码来清掉文件所属的缓存。

前面介绍的vmtouch就有这样的功能，清某个文件的缓存。
vmtouch -ve logfile 就可以试验，但是你会发现内存根本就没下来，原因呢？

我们从代码来看posix_fadvise如何运作的：
参看 mm/fadvise.c：

 
    /*
 
     * Posix_FADV_WILLNEED could set PG_Referenced, and POSIX_FADV_NOREUSE could
 
     * deactivate the pages and clear PG_Referenced.
 
     */
 
    SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 
    {
 
    ...
 
        case POSIX_FADV_DONTNEED:
 
            if (!bdi_write_congested(mapping->backing_dev_info))
 
                filemap_flush(mapping);
 
            /* First and last FULL page! */
 
            start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 
            end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
            if (end_index >= start_index)
 
                invalidate_mapping_pages(mapping, start_index,
 
                            end_index);
 
            break;
 
    ...
 
    }

我们可以看到如果后备设备不忙的话，会先调用filemap_flush(mapping)把脏页面刷掉，然后再调invalidate_mapping_pages清除页面。先看下如何刷页面的：
mm/filemap.c

 
    /**                                                                                                                                                        
 
     * filemap_flush - mostly a non-blocking flush                                                                                                             
 
     * @mapping:    target address_space                                                                                                                       
 
     *                                                                                                                                                         
 
     * This is a mostly non-blocking flush.  Not suitable for data-integrity                                                                                   
 
     * purposes - I/O may not be started against all dirty pages.                                                                                              
 
     */
 
    int filemap_flush(struct address_space *mapping)
 
    {
 
            return __filemap_fdatawrite(mapping, WB_SYNC_NONE);
 
    }
 
    /**                                                                                                                                                        
 
     * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range                                                                            
 
     * @mapping:    address space structure to write                                                                                                           
 
     * @start:      offset in bytes where the range starts                                                                                                     
 
     * @end:        offset in bytes where the range ends (inclusive)                                                                                           
 
     * @sync_mode:  enable synchronous operation                                                                                                               
 
     *                                                                                                                                                         
 
     * Start writeback against all of a mapping's dirty pages that lie                                                                                         
 
     * within the byte offsets <start, end> inclusive.                                                                                                         
 
     *                                                                                                                                                         
 
     * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as                                                                               
 
     * opposed to a regular memory cleansing writeback.  The difference between                                                                                
 
     * these two operations is that if a dirty page/buffer is encountered, it must                                                                             
 
     * be waited upon, and not just skipped over.                                                                                                              
 
     */
 
    int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 
                                    loff_t end, int sync_mode)
 
    {
 
            int ret;
 
            struct writeback_control wbc = {
 
                    .sync_mode = sync_mode,
 
                    .nr_to_write = LONG_MAX,
 
            .range_start = start,
 
                    .range_end = end,
 
        };
 
            if (!mapping_cap_writeback_dirty(mapping))
 
            return 0;
 
        ret = do_writepages(mapping, &wbc);
 
        return ret;
 
    }
 
    int filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 
                                    loff_t end)
 
    {
 
            return __filemap_fdatawrite_range(mapping, start, end, WB_SYNC_ALL);
 
    }

我们看到它刷页面用的参数是是 WB_SYNC_NONE，也就是说不是同步等待页面刷新完成。
而fsync和fdatasync是最终会调用filemap_fdatawrite_range, 用WB_SYNC_ALL参数等到完成才返回的。
我们来看下代码mm/page-writeback.c确认下：

 
    int generic_writepages(struct address_space *mapping,
 
                           struct writeback_control *wbc)
 
    {
 
    ...
 
           return write_cache_pages(mapping, wbc, __writepage, mapping);
 
    }
 
    int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
    {
 
    ...
 
            if (mapping->a_ops->writepages)
 
                    ret = mapping->a_ops->writepages(mapping, wbc);
 
            else
 
                    ret = generic_writepages(mapping, wbc);
 
            return ret;
 
    }
 
    int generic_writepages(struct address_space *mapping,
 
                           struct writeback_control *wbc)
 
    {
 
            /* deal with chardevs and other special file */
 
            if (!mapping->a_ops->writepage)
 
                    return 0;
 
            return write_cache_pages(mapping, wbc, __writepage, mapping);
 
    }
 
    int write_cache_pages(struct address_space *mapping,
 
                          struct writeback_control *wbc, writepage_t writepage,
 
                          void *data)
 
    {
 
    ...
 
                            /*                                                                                                                                 
 
                             * We stop writing back only if we are not doing                                                                                   
 
                             * integrity sync. In case of integrity sync we have to                                                                            
 
                             * keep going until we have written all the pages                                                                                  
 
                             * we tagged for writeback prior to entering this loop.                                                                            
 
                             */
 
                            if (--wbc->nr_to_write <= 0 &&
 
                                wbc->sync_mode == WB_SYNC_NONE) {
 
                                    done = 1;
 
                                    break;
 
                            }
 
                    }
 
                    pagevec_release(&pvec);
 
                    cond_resched();
 
    ...
 
    }

从代码和注释可以看出，在WB_SYNC_NONE模式下，提交完写脏页，然后就返回了，确实不等到回写完成。
到这里为止如何刷脏页就很清楚了，再接着看第二步清除内存的操作：
看下mm/truncate.c的实现：

 
    /**
 
     * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
 
     * @mapping: the address_space which holds the pages to invalidate
 
     * @start: the offset 'from' which to invalidate
 
     * @end: the offset 'to' which to invalidate (inclusive)
 
     *
 
     * This function only removes the unlocked pages, if you want to
 
     * remove all the pages of one inode, you must call truncate_inode_pages.
 
     *
 
     * invalidate_mapping_pages() will not block on IO activity. It will not
 
     * invalidate pages which are dirty, locked, under writeback or mapped into
 
     * pagetables.
 
     */
 
    unsigned long invalidate_mapping_pages(struct address_space *mapping,
 
                           pgoff_t start, pgoff_t end)
 
    {
 
        ...
 
        pagevec_init(&pvec, 0);
 
        while (next <= end &&
 
                pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 
            mem_cgroup_uncharge_start();
 
            for (i = 0; i < pagevec_count(&pvec); i++) {
 
                struct page *page = pvec.pages[i];
 
                ...
 
                ret += invalidate_inode_page(page);
 
                           ...
 
            }
 
            pagevec_release(&pvec);
 
            mem_cgroup_uncharge_end();
 
            cond_resched();
 
        }
 
        return ret;
 
    }
 
    /*
 
     * Safely invalidate one page from its pagecache mapping.
 
     * It only drops clean, unused pages. The page must be locked.
 
     *
 
     * Returns 1 if the page is successfully invalidated, otherwise 0.
 
     */
 
    int invalidate_inode_page(struct page *page)
 
    {
 
        struct address_space *mapping = page_mapping(page);
 
        if (!mapping)
 
            return 0;
 
        if (PageDirty(page) || PageWriteback(page))
 
            return 0;
 
        if (page_mapped(page))
 
            return 0;
 
        return invalidate_complete_page(mapping, page);
 
    }

从上面的注释我们可以看到清除相关的页面要满足二个条件： 1. 不脏。 2. 未被使用。
如果满足了这二个条件就调用invalidate_complete_page继续：

 
    /*
 
     * This Is for invalidate_mapping_pages().  That function can be called at
 
     * any time, and is not supposed to throw away dirty pages.  But pages can
 
     * be marked dirty at any time too, so use remove_mapping which safely
 
     * discards clean, unused pages.
 
     *
 
     * Returns non-zero if the page was successfully invalidated.
 
     */
 
    static int
 
    invalidate_complete_page(struct address_space *mapping, struct page *page)
 
    {
 
        int ret;
 
        if (page->mapping != mapping)
 
            return 0;
 
        if (page_has_private(page) && !try_to_release_page(page, 0))
 
            return 0;
 
        clear_page_mlock(page);
 
        ret = remove_mapping(mapping, page);
 
        return ret;
 
    }

我们看到invalidate_complete_page在满足更多条件的话会继续调用remove_mapping：

 
    /*
 
     * Attempt to detach a locked page from its ->mapping.  If it is dirty or if
 
     * someone else has a ref on the page, abort and return 0.  If it was
 
     * successfully detached, return 1.  Assumes the caller has a single ref on
 
     * this page.
 
     */
 
    int remove_mapping(struct address_space *mapping, struct page *page)
 
    {
 
        if (__remove_mapping(mapping, page)) {
 
            /*
 
             * Unfreezing the refcount with 1 rather than 2 effectively
 
             * drops the pagecache ref for us without requiring another
 
             * atomic operation.
 
             */
 
            page_unfreeze_refs(page, 1);
 
            return 1;
 
        }
 
        return 0;
 
    }
 
    /*
 
     * Same as remove_mapping, but if the page is removed from the mapping, it
 
     * gets returned with a refcount of 0.
 
     */
 
    static int __remove_mapping(struct address_space *mapping, struct page *page)；
 
    {
 
        BUG_ON(!PageLocked(page));
 
        BUG_ON(mapping != page_mapping(page));
 
        spin_lock_irq(&mapping->tree_lock);
 
        /*
 
         * The non racy check for a busy page.
 
         *
 
         * Must be careful with the order of the tests. When someone has
 
         * a ref to the page, it may be possible that they dirty it then
 
         * drop the reference. So if PageDirty is tested before page_count
 
         * here, then the following race may occur:
 
         *
 
         * get_user_pages(&page);
 
         * [user mapping goes away]
 
         * write_to(page);
 
         *                !PageDirty(page)    [good]
 
         * SetPageDirty(page);
 
         * put_page(page);
 
         *                !page_count(page)   [good, discard it]
 
         *
 
         * [oops, our write_to data is lost]
 
         *
 
         * Reversing the order of the tests ensures such a situation cannot
 
         * escape unnoticed. The smp_rmb is needed to ensure the page->flags
 
         * load is not satisfied before that of page->_count.
 
         *
 
         * Note that if SetPageDirty is always performed via set_page_dirty,
 
         * and thus under tree_lock, then this ordering is not required.
 
         */
 
        if (!page_freeze_refs(page, 2))
 
            goto cannot_free;
 
        /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
 
        if (unlikely(PageDirty(page))) {
 
            page_unfreeze_refs(page, 2);
 
            goto cannot_free;
 
        }
 
        if (PageSwapCache(page)) {
 
            swp_entry_t swap = { .val = page_private(page) };
 
            __delete_from_swap_cache(page);
 
            spin_unlock_irq(&mapping->tree_lock);
 
            swapcache_free(swap, page);
 
        } else {
 
            void (*freepage)(struct page *);
 
            freepage = mapping->a_ops->freepage;
 
            __remove_from_page_cache(page);
 
            spin_unlock_irq(&mapping->tree_lock);
 
            mem_cgroup_uncharge_cache_page(page);
 
            if (freepage != NULL)
 
                freepage(page);
 
        }
 
        return 1;
 
    cannot_free:
 
        spin_unlock_irq(&mapping->tree_lock);
 
        return 0;
 
    }

看到这里我们就明白了：为什么相关的内存没有被释放出来：页面还脏是最关键的因素。

但是我们如何保证页面全部不脏呢？fdatasync或者fsync都是选择,或者Linux下新系统调用sync_file_range都是可用的，这几个都是使用WB_SYNC_ALL模式强制要求回写完毕才返回的。
如这样做：

 
    fdatasync(fd);
 
    posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

这里还有一个问题: 就是运维人员如何清除这些缓存呢？毕竟他们没有办法写码干预程序的行为呀! 对于这个问题我的初步建议是：
1. 理想情况写个脚本systemtap脚本把这些文件所拥有的页面用sync_file_range回写了，然后再用vmtouch清除。
2. 干脆定期sudo sysctl vm.drop_caches=1 强制清掉无用的pagecache,不过这招比较危险，不推荐。

vm.drop_caches参考以下：

Writing to this will cause the kernel to drop clean caches, dentries and inodes from memory, causing that memory to become free.
To free pagecache:
* echo 1 > /proc/sys/vm/drop_caches

以上代码是基于Linux2.6.37的, 和我们生产机用的是2.6.18有点不同，但是经过核对总体逻辑上是一样的。
为了验证上面的代码分析和判断，我们准备下实验环境，构造下场景，用数据说话：

1. 我们的物理机器有24G内存，构造一个文件4G，让它的页面全部HOLD在内存里面，而且脏的，不让操作系统自主写到磁盘去。
2. 文件系统用的是ext3, 默认是ordered模式，这个模式下文件系统会起到kjournald把我们的页面写到磁盘，经伯瑜同学指点，用了writeback方式，
3. 提高vm的dirty_ratio和dirty_background_ratio到90，dirty_expire_centisecs和dirty_writeback_centisecs到1个小时，好让pdflushd不要出来捣乱。

我演示下关键的参数：

 
    $uname -r
 
    2.6.18-164.el5
 
    $free -m
 
                 total       used       free     shared    buffers     cached
 
    Mem:         24098       5207      18890          0        119        467
 
    -/+ buffers/cache:       4620      19477
 
    Swap:         8189        582       7606
 
    $sysctl -a|grep vm.dirty
 
    vm.dirty_expire_centisecs = 359945
 
    vm.dirty_writeback_centisecs = 359945
 
    vm.dirty_ratio = 90
 
    vm.dirty_background_ratio = 90
 
    $mount
 
    ...
 
    /dev/sda12 on /u02 type ext3 (rw,data=writeback)
 
    $pwd
 
    /u02

接下来请空全部的cache和buffer, 然后创建个4G的数据文件，观察这个期间系统内存的变化：

 
    $sudo sysctl vm.drop_caches=3
 
    vm.drop_caches = 3
 
    $sudo dd if=/dev/zero of=large bs=4M count=1024
 
    1024+0 records in
 
    1024+0 records out
 
    4294967296 bytes (4.3 GB) copied, 6.68751 seconds, 642 MB/s

在另外一个终端观察内存的情况：

 
    $ watch -n 1 'cat /proc/meminfo'
 
    Every 1.0s: cat/proc/meminfo                                                                                                                   Tue Dec 13 18:09:06 2011
 
    MemTotal:     24676836 kB
 
    MemFree:      15294416 kB
 
    Buffers:         59880 kB
 
    Cached:        4466468 kB
 
    SwapCached:        152 kB
 
    Active:        4849772 kB
 
    Inactive:      4246320 kB
 
    HighTotal:           0 kB
 
    HighFree:            0 kB
 
    LowTotal:     24676836 kB
 
    LowFree:      15294416 kB
 
    SwapTotal:     8385760 kB
 
    SwapFree:      7789424 kB
 
    Dirty:         4382608 kB
 
    Writeback:           0 kB
 
    AnonPages:     4569712 kB
 
    Mapped:          88080 kB
 
    Slab:           196736 kB
 
    PageTables:      33444 kB
 
    NFS_Unstable:        0 kB
 
    Bounce:              0 kB
 
    CommitLimit:  20724176 kB
 
    Committed_AS: 26063060 kB
 
    VmallocTotal: 34359738367 kB
 
    VmallocUsed:    268664 kB
 
    VmallocChunk: 34359469543 kB
 
    HugePages_Total:     0
 
    HugePages_Free:      0
 
    HugePages_Rsvd:      0
 
    Hugepagesize:     2048 kB

在另外一个终端再观察下IO的情况:

 
    $iostat -dx 1
 
    Device:         rrqm/s   wrqm/s   r/s   w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
 
    sda               0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda1              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda2              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda4              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda5              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda6              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda7              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda8              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda9              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda10             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda11             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sda12             0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
 
    sdb               0.00    48.00  0.00  4.00     0.00   416.00   104.00     0.00    0.25   0.25   0.10
 
    sdb1              0.00    48.00  0.00  4.00     0.00   416.00   104.00     0.00    0.25   0.25   0.10