ZFS ARC & L2ARC zfs-$ver/module/zfs/arc.c-CSDN博客

可调参数, 含义以及默认值见arc.c或/sys/module/zfs/parameter/$parm_name

如果是freebsd或其他原生支持zfs的系统, 调整sysctl.conf.

parm:           zfs_arc_min:Min arc size (ulong)
parm:           zfs_arc_max:Max arc size (ulong)
parm:           zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm:           zfs_arc_meta_prune:Bytes of meta data to prune (int)
parm:           zfs_arc_grow_retry:Seconds before growing arc size (int)
parm:           zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm:           zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
parm:           zfs_disable_dup_eviction:disable duplicate buffer eviction (int)
parm:           zfs_arc_memory_throttle_disable:disable memory throttle (int)
parm:           zfs_arc_min_prefetch_lifespan:Min life of prefetch block (int)
parm:           l2arc_write_max:Max write bytes per interval (ulong)
parm:           l2arc_write_boost:Extra write bytes during device warmup (ulong)
parm:           l2arc_headroom:Number of max device writes to precache (ulong)
parm:           l2arc_headroom_boost:Compressed l2arc_headroom multiplier (ulong)
parm:           l2arc_feed_secs:Seconds between L2ARC writing (ulong)
parm:           l2arc_feed_min_ms:Min feed interval in milliseconds (ulong)
parm:           l2arc_noprefetch:Skip caching prefetched buffers (int)
parm:           l2arc_nocompress:Skip compressing L2ARC buffers (int)
parm:           l2arc_feed_again:Turbo L2ARC warmup (int)
parm:           l2arc_norw:No reads during writes (int)

L2ARC几点需要注意,

1. L2ARC的内容是l2arc_feed_thread函数主动间歇性的从ARC读取的. 所以ARC里没有的内容, L2ARC也不可能有.

2. L2ARC不存储脏数据, 所以也不需要回写到DISK. 鉴于这个因素, L2ARC不适合频繁变更的场景(如oltp中的频繁更新场景)

3. 如果L2ARC中缓存的数据块在ARC变成脏数据了, 这部分数据会直接从L2ARC丢弃.

4. L2ARC的优化参数(配置到/etc/modprobe.d/zfs.conf或动态变更/sys/module/zfs/parameters/$PARM_NAME)

 *      l2arc_write_max         max write bytes per interval, 一次l2arc feed的最大量.
 *      l2arc_write_boost       extra write bytes during device warmup
 *      l2arc_noprefetch        skip caching prefetched buffers
 *      l2arc_nocompress        skip compressing buffers
 *      l2arc_headroom          number of max device writes to precache
 *      l2arc_headroom_boost    when we find compressed buffers during ARC
 *                              scanning, we multiply headroom by this
 *                              percentage factor for the next scan cycle,
 *                              since more compressed buffers are likely to
 *                              be present
 *      l2arc_feed_secs         seconds between L2ARC writing, 如果要加快从arc导入l2arc的速度, 可缩短interval

参见

zfs-0.6.2/module/zfs/arc.c

ARC

/*
 * DVA-based Adjustable Replacement Cache
 *
 * While much of the theory of operation used here is
 * based on the self-tuning, low overhead replacement cache
 * presented by Megiddo and Modha at FAST 2003, there are some
 * significant differences:
 *
 * 1. The Megiddo and Modha model assumes any page is evictable.
 * Pages in its cache cannot be "locked" into memory.  This makes
 * the eviction algorithm simple: evict the last page in the list.
 * This also make the performance characteristics easy to reason
 * about.  Our cache is not so simple.  At any given moment, some
 * subset of the blocks in the cache are un-evictable because we
 * have handed out a reference to them.  Blocks are only evictable
 * when there are no external references active.  This makes
 * eviction far more problematic:  we choose to evict the evictable
 * blocks that are the "lowest" in the list.
 *
 * There are times when it is not possible to evict the requested
 * space.  In these circumstances we are unable to adjust the cache
 * size.  To prevent the cache growing unbounded at these times we
 * implement a "cache throttle" that slows the flow of new data
 * into the cache until we can make space available.
 *
 * 2. The Megiddo and Modha model assumes a fixed cache size.
 * Pages are evicted when the cache is full and there is a cache
 * miss.  Our model has a variable sized cache.  It grows with
 * high use, but also tries to react to memory pressure from the
 * operating system: decreasing its size when system memory is
 * tight.
 *
 * 3. The Megiddo and Modha model assumes a fixed page size. All
 * elements of the cache are therefor exactly the same size.  So
 * when adjusting the cache size following a cache miss, its simply
 * a matter of choosing a single page to evict.  In our model, we
 * have variable sized cache blocks (rangeing from 512 bytes to
 * 128K bytes).  We therefor choose a set of blocks to evict to make
 * space for a cache miss that approximates as closely as possible
 * the space used by the new block.
 *
 * See also:  "ARC: A Self-Tuning, Low Overhead Replacement Cache"
 * by N. Megiddo & D. Modha, FAST 2003
 */

/*
 * The locking model:
 *
 * A new reference to a cache buffer can be obtained in two
 * ways: 1) via a hash table lookup using the DVA as a key,
 * or 2) via one of the ARC lists.  The arc_read() interface
 * uses method 1, while the internal arc algorithms for
 * adjusting the cache use method 2.  We therefor provide two
 * types of locks: 1) the hash table lock array, and 2) the
 * arc list locks.
 *
 * Buffers do not have their own mutexes, rather they rely on the
 * hash table mutexes for the bulk of their protection (i.e. most
 * fields in the arc_buf_hdr_t are protected by these mutexes).
 *
 * buf_hash_find() returns the appropriate mutex (held) when it
 * locates the requested buffer in the hash table.  It returns
 * NULL for the mutex if the buffer was not in the table.
 *
 * buf_hash_remove() expects the appropriate hash mutex to be
 * already held before it is invoked.
 *
 * Each arc state also has a mutex which is used to protect the
 * buffer list associated with the state.  When attempting to
 * obtain a hash table lock while holding an arc list lock you
 * must use: mutex_tryenter() to avoid deadlock.  Also note that
 * the active state mutex must be held before the ghost state mutex.
 *
 * Arc buffers may have an associated eviction callback function.
 * This function will be invoked prior to removing the buffer (e.g.
 * in arc_do_user_evicts()).  Note however that the data associated
 * with the buffer may be evicted prior to the callback.  The callback
 * must be made with *no locks held* (to prevent deadlock).  Additionally,
 * the users of callbacks must ensure that their private data is
 * protected from simultaneous callbacks from arc_buf_evict()
 * and arc_do_user_evicts().
 *
 * It as also possible to register a callback which is run when the
 * arc_meta_limit is reached and no buffers can be safely evicted.  In
 * this case the arc user should drop a reference on some arc buffers so
 * they can be reclaimed and the arc_meta_limit honored.  For example,
 * when using the ZPL each dentry holds a references on a znode.  These
 * dentries must be pruned before the arc buffer holding the znode can
 * be safely evicted.
 *
 * Note that the majority of the performance stats are manipulated
 * with atomic operations.
 *
 * The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
 *
 *      - L2ARC buflist creation
 *      - L2ARC buflist eviction
 *      - L2ARC write completion, which walks L2ARC buflists
 *      - ARC header destruction, as it removes from L2ARC buflists
 *      - ARC header release, as it removes from L2ARC buflists
 */

L2ARC

/*
 * Level 2 ARC
 *
 * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
 * It uses dedicated storage devices to hold cached data, which are populated
 * using large infrequent writes.  The main role of this cache is to boost
 * the performance of random read workloads.  The intended L2ARC devices
 * include short-stroked disks, solid state disks, and other media with
 * substantially faster read latency than disk.
 *
 *                 +-----------------------+
 *                 |         ARC           |
 *                 +-----------------------+
 *                    |         ^     ^
 *                    |         |     |
 *      l2arc_feed_thread()    arc_read()
 *                    |         |     |
 *                    |  l2arc read   |
 *                    V         |     |
 *               +---------------+    |
 *               |     L2ARC     |    |
 *               +---------------+    |
 *                   |    ^           |
 *          l2arc_write() |           |
 *                   |    |           |
 *                   V    |           |
 *                 +-------+      +-------+
 *                 | vdev  |      | vdev  |
 *                 | cache |      | cache |
 *                 +-------+      +-------+
 *                 +=========+     .-----.
 *                 :  L2ARC  :    |-_____-|
 *                 : devices :    | Disks |
 *                 +=========+    `-_____-'
 *
 * Read requests are satisfied from the following sources, in order:
 *
 *      1) ARC
 *      2) vdev cache of L2ARC devices
 *      3) L2ARC devices
 *      4) vdev cache of disks
 *      5) disks
 *
 * Some L2ARC device types exhibit extremely slow write performance.
 * To accommodate for this there are some significant differences between
 * the L2ARC and traditional cache design:
 *
 * 1. There is no eviction path from the ARC to the L2ARC.  Evictions from
 * the ARC behave as usual, freeing buffers and placing headers on ghost
 * lists.  The ARC does not send buffers to the L2ARC during eviction as
 * this would add inflated write latencies for all ARC memory pressure.
 *
 * 2. The L2ARC attempts to cache data from the ARC before it is evicted.
 * It does this by periodically scanning buffers from the eviction-end of
 * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
 * not already there. It scans until a headroom of buffers is satisfied,
 * which itself is a buffer for ARC eviction. If a compressible buffer is
 * found during scanning and selected for writing to an L2ARC device, we
 * temporarily boost scanning headroom during the next scan cycle to make
 * sure we adapt to compression effects (which might significantly reduce
 * the data volume we write to L2ARC). The thread that does this is
 * l2arc_feed_thread(), illustrated below; example sizes are included to
 * provide a better sense of ratio than this diagram:
 *
 *             head -->                        tail
 *              +---------------------+----------+
 *      ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->.   # already on L2ARC
 *              +---------------------+----------+   |   o L2ARC eligible
 *      ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->|   : ARC buffer
 *              +---------------------+----------+   |
 *                   15.9 Gbytes      ^ 32 Mbytes    |
 *                                 headroom          |
 *                                            l2arc_feed_thread()
 *                                                   |
 *                       l2arc write hand <--[oooo]--'
 *                               |           8 Mbyte
 *                               |          write max
 *                               V
 *                +==============================+
 *      L2ARC dev |####|#|###|###|    |####| ... |
 *                +==============================+
 *                           32 Gbytes
 *
 * 3. If an ARC buffer is copied to the L2ARC but then hit instead of
 * evicted, then the L2ARC has cached a buffer much sooner than it probably
 * needed to, potentially wasting L2ARC device bandwidth and storage.  It is
 * safe to say that this is an uncommon case, since buffers at the end of
 * the ARC lists have moved there due to inactivity.
 *
 * 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
 * then the L2ARC simply misses copying some buffers.  This serves as a
 * pressure valve to prevent heavy read workloads from both stalling the ARC
 * with waits and clogging the L2ARC with writes.  This also helps prevent
 * the potential for the L2ARC to churn if it attempts to cache content too
 * quickly, such as during backups of the entire pool.
 *
 * 5. After system boot and before the ARC has filled main memory, there are
 * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
 * lists can remain mostly static.  Instead of searching from tail of these
 * lists as pictured, the l2arc_feed_thread() will search from the list heads
 * for eligible buffers, greatly increasing its chance of finding them.
 *
 * The L2ARC device write speed is also boosted during this time so that
 * the L2ARC warms up faster.  Since there have been no ARC evictions yet,
 * there are no L2ARC reads, and no fear of degrading read performance
 * through increased writes.
 *
 * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
 * the vdev queue can aggregate them into larger and fewer writes.  Each
 * device is written to in a rotor fashion, sweeping writes through
 * available space then repeating.
 *
 * 7. The L2ARC does not store dirty content.  It never needs to flush
 * write buffers back to disk based storage.
 *
 * 8. If an ARC buffer is written (and dirtied) which also exists in the
 * L2ARC, the now stale L2ARC buffer is immediately dropped.
 *
 * The performance of the L2ARC can be tweaked by a number of tunables, which
 * may be necessary for different workloads:
 *
 *      l2arc_write_max         max write bytes per interval
 *      l2arc_write_boost       extra write bytes during device warmup
 *      l2arc_noprefetch        skip caching prefetched buffers
 *      l2arc_nocompress        skip compressing buffers
 *      l2arc_headroom          number of max device writes to precache
 *      l2arc_headroom_boost    when we find compressed buffers during ARC
 *                              scanning, we multiply headroom by this
 *                              percentage factor for the next scan cycle,
 *                              since more compressed buffers are likely to
 *                              be present
 *      l2arc_feed_secs         seconds between L2ARC writing
 *
 * Tunables may be removed or added as future performance improvements are
 * integrated, and also may become zpool properties.
 *
 * There are three key functions that control how the L2ARC warms up:
 *
 *      l2arc_write_eligible()  check if a buffer is eligible to cache
 *      l2arc_write_size()      calculate how much to write
 *      l2arc_write_interval()  calculate sleep delay between writes
 *
 * These three functions determine what to write, how much, and how quickly
 * to send writes.
 */