Flushing out pdflush

The kernel page cache contains in-memory copies of data blocksbelonging to files kept in persistent storage. Pages which are written to by a processor, but not yet written to disk, areaccumulated in cache and are known as "dirty" pages. The amount ofdirty memory is listed in /proc/meminfo. Pages inthe cache are flushed to disk after an interval of 30 seconds. Pdflushis a set of kernel threads which are responsible for writing thedirty pages to disk, either explicitly in response to a sync() call, orimplicitly in cases when the page cache runs out of pages, if thepages have been in memory for too long, or there are too many dirty pagesin the page cache (as specified by /proc/sys/vm/dirty_ratio).

At a given point of time, there are between two and eight pdflush threads running in thesystem. The number of pdflush threads is determined by the load on thepage cache; new pdflush threads are spawned ifnone of the existing pdflush threads have been idle for more thanone second, and there is more work in the pdflush work queue. On the other hand, if the last active pdflush thread has been asleepfor more than one second, one thread is terminated. Termination ofthreads happens until only a minimum number of pdflushthreads remain. The current number of running pdflush threads isreflected by/proc/sys/vm/nr_pdflush_threads.

A number of pdflush-related issues have come to light over time.Pdflush threads are common to all block devices, but it is thought thatthey would perform better if they concentrated on a single disk spindle. Contention between pdflush threads is avoided through the use of theBDI_pdflush flag on the backing_dev_info structure, butthis interlock can also limit writeback performance.Another issue with pdflush isrequest starvation. There is a fixed number of I/O requests available for eachqueue in the system. If the limit is exceeded, any applicationrequesting I/O will block waiting for a new slot. Since pdflush works on severalqueues, it cannot block on a single queue. So, it sets thewbc->nonblocking writeback information flag. If other applications continue to write on thedevice, pdflush will not succeed in allocating request slots.This may lead to starvation ofaccess to the queue, if pdflush repeatedly finds the queue congested.

Jens Axboe in his patch set proposes a newidea of using flusher threads per backing device info (BDI), as areplacement forpdflush threads. Unlike pdflush threads, per-BDI flusher threads focuson a single disk spindle. With per-BDI flushing, when therequest_queue is congested, blocking happens on request allocation,avoiding request starvation and providing better fairness.

With pdflush, The dirty inode list is stored bythe super block of the filesystem. Since the per-BDI flusher needs tobe aware of the dirty pages to be written by its assigned device, this list is now stored by the BDI.Calls to flush dirty inodes on the superblock result in flushing theinodes from the list of dirty inodes on the backing device for alldevices listed for the filesystem.

As with pdflush, per-BDI writeback is controlled through thewriteback_control data structure, which instructs the writeback codewhat to do, and how to perform the writeback. The important fields of thisstructure are:

  • sync_mode: defines the way synchronization should be performed with respect to inode locking. If set to WB_SYNC_NONE, the writeback will skip locked inodes, where as if set to WB_SYNC_ALL will wait for locked inodes to be unlocked to perform the writeback.

  • nr_to_write: the number of pages to write. This value is decremented as the pages are written.

  • older_than_this: If not NULL, all inodes older than the jiffies recorded in this field are flushed. This field takes precedence overnr_to_write.

The struct bdi_writeback keeps all information required for flushingthe dirty pages:

    struct bdi_writeback {
	struct backing_dev_info *bdi;
	unsigned int nr;
	struct task_struct	*task;
	wait_queue_head_t	wait;
	struct list_head	b_dirty;
	struct list_head	b_io;
	struct list_head	b_more_io;

	unsigned long		nr_pages;
	struct super_block	*sb;
    };

The bdi_writeback structure is initialized when the device is registered throughbdi_register(). The fields of the bdi_writeback are:

  • bdi: the backing_device_info associated with thisbdi_writeback,

  • task: contains the pointer to the default flusher threadwhich is responsible for spawning threads for performing theflushing work,

  • wait: a wait queue for synchronizing with the flusher threads,

  • b_dirty: list of all the dirty inodes on this BDI to be flushed,

  • b_io: inodes parked for I/O,

  • b_more_io: more inodes parked for I/O; all inodes queued forflushing are inserted in this list, before being moved tob_io,

  • nr_pages: total number of pages to be flushed, and

  • sb: the pointer to the superblock of the filesystem whichresides on this BDI.

nr_pages and sb are parameters passed asynchronously tothe the BDI flush thread, and are not fixed through the life of thebdi_writeback. This is done to facilitate devices with multiplefilesystem, hence multiple super_blocks. With multiple super_blockson a single device, a sync can be requested for a single filesystemon the device.

The bdi_writeback_task() function waits for thedirty_writeback_interval, which by default is 5 seconds, and initiateswb_do_writeback(wb)periodically. If there are no pages written for five minutes, the flusherthread exits (with a grace period ofdirty_writeback_interval).If a writeback work is later required (after exit), new flusherthreads are spawned by the default writeback thread.

Writeback flushes are done in two ways:

  • pdflush style: This is initiated in response to an explicit writeback request, for example syncing inode pages of a super_block.wb_start_writeback() is called with the superblock information and the number of pages to be flushed. The function tries to acquire thebdi_writeback structure associated with the BDI. If successful, it stores the superblock pointer and the number of pages to be flushed in thebdi_writeback structure and wakes up the flusher thread to perform the actual writeout for the superblock. This is different from how pdflush performs writeouts: pdflush attempts to grab the device from the writeout path, blocking the writeouts from other processes.

  • kupdated style: If there is no explicit writeback requests, the thread wakes up periodically to flush dirty data. The first time one of the inode's pages stored in the BDI is dirtied, the dirtying-time is recorded in the inode's address space. The periodic writeback code walks through the superblock's inode list, writing back dirty pages of the inodes older than a specified point in time. This is run once perdirty_writeback_interval, which defaultsto five seconds.

After review of the firstattempt, Jens addedfunctionality of having multiple flusher threads per device based onthe suggestions of Andrew Morton. Dave Chinner suggested thatfilesystems would like to have a flusher thread per allocation group.In the patch set (second iteration) which followed, Jens added a new interface in the superblock to return thebdi_writeback structureassociated with the inode:

    struct bdi_writeback *(*inode_get_wb) (struct inode *);

If inode_get_wb is NULL, the default bdi_writeback of the BDI isreturned, which means there is only onebdi_writeback thread for the BDI. Themaximum number of threads that can be started per BDI is 32.

Initial experiments conducted by Jens found an 8% increase inperformance on a simple SATA drive runningFlexible File SystemBenchmark (ffsb). File layout was smoother as compared to thevanilla kernel as reported byvmstat, with a uniform distribution ofbuffers written out. With a ten-disk btrfs filesystem, per-BDI flushing performed25% faster. The writeback is tracked by Jens's block layer git tree(git://git.kernel.dk/linux-2.6-block.git) under the "writeback" branch.There have been no comments on the second iteration so far, butper-BDI flusher threads is still not ready enough to go into the 2.6.30 tree.

Acknowledgments: Thanks to Jens Axboe for reviewing and explainingcertain aspects of the patch set.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值