linux do direct io,Linux DirectIO机制分析

DirectIO是write函数的一个选项,用来确定数据内容直接写到磁盘上,而非缓存中,保证即是系统异常了,也能保证紧要数据写到磁盘上,具体写文件的机制流程可以参考前面写的,DirectIO流程也是接续着写文件流程而来的。

内核走到__generic_file_aio_write函数时,系统根据file->f_flags & O_DIRECT判断进入DirectIO处理的分支:

if (unlikely(file->f_flags & O_DIRECT)) {

loff_t endbyte;

ssize_t written_buffered;

written = generic_file_direct_write(iocb, iov, &nr_segs, pos,

ppos, count, ocount);

if (written < 0 || written == count)

goto out;

/*

* direct-io write to a hole: fall through to buffered I/O

* for completing the rest of the request.

*/pos += written;

count -= written;

written_buffered = generic_file_buffered_write(iocb, iov,

nr_segs, pos, ppos, count,

written);

/*

* If generic_file_buffered_write() retuned a synchronous error

* then we want to return the number of bytes which were

* direct-written, or the error code if that was zero. Note

* that this differs from normal direct-io semantics, which

* will return -EFOO even if some bytes were written.

*/if (written_buffered < 0) {

err = written_buffered;

goto out;

}

/*

* We need to ensure that the page cache pages are written to

* disk and invalidated to preserve the expected O_DIRECT

* semantics.

*/endbyte = pos + written_buffered - written - 1;

err = filemap_write_and_wait_range(file->f_mapping, pos, endbyte);

if (err == 0) {

written = written_buffered;

invalidate_mapping_pages(mapping,

pos >> PAGE_CACHE_SHIFT,

endbyte >> PAGE_CACHE_SHIFT);

} else {

/*

* We don't know how much we wrote, so just return

* the number of bytes which were direct-written

*/}

}

依次先看generic_file_direct_write函数,主要有filemap_write_and_wait_range,invalidate_inode_pages2_range和mapping->a_ops->direct_IO起作用。

filemap_write_and_wait_range主要用来刷mapping下的脏页,在__filemap_fdatawrite_range下调用do_writepages实现:int do_writepages(struct address_space *mapping, struct writeback_control *wbc)

{

int ret;

if (wbc->nr_to_write <= 0)

return 0;

if (mapping->a_ops->writepages)

ret = mapping->a_ops->writepages(mapping, wbc);

else

ret = generic_writepages(mapping, wbc);

return ret;

}

filemap_write_and_wait_range如果有写入量则返回,后续的两个函数则不执行。我的理解是直写后相关数据都要一起刷到磁盘上,避免direct_IO的已经在磁盘上,而之前缓存的则不在,系统异常后文件系统就挂了。

如果没有写入量,则根据mapping->nrpages判断进入invalidate_inode_pages2_range,作用就是检查当前内存中是否由对应将要direct_IO的缓存页,如果有,则将其缓存标记为无效。目的是,因为direct_IO写入的数据并不缓存,如果direct_IO写入数据之前有对应缓存,而且是clean的,direct_IO完成之后,缓存和磁盘数据就不一致了,读取缓存的时候,如果没有保护,获取的数据就不是磁盘上的数据。如果的确有对应缓存标记为无效,则返回不执行后面的函数。

后面才到真正的主题,mapping->a_ops->direct_IO,在struct address_space_operations ext3_ordered_aops结构体里面有定义,是ext3_direct_IO,核心通过__blockdev_direct_IO实现,在direct_io_worker中组装了dio结构,然后通过dio_bio_submit,本质就是通过submit_bio(dio->rw, bio)提交到io层。所谓direct_io和其他读写比较就是跨过了buffer层,不要中间线程pdflush和kjournald定期刷盘到IO层。这个时候也不一定数据就在磁盘上了,direct_IO就是先假定IO的设备驱动没有较大延时的。

mapping->a_ops->direct_IO执行完成了,invalidate_inode_pages2_range又搞了一边,理由如下:

/* Finally, try again to invalidate clean pages which might have been cached by non-direct readahead, or faulted in by get_user_pages(), if the source of the write was an mmap'ed region of the file , we're writing. Either one is a pretty crazy thing to do, so we don't support it 100%. If this invalidation fails, tough, the write still worked...*/

系统复杂度很高的时候,就很难找到完全的数字式的过程保证,有时候土法炼钢也是简单有效的。

再次退回到__generic_file_aio_write函数,written = generic_file_direct_write(iocb, iov, &nr_segs, pos,

ppos, count, ocount);

if (written < 0 || written == count)

goto out;

/*

* direct-io write to a hole: fall through to buffered I/O

* for completing the rest of the request.

*/pos += written;

count -= written;

written_buffered = generic_file_buffered_write(iocb, iov,

nr_segs, pos, ppos, count,

written);

如果generic_file_direct_write返回值不为count,则重新执行缓存写generic_file_buffered_write,前面已经分析过,如果写入数据有相关的脏页,或者有对应的缓存即是clean,写入量则不是期待的count,此处要重新进行缓存写入。

结果我们就看到,所谓的direct_IO并不完全保证跨越buffer,在某些条件下,也是buffer写入。所以在极端要求directIO情况下,就要对应的规避掉这两种情况,控制缓存映射。

小工具vmtouch对于缓存控制还是简单有效

Linux DirectIO机制分析来自于OenHan

链接为:https://oenhan.com/ext3-fs-directio

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值