我们在做数据库程序或者IO密集型的程序的时候,通常在更新的时候,比如说数据库程序,希望更新有一定的安全性,我们会在更新操作结束的时候调用fsync或者fdatasync来flush数据到持久设备去。而且通常是以页面为单位,16K一次或者4K一次。 安全性保证了,但是性能就有很大的损害。而且我们更新的时候,通常是更新文件的某一个页面,那么由于是更新覆盖操作,对文件系统的元数据来讲的话,无需变更,所以我们通常不大关心元数据是否写入。 当更新非常频繁的时候,我们时候能够有其他方法减少性能损失。sync_file_range同学出场了。
Linux下系统调用sync_file_range只在内核2.6.17及更高版本是可用的, 我们常用的RHEL 5U4是支持的。
也可以man sync_file_range下他的具体作用, 但是请注意,sync_file_range是不可移植的。
sync_file_range – sync a file segment with disk
sync_file_range() permits fine control when synchronising the open file referred to by the file descriptor fd with disk.
offset is the starting byte of the file range to be synchronised. nbytes specifies the length of the range to be synchronised, in bytes; if nbytes is zero, then all bytes from offset through to the end of file are synchronised. Synchronisation is in units of the system page size: offset is rounded down to a page boundary; (offset+nbytes-1) is rounded up to a page boundary.
sync_file_range可以让我们在做多个更新后,一次性的刷数据,这样大大提高IO的性能。 具体的实现在fs/sync.c里面,有兴趣的同学可以围观下。
著名的fio测试工具支持sync_file_range来做sync操作,我们在决定在我们的应用中使用该syscall之前不妨先fio测试一把。
sync_file_range 可以将文件的部分范围作为目标,将对应范围内的脏页刷回磁盘,而不是整个文件的范围。
好处是,当我们对大文件进行了修改时,如果修改了大量的数据块,我们最后fsync的时候,可能会很慢。即使fdatasync,也是有问题的,例如这个大文件的长度在我们的修改过程中发生了变化,那么fdatasync将同时写metadata,而对于文件系统来说,单个文件系统的写metadata 是串行的,这势必导致影响其他用户操作metadata(如创建文件)。
(当文件大小发生变化,
fdatasync必定写metadata;参见http://blog.csdn.net/xiaofei0859/article/details/51144313)
(储存文件元信息-metadata的区域就叫做inode,参见http://blog.csdn.net/xiaofei0859/article/details/50981511)
sync_file_range是绝对不会写metadata的,所以用它非常合适,每次对文件做了小范围的修改时,立即调用sync_file_range,把对应的脏数据刷到磁盘,那么在结束对文件的修改后,再调用fdatasync (flush dirty data page), fsync(flush dirty data+metadata page)都是很块的。
sync_file_range的几个flag, 注意SYNC_FILE_RANGE_WRITE是异步的,所以如果你要达到以上目的话,那么最好不要使用异步模式,或者至少在调用fdatasync和fsync前,使用SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER做一次全文件范围的sync_file_range。从而保证在调用fdatasync或fsync前,该文件的dirty page已经全部刷到磁盘了。
SYNC_FILE_RANGE_WAIT_BEFORE
Wait upon write-out of all pages in the specified range that
have already been submitted to the device driver for write-out
before performing any write.
SYNC_FILE_RANGE_WRITE
Initiate write-out of all dirty pages in the specified range
which are not presently submitted write-out. Note that even
this may block if you attempt to write more than request queue
size.
SYNC_FILE_RANGE_WAIT_AFTER
Wait upon write-out of all pages in the range after performing
any write.
Useful combinations of the flags bits are:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE
Ensures that all pages in the specified range which were dirty
when sync_file_range() was called are placed under write-out.
This is a start-write-for-data-integrity operation.
SYNC_FILE_RANGE_WRITE
Start write-out of all dirty pages in the specified range
which are not presently under write-out. This is an
asynchronous flush-to-disk operation. This is not suitable
for data integrity operations.
SYNC_FILE_RANGE_WAIT_BEFORE (or SYNC_FILE_RANGE_WAIT_AFTER)
Wait for completion of write-out of all pages in the specified
range. This can be used after an earlier
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE operation
to wait for completion of that operation, and obtain its
result.
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER
This is a write-for-data-integrity operation that will ensure
that all pages in the specified range which were dirty when
sync_file_range() was called are committed to disk.
[参考]
1.
http://man7.org/linux/man-pages/man2/sync_file_range.2.html