ext4 and data loss

ext4 and data loss

By Jonathan Corbet
March 11, 2009
The ext4 filesystem offers a number of useful features. It has beenstabilizing quickly, but that does not mean that it will work perfectly foreverybody. Consider this example: Ubuntu's bug tracker contains anentry titled "ext4 data loss", wherein a luckless ext4 user reports:

Today, I was experimenting with some BIOS settings that made thesystem crash right after loading the desktop. After a clean rebootpretty much any file written to by any application (during theprevious boot) was 0 bytes.

Your editor had not intended to write (yet) about this issue, but quite afew readers have suggested that we take a look at it. Since there isclearly interest, here is a quick look at what is going on.

Early Unix (and Linux) systems were known for losing data on a systemcrash. The buffering of filesystem writes within the kernel, while beingvery good for performance, causes the buffered data to be lost should thesystem go down unexpectedly. Users of Unix systems used to be quite awareof this possibility; they worried about it, but the performance lossassociated with synchronous writes was generally not seen to be worth it.So application writers took great pains to ensure that any data whichreally needed to be on the physical media got there quickly.

More recent Linux users may be forgiven for thinking that this problem hasbeen entirely solved; with the ext3 filesystem, system crashes are far lesslikely to result in lost data. This outcome is almost an accidentresulting from some decisions made in the design of ext3. What's happeningis this:

  • By default, ext3 will commit changes to its journal every five seconds. What that means is that any filesystem metadata changes will be saved, and will persist even if the system subsequently crashes.

  • Ext3 does not (by default) save data written to files in the journal. But, in the (default) data=ordered mode, any modified data blocks are forced out to disk before the metadata changes are committed to the journal. This forcing of data is done to ensure that, should the system crash, a user will not be able to read the previous contents of the affected blocks - it's a security feature.

  • The end result is that data=ordered pretty much guarantees that data written to files will actually be on disk five seconds later. So, in general, only five seconds worth of writes might be lost as the result of a crash.

In other words, ext3 provides a relatively high level of crash resistance,even though the filesystem's authors never guaranteed that behavior, andPOSIX certainly does not require it. As Ted put it in hisexcruciatingly clear and understandable explanation of the situation:

Since ext3 became the dominant filesystem for Linux, applicationwriters and users have started depending on this, and so theybecome shocked and angry when their system locks up and they losedata --- even though POSIX never really made any such guarantee.

Accidental or not, the avoidance data loss in a crash seems like a nicefeature for a filesystem to have. So one might well wonder just what wouldhave inspired the ext4 developers to take it away. The answer, of course,is performance - and delayed allocation in particular.

"Delayed allocation" means that the filesystem tries to delay theallocation of physical disk blocks for written data for as long aspossible. This policy brings some important performance benefits. Manyfiles are short-lived; delayed allocation can keep the system from writingfleeting temporary files to disk at all. And, for longer-lived files,delayed allocation allows the kernel to accumulate more data and toallocate the blocks for data contiguously, speeding up both the write andany subsequent reads of that data. It's an important optimization which isfound in most contemporary filesystems.

But, if blocks have not been allocated for a file, there is no need towrite them quickly as a security measure. Since the blocks do not yetexist, it is not possible to read somebody else's data from them. So ext4will not (cannot) write out unallocated blocks as part of the next journalcommit cycle. Those blocks will, instead, wait until the kernel decides toflush them out; at that point, physical blocks will be allocated on diskand the data will be made persistent. The kernel doesn't like to let filedata sit unwritten for too long, but it can still take a minute or so (withthe default settings) for that data to be flushed - far longer thanthe five seconds normally seen with ext3. And that is why a crash cancause the loss of quite a bit more data when ext4 is being used.

The real solution to this problem is to fix the applications which areexpecting the filesystem to provide more guarantees than it really is.Applications which frequently rewrite numerous small files seem to beespecially vulnerable to this kind of problem; they should use a smarteron-disk format. Applications which want to be sure that their files havebeen committed to the media can use the fsync() orfdatasync() system calls; indeed, that's exactly what those systemcalls are for. Bringing the applications back into line with what thesystem is really providing is a better solution than trying to fix things upat other levels.

That said, it would be nice to improve the robustness of the system whilewe're waiting for application developers to notice that they have some workto do. One possible solution is, of course, to just run ext3. Another isto shorten the system's writeback time, which is stored in a couple of sysctl variables:

    /proc/sys/vm/dirty_expire_centisecs
    /proc/sys/vm/dirty_writeback_centisecs

The first of these variables (dirty_expire_centiseconds) controlshow long written data can sit in the page cache before it's considered"expired" and queued to be written to disk; it defaults to30 seconds. The value of dirty_writeback_centiseconds(5 seconds, default) controls how often the pdflush process wakesup to actually flush expired data to disk. Lowering these values willcause the system to flush data to disk more aggressively, with a cost inthe form of reduced performance.

A third, partial solution exists in a set of patches queued for 2.6.30; they add aset of heuristics which attempt to protect users from being badly burned incertain situations. They are:

  • A patch adding a new EXT4_IOC_ALLOC_DA_BLKS ioctl() command. When issued on a file, it will force ext4 to allocate any delayed-allocation blocks for that file. That will have the effect of getting the file's data to disk relatively quickly while avoiding the full cost of the (heavyweight) fsync() call.

  • The second patch sets a special flag on any file which has been truncated; when that file is closed, any delayed allocations will be forced. That should help to prevent the "zero-length files" problem reported at the beginning.

  • Finally, this patch forces block allocation when one file is renamed on top of another. This, too, is aimed at the problem of frequently-rewritten small files.

Together, these patches should mitigate the worst of the data loss problemswhile preserving the performance benefits that come with delayedallocation. They have not been proposed for merging at this late stage inthe 2.6.29 release cycle, though; they are big enough that they will haveto wait for 2.6.30. Distributors shipping earlier kernels can, of course,backport the patches, and some may do so. But they should also note thelesson from this whole episode: ext4, despite its apparent stability,remains a very young filesystem. There may yet be a surprise or twowaiting to be discovered by its early users.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值