ubifs 文件系统-1: overview

UBIFS is a new flash file system developed by Nokia engineers with help ofthe University of Szeged. In a way, UBIFS may be considered as the nextgeneration of the JFFS2 file-system.

JFFS2 file system works on top of MTD devices, but UBIFS works on top of UBI volumes and cannot operate on top of MTD devices. In other words, there are 3subsystems involved:

  • MTD subsystem, which provides uniform interface to accessflash chips. MTD provides an notion of MTD devices (e.g.,/dev/mtd0) which basically represents raw flash;
  • UBI subsystem, which is a wear-leveling and volume managementsystem for flash devices; UBI works on top of MTD devices and providesa notion of UBI volumes; UBI volumes are higher level entities than MTDdevices and they are devoid of many unpleasant issues MTD devices have(e.g., wearing and bad blocks); see here for moreinformation;
  • UBIFS file system, which works on top of UBI volumes.


Write-back support

UBIFS supports write-back, which means that file changes do not go to the flash media straight away, but they are cached and go to the flash later, when it is absolutely necessary. This helps to greatly reduce the amount of I/O which results in better performance. Write-back caching is a standard technique which is used by most file systems likeext3 or XFS.

In contrast, JFFS2 does not have write-back support and all theJFFS2 file system changes go the flash synchronously. Well, this is notcompletely true and JFFS2 does have a small buffer of a NAND page size (if theunderlying flash is NAND). This buffer contains last written data and isflushed once it is full. However, because the amount of cached data are verysmall, JFFS2 is very close to a synchronous file system.

Write-back support requires the application programmers to take extra careabout synchronizing important files in time. Otherwise the files may corrupt ordisappear in case of power-cuts, which happens very often in many embeddeddevices. Let's take a glimpse at Linux manual pages:

$ man 2 write
....
NOTES
       A  successful return from write() does not make any guarantee that data
       has been committed to disk.  In fact, on some buggy implementations, it
       does  not  even guarantee that space has successfully been reserved for
       the data.  The only way to be sure is to call fsync(2)  after  you  are
       done writing all your data.
...

This is true for UBIFS (except of the "some buggy implementations" part,because UBIFS does reserves space for cached dirty data). This is also true forJFFS2, as well as for any other Linux file system.

However, some (perhaps not very good) user-space programmers do not takewrite-back into account. They do not read manual pages carefully. When suchapplications are used in embedded systems which run JFFS2 - they work fine,because JFFS2 is almost synchronous. Of course, the applications are buggy,but they appear to work well enough with JFFS2. But the bugs show up whenUBIFS is used instead. Please, be careful and check/test your applicationswith respect to power cut tolerance if you switch from JFFS2 to UBIFS. Thefollowing is a list of useful hints and advices.

  • If you want to switch into synchronous mode, use the-o sync option when mounting UBIFS; however, the filesystem performance will drop - be careful. Also remember that UBIFSmounted in synchronous mode provides less guarantees than JFFS2 - referthis section for details.
  • Always keep in mind the above statement from the manual pages andrun fsync() for all important files you change; ofcourse, there is no need to synchronize "throw-away" temporary files;Just think how important the file data is and decide; do notuse fsync() unnecessarily, because this will hit theperformance.
  • If you want to be more accurate, you may usefdatasync(), in which cases only data changes will beflushed, but not inode meta-data changes (e.g., "mtime"or permissions); this might be more optimal than usingfsync() if the synchronization is done often, e.g., ina loop; otherwise just stick with fsync().
  • In shell, the sync command may be used, but itsynchronizes the whole file system which might not be optimal; andthere is a similarlibcsync() function.
  • You may use the O_SYNC flag of theopen() call; this will make sure all the data (but notmeta-data) changes go to the media before thewrite()operation returns; but in general, it is better to usefsync(), becauseO_SYNC makes eachwrite to be synchronous, while fsync() allows toaccumulate many writes and synchronize them at once.
  • It is possible to make certain inodes to be synchronous bydefault by setting the "sync" inode flag; in a shell, thechattr +S command may be used; inC programs,use theFS_IOC_SETFLAGSioctl command;Note, themkfs.ubifstool checks for the "sync" flag in the original FS tree, sothe synchronous files in the original FS tree will be synchronous inthe resulting UBIFS image.

Let us stress that the above items are true for any Linux file system,includingJFFS2.

fsync() may be called for directories - it synchronizesthe directory inode meta-data. The "sync" flag may also be set fordirectories to make the directory inode synchronous. But the flag is inherited,which means all new children of this directory will also have this flag. Newfiles and sub-directories of this directory will also be synchronous, and theirchildren, and so forth. This feature is very useful if one needs to create awhole sub-tree of synchronous files and directories, or to make all new childrenof some directory to be synchronous by default (e.g., /etc).

The fdatasync() call for directories is "no-op" in UBIFS andall UBIFS operations which change directory entries are synchronous.However, you should not assume this for portability (e.g., this is nottrue forext2). Similarly, the "dirsync" inode flag hasno effect in UBIFS.

The functions mentioned above work on file-descriptors, not on streams(FILE *). To synchronize a stream, you should first get its filedescriptor using thefileno()libc function, then flush thestream usingfflush(), and then synchronize the file usingfsync() orfdatasync(). You may use othersynchronization methods, but remember to flush the stream before synchronizingthe file. Thefflush() function flushes thelibc-levelbuffers, whilesync(), fsync(), etc flushkernel-level buffers.

Please, refer this FAQentry for information about how to atomically update the contents of afile. Also, theTheodore Tso's article is a good reading.



Write-back knobs in Linux

Linux has several knobs in "/proc/sys/vm" which you may use totune write-back. The knobs are global, so they affect all file-systems. Please,refer the "Documentation/sysctl/vm.txt" file fore moreinformation. The file may be found in the Linux kernel source tree. Below areinteresting knobs described in UBIFS context and in a simplified form.

  • dirty_writeback_centisecs - how often the Linuxperiodic write-back thread wakes up and writes out dirty data.This is a mechanism which makes sure all dirty data hits themedia at some point.
  • dirty_expire_centisecs - dirty data expire period.This is maximum time data may stay dirty. After this period of time itwill be written back by the Linux periodic write-back thread. IOW, theperiodic write-back thread wakes up every"dirty_writeback_centisecs" centi-seconds and synchronizesdata which was dirtied "dirty_expire_centisecs"centi-seconds ago.
  • dirty_background_ratio - maximum amountof dirty data in percent of total memory. When the amount of dirty databecomes larger, the periodic write-back thread starts synchronizing ituntil it becomes smaller. Even non-expired data will be synchronized.This may be used to set a "soft" limit for the amount of dirty data inthe system.
  • dirty_ratio - maximum amount of dirty data atwhich writers will first synchronize the existing dirty data beforeadding more. IOW, this is a "hard" limit of the amount of dirty data inthe system.

Note, UBIFS additionally has smallwrite-buffers which are synchronizedevery 3-5 seconds. This means that most of the dirty data are delayed bydirty_expire_centisecs centi-seconds, but the last few KiB areadditionally delayed by 3-5 seconds.



UBIFS write-buffer

UBIFS is asynchronous file-system (readthis section for more information). Asother Linux file-system, it utilizes the page cache. The page cache isa generic Linux memory-management mechanism. It may be very large and cache alot of data. When you write to a file, the data are written to the page cache,marked as dirty, and the write returns (unless the file is synchronous). Laterthe data are written-back.

Write-buffer is an additional UBIFS buffer, which is implemented insideUBIFS, and it sits between the page cache and the flash. This means thatwrite-back actually writes to the write-buffer, not directly to the flash.

The write-buffer is designated to speed-up UBIFS on NAND flashes. NANDflashes consist of NAND pages, which are usually 512, 2KiB or 4KiB in size.NAND page is the minimal read/write unit of NAND flash (seethis section).

Write-buffer size is equivalent to NAND page size (so it is tiny comparingto the page cache). It's purpose is to accumulate small writes, and write fullNAND pages instead of partially filled. Indeed, imagine we have to write 4512-byte nodes with half a second interval, and NAND page size is 2KiB. Withoutwrite-buffer we would have to write 4 NAND pages and waste 6KiB of flash space,while write-buffer allows us to write only once and waste nothing. This meanswe write less, we create less dirty space so UBIFS garbage collector will haveto do less work, we save power.

Well, the example shows an ideal situation, and even with the write-bufferwe may waste space, for example in case of synchronous I/O, or if the dataarrives with long time intervals. This is because the write-buffer has anassociated timer, which flushes it every 3-5 seconds, even if it isn't full.We do this for data integrity reasons.

Of course, when UBIFS has to write a lot of data, it does not use writebuffer. Only the last part of the data which is smaller than the NAND page endsup in the write-buffer and waits more for data, until it is flushed by thetimer.

The write-buffer implementation is a little more complex, and we actuallyhave several of them - one for each journal head. But this does not change thebasic idea behind the write-buffer.

Few notes with regards to synchronization:

  • "sync()" also synchronizes all write-buffers;
  • "fsync(fd)" also synchronizes all write-buffers whichcontain pieces of "fd";
  • synchronous files, as well as files opened with"O_SYNC", bypass write-buffers, so the I/O is indeedsynchronous for this files;
  • write-buffers are also bypassed if the file-system is mounted withthe "-o sync" mount option.

Take into account that write-buffers delay the data synchronization timeoutdefined by "dirty_expire_centisecs" (seehere) by 3-5 seconds. However, sincewrite-buffers are small, only few data are delayed.


Compression

UBIFS supports on-the-fly compression, which means it compresses databefore writing them to the flash media, and decompresses before reading them,and this is absolutely transparent to the users. UBIFS compresses only regularfiles data. Directories, device nodes and so on are not compressed. Meta-dataand the indexing information are not compressed as well.

At the moment UBIFS supports LZO and zlib compressors. Zlibprovides better compression ratio, but LZO is faster in both compression anddecompression. LZO is the default compressor for UBIFS and for themkfs.ubifs utility. And of course you may disable UBIFScompression altogether using the "-x none"mkfs.ubifs option.

UBIFS splits all data on 4KiB chunks and compresses each chunkindependently. This is not optimal, because larger chunks of data wouldcompress better, but this still provides noticeable flash space economy. Forexample, real-life root file-system image for an ARM platform becomes ~40%smaller with LZO compression and ~50% smaller with zlib compression. Thismeans that you may fit a 300MiB rootfs image into a 256MiB UBI volume and stillhave about 100MiB of free space. However, the figures may be differentdepending on the contents of the file-system. For example, if your file-systemmostly containsmp3 files, UBIFS will be unable to efficientlycompress them, just becausemp3 files are already compressed.

In UBIFS it is possible to enable or disable compression individually foreach inode by setting or cleaning their compression flag. Note, the compressionflag of directories is inherited, which means that when files andsub-directories are created, they inherit the compression flag of the parentdirectory. Please, refer thissection for instruction about how the compression flag may be toggled.

It is also possible to somewhat combine LZO and zlib compressors, seethis FAQ section.

It's also worth noting that JFFS2 LZO compression is a little bit differentto UBIFS zlib compression. UBIFS uses the crypto-API deflate method, while JFFS2uses zlib library directly. As a result, UBIFS and JFFS2 use different zlibcompression options. Namely, JFFS2 uses deflate level 3 and window bits 15,while UBIFS uses deflate level 6 and window bits -11 (minus makes zlib avoidputting a header to the output data stream). Experiments with compressing ARMcodeshowed that JFFS2 compressionratio is slightly smaller, decompression speed is also slightly slower, butcompression speed is a bit faster.


Checksumming

Every piece of information UBIFS writes to the media has a CRC-32 checksum.UBIFS protects both data and meta-data with CRC. Every time the meta-data isread, the CRC checksum is verified. CRC-32 is quite strong function and anydata corruption will most probably be noticed. The same is true for UBI,by the way, it verifies every piece of meta-data.

The data CRC is not verified by default. We do this to improve the defaultfile-system read speed. But UBIFS allows to switch the data verification onusing thechk_data_crc mount option. This decreases UBIFS readspeed a little, but provides better integrity protection. With this option onyou may be sure that any piece of information UBIFS reads from the mediawill be verified and any corruption will most probably be noticed.

Note, currently UBIFS cannot disable CRC-32 calculations on write, becauseUBIFS recovery process depends on in. When recovering from an unclean rebootand re-playing the journal, UBIFS has to be able to detect broken andhalf-written UBIFS nodes and drop them, and UBIFS depends on the CRC-32checksum here.

In other words, if you use UBIFS with data CRC-32 checking disabled, youstill have the CRC-32 checksum attached to each piece of data, and you maymount UBIFS with thechk_data_crc option to enable CRC-32 checkingat any time (e.g., when you suspect the file-system might be corrupted becauseyou visited theLarge Hadron Collider and exposed your flash to proton beams).

NOTE!: before 2.6.39 the default UBIFS behavior was the oppsite -it did check data CRC-32 by default and theno_chk_data_crchad to be used.


Read-ahead

Read-ahead is an optimization technique which makes the file system read a little bit more data than users actually ask. The idea is that files are often read sequentially from the beginning to the end, so the file system triest o make next data available before the user actually asks for them.

Linux VFS is capable of doing read-ahead and this does not require any support from the file system. This probably works well for traditional block-based file systems, however this does not work well for UBIFS. UBIFS works with UBI API, which works with MTD API, which is synchronous. MTD API is pretty trivial and does not have any request queues. This means that VFS blocks UBIFS readers and makes them wait for read-ahead process. In opposite,block-device API is asynchronous and readers do not wait for read-ahead.

VFS read-ahead was designed for hard drives, and it was benchmarked with hard-drives. But the nature of raw flash devices is very different to the nature of Hard Drives Raw flash devices do not heave such a huge seek time as hard drives do, so the techniques which work for HDDs do not necessarily work well for flash.

That said, VFS read-ahead only slows UBIFS down instead of improving it,so UBIFS disables VFS read-ahead. But UBIFS has its own internal read-ahead,which we call "bulk-read". You may enable bulk-read using the"bulk_read" UBIFS mount option.

Some flashes may read faster if the data are read at one go, rather thanat several read requests. For example, OneNAND can do "read-while-load" ifit reads more than one NAND page. So UBIFS may benefit from reading large data chunks at one go, and this is exactly what bulk-read does.

If UBIFS notices that a file is being read sequentially (at least 3sequential 4KiB blocks has been read), and if UBIFS sees that the further file data resides sequentially at the same eraseblock, it starts reading data ahead using large read requests, which makes it possible to read at higher rates. So UBIFS reads more than it is asked to, and it pushes the read-ahead data to the file caches, so the data become instantly available for the further user read requests.

Here is an example. Suppose the user is reading a file sequentially. We arelucky and the file is not fragmented on the media. Suppose LEB 25 contains data nodes belonging to this file, and the data nodes are logically (in terms of logical file offset) and physically (in terms of LEB/offset addresses)sequential. Suppose user requests to read data node at LEB 25 offset 0. In thiscase UBIFS will actually read whole LEB 25 at one go, then populate the filecache with all the read data. And when the user asks the next piece of data,it will already be in the cache.

Obviously, the bulk-read feature may slow UBIFS down in some work-loads, soyou should be careful. It is also worth noting that bulk-read feature cannothelp on highly fragmented file-systems. Although UBIFS does not fragmentfile-systems (e.g., the Garbage-Collector does not re-order data nodes), butUBIFS does not try to defragment them. For example, if you write a filesequentially, it won't be fragmented. But if you write more than one file ata time, they may become fragmented (well, this also depends on how write-backflushes the changes), and UBIFS won't automatically defragment them. However,it is possible to implement a background defragmenter. It is also possibleto have per-inode journal head and avoid mixing data nodes belonging todifferent inodes in the same LEB. So there is room for improvements.



Mount options

The following are UBIFS-specific mount options.

  • chk_data_crc (default) - check data CRC-32checksums;
  • no_chk_data_crc - do not check data CRC-32 checksums,see this section for moredetails;
  • bulk_read - enable bulk-read, seehere;
  • no_bulk_read (default) - do not bulk-read.

Example:

$ mount -o no_chk_data_crc /dev/ubi0_0 /mnt/ubifs

mounts UBIFS file-system to /mnt/ubifs and disables data CRC checking.

Besides, UBIFS supports the standard sync mount option which may be used to disable UBIFS write-back and write-buffer caching and make it fully synchronous. Note, UBIFS does not support "atime", so the atime mount option has no effect.


Dirty space

Dirty space is the flash space occupied by UBIFS nodes which wereinvalidated because they were changed or removed. For example, if the contentsof a file is re-written, than corresponding data nodes are invalidated and newdata nodes are written to the flash media. The invalidated nodes comprise dirtyspace. There are other mechanisms how dirty space appears as well.

UBIFS cannot re-use dirty space straight away, because corresponding flashareas do not contain all 0xFF bytes. Before dirty space can be re-used, UBIFShas to garbage-collect corresponding LEBs. The idea of Garbage collector whichreclaims dirty space is the same as in JFFS2. Please, refer theJFFS2 design documentfor more information.

Roughly, UBIFS garbage collector picks a victim LEB which has some dirtyspace and moves valid UBIFS nodes from the victim LEB to the LEB which wasreserved for GC. This produces some amount of free space at the end of thereserved LEB. Then GC pick new victim LEB, and moves the data to the reservedLEB. When the reserved LEB is full, UBIFS picks another empty LEB (e.g., theold victim which had been made free a step ago), and continues moving nodesfrom the victim LEB to the new reserved LEB. The process continues until a fullempty LEB is produced.

UBIFS has a notion of minimum I/O unit size, which characterizes minimumamount of data which may be written to the flash (seehere for more information).Typically, UBIFS works on large-page NAND flashes and min. I/O size is 2KiB.

Consider a situation when GC picks eraseblocks with less than min. I/O unitsize dirty space. When all nodes from the victim LEB have been moved to thereserved LEB, the last min. I/O unit of the reserved LEB has to be written tothe flash media, which means no space would be reclaimed. The reason why thelast min. I/O unit of the reserved LEB has to me written immediately is becausethe victim LEB cannot be erasedbefore all the moved nodes have reachedthe media. Indeed, otherwise an unclean reboot would result in lost data.

Well, things are actually not that simple and UBIFS GC actually tries not towaste space, but it is not always possible and UBIFS GC is far from beingideal. Anyway, what matters is that UBIFS cannot always reclaim dirty space ifthe amount of it is less than min. I/O unit size.

When UBIFS reports free space to the users, it treats dirty space asavailable for new data, because after garbage-collection dirty space becomesfree space. But we have just showed, UBIFS cannot reclaimall dirtyspace and turn it into free space. Worse, UBIFS does not precisely know howmuch dirty space it can reclaim. So it again uses pessimistic calculations.

Thus, the less dirty space the FS has, and the smaller is dirty spacefragmentation, the more precise is UBIFS free space reporting. In practice thismeans that a file system which is close to be full has less accurate freespace reporting comparing to a less full file system, because this file systempresumably has more dirty space.

Note, to fix this issue, UBIFS would need to run GC instatfs(), which would turn as much dirty space as possible intofree space, which would result in more precise free space reporting. However,this would makestatfs() very slow. Another possibility would beto implement background GC in UBIFS (just like in JFFS2), which would lesseneffect of dirty space with time.


Documentation

If flash file systems is a completely new area for you, it is recommendedto start from learning JFFS2, because many basic ideas are the same in UBIFS.Read theJFFS2 designdocument.

You may find the description of main JFFS2 issues, as well as very basicUBIFS ideas in theJFFS3 design document.Remember, the document in general is old and out-of-date. We do not use the"JFFS3" name anymore, and JFFS3 was re-named to UBIFS. The document was writtenwhen UBI did not exist and the document assumes that JFFS3 is talking directlyto the MTD device, just like JFFS2. However, theJFFS2 overview,JFFS3 Requirements, andIntroduction to JFFS3 chapters are stillmostly valid and give a good introduction into basic UBIFS ideas likewandering tree and the journal. Although please note, that the superblockdescription is irrelevant for UBIFS. UBIFS is based on UBI and does not needthat trick. However, the superblock location idea may be used to create newscalable UBI2 layer.

This web-page as well as the UBIFS FAQcontains a plenty of UBIFS information. And you have to study UBI as well,because UBIFS depends on the services provided by the UBI layer. See theUBI documentation andUBIFAQ sections.

Look at UBIFS presentation slides (ubifs.odp) whichgive another UBI/UBIFS overview. The slides were prepared in OpenOffice.orgImpress 2.4, so you needOpenOffice to see them. Theslides contain animation, so you have to watch them in "slide show" mode(useF5 key). And if you do not have any possibility to getOpenOffice, here is apdf version, but it is very uglybecause it does not store the animation and draws all animation steps atonce.

There is an UBIFS white-paper documentavailable as well. However, it might be rather difficult for newbies, so werecommend to start with the JFFS3 design document. The UBIFS white-paper givesa complete UBIFS design picture and describes the UBIFS internals. Thewhite-paper does not contain some details which you may find at this web-pageor in the UBIFS FAQ, and vice-versa.

And finally, there is UBIFS source code.The code has a great deal of comments, so we recommend to look there ifyou need all the details. And of course, you are welcome to ask questionsat theUBIFS mailing list.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值