韩国人的这篇ijournal论文写的非常好,可以是最近几年来,诠释文件系统日志设计思想的比较好地一篇论文。
以下是我做的阅读笔记:
The ext4 file system uses a physical logging scheme that records the modified blocks [14], rather than logi-
cal logs, which records operations.
Because several of the metadata structures of ext4, such as block bitmap and inode table, are shared among multiple file op-
erations, it is easier and more efficient to commit mul-tiple transactions at once rather than commit each file operation-level transaction individually.
ext4的这种非copy on write的设计决定了得用physical logging。
For the purpose, ext4 groups concurrent unrelated transactions into a single compound transaction [26], which is periodically
flushed into a reserved storage area, called a journal area. The compound transactions are maintained in the journal transaction buffer of the main memory until being committed to the journal area. The compound transaction scheme provides a better performance, particularly when the same metadata structure is frequently updated within a short period of time.
physical logging的优点
Ext4 supports three journaling modes: writeback, ordered, and data modes. Ordered mode, which is the default option, journals only the metadata. However, it enforces an ordering constraint to guarantee file-system consistency, in which the transaction-related data writes must be completed before the journal writes of the metadata. Therefore, the transaction commit latency will be lengthy if the size of the associated data is large
A long latency for committing a compound journal transaction will increase the latency of an fsync system call [12].
physical logging的缺点:
the problem of entangled dependencies
For a short fsync latency, a more fine-grained journaling scheme such as file-level transaction committing is required. However, under a physical logging scheme, fine-grained journaling is difficult to implement because several metadata blocks are shared by multiple file operations.
logical logging的缺点:
耗内存太厉害了
Another solution is the use of a logical logging scheme. For example, XFS [28] and ZFS [7] log the file operations rather the modified blocks for a synchronous request.
logical logging的特点
All file system operations are logically-logged as transactions, which accumulate in memory until they are committed to the journal area for an fsync call. The logical logs are replayed during a crash recovery. However, logical logging requires a large sized transaction buffer in the memory compared with physical logging, particularly when the same metadata structure is fre-
quently updated. For example, ZFS generates a 256 bytes of logical log in memory for each write operation.
ijournal的设计原理:
To address this issue, we propose a hybrid approach that uses both the normal journaling by JBD2 and the file-level transaction journaling of our proposed ijournaling technique. Under a normal periodic journaling operation, the proposed scheme uses a legacy journaling scheme that flushes the compound transaction.
However, if on-demand journaling is invoked by an fsync call, ijournaling commits only the transactions related to the fsynced file without flushing the compound transaction in the transaction buffer. The file-level transactions include only the minimum metadata, through which all relevant file-system metadata blocks can be recovered after a system crash.
ijournal的特点
The ijournaling technique can eliminate the compound transaction problem for an fsync call without requiring an additional large
amount of memory space for transaction management, unlike ZFS
jbd2 journal的特点:
The journal space is treated as a circular buffer. Once the necessary information has been propagated to its fixed location in the ext4 structures, the corresponding journal logs are identified as checkpointed, and the space can be reclaimed. All modified metadata blocks are recorded in a block unit at the journal area even though only a portion of the metadata blocks is modified. This feature makes it difficult to implement file-level journaling because a metadata block is shared by multiple files. One transaction log in the journal contains a journal header (JH), several journal descriptor blocks (JDs) to describe its contents, and a journal commit block (JC) to denote the end of the transaction.
jbd2 transaction的特点:
Each transaction has a metadata list and an inode list, which
have the metadata blocks and pointers to the inodes modified by the transaction, respectively.
jbd2 thread 的缺点:
缺点1:
Based on its operations, we can find several reasons for adverse effect on the latency of an fsync system call.
The first reason is the inter-transaction (IT) dependency.
Because ext4 uses a single JBD2 thread, only one transaction (i.e., a committing transaction) can be committed at a time. Protecting concurrent journal commits is important for preventing multiple journals from being interleaved in the journal area. Furthermore, multiple transactions cannot be committed concurrently because they share several metadata blocks. Therefore, if the JBD2 thread is committing transaction T x n−1 , the next transaction T x n relevant to the fsynced file cannot be changed
into a committing transaction immediately. Such cases will occur frequently when multiple threads invoke fsync calls simultaneously. To solve this IT dependency problem, our ijournaling technique handles an fsync call at system call service rather than the journaling thread, and uses separated journal areas.
缺点2:
the second reason is the compound transaction (CTX) dependency, shown in Figure 1(c). When the JBD2
thread commits the transaction of an fsynced file, the inode list of the committing transaction includes irrelevant inodes. The JBD2 thread must wait for the completion of the data block write operations owing to the ordering constraint of ordered-mode journaling. The CTX dependency is severe when there are many processes generating file-system write operations.
CTX dependency的严重性
Even when only one process generates write operations, a CTX dependency problem can occur if the process updates multiple files. In some cases, a transaction can include discard commands [1], which have considerably long latencies.
The delayed block allocation technique of ext4 aggravates the CTX problem. The delayed block allocation has many advantages because it postpones block allocations until the page flush time, rather than during a write() operation [23]. Therefore, the overall performance of the file system is higher when delayed allocation is enabled. However, if an fsync is called just after the flush kernel thread invocation, as shown in the example in Figure 1(a), the flush thread will allocate data
blocks for dirty pages, and register several modified inodes in the running transaction during the delayed block
allocation. Then, the commit operation of the journal transaction will generate many write requests into storage. If an fsync is called before the flush thread is invoked, the fsync latency will be short because there are few modifications to the file system. Therefore, fsync latencies will fluctuate in a delayed allocation scheme.
上面提到延迟分配会加重单次fsync的负担,因为系统flush thread是每隔30秒触发的(系统中),
然后flush thread 在前,fsync在后的话,容易fsync卡在order模式中的等待file data flush到disk中 这个地方,等很久。
flush thread 在后,fsync在前的话,没有这个问题。
On the contrary, if the delayed allocation is disabled, the modified inodes will be distributed to different transactions, and the fsync latency will be unrelated with the flush thread invocation.
如果延迟分配被禁止的话,则不会加重单次fsync的负担。
Nevertheless, a delayed allocation can demonstrate a better performance and shorter average fsync latency, as described later in Section 6.
延迟分配是缩短的是average fsync latency。
Because our ijournaling scheme commits a file-level transaction rather than a compound transaction, it can always demonstrate a short fsync latency irrespective of the block allocation policy. Throughout our study, we used delayed allocation as the default scheme.
缺点3:
The last reason is the quasi-async request (QA) dependency revealed in [15]. In Figure 1(a), the writeback flush thread has sent a write request on data block 1 of file B before an fsync is called. Whereas the write requests generated by an fsync system call are sent along with a SYNC flag, the write requests generated by the flush thread are sent without the flag. The CFQ I/O sched-
uler in Linux gives lower priorities to requests without a SYNC flag. Although data block 1 is written by an async request, the request is latency-sensitive. Such a request is called a quasi-async request. A long latency will occur for completion of the quasi-async request, particularly when there are many competing async requests in the I/O queue. The QA dependency problem can be solved through the boosting technique proposed in [15], which changes a quasi-async request into a sync request. However, owing to the CTX dependency, the asynchronous write requests on A and C in Figure 1 must also be
changed to sync requests in the boosting technique. The ijournaling can mitigate the QA dependency problem by removing unrelated dependencies. For example, the fsync call on B does not need to wait for the completion of write requests on A and C.
优先级反转,这个是因为各个file changes对应的handle 在同一个transaction里面紧紧耦合在一块,还有order模式的等待文件数据下发到disk上。
ijournal的设计思想:
Only when a process calls an fsync() system call, ijournaling is invoked.
跟ted的观点一样,限制fast commit的触发次数。
The ijournaling scheme generates ijournal transactions (i-transactions) and flushes them into a reserved ijournal area with-
out committing the normal running transaction of an fsynced file.
The i-transaction includes metadata modification logs, which are the minimum required information through which a crash recovery operation can recover the file-system metadata blocks modified throug an fsync operation. Only file-level metadata such as an inode entry and the external extent structures of the target file, and any related directory entries (DEs), are recorded.
ijournal只记录很少的跟file fsynced相关的元数据 blocks。这点是不是说明ijournal也是physical logging啊?
Other modified metadata blocks shared by other files, such as GDT, block bitmap, inode bitmap, or inode
table, are not flushed into the ijournal area. They can be recovered during the crash recovery time using committed i-transactions.
其他的ijournal未记录的元数据也可以靠ijournal进行恢复。
The ijournaling scheme does not change the normal running transaction used by the JBD2 thread. Therefore, the metadata blocks committed by ijournaling are again committed into the normal journal area through the following periodic JBD2 thread,
which simplifies the crash recovery.
留意这个,ijournal记录的元数据竟然会再次被commited到the normal journal area
Figure 2 shows an example of a metadata recovery operation of ijournaling. When the file-system recovery module finds a committed i-transaction in the ijournal area, it can modify the old block bitmap in the filesystem using the extent allocation information, which can be found from the inode entry or the external extent structures in the i-transaction. Because two blocks from block number 30 are allocated for an extent, the 30-th and 31-st bits in the block bitmap must be set.The inode table and inode bitmap can also be easily recovered through a recorded inode entry.
To implement ijournaling, no changes are required to the current JBD2 journaling scheme. Whereas a normal journaling thread flushes the transaction buffer periodically, ijournaling is performed in the fsync() system call service.
ijouranal的工作上下文背景。
Therefore, an ijournaling and a normal journaling can be performed simultaneously, and the intertransaction dependency is removed. The file-system recovery module must be modified to handle ijournal.
具体讲了上面的说的其他的元数据怎么可以靠ijournal来恢复。
The ijournaling will show a slightly difference on crash recovery compared with the normal journaling scheme. While the normal journaling can recover all the other contemporary file operations as well as the fsynced file operation, the proposed ijournaling can recover only the files and directories related to fsync operation. However, the file system consistency is guaranteed
ijournal和normal journal在磁盘恢复方面的区别
To simplify the ijournaling implementation, our scheme uses the normal journaling for some cases. For the fsync call for a directory itself, a normal transaction is committed instead of an ijournal to record all file-system changes in the subdirectories, as well as in the fsynced directory entry. This simplifies the journaling by removing the traversing of the subdirectories.
简化ijournal的实现,dir fsync就不用ijournal了。
When an inode is shared by multiple files using hard link and an fsync() is called for only one file, the file-system consistency can be broken if ijournaling records the parent directories of only the fsynced file. To eliminate the traversing of directories connected by hard links, a normal transaction is committed instead of an ijournal for the case. To track such a case, we added the uncommitted HL flag in the inode structure. The flag of a file is marked if the i link count of its inode is incremented by a hard link operation. The flag is cleared when a running transaction is committed by the JBD2 thread. The fsync system
call service checks the flag of the target inode, and calls normal journaling if the flag has been marked
ijournal的Crash Recovery方面策略:
The ijournal crash recovery module replays only valid i-transactions. It first scans the normal journal area,
replays the committed but not-yet-checkpointed journal transactions, and finds the last committed journal transaction ID (Max TxID).
normal journal area的恢复
Because valid i-transactions have the information on file-system changes after a valid normal journal transaction is committed, the normal journal transaction must be replayed before i-transactions.
先replay normal journal area,然后再replay ijournal areas
Then, the recovery module scans the ijournal areas. If an itransaction has a transaction ID larger than Max TxID,
it is valid. Otherwise, the i-transaction is ignored since a normal committed journal transaction includes all the
metadata modifications of the i-transaction. If there are multiple i-transactions on an inode, only the last i-transaction with the largest sub-transaction ID is valid since the last one includes all the metadata modifications of the previous i-transactions.
Figure 4(a) shows an example of journal commit. At
a time of 30, the normal transaction with the transac-
tion ID (TxID) n is committed and the TxID is incre-
mented to n + 1. Before the next periodic transaction
with TxID = n + 1 is committed, the files B, C, and D
are modified, and fsync() calls are invoked for the files
C and D by different processor cores. In Figure 4(b),
the i-transactions with (TxID, sub-TxID) = (n + 1, 0) and
(n + 1, 1) have the committed file information of the files
C and D, respectively. The system is crashed before
the periodic transaction commit (TxID = n + 1). In Fig-
ure 4(b), the i-transaction with TxID = n is invalid be-
cause the normal transaction with TxID = n has been
committed. Therefore, the recovery operation uses only
the i-transactions with TxID = n+1. In Figure 4(a), there
is a file operation on file B before a system crash, but the
operation cannot be recovered by ijournaling. How-
ever, there is no problem in file-system consistency.
这个举个例子,讲的不错。不恢复file b了,因为他没有被fsynced。
毕竟这样做,file-system consistency也没啥问题。
Figure 5 shows an example of a file-system recovery under the ijournaling scheme. Initially, the file with inode number 3 has three external extents, which are used to access 24 blocks.
磁盘上的已经固化的状态
Through some file operations,
ten blocks (block numbers 50-59) and the corresponding
external extent structure in block number 12 are freed.
Then, six blocks (block numbers 74-79) are appended,
and the external extent in block number 13 is modified.
After the file operations, an fsync is called.
内存的一些数据修改,fsync后,肯定已经调用了ijournal进行了log相关的记录。
Assume that there is a system crash before a normal journal is committed.
crash发生点
The recovery module builds the inode structure including the external extent tree with the recorded i-transactions. By comparing the built inode with the corresponding inode in storage, the recovery module can identify the file-system changes by the logged fsync call, and can replay these changes.
上面说用crash后的recovery用ijournal来做replay。
When the external ex-
tent block in block number 12 is freed, the original ext4
journaling records a revocation block at the journal area
to prevent an incorrect replay of the journal, which will
cause a data corruption.
说的是部分blocks被free后,如果用normal journal会产生个revocation block,这个有什么不好的。
这个revocation block我需要调查下,是怎么回事,是哪种场景下会产生的。
The ijournaling scheme skips the writing of the revocation block because the following normal journaling will write it.
ijournal的优点。
从这一点对应的数据图中,可以看出ijournal是用来恢复自己fsync对应的那单个文件的,不是用来恢复文件系统的,但是用ijournal恢复完后,会保证文件系统的一致性的。
下来有时间再细看这部分,对理解crash recovery有帮助的,讲的也不错。
这个ijournal我大概明白了:
---------------------
normal journal记录整个文件系统的变化,ijournal只记录自己被fsynced对应的那单个文件的变化。
ijournal id的特点:
然后normal journal的id为n时,ijournal 1,2,3各自的id分别为(n,0) (n,1) (n,2).
normal journal的id为n+1时,ijournal 1,2,3各自的id分别为(n+1,0) (n+1,1) (n+1,2).
ijournal的提交特点:
然后是normal journal会周期性的被提交,ijournal只会在fsync的上下文中被提交。
通常是每一笔normal journal提交之间的时间段,会有若干笔ijournal的提交(这段时间触发了几个fsync)。
然后ijournal提交完后,会再次走normal journal的提交流程,把ijournal的东西往normal journal里面提交下。
ijournal的crash recovery特点:
就是normal journal是用来恢复整个文件系统的。
然后ijournal是用来恢复自己fsync对应的那单个文件的,不是用来恢复文件系统的,但是用ijournal恢复完后,会保证文件系统的一致性的。
recovery module首先replay完normal journal区域后,再开始replay ijournal区域。
----------------------
发给ted的邮件:
many thanks for your kind reply.
the fast commit for ext4 may be designed and implemented according to idea of that ijournal paper,
as that ijournal thought is the best way for resolve the problem of large amount file's data has to been waited in jbd2 thread with
order mode from my opinion.
for explain the reason with more detail, i append the design and advantage of the ijournal thought from my viewpoint.
-------------------
1: normal journal records and commit the changes of whole ext4 filesystem to journal area, but ijournal only record and commit the changes of the fsync'ed file to its own ijournal area.
2: when normal journal id is n, and then the subsequent created ijournal 1,2,3's id is (n, 0), (n, 1), (n, 2).
when normal journal id is n+1, and then the subsequent created ijournal 1,2,3's id is (n+1, 0), (n+1, 1), (n+1, 2).
3: normal journal is committed in jbd2 thread within 5s periodically, but ijournal can be committed in thread which is doing fsync work. because it only commit the changes of the single fsync'ed file to its ijournal area.4:
ijournal 和 normal journal的对比实验:
实验前置条件:
The delayed allocation and ordered-mode journaling were used by default. The JBD2 thread conducts a journal commit operation at periodic 5-second intervals.
order模式必须带上,延迟分配会加重单次fsync 延迟负担。
We ran two programs for the experiments. One is an fsync-generating thread (fsync
tester), which writes 80 KB of data in a file and calls
an fsync repeatedly. We gave a delay of 0.1 second be-
tween write() and fsync() in order to generate many
quasi-async requests. The other is the fio program [6],
which generates 4 KB of sequential write requests for a
file with a configurable write bandwidth of BG bw . The fio
program was used as a background process, which gen-
erated many data blocks to be flushed during the trans-
action commit operation. We determined the value of
BG bw at each experiment considering the storage band-
width and the target foreground workload.
怎么跟熊平的fsync 性能测试方案一样啊,不过这个靠后台起个fio线程去write,来变化后台IO带宽倒是挺有创意的。
Figure 6(a) shows the results for the desktop when
BG bw = 400 MB/s. In the normal journaling scheme, the
tail fsync latency at the 95th percentile is longer than 3.5
seconds. This is because the fsync must wait until a large
number of dirty pages are flushed. In our measurement,
1.5 GB of data blocks at maximum were flushed during
an fsync handling. However, ijournaling showed less
than 0.2 seconds of fsync latency.
Figure 6(b) shows the results for the smartphone when
BG bw = 50 MB/s. The ijournaling scheme also im-
proved the fsync latency in the smartphone.
a和b对比,发现ijournal的确大大降低了fsync的延迟。
To demonstrate the CTX dependency problem in
legacy journaling, we measured the fsync latencies of
fsync tester while varying the write bandwidth of the
background process, i.e., BG bw of fio. Figure 7 shows the
average fsync latencies under four different journaling
schemes. As the background write bandwidth increased,
the fsync latency increased for the normal journaling
scheme because more transactions were merged into a
compound transaction. In particular, when BG bw = 500
MB/s during the desktop experiment, the fsync system
call was not completed until the background fio program
was terminated.
这个实验挺好的,通过不断调大后台IO写带宽(通过fio工具),发现normal journal的缺点了,CTX problem的严重性
However, the ijournaling scheme showed short latencies even when BG bw was high. The booting scheme was effective only when ijournaling is enabled.
但是ijournal就没有这个问题了。
Figure 8 compares the fsync latencies in legacy jour-
naling under different block allocation policies. The ex-
periment scenario is same as the scenario of Figure 6(a).
When an fsync() was called while the flush thread was
flushing dirty pages, the fsync latency became signifi-
cantly high for the delayed allocation scheme. Other-
wise, the latency was short. This is because data blocks
are allocated when the flush thread is invoked. How-
ever, when the delayed allocation is disabled, there are
no significant changes in the fsync latency. The average
fsync latency is shorter when the delayed allocation is
enabled. Because ijournaling can solve the CTX de-
pendency problem, it can mitigate the fluctuating fsync
latency problem of delayed allocation, and thus showed
less than 0.2 seconds latencies as shown in Figure 6(a).
这个实验是证明了为什么我们小米手机上开启延迟分配比不开启延迟分配,单次fsync的延迟会增大很多。
多核扩展性实验:
A critical hurdle in implementing a manycore-scalable
file system is the journaling contention, as reported in
[24]. In particular, a single JBD2 thread handles all
file-system transactions in ext4. Because ijournaling
commits an fsync-related transaction in the system call
service without calling the JBD2 thread, it improves the
manycore scalability. In addition, each core has its own
ijournal area, and thus, multiple fsync calls can be
handled simultaneously at multiple processor cores.
normal journal的缺点。
最后总结:
We rely on the journaling of data updates for file-system consistency, and synchronous writes for data durability.
文件系统的一些设计要点。
However, latency-sensitive synchronous operations such as an fsync() system call can be delayed under the com-
pound transaction scheme of the current journaling technique. Because a compound transaction includes irrelevant data and metadata, as well as those of fsynced file, the fsync latency can be unexpectedly long.
In this paper, we first analyzed the affecting factors that may delay an fsync operation, and proposed a novel hybrid journal-
ing technique, called ijournaling, which journals only the related file-level transactions of an fsync call and recovers the file-system consistency through file-level journals upon a crash recovery.
掉电重启后,靠ijournal来recovers the file-system consistency
Experiments using real devices showed that there are significant improvements to the fsync latencies when using ijournaling, and that many synchronous applications can benefit from the proposed ijournaling technique.
ted关于fast commit的想法实现(基于上面这篇论文的)
The trick is that we track whether the inode has changes which we
can't represent in the fast commit "logical journal". In the logical
journal, we record changes since the last full commit, not as the full
physical metadata block, but just bits of the logical metadata that
have changed. If that inode has changed in ways that we can't
represent in the fast commit journal, then we do a normal full commit.
So we avoid entangled dependencies in two ways . First of all, we
only journal the logical change. Hence, if there is a change in
another part of the metadata block (say, another inode in the inode
table) there won't be an issue, since we only update that one inode.
Secondly, if the inode has some entangelements either with other
inodes, or (b) changes in the inode which we can't reflect in the fast
commit log, then fall back to doing a full commit.
So basically, we only deal with the simple, common cases, where it's
easy to log changes to the fast commit log. Now, those changes are
also logged in the normal physical commit, so once we do a full
commit, all of the entries in the fast commit log are no longer needed
--- the fast commit just contains the small, simple changes since the
last full commit.