Fine-grained Metadata Journaling on NVM
Paper link:
http://storageconference.us/2016/Papers/Fine-GrainedJournaling.pdf
Below are some key points that are worth mentioning:
-
Updating data larger than 8 bytes must adopt logging or copy-onwrite (CoW) approaches which make data recoverable by writing a copy elsewhere before updating the data itself. The implementation of these approaches in NVM requires the
memory writes to be in a certain order, e.g., the logs or backup copy of data must be written completely in NVM
before updating the data itself. Unfortunately, ordering memory writes must involve expensive CPU instructions such as memory fence (e.g., MFENCE) and cache-line flush (e.g., CLFLUSH). For example, prior work
showed that directly using them in B±tree reduces the performance by up to 16x. -
Although the data consistency can be guaranteed by using journal, a large portion of metadata when being committed
to the journal is clean and unnecessary to be written. This is because a typical journaling file system always assumes
the persistent storage to be a block device such as HDD. Therefore, the basic unit in the journal is the same as the
disk block. However, since the size of an individual metadata structure is much smaller than the block size, metadata is
usually stored on HDD in batch. For example, the size of the metadata in Ext4, inode, is 256 bytes, while a typcial size of a disk block is 4K bytes which means one metadata block stores 16 inodes. In consequence, when even one of them is needed to be written back to disk, all of them have to be written to the journal. We call this the journal write amplification which is main focus of this paper. -
Another drawback of traditional journal format is the Descriptor Block and Commit Block (or Revoke Block) to indicate the beginning and end of a transaction, respectively. However, these two blocks occupy 8KB space in total while a typical size for inode is only 256B.
Techinuqe Details
1. Cache-friendly Data Structure TxnInfo — the boundary of two adjacent transactions
TxnInfo describes the inodes in the transaction and is used to locates the boundary of each transaction to facilitate the recovery process and guarantee its correctness. Its size is an integral multiple of the inode size so that the journal for
each transaction can be naturally aligned to CPU-cache which results in better performance for cache-line flushing. As the
maximum number of inodes in each transaction is determined by the length of TxnInfo, it can be used to control the default commit frequency which can be used to optimize the overall performance for a certain workload.
The NVDIMM space used as the journal area is guaranteed to be mapped to a continuous (virtual) memory
space. The reserved space stores (1) the start address and total size to locate the boundary of journal space, and (2)
the head and tail addresses for the current status of journal. The head or tail address is actually the memory offset to the
start address of the mapped memory space. Therefore, even if the mapping is changed after reboot, the head and tail can
always be located using the offset so the whole file system is practically recoverable after power down
2. Details about committing, checkpointing and recovery
The paper also has the details about how to commit dirty inodes to the NVM journal area, how to performance checkpoint and how to recovery. The main techinique is to always to use CLFLUSH and memory fence (MFENCE) to make sure the journal is consistent.