Fine-grained Metadata Journaling on NVM 总结

Fine-grained Metadata Journaling on NVM

Paper link:

http://storageconference.us/2016/Papers/Fine-GrainedJournaling.pdf

Below are some key points that are worth mentioning:

  1. Updating data larger than 8 bytes must adopt logging or copy-onwrite (CoW) approaches which make data recoverable by writing a copy elsewhere before updating the data itself. The implementation of these approaches in NVM requires the
    memory writes to be in a certain order, e.g., the logs or backup copy of data must be written completely in NVM
    before updating the data itself. Unfortunately, ordering memory writes must involve expensive CPU instructions such as memory fence (e.g., MFENCE) and cache-line flush (e.g., CLFLUSH). For example, prior work
    showed that directly using them in B±tree reduces the performance by up to 16x.

  2. Although the data consistency can be guaranteed by using journal, a large portion of metadata when being committed
    to the journal is clean and unnecessary to be written. This is because a typical journaling file system always assumes
    the persistent storage to be a block device such as HDD. Therefore, the basic unit in the journal is the same as the
    disk block. However, since the size of an individual metadata structure is much smaller than the block size, metadata is
    usually stored on HDD in batch. For example, the size of the metadata in Ext4, inode, is 256 bytes, while a typcial size of a disk block is 4K bytes which means one metadata block stores 16 inodes. In consequence, when even one of them is needed to be written back to disk, all of them have to be written to the journal. We call this the journal write amplification which is main focus of this paper.

  3. Another drawback of traditional journal format is the Descriptor Block and Commit Block (or Revoke Block) to indicate the beginning and end of a transaction, respectively. However, these two blocks occupy 8KB space in total while a typical size for inode is only 256B.

Techinuqe Details

1. Cache-friendly Data Structure TxnInfo — the boundary of two adjacent transactions

TxnInfo describes the inodes in the transaction and is used to locates the boundary of each transaction to facilitate the recovery process and guarantee its correctness. Its size is an integral multiple of the inode size so that the journal for
each transaction can be naturally aligned to CPU-cache which results in better performance for cache-line flushing. As the
maximum number of inodes in each transaction is determined by the length of TxnInfo, it can be used to control the default commit frequency which can be used to optimize the overall performance for a certain workload.

在这里插入图片描述

The NVDIMM space used as the journal area is guaranteed to be mapped to a continuous (virtual) memory
space. The reserved space stores (1) the start address and total size to locate the boundary of journal space, and (2)
the head and tail addresses for the current status of journal. The head or tail address is actually the memory offset to the
start address of the mapped memory space. Therefore, even if the mapping is changed after reboot, the head and tail can
always be located using the offset so the whole file system is practically recoverable after power down

2. Details about committing, checkpointing and recovery

The paper also has the details about how to commit dirty inodes to the NVM journal area, how to performance checkpoint and how to recovery. The main techinique is to always to use CLFLUSH and memory fence (MFENCE) to make sure the journal is consistent.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值