EnhanceIO的Readme介绍

最新推荐文章于 2024-03-16 09:39:34 发布

styshoo

最新推荐文章于 2024-03-16 09:39:34 发布

阅读量3k

点赞数

分类专栏： linux 存储 kernel

linux 同时被 3 个专栏收录

39 篇文章 0 订阅

订阅专栏

存储

7 篇文章 0 订阅

订阅专栏

kernel

5 篇文章 0 订阅

订阅专栏

STEC EnhanceIO SSD Caching Software

25th December, 2012

1. WHAT IS ENHANCEIO?

什么是EnhanceIO？

EnhanceIO driver is based on EnhanceIO SSD caching software product

developed by STEC Inc. EnhanceIO was derived from Facebook's open source

Flashcache project. EnhanceIO uses SSDs as cache devices for

traditional rotating hard disk drives (referred to as source volumes

throughout this document).

EnhanceIO驱动基于STEC开发的EnhanceIO SSD缓存软件。EnhanceIO衍生于

Facebook的开源软件Flashcache项目。EnhanceIO使用SSD作为传统硬盘驱动

（在本文中会一直用“源卷”来描述该词）的缓存设备。

EnhanceIO can work with any block device, be it an entire physical

disk, an individual disk partition, a RAIDed DAS device, a SAN volume,

a device mapper volume or a software RAID (md) device.

EnhanceIO可以和任何的块设备一起工作，如一个完整的硬盘、一个硬盘分区、

一个做了raid的DAS设备，一个SAN卷，一个device mapper卷，或者一个软RAID设备。

The source volume to SSD mapping is a set-associative mapping based on

the source volume sector number with a default set size

(aka associativity) of 512 blocks and a default block size of 4 KB.

Partial cache blocks are not used. The default value of 4 KB is chosen

because it is the common I/O block size of most storage systems. With

these default values, each cache set is 2 MB (512 * 4 KB). Therefore,

a 400 GB SSD will have a little less than 200,000 cache sets because a

little space is used for storing the meta data on the SSD.

SSD映射的源卷是组合相关映射，集合默认大小为512个块并且块大小4KB。

没有使用局部缓存。选择默认值4KB是因为很多存储系统的通用IO块大小也是4KB。

使用默认值，每个缓存集大小为2MB（512 * 4KB）。因此，一个400GB的SSD有略少于

20万个缓存集，因为一小部分空间用于存储SSD的元数据。

EnhanceIO supports three caching modes: read-only, write-through, and

write-back and three cache replacement policies: random, FIFO, and LRU.

EnhanceIO支持三种缓存模式：只读，透写和回写。支持三种缓存替换策略：随机、

FIFO和LRU。

Read-only caching mode causes EnhanceIO to direct write IO requests only

to HDD. Read IO requests are issued to HDD and the data read from HDD is

stored on SSD. Subsequent Read requests for the same blocks are carried

out from SSD, thus reducing their latency by a substantial amount.

只读模式使得EnhanceIO将写IO只导向给HDD。读IO请求发给HDD，从HDD读到的数据

存储到SDD中。后来的位于相同块的读请求就直接在SSD上执行，因此大量数据时就

会减少延迟。

In Write-through mode - reads are handled similar to Read-only mode.

Write-through mode causes EnhanceIO to write application data to both

HDD and SSD. Subsequent reads of the same data benefit because they can

be served from SSD.

在透写模式中，读请求与只读模式的处理类似。透写模式使得EnhanceIO将写应用数据

写到HDD和SSD中。由于可以从SSD中获取到数据，后来的读请求就可以受益。

Write-back improves write latency by writing application requested data

only to SSD. This data, referred to as dirty data, is copied later to

HDD asynchronously. Reads are handled similar to Read-only and

Write-through modes.

回写模式将应用请求数据只写到SSD，可以降低写的延迟。这些数据，可以认为是脏数据，

会被异步的拷贝到HDD中。读请求处理方式和只读、透写模式类似。

2. WHAT HAS ENHANCEIO CHANGED TO FLASHCACHE?

EnhanceIO与FlasnCache比，改变了什么？

2.1. A new write-back engine

新的回写引擎

The write-back engine in EnhanceiO has been designed from scratch.

Several optimizations have been done. IO completion guarantees have

been improved. We have defined limits to let a user control the amount

of dirty data in a cache. Clean-up of dirty data is stopped by default

under a high load; this can be overridden if required. A user can

control the extent to which a single cache set can be filled with dirty

data. A background thread cleans-up dirty data at regular intervals.

Clean-up is also done at regular intevals by identifying cache sets

which have been written least recently.

EnhanceIO的回写引擎是从头设计的：

做了一些优化。提升了IO完成率。我们定义了一些限制，让用户控制cache中脏数据量。

默认情况下，高负载时脏数据清理会被关闭；如果有需要，这部分可以重写。用户可以

控制单个缓存集可以被脏数据填充的范围。一个后台线程会定期的清理脏数据。在确认

缓存器最近最少被写时，也会定期地进行清理工作。

2.2. Transparent cache

透明缓存

EnhanceIO does not use device mapper. This enables creation and

deletion of caches while a source volume is being used. It's possible

to either create or delete cache while a partition is mounted.

EnhanceIO不使用device mapper。这使得可以在一个源卷正在被使用时，也可以创建

和删除缓存。也使得即使一个分区被mount，也可以创建和删除缓存。

EnhanceIO also supports creation of a cache for a device which contains

partitions. With this feature it's possible to create a cache without

worrying about having to create several SSD partitions and many

separate caches.

EnhanceIO也支持为包含分区的设备创建缓存。有了这个特性，就可以创建缓存，

而不必再创建一些SSD分区和很多分离的缓存了。

2.3. Large I/O Support

大I/O支持

Unlike Flashcache, EnhanceIO does not cause source volume I/O requests

to be split into cache block size pieces. For the typical SSD cache

block size of 4 KB, this means that a write I/O request size of, say,

64 KB to the source volume is not split into 16 individual requests of

4 KB each. This is a performance improvement over Flashcache. IO

codepaths have been substantially modified for this improvement.

和FlashCache不同，EnhanceIO不会导致源卷的I/O请求被切割成缓存中块大小的条带了。

对于典型的4KB块大小的SSD缓存，这意味着一个64KB大小的写I/O请求，不会被切割成16个

4KB大小的独立请求。这是由于FlashCache的性能提升。为了这个提升，IO的代码路径

大体上都被修改了。

2.4. Small Memory Footprint

小内存

Through a special compression algorithm, the meta data RAM usage has

been reduced to only 4 bytes for each SSD cache block (versus 16 bytes

in Flashcache). Since the most typical SSD cache block size is 4 KB,

this means that RAM usage is 0.1% (1/1000) of SSD capacity.

For example, for a 400 GB SSD, EnhanceIO will need only 400 MB to keep

all meta data in RAM.

通过一种特殊的压缩算法，RAM中元数据减少到了每个SSD缓存块只需要4个字节

（对比FlashCache需要16字节）。既然大多数SSD缓存块大小是4KB，这意味着RAM

利用SSD容量的0.1%(1/1000)。比如说，对于一个400GB的SSD，EnhanceIO只需要

400MB的内存来保存所有元数据。

For an SSD cache block size of 8 KB, RAM usage is 0.05% (1/2000) of SSD

capacity.

如果SSD缓存块大小是8KB，RAM使用空间为SSD容量的0.05（1/2000）。

The compression algorithm needs at least 32,768 cache sets

(i.e., 16 bits to encode the set number). If the SSD capacity is small

and there are not at least 32,768 cache sets, EnhanceIO uses 8 bytes of

RAM for each SSD cache block. In this case, RAM usage is 0.2% (2/1000)

of SSD capacity for a cache block size of 4K.

该压缩算法需要最少32,678个缓存集（16位来加密集合个数）。如果SSD容量很小，

并且不足32,678个缓存集合，那么对于每个SSD缓存块，EnhanceIO需要使用8字节的

内存。假若这样，如果一个缓存块大小为4K时，RAM使用SSD缓存容量的0.2%（2/1000）。

2.5. Loadable Replacement Policies

可加载替换策略

Since the SSD cache size is typically 10%-20% of the source volume

size, the set-associative nature of EnhanceIO necessitates cache

block replacement.

既然SSD缓存大小占源卷的10%-20%，EnhanceIO的组合相关映射特性必须要

缓存块替换。

The main EnhanceIO kernel module that implements the caching engine

uses a random (actually, almost like round-robin) replacement policy

that does not require any additional RAM and has the least CPU

overhead. However, there are two additional kernel modules that

implement FIFO and LRU replacement policies. FIFO is the default cache

replacement policy because it uses less RAM than LRU. The FIFO and LRU

kernel modules are independent of each other and do not have to be

loaded if they are not needed.

主EnhanceIO内核模块实现了缓存引擎，它使用了随机（其实，更应该是轮训）替换

策略，并不要求额外的RAM，CPU负载也最低。此外，还有两个额外的内核模块，实现

FIFO和LRU替换策略。FIFO是默认的缓存替换策略，因为较LRU，它使用较少的LRU。

FIFO和LRU内核模块各自独立，并且如果不需要时它们不会被加载。

Since the replacement policy modules do not consume much RAM when not

used, both modules are typically loaded after the main caching engine

is loaded. RAM is used only after a cache has been instantiated to use

either the FIFO or the LRU replacement policy.

既然替换策略模块在不使用时不消耗很多RAM，那么在主缓存引擎加载之后，这

两个模块就被加载了。只有在缓存使用FIFO或者LRU替换策略实例化后，RAM才会

被使用。

Please note that the RAM used for replacement policies is in addition

to the RAM used for meta data (mentioned in Section 2.1). The table

below shows howmuch RAM each cache replacement policy uses:

注意：替换策略中的RAM使用是除开元数据使用的内存之外的，下图显示各种策略使用RAM情况。

POLICYRAM USAGE

---------------

Random0

FIFO4 bytes per cache set

LRU4 bytes per cache set + 4 bytes per cache block

2.6. Optimal Alignment of Data Blocks on SSD

SSD上数据块的最佳对齐

EnhanceIO writes all meta data and data blocks on 4K-aligned blocks

on the SSD. This minimizes write amplification and flash wear.

It also improves performance.

EnhanceIO在SSD上写所有元数据和数据块都按照4K边界对齐，这最小化了写入放大

和flash损耗。它也提升了性能。

2.7. Improved device failure handling

提升了设备故障控制

Failure of an SSD device in read-only and write-through modes is

handled gracefully by allowing I/O to continue to/from the

source volume. An application may notice a drop in performance but it

will not receive any I/O errors.

只读和透写模式下，SSD设备故障可以很优雅地处理，通过允许I/O来继续读/写

源卷。一个应用可能会注意到性能的下降，但是不会获取到任何I/O错误。

Failure of an SSD device in write-back mode obviously results in the

loss of dirty blocks in the cache. To guard against this data loss, two

SSD devices can be mirrored via RAID 1.

回写模式下，SSD故障明显地导致缓存中脏数据的丢失。为了防止数据丢失，两块SSD

设备可以通过做RAID1。

EnhanceIO identifies device failures based on error codes. Depending on

whether the failure is likely to be intermittent or permanent, it takes

the best suited action.

EnhanceIO识别设备基于错误码的设备故障。取决于故障是间歇性的还是持久性的，

它会选择最合适的措施。

2.8. Coding optimizations

代码优化

Several coding optizations have been done to reduce CPU usage. These

include removing queues which are not required for write-through and

read-only cache modes, splitting of a single large spinlock, and more.

Most of the code paths in flashcache have been substantially

restructured.

部分代码优化，用于减少CPU使用率。包括减少透写和只读模式下的队列，将一个

大的自旋锁拆分开。FlashCache中的很多代码都被重新组织了。

2.9 Sequential I/O bypass

连续I/Obypass

EnhanceIO has removed the bypass of sequential IO available in flashcache.

The sequential detection logic has a limited use case, espescially in a

reasonably multithreaded scenario.

EnhanceIO去除了FlashCache中连续IO的bypass，这个连续探测逻辑有一个受限的

用例，尤其是在合理的多线程程序中。

3. EnhanceIO usage

EnhanceIO用法

3.1. Cache creation, deletion and editing properties

缓存创建、删除和编辑

eio_cli utility is used for creating and deleting caches and editing

their properties. Manpage for this utility eio_cli(8) provides more

information.

eio_cli功能用于创建、删除缓存和编辑它们的属性。eio_cli功能的Manpange

提供更多的信息。

3.2. Making a cache configuration persistent

持久化缓存配置

It's essential that a cache be resumed before any applications or a

filesystem use the source volume during a bootup. If a cache is enabled

after a source volume is written to, stale data may be present in the

cache. It may cause data corruption. The document Persistent.txt

describes how to enable a cache during bootup using udev scripts.

启动时，在任何应用或者文件系统使用源卷前，缓存需要被恢复，该功能是核心功能。

如果缓存在源卷被写入后才使能，旧数据就存在于缓存中，这可能会引起数据冲突。

文档Persistent.txt描述了如何在启动时使用udev脚本使能缓存。

In case an SSD does not come up during a bootup, it's ok to allow read

and write access to HDD only in the case of a Write-through or a

read-only cache. A cache should be created again when SSD becomes

available. If a previous cache configuration is resumed, it may cause

stale data to be read.

万一SSD在启动时没有被发现，可以使用透写或者只读模式，允许读写到HDD中。

缓存应该再SSD可用时再次创建。如果一个之前的缓存文件被恢复了，可能会导致

读取旧数据。

3.3. Using a Write-back cache

使用回写缓存

It's absolutely necessary to make a Write-back cache configuration

persistent. This is required particularly in the case of an OS crash or

a power failure. A Write-back cache may contain dirty blocks which

haven't been written to HDD yet. Reading the source volume without

enabling the cache will cause incorrect data to be read.

持久化回写配置文件是完全有必须要的，尤其是万一OS崩溃或者电源故障时。

回写cache可能包含还没有写到HDD中的脏数据，不是能缓存而直接从源卷读取

数据，会导致读取到错误的数据。

In case an SSD does not come up during a bootup, access to HDD should

stopped. It should be enabled only after SSD comes-up and a cache is

enabled.

万一启动时SSD没有被发现，那么需要停止对HDD的访问。在SSD被发现，并且缓存

启动后，才能使能访问HDD。

Write-back cache needs to perform clean operation in order to flush the

dirty data to the source device(HDD). The clean can be either trigerred

by the user or automatically initiated, based on preconfigured

thresholds. These thresholds are described below. They can be set using

sysctl calls.

回写缓存需要执行干净的操作，从而将脏数据刷到源设备（HDD）中。操作可以是

用户出发或者自动初始化完成，这取决于预配置的阈值。这些阈值如下描述，可

通过syscttl调用来设置。

a) Dirty high threshold (%) : The upper limit on percentage of dirty

blocks in the entire cache.

脏块高阈值：脏块在整个缓存中的最高上限百分比。

b) Dirty low threshold (%) : The lower limit on percentage of dirty

blocks in the entire cache.

脏块低阈值：脏块在整个缓存中的最低下限百分比。

c) Dirty set high threshold (%) : The upper limit on percentage of dirty

blocks in a set.

脏块集合高阈值：脏块在一个缓存集中的最高上限百分比。

d) Dirty set low threshold (%) : The lower limit on percentage of dirty

blocks in a set.

脏块集合低阈值：脏块在一个缓存集中的最低下限百分比。

e) Automatic clean-up threshold : An automatic clean-up of the cache

will occur only if the number of outstanding I/O requests from the

HDD is below the threshold.

自动清理阈值：只有来自HDD的I/O请求数在该阈值之下时，才会启动缓存自动清理。

f) Time based clean-up interval (minutes) : This option allows you to

specify an interval between each clean-up process.

清理间隔（分钟）：该选项允许你指定两次清理操作的时间间隔。

Clean is trigerred when one of the upper thresholds or time based clean

threshold is met and stops when all the lower thresholds are met.

当其中的一个上限阈值或者时间阈值满足时，就会触发清理操作；当所有下线阈值

满足时，清理就会停止。

4. ACKNOWLEDGEMENTS

STEC acknowledges Facebook and in particular Mohan Srinivasan

for the design, development, and release of Flashcache as an

open source project.

styshoo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录