RAIDs - 冗余磁盘阵列

本文是笔者在学习Operating system : three easy pieces时所做的学习笔记,原文用Obsidian撰写,因为使用了插件和一些扩展语法所以排版可能会出现问题。

RAID概述

RAIDs是什么

直译为"廉价冗余磁盘阵列",书中的介绍为

a technique to use multiple disks in concert to build a faster, bigger, and more reliable disk system.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=457&selection=44,7,45,66|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 457]]

说白了就是用很多没那么快,没那么可靠的小磁盘,合起来一起建立一个又快,又大,又可靠的大磁盘

磁盘阵列为什么比单盘好

RAIDs offer a number of advantages over a single disk. One advantage is performance. Using multiple disks in parallel can greatly speed up I/O times. Another benefit is capacity. Large data sets demand large disks. Finally, RAIDs can improve reliability; spreading data across multiple disks (without RAID techniques) makes the data vulnerable to the loss of a single disk; with some form of redundancy, RAIDs can tolerate the loss of a disk and keep operating as if nothing were wrong.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=457&selection=58,0,78,63|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 457]]

  • 性能:并行使用多磁盘能加速I/O
  • 容量:磁盘多了容量自然大了
  • 可靠:用一部分磁盘做冗余后,就可以容忍一部分磁盘的损坏,在可容忍范围内不会导致数据的丢失(当然,坏得太多RAIDs也没办法)

透明性

Amazingly, RAIDs provide these advantages transparently to systems that use them, i.e., a RAID just looks like a big disk to the host system.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=458&selection=45,0,50,74|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 458]]

所谓透明性,我的理解是可以随意地替换掉这部分而不必去改动其他部分,例如我可以随意地替换我电脑上的内存条,只要它没坏且兼容我的电脑,并不会导致电脑无法启动。RAIDs也可以替换掉原本的单盘,就好像它本来就是一块单盘

failed-stop错误模型

In this model, a disk can be in exactly one of two states: working or failed. With a working disk, all blocks can be read or written. In contrast, when a disk has failed, we assume it is permanently lost.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=459&selection=40,0,47,30|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 459]]

在这个模型中,磁盘就只有两种状态

  • 工作中(working):所有的块都可以读写
  • 失败 (failed):一块盘failed,我们认为它永久地丢失了

RAIDs的评估

怎么评估RAID

原文部分:[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=459&selection=63,0,63,22|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 459]]

三个标准:容量(Capacity),可靠性(Reliability),性能(Performance)

  • 容量:给N个盘,最后组成RAID能用的容量是多少
  • 可靠性:这个设计可以挂多少盘还不崩
  • 性能:
    • 单请求延迟:它反应了单个逻辑操作可以并行化的程度 (怎么理解)
    • 稳定的带宽 :因为RAIDs通常用在高性能场景,所以稳定的带宽是一个主要的关注点
    • 顺序读写:连续块的读写
    • 随机读写:很多对于分散小块的读写请求

性能评估

具体的例子:
[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=463&selection=18,0,97,59|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 463]]
因为是例子,我尽量按原意翻译了过来

假设条件

书中假定:

  • 顺序负载传输速率:S MB/S
  • 随机负载传输速率:R MB/S
  • 平均顺序传输数据量:10MB
  • 平均随机传输数据量:10KB

Q: 为什么假设的顺序传输数据量比随机传输数据量大这么多?
W:因为顺序读写本来就是大块连续的读写,而随机读写大多都是分散的小块读写

英文中文速率
Average seek time平均寻道时间7 ms
Average rotatinal delay平均旋转延迟3ms
Transfer rate of disk磁盘传输速率50 MB/s
计算

注:10MB @ 50MB/s (10MB at 50MB/s)是说 10MB在50MB/s的情况下要花费的时间,也就是10MB/50MB/s

计算S,需要需算出一个典型数值,10MB的数据传输需要多久。
首先,需要7ms寻道,3ms旋转,然后开始传输.
10MB @ 50MB/s是1/5秒,也就是200ms,所以Time to access=210ms

S = A m o u n t   o f   D a t a T i m e   t o   a c c e s s = 10 M B 210 m s = 47.62 M B / s S=\frac{Amount\ of\ Data}{Time\ to\ access}=\frac{10MB}{210ms}=47.62MB/s S=Time to accessAmount of Data=210ms10MB=47.62MB/s
在我们看来,因为传输数据的耗时占了大头,S非常接近峰值带宽(寻道和旋转时间被摊还了)


R的计算是类似的,寻道和旋转时间与S相同,我们带入数据传输的耗时:10KB @ 50MB/s, 或者说0.195ms

R = A m o u n t   o f   D a t a T i m e   t o   a c c e s s = 10 K B 10.95 m s = 0.98 M B / s R=\frac{Amount\ of\ Data}{Time\ to\ access}=\frac{10KB}{10.95ms}=0.98MB/s R=Time to accessAmount of Data=10.95ms10KB=0.98MB/s

总结
  • 顺序读写时传输数据耗时占大头,所以S很接近峰值带宽
  • 随机读写时寻道和旋转时间比重非常大,就很慢
  • 两者相差接近50倍

RAID映射问题

问题是:

给一个逻辑块,我该放在哪个磁盘的哪个块?

Disk = A % number_of_disks
Offset = A / number_of_disks

数据都向下取整,因为实际上是这样的
A = Disk_N * number_of_disks + Offset

Chunk Sizes

a small chunk size implies that many files will get striped across many disks, thus increasing the parallelism of reads and writes to a single file; however, the positioning time to access blocks across multiple disks increases, because the positioning time for the entire request is determined by the maximum of the positioning times of the requests across all drives.

A big chunk size, on the other hand, reduces such intra-file paral- lelism, and thus relies on multiple concurrent requests to achieve high throughput. However, large chunk sizes reduce positioning time; if, for example, a single file fits within a chunk and thus is placed on a single disk, the positioning time incurred while accessing it will just be the po- sitioning time of a single disk.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=461&selection=74,65,85,32|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 461]]

  • 块太小了可以很好地并行化,但是增加了寻道时间
  • 块太大了会影响并行化,但是相应的,寻道时间就比较小

很好,经典的权衡问题,“the best chunk size”的选取


RAID-0: 条带化

把连续的数据分散到多个磁盘
raid_0

just like this,感觉书上的表格没这么直观

A1,A2就相当于书上说的一个chunk,这个chunk尺寸是不固定的,书上的两张表,第一张相当于chunk取1个block,所以排列是 0 1 2 3然后第二行 4 5 6 7
第二张取两个block,所以排列是 (0 1) (2 3) (4 5) (6 7)

  • block=1
Disk 0Disk 1Disk 2Disk 3
0123
4567
891011
12131415
  • block=2
Disk 0Disk 1Disk 2Disk 3
0246
1357
8101214
9111315

可靠性

any disk failure will lead to data loss.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=462&selection=23,46,24,10|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 462]]

任何一块磁盘的失败,导致所有数据丢失

很显然,因为没有任何的冗余,RAID 0虽然拥有理论上最好的性能,但是它毫无可靠性可言

性能

N块盘并行读写,速率是直接N倍的
吞吐量: N ∗ S N*S NS


RAID-1: 镜像

With a mirrored system, we simply make more than one copy of each block in the system; each copy should be placed on a separate disk

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=463&selection=128,12,130,4|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 463]]

在分开的硬盘里,简单地为每个block创建一个副本

就是每份数据都在两块不同盘里有一份备份

raid_0

容量

mirroring level=2时,即每份数据有两份副本,那么RAID1只有了一半的有效空间,有点贵

可靠性

最多允许一半的磁盘挂掉,因为每份数据主要还有一份备份,那就没有问题,所以有非常高的安全性

性能

we analyze performance. From the perspective of the latency of a single read request, we can see it is the same as the latency on a single disk; all the RAID-1 does is direct the read to one of its copies.
A write is a little different : it requires two physical writes to complete before it is done. These two writes happen in parallel, and thus the time will be roughly equivalent to the time of a single write; however, because the logical write must wait for both physical writes to complete, it suffers the worst-case seek and rotational delay of the two requests, and thus (on average) will be slightly higher than a write to a single disk.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=464&selection=109,9,117,63|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 464]]

  • 单请求延迟:与单盘相当,因为读一个副本就行
  • 写入:每次都要进行两次实际的物理写入,虽然可以并行,这让它接近单盘的写入速度。但是!它要等待两块盘都写完,实际写入速度=worst(两次写入速度),是略高于单盘的
  • 稳定吞吐量
    • 顺序写:因为要写两份数据,所以是峰值带宽的一半:(N/2)*S
    • 顺序读:也只有一半。为啥?因为每个磁盘每隔一个block接受一个请求,只有一半哦,这个副本读一下,那个副本再读一下
    • 随机写:就是(N/2)*S,和顺序写原因一样
    • 随机读:分散读取到每个磁盘,是N*S

一致性更新问题

原文:[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=465&selection=18,0,31,6|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 465]]

The problem occurs on a write to any RAID that has to up- date multiple disks during a single logical operation.

这个问题会在单逻辑操作必须进行多磁盘更新时发生

场景:
RAID要更新disk0和disk1,它先给disk0发了请求,然后disk0处理请求,然后向disk1请求时,遇到了不可抗力因素(断电了),这时候disk0的数据更新完了,但是disk1的数据没有更新。出现了disk0有一份新数据,而disk1有一份旧数据的情况,两个盘的数据就不一致了

而实际上我们希望这个操作是原子的,即要么disk0和disk1的请求都处理完成,要么都不处理,那怎么办

于是有了日志先行的方法,只要日志写入了,那么就算出了问题,我们可以根据日志来进行回放,来达到恢复的目的


RAID-4: 用奇偶验证节省空间

拿出一块来做奇偶校验盘
raid_4

对于(A1, A2, A3, Ap)这一行,A1, A2, A3是数据,Ap是校验位
Ap=XOR(A1, A2, A3)

原理很简单,奇偶校验很熟悉,这样略记一下

容量

因为有一块校验盘,所以是(N-1)

可靠性

因为奇偶校验的性质,我们可以容忍一块盘的损坏

性能

  • 顺序读:(N-1)*SMB/s
  • 顺序写:(N-1)*SMB/s
  • 随机读:(N-1)*SMB/s
  • 随机写:
顺序写

the RAID can simply calculate the new value of P0 (by performing an XOR across the blocks 0, 1, 2, and 3) and then write all of the blocks (including the parity block) to the five disks above in parallel

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=468&selection=79,13,81,74|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 468]]
顺序写时可以进行 整条写(full-strip write) , 即校验位的新值只需要对那一整条都进行一次异或就可以获得,并没有很大的性能损失,所以是 (N-1)*SMB/s

随机写

这是RAID4最拉的地方,因为有校验位要修改,每次修改一个block,都要修改一次相应的校验位

两种方法:

Additive parity

加性奇偶校验:并行读出所有数据块,然后进行XOR运算,将得到的值写入校验位
这会有大量的读操作

Subtractive parity

原文:[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=469&selection=18,9,30,6|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 469]]

减性奇偶校验:只看奇偶校验位和被修改的数据位
这个算法的思想是:
被更改的数据位要是变了,校验位也跟着该改变
反之,数据位保持原值,那么校验位也保持原值

P n e w = ( C o l d   ⊕   C n e w )   ⊕   P o l d P_{new} = (C_{old} \ \oplus \ C_{new}) \ \oplus \ P_{old} Pnew=(Cold  Cnew)  Pold

在上面的式子中体现出来便是

  • C o l d = C n e w C_{old} = C_{new} Cold=Cnew 那么 P n e w = 0 ⊕ P o l d = P o l d P_{new} = 0 \oplus P_{old} = P_{old} Pnew=0Pold=Pold
  • C o l d ≠ C n e w C_{old} \neq C_{new} Cold=Cnew 那么 P n e w = 1 ⊕ P o l d = ¬ P o l d P_{new} = 1 \oplus P_{old} = \neg P_{old} Pnew=1Pold=¬Pold

Hopefully, the issue is now clear: the parity disk is a bottleneck under this type of work- load; we sometimes thus call this the small-write problem for parity- based RAIDs. Thus, even though the data disks could be accessed in parallel, the parity disk prevents any parallelism from materializing

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=469&selection=118,3,128,69|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 469]]

small-write problem

奇偶校验盘是瓶颈,即使数据盘可以并行,奇偶校验盘也会阻止并行化的实现,就算以减性奇偶校验来看,每次逻辑写入也会产生两次物理写入,所以最终的随机写是(R/2)MB/s

延迟

单次读只是对单盘进行了一次映射,所以与单盘是等效的

RAID-5: 旋转校验

它与RAID4的区别仅仅在于:RAID4是将校验全放在一张盘上,而RAID5是在驱动间旋转校验块,所以除了随机写入,其他方面的参数应该是相同的

看图

在这里插入图片描述

性能

在校验块以旋转的形式分布在各个盘后,RAID5就可以进行并行处理,当然,每次的逻辑写入还是会产生多个I/O操作,书中有4个数据盘,small write bindwidth就是 (N/4)*RMB/s

RAID的对比总结

[RAID-1](#RAID-0 条带化)[RAID-1](#RAID-1 镜像)[[RAID-4](#RAID-4 用奇偶验证节省空间)][RAID-5](#RAID-5: 旋转校验)
CapacityNN/2N − 1N − 1
Reliability01 (for sure)
N/2 (if lucky)
11
Throughput
Sequential ReadN · S(N/2) · S(N − 1) · S(N − 1) · S
Sequential WriteN · S(N/2) · S(N − 1) · S(N − 1) · S
Random ReadN · RN · R(N − 1) · RN · R
Random WriteN · R(N/2) · R12 · R
Latency
ReadDDDD
WriteDD2D2D

总结

We have discussed RAID. RAID transforms a number of independent disks into a large, more capacious, and more reliable single entity; importantly, it does so transparently, and thus hardware and software above is relatively oblivious to the change. There are many possible RAID levels to choose from, and the exact RAID level to use depends heavily on what is important to the end-user. For example, mirrored RAID is simple, reliable, and generally provides good performance but at a high capacity cost. RAID-5, in contrast, is reliable and better from a capacity standpoint, but performs quite poorly when there are small writes in the workload. Picking a RAID and setting its parameters (chunk size, number of disks, etc.) properly for a particular workload is challenging, and remains more of an art than a science.

[[Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.).pdf#page=472&selection=59,0,72,67|Operating Systems Three Easy Pieces (Remzi H. Arpaci-Dusseau etc.), page 472]]

我们已经讨论了RAID。RAID将多个独立的磁盘转换成一个更大、更大容量、更可靠的单个实体;重要的是,它是透明的,因此上面的硬件和软件相对不会注意到变化。有许多可能的RAID级别可供选择,要使用的确切RAID级别在很大程度上取决于对最终用户来说什么是重要的。例如,镜像RAID简单、可靠,性能一般较好,但容量成本较高。相反,从容量的角度来看,RAID-5更可靠,性能更好,但是当工作负载中有少量写操作时,性能就很差了。为特定的工作负载正确地选择RAID并设置其参数(块大小、磁盘数量等)是具有挑战性的,它更像是一门艺术,而不是一门科学。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值