- 参考: C o m p u t e r A r i c h i t e c t u r e ( 6 th E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer Arichitecture (6th Edition)
- Memory (存储系统): 内存
- Storage Systems (存贮系统): 外存 (持久性、非易失性)
Bus
- I/O buses tap into the processor-memory bus via bus adaptors: 适配器用于速度匹配(做缓存)、做接口
Main components of Intel Chipset: Pentium 4
- Northbridge (接高速设备的适配器): Handles memory, Graphics
- Southbridge (接低速设备的适配器): I/O, PCI bus, Disk controllers, USB controllers, Audio, Serial I/O, Interrupt controller, Timers
IMC(Integrated Memory Controller)
- 可以看到,CPU 集成度越来越高: Memory Controller 被集成到了 CPU 内部,北桥消失了。同时 L1 和 L2 Cache 被集成到了每个 Core 里,L3 Cache 被四个核共享,也被集成到了 CPU 里
- QPI (Quick Path Interconnect)——“快速通道互联”,支持多条系统总线连接,取代前端总线 (FSB)
下一步把 Memory 也集成进 CPU…
The move from Parallel to Serial I/O
- Parallel I/O (ISA bus, PCI, SCSI, IDE)
- Parallel bus clock rate limited by clock skew across long bus (~100MHz)
- High power to drive large number of loaded bus lines
- Central bus arbiter (总线仲裁器) adds latency to each transaction, sharing limits throughput
- Expensive parallel connectors and backplanes/cables (all devices pay costs)
- Dedicated Point-to-point Serial Links (Ethernet, Infiniband, PCI Express, SATA, USB, Firewire)
- Point-to-point links run at multi-gigabit speed using advanced clock/signal encoding (requires lots of circuitry at each end)
- Lower power since only one well-behaved load
- Multiple simultaneous transfers
- Cheap cables and connectors (trade greater endpoint transistor cost for lower physical wiring cost), customize bandwidth per device using multiple links in parallel
- Examples: 硬盘接口: IDE (并行) → \rightarrow → SATA (串行)
Disk Storage
- Storage emphasizes reliability and scalability (可扩展性) as well as cost-performance (性价比)
- What is “Software king” that determines which HW features actually used?
- Compiler for processor
- Operating System for storage
Flash: The future of disks? (固态硬盘)
- Flash drive advantages: Lower power (no moving parts), Much faster seek time, 100X IOs per second (no moving parts), Greater reliability (no moving parts), Lower noise (no moving parts) (数据不移动时表现好)
- Flash disadvantages: Cost (20-100x disk cost/GB), Slow writes with current design (competitive with disks), write endurance (耐久度不行,某一个位置写的次数多就坏了) - not an issue for most applications since use write-leveling to spread wear around blocks on chip (通过软件来处理该问题)
Disk Figure of Metric: Areal Density
- Bits recorded along a track; Metric is Bits Per Inch (BPI)
- Number of tracks per surface; Metric is Tracks Per Inch (TPI)
- bit density per unit area; Metric is Bits Per Square Inch: Areal Density = BPI × TPI = \textrm{BPI} \times \textrm{TPI} =BPI×TPI
Disk Drive Performance
- Disk Service Time: Time taken by a disk to complete an I/O request is sum of
- Seek Time (寻道时间), Rotational Latency, Data Transfer Rate(MB/s)
Utilization vs. Response time
利用率和响应时间
- 利用率 (I/O 请求频率) 越高,响应时间越长
反映存储外设可靠性能的参数
- Reliability 系统可靠性: 系统从初始状态开始一直提供服务的能力
- 用平均无故障时间 MTTF (Mean Time to Failure) 来衡量
- Availability 系统可用性: 系统正常工作时间在连续两次正常服务间隔时间中所占的比率
- 用 MTTF MTTF + MTTR \frac{\textrm{MTTF}}{\textrm{MTTF} +\textrm{MTTR}} MTTF+MTTRMTTF (Mean Time To Repair, 平均修复时间)来衡量 (修复 → \rightarrow → 数据恢复)
- MTTF + MTTR = MTBF(Mean Time Between Failure, 平均故障间隔时间)
- Dependability 系统可信性: 多大程度上可以合理地认为服务是可靠的
- 可信性不可度量
Use Arrays of Small Disks?
Replace Small Number of Large Disks with Large Number of Small Disks!
- Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
Array Reliability
- Reliability of N N N disks = Reliability of 1 Disk ÷ N ÷ \ N ÷ N
- Arrays (without redundancy) too unreliable to be useful!
RAID
Redundant Arrays of (Inexpensive) Disks; 廉价磁盘冗余阵列
- Files are “striped” across multiple disks (将数据以条带化的形式存储在很多磁盘上)
- Redundancy yields high data availability 可用性 (Disks will still fail)
- Availability: service still provided to user, even if some components failed
- Contents reconstructed from data redundantly stored in the array
- Capacity penalty to store redundant info
- Bandwidth penalty to update redundant info
RAID 0: Striping
数据条带化
- RAID 0: 非冗余磁盘阵列,无冗余信息;
- 将数据分成条带 (stripe),以条带为单位交叉地分布存放到各个磁盘中,形成一个容量更大,能并行工作的磁盘 (图中 Stripe0, Stripe1… 为按顺序排列的条带,其大小称为条带宽度)
- 所有磁盘可以并行读,因此性能很高;但不提供数据冗余,只要其中任一磁盘故障,整个系统都无法正常工作
- 适用于需要高带宽磁盘访问的场合
RAID 1: Disk Mirroring/Shadowing
- Each disk is fully duplicated onto its “mirror”: Very high availability can be achieved
- Bandwidth sacrifice on write: Logical write = two physical writes (并行写入磁盘及其镜像盘,且不需要计算校验信息,因此写入速度比级别更高的 RAID 都快)
- Reads may be optimized: 从 RAID 1 读取数据时,磁盘及其镜像盘可独立地同时工作,由最先读出数据的磁盘提供数据
- Most expensive solution: 100% capacity overhead
RAID 2: 位交叉式海明编码阵列
- 每个数据盘存放数据字的一位,按位交叉存放,即 Disk0 存放所有数据字的第 0 位,Disk1 存放第 1 位… 各个数据盘上的相应位计算海明 Hamming 校验码,编码位被存放在多个校验(Ecc)磁盘的对应位上
- 从数据盘读数据时,也要读出 Hamming 码,用于判断数据是否有错并加以纠正 (Hamming 码可以纠正 1 位错误、检测两位错误)
- 需要多个磁盘来存放海明校验码信息,冗余磁盘数量与数据磁盘数量的对数成正比( log 2 m \log_2m log2m, m m m 为数据盘的个数)
RAID 3: Bit-interleaved Parity Disk
位交叉奇偶校验盘阵列
- 当某个磁盘发生故障时,磁盘控制器本身就能发现哪个磁盘出错,因此不需要采用复杂的 Hamming 码,使用奇偶校验即可
- Logically, a single high capacity, high transfer rate disk: good for large transfers 单盘容错并行传输 (细粒度磁盘阵列,即条带宽度较小 (1 个字节或 1 位)。因此对于绝大多数 I/O 请求都需要磁盘阵列中所有磁盘为之服务,因此能获得很高的数据传输率)
-
1
/
N
1/N
1/N capacity cost for parity if
N
N
N data disks and
1
1
1 parity disk
- Wider arrays reduce capacity costs, but decreases reliability/availability
RAID3 读写特点
- 假定:有 4 个数据盘和一个冗余盘
- 读出数据,一共需要 5 次磁盘读操作 (同时读 4 个数据盘和一个冗余盘)
- 写数据需要 3 次磁盘读和 2 次磁盘写操作
RAID 4: Block-interleaved Parity Disk
块交叉奇偶校验磁盘阵列
Inspiration for RAID 4
- 在 RAID 3 中,一次磁盘访问将对磁盘阵列中的所有磁盘进行操作。RAID 4 希望使用较少的磁盘参与操作,以使磁盘阵列可以并行进行多个数据的磁盘操作
- RAID 4 数据以块交叉的方式存于各盘, 奇偶校验信息存在一台专用盘上 (parity disk),冗余代价与 RAID 3 相同 (采用粗粒度的磁盘阵列,即采用比较大的条带(块)为单位进行交叉存放和计算奇偶校验);访问数据的方法与 RAID 3 不同
- Small read: every block has an error detection field——每个磁盘独立的进行读操作;Allows independent reads to different disks simultaneously (只有磁盘出现故障时,才会读校验盘,进行数据重建)
- To catch errors on read, rely on error detection field vs. the parity disk
- Large write: 写入操作时,由于要重新计算校验码,因此几乎要访问所有磁盘
- Small read: every block has an error detection field——每个磁盘独立的进行读操作;Allows independent reads to different disks simultaneously (只有磁盘出现故障时,才会读校验盘,进行数据重建)
RAID 5: Block-interleaved Distributed Parity
Inspiration for RAID 5
- Small writes (write to one disk): since P has old sum, compare old data to new data, add the difference to P
Small Write Algorithm
- 1 Logical Write = 2 Physical Reads + 2 Physical Writes
Problems of Disk Arrays: Small Writes
- Small writes are limited by Parity Disk:
- Write to
D
0
D_0
D0,
D
5
D_5
D5 both also write to P disk (因此还是不能同时写
D
0
D_0
D0 和
D
5
D_5
D5)
- Write to
D
0
D_0
D0,
D
5
D_5
D5 both also write to P disk (因此还是不能同时写
D
0
D_0
D0 和
D
5
D_5
D5)
RAID 5: High I/O Rate Interleaved Parity
块交叉分布式奇偶校验盘阵列
- 为了解决上面的问题,把校验信息分布到磁盘阵列中的各个磁盘上,无专用冗余盘,每一行数据块的校验块被依次错开、循环地存放到不同盘中,使奇偶校验信息均匀分布在所有磁盘上
- Independent writes possible because of interleaved parity
- Independent writes possible because of interleaved parity
RAID 6: 双维奇偶校验独立存取盘阵列
Inspiration:
- Recovering from 2 failures
RAID6 特点
- 双维奇偶校验独立存取盘阵列: 在 RAID5 的基础上增加了一个独立的校验信息,放在另一个校验盘中,写入数据要访问 1 个数据盘和 2 个冗余盘,可容忍双盘出错
- 数据以块交叉方式存于各盘,检、纠错信息均匀分布在所有磁盘上
RAID 的实现
- 软件方式:阵列管理软件由主机来实现
- 优点:成本低;
- 缺点:过多地占用主机时间,带宽指标上不去
- 阵列卡方式:把 RAID 管理软件固化在 I/O 控制卡上,从而可不占用主机时间,一般用于工作站和 PC 机
- 子系统方式:这是一种基于通用接口总线的开放式平台,可用于各种主机平台和网络系统
Storage Environment
Direct Attached Storage (DAS)
直连
- Servers connect directly to the disk array typically via a SCSI interface.
Network Attached Storage (NAS)
网络附加存储——网络上的文件系统
- Server 用来提供服务,有另外一套专门的体系负责存储
- NAS Devices access the disks in an array via direct connection or through external connectivity
Storage Area Network (SAN)
存储区域网络——网络上的磁盘
- Servers access the disk array through a dedicated network designated as SAN (consists of Fibre Channel switches) (专门构建一个网络进行存储介质和服务器之间的交互)