WHEN TO (AND NOT TO) USE RAID-Z RAID-Z is the technology used by ZFS to implement a data-protection scheme which is less costly than mirroring in terms of block overhead. Here, I'd like to go over, from a theoretical standpoint, the performance implication of using RAID-Z. The goal of this technology is to allow a storage subsystem to be able to deliver the stored data in the face of one or more disk failures. This is accomplished by joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z groups can be dynamically striped to form a larger storage pool. To store file data onto a RAID-Z group, ZFS will spread a filesystem (FS) block onto the N devices that make up the group. So for each FS block, (N - 1) devices will hold file data and 1 device will hold parity information. This information would eventually be used to reconstruct (or resilver) data in the face of any device failure. We thus have 1 / N of the available disk blocks that are used to store the parity information. A 10-disk RAID-Z group has 9/10th of the blocks effectively available to applications. A common alternative for data protection, is the use of mirroring. In this technology, a filesystem block is stored onto 2 (or more) mirror copies. Here again, the system will survive single disk failure (or more with N-way mirroring). So 2-way mirror actually delivers similar data-protection at the expense of providing applications access to only one half of the disk blocks. Now let's look at this from the performance angle in particular that of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z group achieves it's protection by spreading a ZFS block onto the N underlying devices. That means that a single ZFS block I/O must be converted to N device I/Os. To be more precise, in order to acces an ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for input as the parity data need not generally be read-in. Now after a request for a ZFS block has been spread this way, the IO scheduling code will take control of all the device IOs that needs to be issued. At this stage, the ZFS code is capable of aggregating adjacent physical I/Os into fewer ones. Because of the ZFS Copy-On-Write (COW) design, we actually do expect this reduction in number of device level I/Os to work extremely well for just about any write intensive workloads. We also expect it to help streaming input loads significantly. The situation of random inputs is one that needs special attention when considering RAID-Z. Effectively, as a first approximation, an N-disk RAID-Z group will behave as a single device in terms of delivered random input IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will globally act as a 200-IOPS capable RAID-Z group. This is the price to pay to achieve proper data protection without the 2X block overhead associated with mirroring. With 2-way mirroring, each FS block output must be sent to 2 devices. Half of the available IOPS are thus lost to mirroring. However, for Inputs each side of a mirror can service read calls independently from one another since each side holds the full information. Given a proper software implementation that balances the inputs between sides of a mirror, the FS blocks delivered by a mirrored group is actually no less than what a simple non-protected RAID-0 stripe would give. So looking at random access input load, the number of FS blocks per second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way mirrored or simply striped (a.k.a RAID-0, no data protection !), the equation would be (where dev represents the capacity in terms of blocks of IOPS of a single device):
Random
Blocks Available
FS Blocks / sec
----------------
-------------- RAID-Z
(N - 1) \* dev
1 \* dev
Mirror
(N / 2) \* dev
N \* dev
Stripe
N \* dev
N \* dev
Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and look at different possible configurations; In the table below the configuration labeled:
"Z 5 x (19+1)" refers to a dynamic striping of 5 RAID-Z groups, each group made of 20 disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a simple dynamic stripe.
Random
Config
Blocks Available
FS Blocks /sec
------------
----------------
---------
Z 1 x (99+1)
9900 GB
200
Z 2 x (49+1)
9800 GB
400
Z 5 x (19+1)
9500 GB
1000
Z 10 x (9+1)
9000 GB
2000
Z 20 x (4+1)
8000 GB
4000
Z 33 x (2+1)
6600 GB
6600
M 2 x (50)
5000 GB
20000
S 1 x (100) 10000 GB
20000
So RAID-Z gives you at most 2X the number of blocks that mirroring provides but hits you with much fewer delivered IOPS. That means that, as the number of devices in a group N increases, the expected gain over mirroring (disk blocks) is bounded (to at most 2X) but the expected cost in IOPS is not bounded (cost in the range of [N/2, N] fewer IOPS). Note that for wide RAID-Z configurations, ZFS takes into account the sector size of devices (typically 512 Bytes) and dynamically adjust the effective number of columns in a stripe. So even if you request a 99+1 configuration, the actual data will probably be stored on much fewer data columns than that. Hopefully this article will contribute to steering deployments away from those types of configuration. In conclusion, when preserving IOPS capacity is important, the size of RAID-Z groups should be restrained to smaller sizes and one must accept some level of disk block overhead. When performance matters most, mirroring should be highly favored. If mirroring is considered too costly but performance is nevertheless required, one could proceed like this:
Given N devices each capable of X IOPS.
Given a target of delivered Y FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(Y / X) devices. For instance:
Given 50 devices each capable of 200 IOPS.
Given a target of delivered 1000 FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(1000 / 200) = 5 devices. In that system we then would have 20% block overhead lost to maintain RAID-Z level parity. RAID-Z is a great technology not only when disk blocks are your most precious resources but also when your available IOPS far exceed your expected needs. But beware that if you get your hands on fewer very large disks, the IOPS capacity can easily become your most precious resource. Under those conditions, mirroring should be strongly favored or alternatively a dynamic stripe of RAID-Z groups each made up of a small number of devices.