本文讨论一下zfs读写IOPS或吞吐量的优化技巧, (读写操作分同步和异步两种情况).
影响性能的因素
1. 底层设备的性能直接影响同步读写 iops, throughput. 异步读写和cache(arc, l2arc) 设备或配置有关.
2. vdev 的冗余选择影响iops, through.
因为ZPOOL的IO是均分到各vdevs的, 所以vdev越多, IO和吞吐能力越好.
vdev本身的话, 写性能 mirror > raidz1 > raidz2 > raidz3 ,
读性能看实际存储的盘数量决定. (raidz1(3) = raidz2(4) = raidz3(5) > mirror(n))
3. 底层设备的IO对齐影响IOPS.
在创建zpool 时需指定ashift, 而且以后都无法更改.
建议同一个vdev底层设备的sector一致, 如果不一致的话, 建议取最大的扇区作为ashift. 或者将不一致的块设备分到不同的vdev里面.
例如sda sdb的sector=512, sdc sdd的sector=4K
zpool create -o ashift=9 zp1 mirror sda sdb
zpool add -o ashift=12 zp1 mirror sdc sdd
ashift
Pool sector size exponent, to the power of 2 (internally referred to as "ashift"). I/O operations will be
aligned to the specified size boundaries. Additionally, the minimum (disk) write size will be set to the
specified size, so this represents a space vs. performance trade-off. The typical case for setting this
property is when performance is important and the underlying disks use 4KiB sectors but report 512B sectors
to the OS (for compatibility reasons); in that case, set ashift=12 (which is 1<<12 = 4096).
For optimal performance, the pool sector size should be greater than or equal to the sector size of the
underlying disks. Since the property cannot be changed after pool creation, if in a given pool, you ever
want to use drives that report 4KiB sectors, you must set ashift=12 at pool creation time.
Keep in mind is that the ashift is vdev specific and is not a pool global. This means that when adding new
vdevs to an existing pool you may need to specify the ashift.
这里有一个工具收录了一些常见设备的扇区大小.
如果不清楚底层设备的扇区大小, 为了对齐可以设置为13(8K).
例如
# zpool create -o ashift=13 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zpool create -o ashift=9 zp2 scsi-36c81f660eb17fb001b2c5ff465cff3ed
# zfs create -o mountpoint=/data01 zp1/data01
# zfs create -o mountpoint=/data02 zp2/data02
# date +%F%T; dd if=/dev/zero of=/data01/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:57:35
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 46.4277 s, 185 MB/s
2014-06-2609:58:22
# date +%F%T; dd if=/dev/zero of=/data02/test.img bs=1024K count=8192 oflag=sync,noatime,nonblock; date +%F%T;
2014-06-2609:58:32
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 43.9984 s, 195 MB/s
2014-06-2609:59:16
# zpool list
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
zp1 3.62T 8.01G 3.62T 0% 1.00x ONLINE -
zp2 3.62T 8.00G 3.62T 0% 1.00x ONLINE -
大文件看不出区别, 小文件的话, 如果文件小于ashift设置的大小, 那么就等于浪费空间, 同时降低了小文件的写效率. 增加cache占用等.
4. 底层设备的模式, 建议JBOD或passthrough, 绕过RAID卡的控制.
5. zfs 参数直接影响iops和吞吐量.
5.1
对于数据库类型的应用, 大文件, 离散的小数据集访问, 选择recordsize 大于或等于数据库的块大小比较好. 例如PostgreSQL 8K的block_size, 建议zfs recordsize大于等于8KB. 一般不建议调整recordsize, 使用默认的128K就能满足大多数需求.
测试 :
pg_test_fsync 测试结果, 512最差, 8K和128K差不多.
5.2
recordsize=size
Specifies a suggested block size for files in the file system. This property is designed solely for use
with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes
according to internal algorithms optimized for typical access patterns.
For databases that create very large files but access them in small random chunks, these algorithms may be
suboptimal. Specifying a recordsize greater than or equal to the record size of the database can result in
significant performance gains. Use of this property for general purpose file systems is strongly discour-
aged, and may adversely affect performance.
The size specified must be a power of two greater than or equal to 512 and less than or equal to 128
Kbytes.
Changing the file system’s recordsize affects only files created afterward; existing files are unaffected.
This property can also be referred to by its shortened column name, recsize.
测试 :
# zpool create -o ashift=12 zp1 scsi-36c81f660eb17fb001b2c5fec6553ff5e
# zfs create -o mountpoint=/data01 -o recordsize=8K -o atime=off zp1/data01
# zfs create -o mountpoint=/data02 -o recordsize=128K -o atime=off zp1/data02
# zfs create -o mountpoint=/data03 -o recordsize=512 -o atime=off zp1/data03
关闭数据缓存, 不影响结果.
# zfs set primarycache=metadata zp1/data01
# zfs set primarycache=metadata zp1/data02
# zfs set primarycache=metadata zp1/data03
# mkdir -p /data01/pgdata
# mkdir -p /data02/pgdata
# mkdir -p /data03/pgdata
# chown postgres:postgres /data0*/pgdata
pg_test_fsync 测试结果, 512最差, 8K和128K差不多.
512
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
fdatasync 252.052 ops/sec 3967 usecs/op
fsync 248.701 ops/sec 4021 usecs/op
Non-Sync'ed 8kB writes:
write 7615.510 ops/sec 131 usecs/op
8K
fdatasync 329.874 ops/sec 3031 usecs/op
fsync 329.008 ops/sec 3039 usecs/op
Non-Sync'ed 8kB writes:
write 83849.214 ops/sec 12 usecs/op
128K
fdatasync 329.207 ops/sec 3038 usecs/op
fsync 328.739 ops/sec 3042 usecs/op
Non-Sync'ed 8kB writes:
write 76100.311 ops/sec 13 usecs/op
5.2
压缩效率和压缩比不能兼得, 一般推荐LZ4, 压缩效率和压