WRITE SAME 命令是SCSI中一个不是必须的实现的命令,主要的用途是在重置设备内容。
一个典型的场景是ESXi下厚制备立即置零整个卷。
在云场景,一个VM一般对应多个卷,每个卷的空间都是G到T级别。
为了性能的稳定,很多分布式系统都需要将卷写一遍,然后跑业务或者跑性能。
全写卷的目的是让 backend 存储提前分配好元数据,做好预热等。
如果是上层直接发写全卷的调用,例如write,那么写一个T级别的卷,需要耗费非常长的时间。
例如1T的卷,顺序写的速度是500M/s,那么写入需要2000s左右,约等于30多分钟。
Write Same 就是为这种场景准备的。
通过在块设备层,下发Write Same命令,上层不需要传输那么多数据,只需要很少的数据(512B),
然后在backend反复的写这部分数据,就可以达到 offload 写到backend的目的。
总之,Write Same的目的是:
大大减少数据的传输;
offload 全卷写到backend;
如果跟UNMAP结合,就可以最大限度的避免写,更进一步提升性能。
本文Agenda如下:
介绍一下SCSI provison的知识;
如何查看SCSI provison 和 Write Same/Unmap的协商信息
WRITE SAME测试方法;
provision
provision 决定了逻辑块与物理块的对应关系。
读一下 SBC-3 手册可知,provision 分为如下几类:
full provision:逻辑块和物理块一一对应
logcal block provision
resource provision:有足够的资源,使得所有的逻辑快都可以对应到一个物理块,但是当前有些是 unmap 或者 anchor;
thin provision:可以超分,也就是 lb 的数量可以大于物理块的数量
术语
- anchor:预留的意思,lb 与物理块有对应关系,但是并没有使用
- unmapping:lb 与物理块没有对应
sg_utils 查看SCSI相关特性
查看是否支持 logical provision:
$ sg_readcap /dev/sda -l
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=1, lbprz=0
Last logical block address=629145599 (0x257fffff), Number of logical blocks=629145600
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned logical block address=0
Hence:
Device size: 322122547200 bytes, 307200.0 MiB, 322.12 GB
查看 block limits 的限制:
$ sg_vpd -p oi /dev/sdb 1 ↵
Block limits VPD page (SBC):
Write same no zero (WSNZ): 0
Maximum compare and write length: 0 blocks
Optimal transfer length granularity: 1 blocks
Maximum transfer length: 2097152 blocks
Optimal transfer length: 64 blocks
Maximum prefetch length: 0 blocks
Maximum unmap LBA count: 4294967295
Maximum unmap block descriptor count: 256
Optimal unmap granularity: 1
Unmap granularity alignment valid: 0
Unmap granularity alignment: 0
Maximum write same length: 0xffff blocks
查看一块盘的 lbp 的信息
$ sg_vpd -p lbpv /dev/sda
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 0
Write same (16) with unmap bit supported (LBWS): 1
Write same (10) with unmap bit supported (LBWS10): 0
Logical block provisioning read zeros (LBPRZ): 0
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0
Descriptor present (DP): 0
Provisioning type: 0
以上几个命令,在做 SCSI provision,Write same和unmap协议支持检查时经常用到的,SCSI 设备通过这几个inquiry暴露自己的特性。
以下是上面查询结果中,最关键的缩写,以及意义:
LBPU:Logical block provisioning unmap,支持unmap
LBWS:Logical block provisioning write same
LBWS10:Logical block provisioning write same16
LBPRZ:Logical block provisioning read zeros
lbpme:logical block provision management enable,如果是1,表示支持 logical block provision;
lbprz:ogical block provisioning read zeros,如果是1,表示从 provision 的地方读0;
Maximum write same length: 0xffff blocks 表示一个Write Same命令可以写的最大长度;
查看 mapping 的状态:
$ sg_get_lba_status /dev/sda
descriptor LBA: 0x0000000000000000 blocks: 838860800 mapped
使用scsi_debug 测试验证
modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024
$ sg_readcap /dev/sdb -l
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=1, lbprz=1
Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned logical block address=0
Hence:
Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GB
$ sg_vpd -p lbpv /dev/sdb
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 1
Write same (16) with unmap bit supported (LBWS): 1
Write same (10) with unmap bit supported (LBWS10): 0
Logical block provisioning read zeros (LBPRZ): 1
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0
Descriptor present (DP): 0
Provisioning type: 0
$ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=0 lbpws=0 dev_size_mb=1024
$ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdb
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=0, lbprz=0
Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned logical block address=0
Hence:
Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GB
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 0
Write same (16) with unmap bit supported (LBWS): 0
Write same (10) with unmap bit supported (LBWS10): 0
Logical block provisioning read zeros (LBPRZ): 1
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0
Descriptor present (DP): 0
Provisioning type: 0
$ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=0 dev_size_mb=1024
$ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdb
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=1, lbprz=1
Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned logical block address=0
Hence:
Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GB
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 1
Write same (16) with unmap bit supported (LBWS): 0
Write same (10) with unmap bit supported (LBWS10): 0
Logical block provisioning read zeros (LBPRZ): 1
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0
Descriptor present (DP): 0
Provisioning type: 0
关掉 logical provision
$ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=0 lbpws=0 dev_size_mb=1024
$ sg_readcap /dev/sdb -l && sg_vpd -p lbpv /dev/sdb
Read Capacity results:
Protection: prot_en=0, p_type=0, p_i_exponent=0
Logical block provisioning: lbpme=0, lbprz=0
Last logical block address=2097151 (0x1fffff), Number of logical blocks=2097152
Logical block length=512 bytes
Logical blocks per physical block exponent=0
Lowest aligned logical block address=0
Hence:
Device size: 1073741824 bytes, 1024.0 MiB, 1.07 GB
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 0
Write same (16) with unmap bit supported (LBWS): 0
Write same (10) with unmap bit supported (LBWS10): 0
Logical block provisioning read zeros (LBPRZ): 1
Anchored LBAs supported (ANC_SUP): 0
Threshold exponent: 0
Descriptor present (DP): 0
Provisioning type: 0
$ sg_get_lba_status -l 1024 /dev/sdb
Get LBA Status command not supported
测试 unmap
$ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024
$ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb
138+0 records in
138+0 records out
70656 bytes (71 kB) copied, 0.0192505 s, 3.7 MB/s
descriptor LBA: 0x0000000000000400 blocks: 144 mapped
$ sg_unmap -v -l 1024 -n 16 /dev/sdb
unmap cdb: 42 00 00 00 00 00 00 00 18 00
$ sg_get_lba_status -l 1024 /dev/sdb
descriptor LBA: 0x0000000000000400 blocks: 16 deallocated
$ sg_get_lba_status -l 1040 /dev/sdb
descriptor LBA: 0x0000000000000410 blocks: 128 mapped
write same
详情在这里:https://www.systutorials.com/docs/linux/man/8-sg_write_same/
$ rmmod scsi_debug && modprobe scsi_debug lbprz=1 lbpu=1 lbpws=1 dev_size_mb=1024
$ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb
138+0 records in
138+0 records out
70656 bytes (71 kB) copied, 0.0187926 s, 3.8 MB/s
descriptor LBA: 0x0000000000000400 blocks: 144 mapped
$ sg_write_same -U --in /dev/zero --num=128 --lba=1024 /dev/sdb
$ sg_get_lba_status -l 1024 /dev/sdb
descriptor LBA: 0x0000000000000400 blocks: 128 deallocated
$ dd if=/dev/zero of=/dev/sdb bs=512 seek=1024 count=138 && sg_get_lba_status -l 1024 /dev/sdb
138+0 records in
138+0 records out
70656 bytes (71 kB) copied, 0.0186514 s, 3.8 MB/s
descriptor LBA: 0x0000000000000400 blocks: 144 mapped
$ sg_write_same --in /dev/zero --num=128 --lba=1024 /dev/sdb
$ sg_get_lba_status -l 1024 /dev/sdb
descriptor LBA: 0x0000000000000400 blocks: 144 mapped
$ cat /sys/bus/pseudo/drivers/scsi_debug/map
1152-1167
SPDK 实现完成之后 WRITE SAME测试用例
1. 写512的0
dd if=/dev/urandom of=/dev/sdf bs=512 seek=0 count=2 && sg_get_lba_status -l 0 /dev/sdf
hexdump -C -n1024 /dev/sdf
sg_write_same --num=1 --lba=0 /dev/sdf -vvv
hexdump -C -n1024 /dev/sdf
2. unmap
# not support
sg_write_same -U --in buf --num=1 --lba=0 /dev/sdf -vvv
3. 写小于512的内容
perl -e 'print("-" x 504, "+" x 4);' >buf
time sg_write_same -U --in buf --num=4 --lba=0 /dev/sdf -vvv
4. 写大于512的内容
非对齐
perl -e 'print("-" x 512, "+" x 4);' >buf
time sg_write_same -U --in buf --num=4 --lba=0 /dev/sdf -vvv
对齐
perl -e 'print("-" x 510, "+" x 514);' >buf
time sg_write_same --in buf --num=4 --lba=0 /dev/sdf -vvv
hexdump -C -n1024 /dev/sdf
5. 性能测试
写256M数据
time sg_write_same --num=$((256*1024*1024/512)) --lba=0 /dev/sdf -vvv
no-data-out
time sg_write_same -N --num=$((256*1024*1024/512)) --lba=0 /dev/sdf -vvv
ESXi测试用例
创建一个VM
创建一个卷,类型为“厚制备立即置零”,打开wireshark,可以看到ESXi下发的Write Same命令
总结
实现 SCSI 的write same 比较简单,按照spec实现就行,patch有空整理发出来。
实现了 write same之后,ESXi 厚制备立即置零(thick provision eager zeroed)在我的测试机上提速5倍以上。
原文连接:https://zhuanlan.zhihu.com/p/44606912
(免费订阅,永久学习)学习地址: Dpdk/网络协议栈/vpp/OvS/DDos/NFV/虚拟化/高性能专家-学习视频教程-腾讯课堂