给NVMe设备发送一个SCSI READ_10命令

###0 READ_10命令
READ_6命令只能支持块大小为512B设备的2GB范围的寻址,因此官方推荐将READ_6迁移到READ_10
READ_10具有2TB的寻址能力,对于800G的NVMe设备来说当然是极好的。其实READ_10具有更多乱七八糟的特性,但当前的nvme-scsi.c中忽略了其中大部分的特性,因此先不予考虑。

想要通过ioctl发送一个READ_10的SCSI命令,至少需要进行下文的四步操作。

###1 构建一个READ(10) Command
READ(10) Command具体定义如下:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0zkaCL6S-1659682619457)(https://img-blog.csdn.net/20150508152959980?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcHl5YW9lcg==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center “READ (10) command”)]

由图可知,这个命令由10个char组成。需要填写opcode、LBA和Transfer Length:

	unsigned char rdCmd[10] = {READ_10, 0, 0, 0, 0, 0, 0, 0, 0, 0};
	rdCmd[2] = (unsigned char)((start_lba >> 24) & 0xff);
	rdCmd[3] = (unsigned char)((start_lba >> 16) & 0xff);
	rdCmd[4] = (unsigned char)((start_lba >> 8) & 0xff);
	rdCmd[5] = (unsigned char)(start_lba & 0xff);
	rdCmd[7] = (unsigned char)((lba_num >> 8) & 0xff);
	rdCmd[8] = (unsigned char)(lba_num & 0xff);

亦即从start_lba开始读出lba_num个逻辑块,每个逻辑块的大小一般是512B。

###2 切割读写请求
NVMe接收两种读写模式,区别在于下一步sg_io_hdriovec_count的设置。当这个位置非0时,表示使用了scatter gather方法,它向设备传递了一个请求向量。
非sg的方法可以请求传递任意大小的块。而且驱动自动分割读写请求,保证了安全性和可靠性。然而,非sg方法传递一个大缓冲区时,很容易得到ENOMEM错误。
下面这段话引用自sg.danny.cz

Scatter gather allows large buffers (previously limited to 128 KB on i386) to be used. Scatter gather is also a lot more “kernel friendly”. The original driver used a single large buffer which made it impossible to run 2 or more sg-based applications at the same time. With the new driver a buffer is reserved for each file descriptor guaranteeing that at least that buffer size will be available for each request on the file descriptor. A user may request a larger buffer size on any particular request but runs the (usually remote) risk of an out of memory (ENOMEM) error.

因此我们尽量选择scatter gather的方法,手动将一个大请求分割成很多的小块。下面具体阐述了这个过程。

NVMe的nvme-scsi.c中有一个名为nvme_trans_io的函数,它具体执行将一个标准SCSI读写命令转化为NVMe命令的工作。
这个函数对SCSI请求进行了一些限制。可以看到该函数有以下语句:

/* IO vector sizes should be multiples of block size */
if (sgl.iov_len % (1 << ns->lba_shift) != 0) {
	res = nvme_trans_completion(hdr,
			SAM_STAT_CHECK_CONDITION,
			ILLEGAL_REQUEST,
			SCSI_ASC_INVALID_PARAMETER,
			SCSI_ASCQ_CAUSE_NOT_REPORTABLE);	
	goto out;
}

因此切割的时候必须以LBA为基本单位。这类函数目前还不是很完善,最终形态应该是用NVMe命令直接承载所有的SCSI操作。

另外注意到,nvme_trans_io调用的执行读写操作的函数nvme_trans_do_nvme_io函数中不加判断地将用户传进来的sg_iovec中的请求大小放入NVMe命令:

if (hdr->iovec_count > 0) {
	unit_len = sgl.iov_len;
	unit_num_blocks = unit_len >> ns->lba_shift;
}
c.rw.length = cpu_to_le16(unit_num_blocks - 1);
nvme_sc = nvme_submit_io_cmd(dev, ns, &c, NULL);

然而,每个NVMe命令却有一个读写大小的上限。所以在分割时还需要注意,每个块不能超过这个大小。
800GB NVMe P3700的这个大小为256个LBA,也就是128KB。对于一个特定的NVMe设备,这个大小的获取在我以前的博文中提到过。

下面是具体的分割过程,使用IOVEC_ELEMS控制一次读写的数量。

// split into iovecs
rem = len * lba_size;
max_size = max_blocks * lba_size;
for (i = 0; i < IOVEC_ELEMS; ++i){
	iovec[i].iov_base = (char*)(base + i * max_size);
	iovec[i].iov_len = (rem > max_size) ? max_size : rem;
	if (rem <= max_size)
		break;
	rem -= max_size;
}
if (i >= IOVEC_ELEMS){
	printf("Too many data!\n");
	goto exit;
}

###3 构建一个sg_io_hdr
sg_io_hdr定义在sg.h中,包含了一个SCSI命令的全部信息:

typedef struct sg_io_hdr
{
	int interface_id;           /* [i] 'S' for SCSI generic (required) */
	int dxfer_direction;        /* [i] data transfer direction  */
	unsigned char cmd_len;      /* [i] SCSI command length ( <= 16 bytes) */
	unsigned char mx_sb_len;    /* [i] max length to write to sbp */
	unsigned short iovec_count; /* [i] 0 implies no scatter gather */
	unsigned int dxfer_len;     /* [i] byte count of data transfer */
	void * dxferp;              /* [i], [*io] points to data transfer memory
											  or scatter gather list */
	unsigned char * cmdp;       /* [i], [*i] points to command to perform */
	unsigned char * sbp;        /* [i], [*o] points to sense_buffer memory */
	unsigned int timeout;       /* [i] MAX_UINT->no timeout (unit: millisec) */
	unsigned int flags;         /* [i] 0 -> default, see SG_FLAG... */
	int pack_id;                /* [i->o] unused internally (normally) */
	void * usr_ptr;             /* [i->o] unused internally */
	unsigned char status;       /* [o] scsi status */
	unsigned char masked_status;/* [o] shifted, masked scsi status */
	unsigned char msg_status;   /* [o] messaging level data (optional) */
	unsigned char sb_len_wr;    /* [o] byte count actually written to sbp */
	unsigned short host_status; /* [o] errors from host adapter */
	unsigned short driver_status;/* [o] errors from software driver */
	int resid;                  /* [o] dxfer_len - actual_transferred */
	unsigned int duration;      /* [o] time taken by cmd (unit: millisec) */
	unsigned int info;          /* [o] auxiliary information */
} sg_io_hdr_t;  /* 64 bytes long (on i386) */

基本上需要填写的就是下面这些啦:

interface_id = 'S';
cmd_len = sizeof(rdCmd);
cmdp = rdCmd;
dxfer_direction = SG_DXFER_FROM_DEV;
dxfer_len = lba_size * len;
iovec_count = num_of_iovec;
dxferp = iovec;

rdCmd和iovec分别就是上面填好的东西啦。

###4 将请求发出去!
一言以蔽之:

err = ioctl(fd, SG_IO, &io_hdr);
  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值