1. IO 下发流程
SCSI IO路径上接 block 层
1.1 SQ
SCSI SQ 注册到Block层IO钩子函数:
scsi_old_alloc_queue()
|
+---> q->request_fn = scsi_request_fn;
SQ 下发IO路径:
scsi_request_fn()
|
+---> blk_peek_request()
| |
| +---> scsi_prep_fn()
| (Allocate and setup scsi_cmnd)
|
+---> cmd->scsi_done = scsi_done; // 设置回调,IO完成时驱动调用
|
+---> scsi_dispatch_cmd()
|
+---> hostt->queue_command()
1.2 MQ
SCSI MQ 注册到Block层IO钩子函数:
static const struct blk_mq_ops scsi_mq_ops = {
.queue_rq = scsi_queue_rq,
...
};
MQ 下发IO:
scsi_queue_rq()
|
+---> cmd->scsi_done = scsi_mq_done // 设置回调,IO完成时驱动调用
|
+---> scsi_dispatch_cmd()
|
+---> hostt->queue_command()
2. IO 完成
驱动(LLDD)层完成IO后,调用scsi层提供的回调done函数:scsi_done/scsi_mq_done,
2.1 SQ
scsi_done()
|
+---> blk_complete_request()
|
+---> causes softirq
|
+---> blk_done_softirq()
|
+---> scsi_softirq_done()
2.2 MQ
scsi_mq_done()
|
+---> blk_mq_complete_request()
|
+---> scsi_softirq_done()
2.3 公共处理部分
scsi_softirq_done()
|
+---> scsi_decide_disposition() // 根据驱动返回IO状态码,确定下一步处理逻辑
| Takes a look at the scsi_cmnd->result and sense data to determine
| what is the best course of action to take. While reading this+
| function code, one should not confuse SUCCESS as meaning the command
| was successful, or FAILED to mean the command failed etc. The return
| value of this function merely indicates the course of action to take
|
+---> case SUCCESS: // IO处理成功,返回block层
| (Finish off the command to block layer. For e.g, the device may be
| offline, and hence complete the command - the block layer may retry
| on its own later, but that doesn't concern the SCSI ML)
| |
| +---> scsi_finish_command()
| |
| +---> scsi_io_completion() (*see note below)
| |
| +---> blk_finish_request()
|
+---> case RETRY/ADD_TO_MLQUEUE: // 驱动层返回IO重试或者IO重新入队列,待重新调度下发执行
| (Requeue the command to request queue. For e.g. the device HW was
| busy, and thus SCSI ML knows that retrying may help)
| |
| +---> scsi_queue_insert()
| |
| +---> blk_requeue_request()
|
+---> case FAILED/default: // IO 执行失败,将IO添加处理链表,进行错误恢复处理
(Schedule the scsi_cmnd for EH. For e.g. there was a bus error that
might need bus reset. Or we got CHECK_CONDITION and we need to issue
REQ_SENSE to get more info about the failure. etc)
|
+---> scsi_eh_scmd_add()
Add scsi_cmnd to the host EH queue
scsi_eh_wakeup()
IO完成主要有以下三种处理逻辑:
a. IO处理成功,返回block层
b. 驱动层返回IO重试或者IO重新入队列,待重新调度下发执行
c. IO 执行失败,将IO添加处理链表,进行错误恢复处理
3. IO 超时
block提供块设备超时处理机制,块设备分配request_queue时,注册超时处理函数(也可以不注册超时处理,如DM设备)。通常默认超时时间为30S, 超时时间支持修改:
/sys/class/scsi_device/<#:#:#:#>/device/timeout
3.1 SQ
scsi_old_alloc_queue
|
+---> blk_init_allocated_queue()
| |
| +---> INIT_WORK(&q->timeout_work, blk_timeout_work); // 超时处理work
|
+---> blk_queue_rq_timed_out(q, scsi_times_out); // 注册超时处理函数
对于SQ设备, block下发到scsi层的request都会添加到timeout_list 链表中,blk_timeout_work 检测这个链表上的IO是否超时。
3.2 MQ
注册超时处理函数
static const struct blk_mq_ops scsi_mq_ops = {
.timeout = scsi_timeout // 注册超时处理函数
};
blk_mq_init_allocated_queue()
|
+---> INIT_WORK(&q->timeout_work, blk_mq_timeout_work); // 超时处理work
对于MQ设备,因为request是在初始化request_queue时预分配的,通过tag管理,blk_mq_timeout_work遍历tag,
通过bitmap找到request.
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
3.3 超时公共处理
scsi_times_out()
|
| // 如果驱动注册了超时处理,则执行驱动的超时处理逻辑
+---> scsi_transport_template->eh_timed_out() - Successful? If not...
| (Gives transportt a chance to deal with it)
| // 如果上述超时处理失败,则执行abort处理
+---> scsi_host_template->eh_timed_out() - Successful? If not...
| (Gives hostt a chance to deal with it)
|
+---> scsi_abort_command() - Successful? If not... // 指定abort操作
| (Schedule an ABORT of the scsi_cmnd. The abort handler will also
| requeue it if needed)
|
| // 如果abort失败,将命令添加到 scsi 错误链表,由scsi错误线程处理。
+---> scsi_eh_scmd_add()
(Schedule the scsi_cmnd for EH. This'll definitely work. Because if it
doesn't work, the EH handler will mark the device as offline, which
counts as a good fix :-))
4. IO 错误处理
4.1 scsi 错误处理线程注册
在host初始化时, 每个host启动一个内核线程scsi_eh_#, 其中#为host_no,
通过ps -aux | grep scsi_eh 可以查看当前系统scsi错误出现线程。
shost->ehandler = kthread_run(scsi_error_handler, shost, "scsi_eh_%d", shost->host_no);
4.2 唤醒错误处理线程
线程被唤醒有两条路径:
a. 将IO添加到错误处理(IO failed完成或者IO超时)
scsi_eh_scmd_add()
|
+---> scsi_host_set_state(shost, SHOST_RECOVERY) // 设置host状态为RECOVERY状态
|
+---> scsi_eh_wakeup()
b. 主动调用 scsi_schedule_eh()接口唤醒
scsi_schedule_eh()
|
+---> scsi_host_set_state(shost, SHOST_RECOVERY) // 设置host状态为RECOVERY状态
|
+---> scsi_eh_wakeup()
4.3 错误处理
scsi_error_handler()
|
+---> shost->transportt->eh_strategy_handler(shost) // 驱动有注册私有的错误处理 (如libsas)
|
+--> scsi_eh_get_sense() - Are we done? if not..// 如果驱动没有,则执行scsi 层提供的错误处理逻辑
| (For the commands that have CHECK_CONDITION, get sense_info)
| |
| +--> scsi_request_sense()
| | (Use scsi_send_eh_cmnd() to send a "hijacked" REQ_SENSE cmnd)
| |
| +--> scsi_decide_disposition()
| |
| +--> Arrange to finish the scsi_cmnd if SUCCESS (by setting
| retries=allowed)
|
+--> scsi_eh_abort_cmds() - Are we done? If not...
| (Abort the commands that had timed out)
| |
| +--> scsi_try_to_abort_cmd()
| | (Results in call to hostt->eh_abort_handler() which is responsible
| | making the LLD and the HW forget about the scsi_cmnd)
| |
| +--> scsi_eh_test_devices()
| (Test if the device is responding now by sending appropriate EH
| commands (STU / TEST_UNIT_READY). Again, sending these EH
| commands involves highjacking the original scsi_cmnd, and later
| restoring the context)
|
+--> scsi_eh_ready_devs() - Are we done? if not... // 进行reset恢复操作
| (Take increasing order of higher severity actions in order to recover)
| |
| +--> scsi_eh_bus_device_reset() // device reset (Lun reset)
| | (Reset the scsi_device. Results in call to
| | hostt->eh_device_reset_handler())
| |
| +--> scsi_eh_target_reset() // target reset
| | (Reset the scsi_target. Results in call to
| | hostt->eh_target_reset_handler())
| |
| +--> scsi_eh_bus_reset() // bus reset
| | (Reset the scsi_device. Results in call to
| | hostt->eh_bus_reset_handler())
| |
| +--> scsi_eh_host_reset() // host reset
| | (Reset the Scsi_Host. Results in call to
| | hostt->eh_host_reset_handler())
| | // 上述reset参数都失败后,则将盘设备为offline状态
| +--> If nothing has worked - scsi_eh_offline_sdevs()
| (The device is not recoverable, put it offline)
| // 上述处理完毕后,错误处理链表上的IO移到done链表,这里处理done链表上的cmd
+--> scsi_eh_flush_done_q()
(For all the EH commands on the done_q, either requeue them (via
scsi_queue_insert()) if eligible, or finish them up to block layer
(via scsi_finish_command())
上述几乎每一步都会去检查 host 的eh_deadline字段,如果是启动并过期,则立即返回,不执行对应的操作。eh_deadline 默认为off, 即不启动。
如果需要设置, 可以通过如下路径来修改:/sys/class/scsi_host/host#/eh_deadline。
IO添加到错误处理链表后, 会设置host设置为RECOVERY状态, 该状态会导致Host下所有的磁盘无法下发新的IO,出现IO为零状态。待IO错误处理完毕后,清除host上的为RECOVERY状态,则可以重新下新的IO。
4.3.1 libsas 错误处理
libsas 有注册私有的错误处理函数,不使用scsi提供的错误处理逻辑。
scsi层对scsi_cmnd级别的错误处理,libsas针对更底层一些,每个scsi_cmnd有对于一个sas_task,libsas是针对sas_task进行错误处理。
a. 注册:
stt->eh_strategy_handler = sas_scsi_recover_host;
b. 错误处理
sas_scsi_recover_host
|
+---> sas_eh_handle_sas_errors()
| |
| +---> sas_scsi_find_task()
| | |
| | +---> lldd_abort_task(task) // 执行abort
| | |
| | +---> lldd_query_task() // 查询命令状态
| |
| +---> case TASK_IS_DONE: sas_eh_finish_cmd(cmd) // 命令先与abort完成,即以正常完成。
| |
| +---> case TASK_IS_ABORTED: sas_eh_finish_cmd(cmd) // 命令abort成功,命令
| |
| +---> case TASK_IS_AT_LU: // 需要进入 lun recover 操作, 类似scsi的device reset
| | |
| | +---> sas_recover_lu()
| |
| +---> case TASK_IS_NOT_AT_LU/TASK_ABORT_FAILED // 进入 I_T recover 恢复
| | |
| | +---> sas_recover_I_T(task->dev) // 执行phy reset
| |
| +---> try_to_reset_cmd_device(cmd) // 其他情况
| | |
| | +---> eh_device_reset_handler() // 如果驱动有注册,执行 device reset
| | |
| | +---> eh_target_reset_handler() // 如果驱动有注册,执行 target reset
| |
| +----> i->dft->lldd_clear_nexus_port() // 如果驱动有注册
| | // 如果驱动有注册,进入ha级别的恢复,类似了scsi的host reset
| |----> i->dft->lldd_clear_nexus_ha()
|
+---> sas_ata_eh(shost, &eh_work_q, &ha->eh_done_q) //进入sata盘专有的错误恢复处理
| |
| +---> ata_scsi_cmd_error_handler()
| // 如果经过上述的错误恢复处理后,仍然还有待处理的错误IO,则执行scsi层提供的错误处理
+---> scsi_eh_ready_devs()
|
+---> sas_ata_strategy_handler() //进入sata盘专有的错误恢复处理
|
+---> ata_scsi_port_error_handler
1) sas_recover_lu() 执行 lun reset
2) sas_recover_I_T(), 进行phy reset操作, 对于sas磁盘:即执行hardreset, 对于sata盘,则为link reset
3) lldd_clear_nexus_ha(), 进行SAM TMF定义中的 _CLEAR_ACA 恢复
4) sas_ata_eh()/sas_ata_strategy_handler(), sata盘专有的错误处理,
可见sata磁盘比sas磁盘多这个错误处理,错误处理时间更长。
4.3.2 libata 错误处理
(待补充)
5. IO 重试
以下5种情况,io会进行重试
5.1 blk_timeout_work 检查到IO超时,进行IO超时处理,abort命令成功后,IO重新入队列进行重试
scsi_times_out()
-> scsi_abort_command()
-> schedules scmd_eh_abort_handler()
-> scsi_queue_insert()
-> blk_requeue_request()
5.2 scsi错误处理线程,在处理完RECOVERY host后, IO重新入队列
scsi_error_handler()
-> scsi_unjam_host()
-> scsi_eh_flush_done_q()
-> scsi_queue_insert()
-> blk_requeue_request()
5.3 驱动完成IO后,驱动明确返回该IO需要重试(如驱动暂时忙场景)
scsi_softirq_done()
-> scsi_decide_disposition() returns NEEDS_RETRY
-> scsi_queue_insert()
-> blk_requeue_request()
5.4 block层下发IO到scsi层,scsi 设备或者host处于busy状态,IO重新入队列
scsi_request_fn()
-> case note_ready: // 设备busy或者Host busy
-> blk_requeue_request()
5.5 scsi多处调用 scsi_finish_command, 检查驱动返回的result不为0,IO重新入队列
scsi_finish_command()
-> scsi_io_completion()
-> scsi_io_completion_action()
-> blk_requeue_request()
参考资料:
[1]: Documentation/scsi: Documentation about scsi_cmnd lifecycle