在https://blog.csdn.net/u014104588/article/details/87277341中,分析了bluestore对于大写和小写不同的处理方式,最后采用异步写和延迟写的方法。在_txc_state_proc函数中,处理了异步写和延迟写的情况(分别成为simple write和deferred write)。
simple write
此情况会发生在写新blob和可复用blob的情况下
_txc_state_proc里就是状态机的处理逻辑,根据所处的状态进行不同阶段的处理。起始状态是STATE_PREPARE,在这个状态下回检查是否还有未完成的aio,如果有就将状态置为STATE_AIO_WAIT,并调用_txc_aio_submit进行处理,否则就直接进入到下一个状态STATE_AIO_WAIT的处理。在调用aio_write时,会执行num_pending++,因此这里要执行_txc_aio_submit函数。STATE_PREPARE状态的函数调用栈如下
case TransContext::STATE_PREPARE:
if (txc->ioc.has_pending_aios())
_txc_aio_submit(txc);
bdev->aio_submit(&txc->ioc);
list<aio_t>::iterator e = ioc->running_aios.begin();
ioc->running_aios.splice(e, ioc->pending_aios);
int pending = ioc->num_pending.load();
ioc->num_running += pending;
ioc->num_pending -= pending;
aio_queue.submit_batch(ioc->running_aios.begin(), e, pending, priv, &retries);
io_submit(ctx, std::min(left, max_iodepth), piocb + done);
return;
最终会调用io_submit将异步写请求下发到内核。注意执行完_txc_aio_submit后,就会执行return返回_txc_state_proc函数。
在_aio_thread线程中,执行完aio后,会执行对应的回调函数aio_callback(aio_callback_priv, ioc->priv);其会最终执行TransContext::aio_finish,其又会调用_txc_state_proc函数来判断事务的执行状态。aio执行完成,则has_pending_aios函数返回0,直接进入STATE_AIO_WAIT状态。
case TransContext::STATE_AIO_WAIT:
_txc_finish_io(txc);//可能继续调用_txc_state_proc
txc->state = TransContext::STATE_IO_DONE;
do {
_txc_state_proc(&*p++); //进入STATE_IO_DONE
} while (p != osr->q.end() && p->state == TransContext::STATE_IO_DONE);
在STATE_AIO_WAIT状态里,设置状态为STATE_IO_DONE,然后继续递归调用_txc_state_proc
case TransContext::STATE_IO_DONE:
txc->state = TransContext::STATE_KV_QUEUED;
kv_queue.push_back(txc);
kv_cond.notify_one();
if (txc->state != TransContext::STATE_KV_SUBMITTED)
kv_queue_unsubmitted.push_back(txc);
return;
在STATE_IO_DONE状态里,将事务入队kv_queue,并唤醒_kv_sync_thread线程,该线程对kv_queue的处理如下
kv_committing.swap(kv_queue)
for (auto txc : kv_committing)
if (txc->state == TransContext::STATE_KV_QUEUED)
db->submit_transaction(txc->t);
_txc_applied_kv(txc);
txc->state = TransContext::STATE_KV_SUBMITTED;
db->submit_transaction_sync(synct);
if (kv_committing_to_finalize.empty()) {
kv_committing_to_finalize.swap(kv_committing); //_kv_finalize_thread去处理
} else {
kv_committing_to_finalize.insert(
kv_committing_to_finalize.end(),
kv_committing.begin(),
kv_committing.end());
kv_committing.clear();
}
kv_finalize_cond.notify_one();
将kv_queue中的事务交换到kv_committing,利用rocksdb数据库将事务写入到磁盘,然后将kv_committing插入到kv_committing_to_finalize并唤醒_kv_finalize_thread线程
_kv_finalize_thread线程对kv_committing_to_finalize的处理如下
kv_committed.swap(kv_committing_to_finalize);
while (!kv_committed.empty())
TransContext *txc = kv_committed.front();
_txc_state_proc(txc);
kv_committed.pop_front();
此时状态为STATE_KV_SUBMITTED,然后进入_txc_state_proc函数,对STATE_KV_SUBMITTED状态的处理如下
case TransContext::STATE_KV_SUBMITTED:
txc->state = TransContext::STATE_KV_DONE;
_txc_committed_kv(txc); //结束回调函数,如果是simple write则就是最后一步,因为先aio写了,然后再kv写入了
finishers[txc->osr->shard]->queue(txc->oncommits);//调用C_OSD_OnOpCommit::finish
if (op->op)
op->op->mark_event("op_commit");
op->op->pg_trace.event("op commit");
op->waiting_for_commit.erase(get_parent()->whoami_shard());
if (op->waiting_for_commit.empty())
op->on_commit->complete(0);
op->on_commit = 0;
in_progress_ops.erase(op->tid);
对于simple write写场景,到这里就基本上结束了,这里只是确认本osd的op操作完成,调用回调函数C_OSD_OnOpCommit::finish,并打印日志,如果waiting_for_commit为空,则说明所有osd副本的op操作都已经完成就会调用回调函数C_OSD_RepopCommit::finish
在STATE_KV_SUBMITTED状态执行结束后,会立马处理STATE_KV_DONE状态,
case TransContext::STATE_KV_DONE:
if (txc->deferred_txn)
txc->state = TransContext::STATE_DEFERRED_QUEUED;
_deferred_queue(txc);
return;
txc->state = TransContext::STATE_FINISHING;
break;
对于simple write,deferred_txn为空,所以不会运行_deferred_queue,而是直接运行STATE_FINISHING,这个状态里会调用_txc_finish(txc);来做结束的工作。
deferred write
延迟写发生在覆盖写的情况下
延迟写先写日志,即commit k/v操作。日志写完成后,再封装一个dbh事务执行data的写操作。
在延迟写的情况下进入_txc_state_proc函数时,没有pending 的aio因此直接进入STATE_AIO_WAIT状态,在STATE_AIO_WAIT状态里设置状态为STATE_IO_DONE然后递归的调用_txc_state_proc,同simple write,在STATE_IO_DONE状态里,将事务入队kv_queue,并唤醒_kv_sync_thread线程。之后的处理和simple write相似(在延迟写的情况下,将日志写完后即在STATE_KV_SUBMITTED状态就可以向客户端确认写成功了,如果后面的数据写磁盘失败,也可以呃从日志中恢复),直到STATE_KV_DONE状态
case TransContext::STATE_KV_DONE:
if (txc->deferred_txn)
txc->state = TransContext::STATE_DEFERRED_QUEUED;
_deferred_queue(txc);
bluestore_deferred_transaction_t& wt = *txc->deferred_txn;
for (auto opi = wt.ops.begin(); opi != wt.ops.end(); ++opi)
const auto& op = *opi;
bufferlist::const_iterator p = op.data.begin();
for (auto e : op.extents)
txc->osr->deferred_pending->prepare_write(cct, wt.seq, e.offset, e.length, p);
auto i = iomap.insert(make_pair(offset, deferred_io()));
_deferred_submit_unlock(txc->osr.get());
auto i = b->iomap.begin();
while (true)
bdev->aio_write(start, bl, &b->ioc, false); //b->ioc决定了回调函数是DeferredBatch::aio_finish
++i;
bdev->aio_submit(&b->ioc);
在deferred write场景下,在STATE_KV_DONE状态会把数据写入到磁盘。在_aio_thread线程中,执行完aio后,会执行对应的回调函数aio_callback(aio_callback_priv, ioc->priv);对于deferred write,其最终会调用DeferredBatch::aio_finish,在aio_finish中会调用_deferred_aio_finish
DeferredBatch *b = osr->deferred_running;
for (auto& i : b->txcs)
TransContext *txc = &i;
txc->state = TransContext::STATE_DEFERRED_CLEANUP;
costs += txc->cost;
throttle_deferred_bytes.put(costs);
deferred_done_queue.emplace_back(b);
kv_cond.notify_one();
线程会把deferred_done_queue插入swap到deferred_done,然后又把deferred_done swap到deferred_stable,最后又把deferred_stable swap到deferred_stable_to_finalize,然后唤醒_kv_finalize_thread线程去处理deferred_stable_to_finalize。
_kv_finalize_thread对deferred_stable_to_finalize的处理如下
deferred_stable.swap(deferred_stable_to_finalize);
for (auto b : deferred_stable)
auto p = b->txcs.begin();
while (p != b->txcs.end())
TransContext *txc = &*p;
p = b->txcs.erase(p);
_txc_state_proc(txc);
delete b;
deferred_stable.clear();
在_kv_finalize_thread线程中,又进入_txc_state_proc函数,此时会处理STATE_DEFERRED_CLEANUP状态,在该状态里直接设置状态为STATE_FINISHING,然后继续往下处理STATE_FINISHING状态。