回滚的场景
在旧的Primary节点内, 有可能有新的节点内部没有的数据, 导致主从之间的数据不一致。如下:
Node | t0 | t1 | t2 | t3 | t4 |
---|---|---|---|---|---|
Primary | 1 | 2 | 3 | 4 | 5 |
Secondary1 | 1 | 2 | 3 | 4 | |
Secondary2 | 1 | 2 | 3 |
假设在t4 Primary发生故障, Secondary1切换为Primary, 并且在t4有值6 写入,secondary2 与Primary同步, 结果如下:
Node | t0 | t1 | t2 | t3 | t4 |
---|---|---|---|---|---|
Primary | 1 | 2 | 3 | 4 | 6 |
Secondary2 | 1 | 2 | 3 | 4 | 6 |
此时将旧的primary加入副本集, 变成secondary1, 结果如下:
Node | t0 | t1 | t2 | t3 | t4 |
---|---|---|---|---|---|
Primary | 1 | 2 | 3 | 4 | 6 |
Secondary1 | 1 | 2 | 3 | 4 | 5 |
Secondary2 | 1 | 2 | 3 | 4 | 6 |
复制回滚的结构
看得出primary 与secondary1 在时间点t4不一致, 系统需要将secondary1 的t4 的值回滚, 然后同步primary在改时间点的值。整个过程如下:
从上图可以看出, 整个回滚过程包含3个阶段:
第一阶段: 得到rollback Id;
第二阶段:通过RollBackLocalOperations 来反向遍历local oplog, 对于每一条oplog, 透过RollBackLocalOperations::onRemoteOperation 来比较local oplog与source端的oplog在时间和oplog的内容是否一致, 如果不一致, 就根据oplog的内容将结果保存到FixUpInfo, 以备后续回滚使用。 重复该动作, 直到找到一条oplog在本地和远端是一致的, 如果没有找到一个共同的oplog点, 该回滚动作失败。
在本阶段, 最终会产生一个FixUpInfo的结构, 本地和源端oplog的差异:
struct FixUpInfo {
// note this is a set -- if there are many $inc's on a single document we need to rollback,
// we only need to refetch it once.
set<DocID> toRefetch;
// collections to drop
set<string> toDrop;
// Indexes to drop.
// Key is collection namespace. Value is name of index to drop.
multimap<string, string> indexesToDrop;
set<string> collectionsToResyncData;
set<string> collectionsToResyncMetadata;
Timestamp commonPoint;
RecordId commonPointOurDiskloc;
int rbid; // remote server's current rollback sequence #
};
// rollbackOperation 代表的是refetch函数
StatusWith<RollBackLocalOperations::RollbackCommonPoint> syncRollBackLocalOperations(
const OplogInterface& localOplog,
const OplogInterface& remoteOplog,
const RollBackLocalOperations::RollbackOperationFn& rollbackOperation) {
// 一个从后向前的iterator
auto remoteIterator = remoteOplog.makeIterator();
auto remoteResult = remoteIterator->next();
RollBackLocalOperations finder(localOplog, rollbackOperation);
Timestamp theirTime;
while (remoteResult.isOK()) {
theirTime = remoteResult.getValue().first["ts"].timestamp();
BSONObj theirObj = remoteResult.getValue().first;
auto result = finder.onRemoteOperation(theirObj);
// 找到了共同的oplog 点, 返回公共的点
if (result.isOK()) {
return result.getValue();
}
// oplog 在源端没有找到对应的oplog, 丢失或者以备覆盖了
else if (result.getStatus().code() != ErrorCodes::NoSuchKey) {
return result;
}
// 其他的情况就是本地端和源端oplog不一致
remoteResult = remoteIterator->next();
}
}
对于每一条不一致的oplog, 根据不同的类型存入FixUpInfo的不同集合里面, 后面的回滚动作, 会依赖这些结合的内容进行处理,如下处理去掉了错误处理部分:
Status refetch(FixUpInfo& fixUpInfo, const BSONObj& ourObj) {
const char* op = ourObj.getStringField("op");
if (*op == 'c') {
BSONElement first = obj.firstElement();
NamespaceString nss(doc.ns); // foo.$cmd
string cmdname = first.fieldName();
Command* cmd = Command::findCommand(cmdname.c_str());
if (cmdname == "create") {
string ns = nss.db().toString() + '.' + obj["create"].String(); // -> foo.abc
fixUpInfo.toDrop.insert(ns);
return Status::OK();
} else if (cmdname == "drop") {
string ns = nss.db().toString() + '.' + first.valuestr();
fixUpInfo.collectionsToResyncData.insert(ns);
return Status::OK();
} else if (cmdname == "dropIndexes" || cmdname == "deleteIndexes") {
string ns = nss.db().toString() + '.' + first.valuestr();
fixUpInfo.collectionsToResyncData.insert(ns);
return Status::OK();
} else if (cmdname == "renameCollection") {
string from = first.valuestr();
string to = obj["to"].String();
fixUpInfo.collectionsToResyncData.insert(from);
fixUpInfo.collectionsToResyncData.insert(to);
return Status::OK();
} else if (cmdname == "dropDatabase") {
severe() << "rollback : can't rollback drop database full resync "
<< "will be required";
log() << obj.toString();
throw RSFatalException();
} else if (cmdname == "collMod") {
const auto ns = NamespaceString(cmd->parseNs(nss.db().toString(), obj));
for (auto field : obj) {
if (modification == "validator" || modification == "validationAction" ||
modification == "validationLevel" || modification == "usePowerOf2Sizes" ||
modification == "noPadding") {
fixUpInfo.collectionsToResyncMetadata.insert(ns.ns());
continue;
}
}
return Status::OK();
} else if (cmdname == "applyOps") {
for (const auto& subopElement : first.Array()) {
auto subStatus = refetch(fixUpInfo, subopElement.Obj());
}
return Status::OK();
} else {
}
}
NamespaceString nss(doc.ns);
if (nss.isSystemDotIndexes()) {
fixUpInfo.indexesToDrop.insert(pairToInsert);
return Status::OK();
}
doc._id = obj["_id"];
fixUpInfo.toRefetch.insert(doc);
return Status::OK();
}
第三部分 是oplog回滚部分:
根据第二部分产生的FixUpInfo的结果, 针对不同的类型进行处理。
- 遍历fixUpInfo.toRefetch, 产生一个2D map映射map(ns, map(id, doc));
- 遍历fixUpInfo.collectionsToResyncData, 删除集合;
- 遍历fixUpInfo.collectionsToResyncMetadata, 更新集合的metadata;
- 遍历fixUpInfo.toDrop, 对每一个collection, 遍历每一个KV写入removeSaver, 然后删除集合;
- 遍历fixUpInfo.indexesToDrop, 删除索引;
- 遍历map(ns, map(id, doc)), 将要删除或者更新的值写入removeSaver, 然后删除或者更新集合;
- trancate fixUpInfo.commonPoint 之后的oplog;
- 更新lastAppliedOpTime;
具体的代码如下, 做了简化处理, 每一个if语句对应上面的一条。
void syncFixUp(OperationContext* txn,
FixUpInfo& fixUpInfo,
const RollbackSource& rollbackSource,
ReplicationCoordinator* replCoord) {
// fetch all the goodVersions of each document from current primary
DocID doc;
unsigned long long numFetched = 0;
try {
for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
it++) {
doc = *it;
{
// TODO : slow. lots of round trips.
numFetched++;
BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
totalSize += good.objsize();
// note good might be eoo, indicating we should delete it
goodVersions[doc.ns][doc] = good;
}
}
newMinValid = rollbackSource.getLastOperation();
} catch (const DBException& e) {
}
OpTime::parseFromOplogEntry(newMinValid);
log() << "minvalid=" << minValid;
setMinValid(txn, {OpTime{}, minValid});
// any full collection resyncs required?
if (!fixUpInfo.collectionsToResyncData.empty() ||
!fixUpInfo.collectionsToResyncMetadata.empty()) {
for (const string& ns : fixUpInfo.collectionsToResyncData) {
log() << "rollback 4.1.1 coll resync " << ns;
fixUpInfo.indexesToDrop.erase(ns);
fixUpInfo.collectionsToResyncMetadata.erase(ns);
const NamespaceString nss(ns);
{
ScopedTransaction transaction(txn, MODE_IX);
Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
Database* db = dbHolder().openDb(txn, nss.db().toString());
invariant(db);
WriteUnitOfWork wunit(txn);
db->dropCollection(txn, ns);
wunit.commit();
}
rollbackSource.copyCollectionFromRemote(txn, nss);
}
for (const string& ns : fixUpInfo.collectionsToResyncMetadata) {
log() << "rollback 4.1.2 coll metadata resync " << ns;
const NamespaceString nss(ns);
ScopedTransaction transaction(txn, MODE_IX);
Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
auto db = dbHolder().openDb(txn, nss.db().toString());
invariant(db);
auto collection = db->getCollection(ns);
invariant(collection);
auto cce = collection->getCatalogEntry();
auto infoResult = rollbackSource.getCollectionInfo(nss);
// update collection metadata
}
OpTime minValid = fassertStatusOK(28775, OpTime::parseFromOplogEntry(newMinValid));
log() << "minvalid=" << minValid;
const OpTime start{fixUpInfo.commonPoint, OpTime::kUninitializedTerm};
setMinValid(txn, {start, minValid});
// drop collections to drop before doing individual fixups - that might make things faster
// below actually if there were subsequent inserts to rollback
for (set<string>::iterator it = fixUpInfo.toDrop.begin(); it != fixUpInfo.toDrop.end(); it++) {
log() << "rollback drop: " << *it;
fixUpInfo.indexesToDrop.erase(*it);
ScopedTransaction transaction(txn, MODE_IX);
const NamespaceString nss(*it);
Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
Database* db = dbHolder().get(txn, nsToDatabaseSubstring(*it));
if (db) {
WriteUnitOfWork wunit(txn);
Helpers::RemoveSaver removeSaver("rollback", "", *it);
// perform a collection scan and write all documents in the collection to disk
std::unique_ptr<PlanExecutor> exec(InternalPlanner::collectionScan(
txn, *it, db->getCollection(*it), PlanExecutor::YIELD_MANUAL));
BSONObj curObj;
PlanExecutor::ExecState execState;
while (PlanExecutor::ADVANCED == (execState = exec->getNext(&curObj, NULL))) {
auto status = removeSaver.goingToDelete(curObj);
if (!status.isOK()) {
throw RSFatalException();
}
}
if (execState != PlanExecutor::IS_EOF) {
throw RSFatalException();
}
db->dropCollection(txn, *it);
wunit.commit();
}
}
// Drop indexes.
for (auto it = fixUpInfo.indexesToDrop.begin(); it != fixUpInfo.indexesToDrop.end(); it++) {
const NamespaceString nss(it->first);
const string& indexName = it->second;
log() << "rollback drop index: collection: " << nss.toString() << ". index: " << indexName;
bool includeUnfinishedIndexes = false;
auto indexDescriptor =
indexCatalog->findIndexByName(txn, indexName, includeUnfinishedIndexes);
auto status = indexCatalog->dropIndex(txn, indexDescriptor);
}
unsigned deletes = 0, updates = 0;
time_t lastProgressUpdate = time(0);
time_t progressUpdateGap = 10;
for (const auto& nsAndGoodVersionsByDocID : goodVersions) {
// Keep an archive of items rolled back if the collection has not been dropped
// while rolling back createCollection operations.
const auto& ns = nsAndGoodVersionsByDocID.first;
unique_ptr<Helpers::RemoveSaver> removeSaver;
if (!fixUpInfo.toDrop.count(ns)) {
removeSaver.reset(new Helpers::RemoveSaver("rollback", "", ns));
}
const auto& goodVersionsByDocID = nsAndGoodVersionsByDocID.second;
for (const auto& idAndDoc : goodVersionsByDocID) {
time_t now = time(0);
if (now - lastProgressUpdate > progressUpdateGap) {
log() << deletes << " delete and " << updates
<< " update operations processed out of " << goodVersions.size()
<< " total operations";
lastProgressUpdate = now;
}
const DocID& doc = idAndDoc.first;
BSONObj pattern = doc._id.wrap(); // { _id : ... }
try {
Collection* collection = ctx.db()->getCollection(doc.ns);
if (collection && removeSaver) {
BSONObj obj;
bool found = Helpers::findOne(txn, collection, pattern, obj, false);
auto status = removeSaver->goingToDelete(obj);
}
if (idAndDoc.second.isEmpty()) {
// wasn't on the primary; delete.
// TODO 1.6 : can't delete from a capped collection. need to handle that
// here.
deletes++;
if (collection) {
if (collection->isCapped()) {
RecordId loc = Helpers::findOne(txn, collection, pattern, false);
collection->temp_cappedTruncateAfter(txn, loc, true);
} else {
deleteObjects(txn,
collection,
doc.ns,
pattern,
PlanExecutor::YIELD_MANUAL,
true, // justone
true); // god
}
// did we just empty the collection? if so let's check if it even
// exists on the source.
if (collection->numRecords(txn) == 0) {
try {
NamespaceString nss(doc.ns);
ctx.db()->dropCollection(txn, doc.ns);
}
}
} else {
// TODO faster...
OpDebug debug;
updates++;
const NamespaceString requestNs(doc.ns);
UpdateRequest request(requestNs);
request.setQuery(pattern);
request.setUpdates(idAndDoc.second);
request.setGod();
request.setUpsert();
UpdateLifecycleImpl updateLifecycle(true, requestNs);
request.setLifecycle(&updateLifecycle);
update(txn, ctx.db(), request, &debug);
}
} catch (const DBException& e) {
log() << "exception in rollback ns:" << doc.ns << ' ' << pattern.toString() << ' '
<< e.toString() << " ndeletes:" << deletes;
warn = true;
}
}
}
log() << "rollback 5 d:" << deletes << " u:" << updates;
log() << "rollback 6";
// clean up oplog
LOG(2) << "rollback truncate oplog after " << fixUpInfo.commonPoint.toStringPretty();
{
const NamespaceString oplogNss(rsOplogName);
ScopedTransaction transaction(txn, MODE_IX);
Lock::DBLock oplogDbLock(txn->lockState(), oplogNss.db(), MODE_IX);
Lock::CollectionLock oplogCollectionLoc(txn->lockState(), oplogNss.ns(), MODE_X);
OldClientContext ctx(txn, rsOplogName);
Collection* oplogCollection = ctx.db()->getCollection(rsOplogName);
if (!oplogCollection) {
fassertFailedWithStatusNoTrace(13423,
Status(ErrorCodes::UnrecoverableRollbackError,
str::stream() << "Can't find " << rsOplogName));
}
// TODO: fatal error if this throws?
oplogCollection->temp_cappedTruncateAfter(txn, fixUpInfo.commonPointOurDiskloc, false);
}
Status status = getGlobalAuthorizationManager()->initialize(txn);
if (!status.isOK()) {
warning() << "Failed to reinitialize auth data after rollback: " << status;
warn = true;
}
// Reload the lastAppliedOpTime and lastDurableOpTime value in the replcoord and the
// lastAppliedHash value in bgsync to reflect our new last op.
replCoord->resetLastOpTimesFromOplog(txn);
// done
if (warn)
warning() << "issues during syncRollback, see log";
else
log() << "rollback done";
}