ReplicSet Replication Rollback

回滚的场景

在旧的Primary节点内, 有可能有新的节点内部没有的数据, 导致主从之间的数据不一致。如下:

Nodet0t1t2t3t4
Primary12345
Secondary11234
Secondary2123

假设在t4 Primary发生故障, Secondary1切换为Primary, 并且在t4有值6 写入,secondary2 与Primary同步, 结果如下:

Nodet0t1t2t3t4
Primary12346
Secondary212346

此时将旧的primary加入副本集, 变成secondary1, 结果如下:

Nodet0t1t2t3t4
Primary12346
Secondary112345
Secondary212346

复制回滚的结构

看得出primary 与secondary1 在时间点t4不一致, 系统需要将secondary1 的t4 的值回滚, 然后同步primary在改时间点的值。整个过程如下:
这里写图片描述
从上图可以看出, 整个回滚过程包含3个阶段:
第一阶段: 得到rollback Id;
第二阶段:通过RollBackLocalOperations 来反向遍历local oplog, 对于每一条oplog, 透过RollBackLocalOperations::onRemoteOperation 来比较local oplog与source端的oplog在时间和oplog的内容是否一致, 如果不一致, 就根据oplog的内容将结果保存到FixUpInfo, 以备后续回滚使用。 重复该动作, 直到找到一条oplog在本地和远端是一致的, 如果没有找到一个共同的oplog点, 该回滚动作失败。
在本阶段, 最终会产生一个FixUpInfo的结构, 本地和源端oplog的差异:

struct FixUpInfo {
    // note this is a set -- if there are many $inc's on a single document we need to rollback,
    // we only need to refetch it once.
    set<DocID> toRefetch;

    // collections to drop
    set<string> toDrop;

    // Indexes to drop.
    // Key is collection namespace. Value is name of index to drop.
    multimap<string, string> indexesToDrop;

    set<string> collectionsToResyncData;
    set<string> collectionsToResyncMetadata;

    Timestamp commonPoint;
    RecordId commonPointOurDiskloc;

    int rbid;  // remote server's current rollback sequence #
};
// rollbackOperation 代表的是refetch函数
StatusWith<RollBackLocalOperations::RollbackCommonPoint> syncRollBackLocalOperations(
    const OplogInterface& localOplog,
    const OplogInterface& remoteOplog,
    const RollBackLocalOperations::RollbackOperationFn& rollbackOperation) {
     // 一个从后向前的iterator
    auto remoteIterator = remoteOplog.makeIterator();
    auto remoteResult = remoteIterator->next();


    RollBackLocalOperations finder(localOplog, rollbackOperation);
    Timestamp theirTime;
    while (remoteResult.isOK()) {
        theirTime = remoteResult.getValue().first["ts"].timestamp();
        BSONObj theirObj = remoteResult.getValue().first;
        auto result = finder.onRemoteOperation(theirObj);
        // 找到了共同的oplog 点, 返回公共的点
        if (result.isOK()) {
            return result.getValue();
         }
        // oplog 在源端没有找到对应的oplog, 丢失或者以备覆盖了 
        else if (result.getStatus().code() != ErrorCodes::NoSuchKey) {
            return result;
        }
        // 其他的情况就是本地端和源端oplog不一致
        remoteResult = remoteIterator->next();
    }
}

对于每一条不一致的oplog, 根据不同的类型存入FixUpInfo的不同集合里面, 后面的回滚动作, 会依赖这些结合的内容进行处理,如下处理去掉了错误处理部分:

Status refetch(FixUpInfo& fixUpInfo, const BSONObj& ourObj) {
    const char* op = ourObj.getStringField("op");
    if (*op == 'c') {
        BSONElement first = obj.firstElement();
        NamespaceString nss(doc.ns);  // foo.$cmd
        string cmdname = first.fieldName();
        Command* cmd = Command::findCommand(cmdname.c_str());

        if (cmdname == "create") {
            string ns = nss.db().toString() + '.' + obj["create"].String();  // -> foo.abc
            fixUpInfo.toDrop.insert(ns);
            return Status::OK();
        } else if (cmdname == "drop") {
            string ns = nss.db().toString() + '.' + first.valuestr();
            fixUpInfo.collectionsToResyncData.insert(ns);
            return Status::OK();
        } else if (cmdname == "dropIndexes" || cmdname == "deleteIndexes") {
            string ns = nss.db().toString() + '.' + first.valuestr();
            fixUpInfo.collectionsToResyncData.insert(ns);
            return Status::OK();
        } else if (cmdname == "renameCollection") {
            string from = first.valuestr();
            string to = obj["to"].String();
            fixUpInfo.collectionsToResyncData.insert(from);
            fixUpInfo.collectionsToResyncData.insert(to);
            return Status::OK();
        } else if (cmdname == "dropDatabase") {
            severe() << "rollback : can't rollback drop database full resync "
                     << "will be required";
            log() << obj.toString();
            throw RSFatalException();
        } else if (cmdname == "collMod") {
            const auto ns = NamespaceString(cmd->parseNs(nss.db().toString(), obj));
            for (auto field : obj) {
                if (modification == "validator" || modification == "validationAction" ||
                    modification == "validationLevel" || modification == "usePowerOf2Sizes" ||
                    modification == "noPadding") {
                    fixUpInfo.collectionsToResyncMetadata.insert(ns.ns());
                    continue;
                }
            }
            return Status::OK();
        } else if (cmdname == "applyOps") {
            for (const auto& subopElement : first.Array()) {
                auto subStatus = refetch(fixUpInfo, subopElement.Obj());
            }
            return Status::OK();
        } else {
        }
    }

    NamespaceString nss(doc.ns);
    if (nss.isSystemDotIndexes()) {
        fixUpInfo.indexesToDrop.insert(pairToInsert);        
        return Status::OK();
    }

    doc._id = obj["_id"];
    fixUpInfo.toRefetch.insert(doc);
    return Status::OK();
}

第三部分 是oplog回滚部分:
根据第二部分产生的FixUpInfo的结果, 针对不同的类型进行处理。

  • 遍历fixUpInfo.toRefetch, 产生一个2D map映射map(ns, map(id, doc));
  • 遍历fixUpInfo.collectionsToResyncData, 删除集合;
  • 遍历fixUpInfo.collectionsToResyncMetadata, 更新集合的metadata;
  • 遍历fixUpInfo.toDrop, 对每一个collection, 遍历每一个KV写入removeSaver, 然后删除集合;
  • 遍历fixUpInfo.indexesToDrop, 删除索引;
  • 遍历map(ns, map(id, doc)), 将要删除或者更新的值写入removeSaver, 然后删除或者更新集合;
  • trancate fixUpInfo.commonPoint 之后的oplog;
  • 更新lastAppliedOpTime;

具体的代码如下, 做了简化处理, 每一个if语句对应上面的一条。

void syncFixUp(OperationContext* txn,
               FixUpInfo& fixUpInfo,
               const RollbackSource& rollbackSource,
               ReplicationCoordinator* replCoord) {

    // fetch all the goodVersions of each document from current primary
    DocID doc;
    unsigned long long numFetched = 0;
    try {
        for (set<DocID>::iterator it = fixUpInfo.toRefetch.begin(); it != fixUpInfo.toRefetch.end();
             it++) {
            doc = *it;
            {
                // TODO : slow.  lots of round trips.
                numFetched++;
                BSONObj good = rollbackSource.findOne(NamespaceString(doc.ns), doc._id.wrap());
                totalSize += good.objsize();


                // note good might be eoo, indicating we should delete it
                goodVersions[doc.ns][doc] = good;
            }
        }
        newMinValid = rollbackSource.getLastOperation();
    } catch (const DBException& e) {
    }

    OpTime::parseFromOplogEntry(newMinValid);
    log() << "minvalid=" << minValid;
    setMinValid(txn, {OpTime{}, minValid});

    // any full collection resyncs required?
    if (!fixUpInfo.collectionsToResyncData.empty() ||
        !fixUpInfo.collectionsToResyncMetadata.empty()) {
        for (const string& ns : fixUpInfo.collectionsToResyncData) {
            log() << "rollback 4.1.1 coll resync " << ns;

            fixUpInfo.indexesToDrop.erase(ns);
            fixUpInfo.collectionsToResyncMetadata.erase(ns);

            const NamespaceString nss(ns);


            {
                ScopedTransaction transaction(txn, MODE_IX);
                Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
                Database* db = dbHolder().openDb(txn, nss.db().toString());
                invariant(db);
                WriteUnitOfWork wunit(txn);
                db->dropCollection(txn, ns);
                wunit.commit();
            }

            rollbackSource.copyCollectionFromRemote(txn, nss);
        }

        for (const string& ns : fixUpInfo.collectionsToResyncMetadata) {
            log() << "rollback 4.1.2 coll metadata resync " << ns;

            const NamespaceString nss(ns);
            ScopedTransaction transaction(txn, MODE_IX);
            Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
            auto db = dbHolder().openDb(txn, nss.db().toString());
            invariant(db);
            auto collection = db->getCollection(ns);
            invariant(collection);
            auto cce = collection->getCatalogEntry();

            auto infoResult = rollbackSource.getCollectionInfo(nss);

            // update collection metadata 
        }



        OpTime minValid = fassertStatusOK(28775, OpTime::parseFromOplogEntry(newMinValid));
        log() << "minvalid=" << minValid;
        const OpTime start{fixUpInfo.commonPoint, OpTime::kUninitializedTerm};
        setMinValid(txn, {start, minValid});


    // drop collections to drop before doing individual fixups - that might make things faster
    // below actually if there were subsequent inserts to rollback
    for (set<string>::iterator it = fixUpInfo.toDrop.begin(); it != fixUpInfo.toDrop.end(); it++) {
        log() << "rollback drop: " << *it;

        fixUpInfo.indexesToDrop.erase(*it);

        ScopedTransaction transaction(txn, MODE_IX);
        const NamespaceString nss(*it);
        Lock::DBLock dbLock(txn->lockState(), nss.db(), MODE_X);
        Database* db = dbHolder().get(txn, nsToDatabaseSubstring(*it));
        if (db) {
            WriteUnitOfWork wunit(txn);

            Helpers::RemoveSaver removeSaver("rollback", "", *it);

            // perform a collection scan and write all documents in the collection to disk
            std::unique_ptr<PlanExecutor> exec(InternalPlanner::collectionScan(
                txn, *it, db->getCollection(*it), PlanExecutor::YIELD_MANUAL));
            BSONObj curObj;
            PlanExecutor::ExecState execState;
            while (PlanExecutor::ADVANCED == (execState = exec->getNext(&curObj, NULL))) {
                auto status = removeSaver.goingToDelete(curObj);
                if (!status.isOK()) {
                    throw RSFatalException();
                }
            }
            if (execState != PlanExecutor::IS_EOF) {
                throw RSFatalException();
            }

            db->dropCollection(txn, *it);
            wunit.commit();
        }
    }

    // Drop indexes.
    for (auto it = fixUpInfo.indexesToDrop.begin(); it != fixUpInfo.indexesToDrop.end(); it++) {
        const NamespaceString nss(it->first);
        const string& indexName = it->second;
        log() << "rollback drop index: collection: " << nss.toString() << ". index: " << indexName;

        bool includeUnfinishedIndexes = false;
        auto indexDescriptor =
            indexCatalog->findIndexByName(txn, indexName, includeUnfinishedIndexes);

        auto status = indexCatalog->dropIndex(txn, indexDescriptor);

    }

    unsigned deletes = 0, updates = 0;
    time_t lastProgressUpdate = time(0);
    time_t progressUpdateGap = 10;
    for (const auto& nsAndGoodVersionsByDocID : goodVersions) {
        // Keep an archive of items rolled back if the collection has not been dropped
        // while rolling back createCollection operations.
        const auto& ns = nsAndGoodVersionsByDocID.first;
        unique_ptr<Helpers::RemoveSaver> removeSaver;
        if (!fixUpInfo.toDrop.count(ns)) {
            removeSaver.reset(new Helpers::RemoveSaver("rollback", "", ns));
        }

        const auto& goodVersionsByDocID = nsAndGoodVersionsByDocID.second;
        for (const auto& idAndDoc : goodVersionsByDocID) {
            time_t now = time(0);
            if (now - lastProgressUpdate > progressUpdateGap) {
                log() << deletes << " delete and " << updates
                      << " update operations processed out of " << goodVersions.size()
                      << " total operations";
                lastProgressUpdate = now;
            }
            const DocID& doc = idAndDoc.first;
            BSONObj pattern = doc._id.wrap();  // { _id : ... }
            try {

                Collection* collection = ctx.db()->getCollection(doc.ns);
                if (collection && removeSaver) {
                    BSONObj obj;
                    bool found = Helpers::findOne(txn, collection, pattern, obj, false);
                    auto status = removeSaver->goingToDelete(obj);
                }

                if (idAndDoc.second.isEmpty()) {
                    // wasn't on the primary; delete.
                    // TODO 1.6 : can't delete from a capped collection.  need to handle that
                    // here.
                    deletes++;

                    if (collection) {
                        if (collection->isCapped()) {

                            RecordId loc = Helpers::findOne(txn, collection, pattern, false);
                            collection->temp_cappedTruncateAfter(txn, loc, true);

                        } else {
                            deleteObjects(txn,
                                          collection,
                                          doc.ns,
                                          pattern,
                                          PlanExecutor::YIELD_MANUAL,
                                          true,   // justone
                                          true);  // god
                        }
                        // did we just empty the collection?  if so let's check if it even
                        // exists on the source.
                        if (collection->numRecords(txn) == 0) {
                            try {
                                NamespaceString nss(doc.ns);
                                ctx.db()->dropCollection(txn, doc.ns);

                        }
                    }
                } else {
                    // TODO faster...
                    OpDebug debug;
                    updates++;

                    const NamespaceString requestNs(doc.ns);
                    UpdateRequest request(requestNs);

                    request.setQuery(pattern);
                    request.setUpdates(idAndDoc.second);
                    request.setGod();
                    request.setUpsert();
                    UpdateLifecycleImpl updateLifecycle(true, requestNs);
                    request.setLifecycle(&updateLifecycle);

                    update(txn, ctx.db(), request, &debug);
                }
            } catch (const DBException& e) {
                log() << "exception in rollback ns:" << doc.ns << ' ' << pattern.toString() << ' '
                      << e.toString() << " ndeletes:" << deletes;
                warn = true;
            }
        }
    }

    log() << "rollback 5 d:" << deletes << " u:" << updates;
    log() << "rollback 6";

    // clean up oplog
    LOG(2) << "rollback truncate oplog after " << fixUpInfo.commonPoint.toStringPretty();
    {
        const NamespaceString oplogNss(rsOplogName);
        ScopedTransaction transaction(txn, MODE_IX);
        Lock::DBLock oplogDbLock(txn->lockState(), oplogNss.db(), MODE_IX);
        Lock::CollectionLock oplogCollectionLoc(txn->lockState(), oplogNss.ns(), MODE_X);
        OldClientContext ctx(txn, rsOplogName);
        Collection* oplogCollection = ctx.db()->getCollection(rsOplogName);
        if (!oplogCollection) {
            fassertFailedWithStatusNoTrace(13423,
                                           Status(ErrorCodes::UnrecoverableRollbackError,
                                                  str::stream() << "Can't find " << rsOplogName));
        }
        // TODO: fatal error if this throws?
        oplogCollection->temp_cappedTruncateAfter(txn, fixUpInfo.commonPointOurDiskloc, false);
    }

    Status status = getGlobalAuthorizationManager()->initialize(txn);
    if (!status.isOK()) {
        warning() << "Failed to reinitialize auth data after rollback: " << status;
        warn = true;
    }

    // Reload the lastAppliedOpTime and lastDurableOpTime value in the replcoord and the
    // lastAppliedHash value in bgsync to reflect our new last op.
    replCoord->resetLastOpTimesFromOplog(txn);

    // done
    if (warn)
        warning() << "issues during syncRollback, see log";
    else
        log() << "rollback done";
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值