mongodb源码分析(五)

最新推荐文章于 2017-11-20 17:08:07 发布

happylife1527

最新推荐文章于 2017-11-20 17:08:07 发布

阅读量1.3k

点赞数

分类专栏： mongodb

mongodb 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

mongodb源码分析(十三)持久化

先来看看持久化的流程.默认情况下持久化是开启的,需要关闭启动时--nodur或者--nojournal.在开启journal

时mongodb保留了多数据库的两份映射,每一个文件有两个映射的初始地址_view_write和_view_private,

_view_private是为了持久化而生的.这就是为什么用mongostat查看系统信息时会看到vsize是mapped的2倍多

了,因为一份数据有两份映射.

_view_private初始映射时是只读的,因为写时复制所以虽然其和_view_write都映射了数据,但是并未占用更多

的内存.当我们要修改数据时,mongodb会将要修改的数据所在页也就是_view_private部分的页修改属性修改为可

写,然后做实际的修改.这里的修改修改的部分并不是真正的数据库文件,也不是对_view_write映射部分的修改,而是

因为写时复制修改了一份数据库文件的拷贝.这时同一个数据可能在系统中存在两份数据,一份是没有修改的

_view_write,一份是已经修改的_view_private.当修改数据达到一个上限(32位是50M,64位是100M)或者显示的提交

调用(可能是持久化线程的调用会一般的调用),就会产生一次真正的提交,之前修改的_view_private中的记录将会被

压缩,然后存到日志文件中.然后修改_view_write映射的数据.最后由一个专门的线程DataFileSync将数据从

_view_write中刷到磁盘.这里有个问题就是随着写的持续性,_view_private实际占用的内存会不断扩大,所以一段时

间后需要释放映射,然后重新建立新的_view_private的映射.这样空间就被释放了,_view_private实际占用空间回

到0.一般操作如查询,删除,修改等操作使用的都是_view_private映射,所以可以保证最新的数据.

j._0 j._1 ... j._n: 持久化的日志文件.

prealloc.0 prealloc.1: 预分配的持久化日志文件.

lsn: 记录上一次提交到磁盘的时间戳以及其位反的校验的文件.

需要注意如果在写了数据后持续化过程还没有生成日志时因为一些原因系统crash,那么这部分数据将丢失,

可以明确一次最多丢失100M数据.下面我们实际来看代码.首先从写入动作开始.一次常规的写入如下:

[cpp]view plaincopy 
   
 recNew = (Record *) getDur().writingPtr(recNew, lenWHdr);  

来看看这里的writingPtr的实现

[cpp]view plaincopy 
   
 void* DurableImpl::writingPtr(void *x, unsigned len) {  
     void *p = x;  
     declareWriteIntent(p, len);  
     return p;  
 }  

writingPtr->declareWriteIntent

[cpp]view plaincopy 
   
 void DurableImpl::declareWriteIntent(void *p, unsigned len) {  
     cc().writeHappened();  
     MemoryMappedFile::makeWritable(p, len);//映射部分_view_private初始化为只读,所以将要写入部分的内存设置为可写  
     ThreadLocalIntents *t = tlIntents.getMake();//得到tls变量ThreadLocalIntents指针  
     t->push(WriteIntent(p,len));  
 }  

[cpp]view plaincopy 
   
 void ThreadLocalIntents::push(const WriteIntent& x) { //每21次写unspool一次  
      if( !commitJob._hasWritten )  
          commitJob._hasWritten = true;  
      if( n == 21 )//每21次写就将数据从本地线程传递到全局的commitjob中,并清空本地存储  
          unspool();  
      i[n++] = x;  
  }  

writingPtr->declareWriteIntent->push->unspool->_unspool->CommitJob::note

[cpp]view plaincopy 
   
 void CommitJob::note(void* p, int len) {  
     // from the point of view of the dur module, it would be fine (i think) to only  
     // be read locked here.  but must be at least read locked to avoid race with  
     // remapprivateview  
     if( !_intentsAndDurOps._alreadyNoted.checkAndSet(p, len) ) {//返回false表示新插入的,或者修改了长度的  
         // remember intent. we will journal it in a bit  
         _intentsAndDurOps.insertWriteIntent(p, len);//实际的记录操作,将修改的位置和修改长度记录到一个vector中  
         {  
             // a bit over conservative in counting pagebytes used  
             static size_t lastPos; // note this doesn't reset with each commit, but that is ok we aren't being that precise  
             size_t x = ((size_t) p) & ~0xfff; // round off to page address (4KB)  
             if( x != lastPos ) {   
                 lastPos = x;  
                 unsigned b = (len+4095) & ~0xfff;  
                 _bytes += b;//需要更新的byte数  
                 if (_bytes > UncommittedBytesLimit * 3) {//超过了3倍需要更新的文件了  
                     static time_t lastComplain;  
                     static unsigned nComplains;  
                     // throttle logging  
                     if( ++nComplains < 100 || time(0) - lastComplain >= 60 ) {  
                         lastComplain = time(0);  
                         warning() << "DR102 too much data written uncommitted " << _bytes/1000000.0 << "MB" << endl;  
                         if( nComplains < 10 || nComplains % 10 == 0 ) {  
                             // wassert makes getLastError show an error, so we just print stack trace  
                             printStackTrace();  
                         }  
                     }  
                 }  
             }  
         }  
     }  
 }  

下面再来看看创建删除文件的操作.

[cpp]view plaincopy 
   
 void DurableImpl::createdFile(string filename, unsigned long long len) {  
     shared_ptr<DurOp> op( new FileCreatedOp(filename, len) );//记录创建文件的地址以及其长度  
     commitJob.noteOp(op);  
 }  

[cpp]view plaincopy 
   
 void CommitJob::noteOp(shared_ptr<DurOp> p) {  
     dassert( cmdLine.dur );  
     // DurOp's are rare so it is ok to have the lock cost here  
     SimpleMutex::scoped_lock lk(groupCommitMutex);  
     cc().writeHappened();  
     _hasWritten = true;  
     _intentsAndDurOps._durOps.push_back(p);//将操作记录到持久化操作中  
 }  

上面的流程就是操作的记录,下面来看提交部分.

[cpp]view plaincopy 
   
 bool NOINLINE_DECL DurableImpl::_aCommitIsNeeded() {  
     if( !Lock::isLocked() ) {//数据的提交至少需要全局的读锁  
         Lock::GlobalRead r;  
         if( commitJob.bytes() < UncommittedBytesLimit ) {  
             // someone else beat us to it  
             return false;  
         }  
         commitNow();  
     }  
     else {   
         // 'W'  
         commitNow();  
     }  
     return true;  
 }  

[cpp]view plaincopy 
   
 bool DurableImpl::commitNow() {  
     stats.curr->_earlyCommits++;//统计信息  
     groupCommit(0);  
     return true;  
 }  

groupCommit简单的调用_groupCommit完成提交,然后可能的异常处理.

[cpp]view plaincopy 
   
 static void _groupCommit(Lock::GlobalWrite *lgw) {  
     // We are 'R' or 'W'  
     assertLockedForCommitting();  
  //将unspool中还未写到队列中的操作全部写到队列中去  
     unspoolWriteIntents(); // in case we were doing some writing ourself  
     {  
         AlignedBuilder &ab = __theBuilder;  
         // we need to make sure two group commits aren't running at the same time  
         // (and we are only read locked in the dbMutex, so it could happen)  
         SimpleMutex::scoped_lock lk(commitJob.groupCommitMutex);  
         commitJob.commitingBegin();  
         if( !commitJob.hasWritten() ) {//没有数据需要提交  
             // getlasterror request could have came after the data was already committed  
             commitJob.committingNotifyCommitted();  
         }  
         else {  
             JSectHeader h;  
             PREPLOGBUFFER(h,ab);//将所有操作buffer准备到ab中  
             // todo : write to the journal outside locks, as this write can be slow.  
             //        however, be careful then about remapprivateview as that cannot be done   
             //        if new writes are then pending in the private maps.  
             WRITETOJOURNAL(h, ab);//这里已经成功将日志写入了日志文件中  
             // data is now in the journal, which is sufficient for acknowledging getLastError.  
             // (ok to crash after that)  
             commitJob.committingNotifyCommitted();  
             WRITETODATAFILES(h, ab);//这里真正的将数据写入到了文件中  
             debugValidateAllMapsMatch();//验证dur所在映射是否与文件本身映射相等  
             commitJob.committingReset();  
             ab.reset();  
         }  
     }  
     // REMAPPRIVATEVIEW  
     //  
     // remapping private views must occur after WRITETODATAFILES otherwise  
     // we wouldn't see newly written data on reads.  
     //  
     if( !Lock::isW() ) {//这里就是之前描述的_view_private的remap动作,释放内存,然后再次map  
         // REMAPPRIVATEVIEW needs done in a write lock (as there is a short window during remapping when each view   
         // might not exist) thus we do it later.  
         //   
         // if commitIfNeeded() operations are not in a W lock, you could get too big of a private map   
         // on a giant operation.  for now they will all be W.  
         //   
         // If desired, perhaps this can be eliminated on posix as it may be that the remap is race-free there.  
         //  
         // For durthread, lgw is set, and we can upgrade to a W lock for the remap. we do this way as we don't want   
         // to be in W the entire time we were committing about (in particular for WRITETOJOURNAL() which takes time).  
         if( lgw ) {   
             LOG(4) << "_groupCommit upgrade" << endl;  
             lgw->upgrade();  
             REMAPPRIVATEVIEW();  
         }  
     }  
     else {  
         stats.curr->_commitsInWriteLock++;  
         // however, if we are already write locked, we must do it now -- up the call tree someone  
         // may do a write without a new lock acquisition.  this can happen when MongoMMF::close() calls  
         // this method when a file (and its views) is about to go away.  
         //  
         REMAPPRIVATEVIEW();  
     }  
 }  

_groupCommit->PREPLOGBUFFER

[cpp]view plaincopy 
   
 void PREPLOGBUFFER(/*out*/ JSectHeader& h, AlignedBuilder& ab) {  
     assertLockedForCommitting();  
     Timer t;//如果日志文件没有打开则打开一个日志文件并向其写入日志头JHeader  
     j.assureLogFileOpen(); // so fileId is set  
     _PREPLOGBUFFER(h, ab);//实际的操作以及数据写入  
     stats.curr->_prepLogBufferMicros += t.micros();  
 }  

_groupCommit->PREPLOGBUFFER->_PREPLOGBUFFER

[cpp]view plaincopy 
   
 static void _PREPLOGBUFFER(JSectHeader& h, AlignedBuilder& bb) {  
     resetLogBuffer(h, bb); // adds JSectHeader得到JSecHeader头  
     // ops other than basic writes (DurOp's)  
     {//记录如创建文件删除文件等操作  
         for( vector< shared_ptr<DurOp> >::iterator i = commitJob.ops().begin(); i != commitJob.ops().end(); ++i ) {  
             (*i)->serialize(bb);  
         }  
     }  
     prepBasicWrites(bb);  
     return;  
 }  

_groupCommit->PREPLOGBUFFER->_PREPLOGBUFFER->prepBasicWrites

[cpp]view plaincopy 
   
 static void prepBasicWrites(AlignedBuilder& bb) {  
     scoped_lock lk(privateViews._mutex());  
     // each time events switch to a different database we journal a JDbContext  
     // switches will be rare as we sort by memory location first and we batch commit.  
     RelativePath lastDbPath;  
     const vector<WriteIntent>& _intents = commitJob.getIntentsSorted();  
     WriteIntent last;  
     for( vector<WriteIntent>::const_iterator i = _intents.begin(); i != _intents.end(); i++ ) {   
         if( i->start() < last.end() ) {//两个地址有重合,合并其  
             // overlaps  
             last.absorb(*i);//合并到last中  
         }  
         else {   
             // discontinuous  
             if( i != _intents.begin() )  
                 prepBasicWrite_inlock(bb, &last, lastDbPath);  
             last = *i;  
         }  
     }  
     prepBasicWrite_inlock(bb, &last, lastDbPath);  
 }  

_groupCommit->PREPLOGBUFFER->_PREPLOGBUFFER->prepBasicWrites->prepBasicWrite_inlock

[cpp]view plaincopy 
   
 /** put the basic write operation into the buffer (bb) to be journaled */  
 static void prepBasicWrite_inlock(AlignedBuilder&bb, const WriteIntent *i, RelativePath& lastDbPath) {  
     size_t ofs = 1;  
     MongoMMF *mmf = findMMF_inlock(i->start(), /*out*/ofs);//ofs表示从文件的ofs偏移量开始写  
     if( unlikely(!mmf->willNeedRemap()) ) {  
         // tag this mmf as needed a remap of its private view later.  
         // usually it will already be dirty/already set, so we do the if above first  
         // to avoid possibility of cpu cache line contention  
         mmf->willNeedRemap() = true;  
     }  
     // since we have already looked up the mmf, we go ahead and remember the write view location  
     // so we don't have to find the MongoMMF again later in WRITETODATAFILES()  
     //   
     // this was for WRITETODATAFILES_Impl2 so commented out now  
     //  
     /* 
     i->w_ptr = ((char*)mmf->view_write()) + ofs; 
     */  
     JEntry e;//一段数据的修改就是一个JEntry结构  
     e.len = min(i->length(), (unsigned)(mmf->length() - ofs)); //dont write past end of file  
     verify( ofs <= 0x80000000 );  
     e.ofs = (unsigned) ofs;  
     e.setFileNo( mmf->fileSuffixNo() );  
     if( mmf->relativePath() == local ) {  
         e.setLocalDbContextBit();  
     }  
     else if( mmf->relativePath() != lastDbPath ) {  
         lastDbPath = mmf->relativePath();  
         JDbContext c;  
         bb.appendStruct(c);  
         bb.appendStr(lastDbPath.toString());  
     }  
     bb.appendStruct(e);  
     bb.appendBuf(i->start(), e.len);//记录实际写的数据  
     if (unlikely(e.len != (unsigned)i->length())) {  
         log() << "journal info splitting prepBasicWrite at boundary" << endl;  
         // This only happens if we write to the last byte in a file and  
         // the fist byte in another file that is mapped adjacently. I  
         // think most OSs leave at least a one page gap between  
         // mappings, but better to be safe.  
         WriteIntent next ((char*)i->start() + e.len, i->length() - e.len);  
         prepBasicWrite_inlock(bb, &next, lastDbPath);  
     }  
 }  

通过上面我们已经分析完完了持久化log的产生,下面继续WRITETOJOURNAL,持久化log的写入动作.

[cpp]view plaincopy 
   
 void WRITETOJOURNAL(JSectHeader h, AlignedBuilder& uncompressed) {  
     Timer t;  
     j.journal(h, uncompressed);  
     stats.curr->_writeToJournalMicros += t.micros();  
 }  

[cpp]view plaincopy 
   
 void Journal::journal(const JSectHeader& h, const AlignedBuilder& uncompressed) {  
     static AlignedBuilder b(32*1024*1024);  
     /* buffer to journal will be 
        JSectHeader 
        compressed operations 
        JSectFooter 
     */  
     const unsigned headTailSize = sizeof(JSectHeader) + sizeof(JSectFooter);  
     const unsigned max = maxCompressedLength(uncompressed.len()) + headTailSize;  
     b.reset(max);  
     {  
         dassert( h.sectionLen() == (unsigned) 0xffffffff ); // we will backfill later  
         b.appendStruct(h);//添加JSectHeader头  
     }  
     size_t compressedLength = 0;//日志的压缩,使用snappy能达到很快的压缩速度,虽然压缩率不如zip之类的  
     rawCompress(uncompressed.buf(), uncompressed.len(), b.cur(), &compressedLength);  
     b.skip(compressedLength);  
     // footer  
     unsigned L = 0xffffffff;  
     {  
         // pad to alignment, and set the total section length in the JSectHeader  
         unsigned lenUnpadded = b.len() + sizeof(JSectFooter);  
         L = (lenUnpadded + Alignment-1) & (~(Alignment-1));  
         ((JSectHeader*)b.atOfs(0))->setSectionLen(lenUnpadded);  
         JSectFooter f(b.buf(), b.len()); // computes checksum  
         b.appendStruct(f);//添加JSectFooter尾  
         b.skip(L - lenUnpadded);  
     }  
      {  
         SimpleMutex::scoped_lock lk(_curLogFileMutex);  
         // must already be open -- so that _curFileId is correct for previous buffer building  
         verify( _curLogFile );  
         stats.curr->_uncompressedBytes += uncompressed.len();  
         unsigned w = b.len();  
         _written += w;  
         verify( w <= L );  
         stats.curr->_journaledBytes += L;//将日志记录写入日志文件中  
         _curLogFile->synchronousAppend((const void *) b.buf(), L);  
         _rotate();//  
     }  
 }  

WRITETOJOURNAL->_rotate

[cpp]view plaincopy 
   
 void Journal::_rotate() {  
       j.updateLSNFile();//更新时间戳到上一次DataFileSync刷新的时间,以后系统故障还原时只需要还原比这个时间戳更新的日志就行了  
       if( _curLogFile && _written < DataLimitPerJournalFile )//达到单个日志文件大小的上限,32位上限默认为256M,64位为1G,若设置了--smallfiles则大小为128M,达到这个上线后重新打开一个新的日志文件  
           return;  
       if( _curLogFile ) {  
           _curLogFile->truncate();//截断日志文件，并开启新的日志文件  
           closeCurrentJournalFile();  
           removeUnneededJournalFiles();//移出过时的日志文件,可能将其加入到预分配的文件中  
       }  
           Timer t;  
           _open();  
           int ms = t.millis();  
           if( ms >= 200 ) {  
               log() << "DR101 latency warning on journal file open " << ms << "ms" << endl;  
           }  
   }  

上面完成了日志的写入工作,继续_groupCommit函数.

[cpp]view plaincopy 
   
 void WRITETODATAFILES(const JSectHeader& h, AlignedBuilder& uncompressed) {  
     Timer t;  
     WRITETODATAFILES_Impl1(h, uncompressed);//写入到_view_write映射中用于DataFileSync刷到磁盘中  
     unsigned long long m = t.micros();  
     stats.curr->_writeToDataFilesMicros += m;  
 }  

[cpp]view plaincopy 
   
 static void WRITETODATAFILES_Impl1(const JSectHeader& h, AlignedBuilder& uncompressed) {  
     LockMongoFilesShared lk;//解析之前产生的日志信息,将记录写到_view_write映射中  
     RecoveryJob::get().processSection(&h, uncompressed.buf(), uncompressed.len(), 0);  
 }  

[cpp]view plaincopy 
   
 void RecoveryJob::processSection(const JSectHeader *h, const void *p, unsigned len, const JSectFooter *f) {  
     scoped_lock lk(_mx);  
     /** todo: we should really verify the checksum to see that seqNumber is ok? 
               that is expensive maybe there is some sort of checksum of just the header  
               within the header itself  */  
    //启动时的recovering,我们这里需要跳过,_lastDataSyncedFromLastRun就是之前写入的最新的刷新时间戳,  
           //晚于这个时间的日志才需要恢复  
     if( _recovering && _lastDataSyncedFromLastRun > h->seqNumber + ExtraKeepTimeMs ) {  
         if( h->seqNumber != _lastSeqMentionedInConsoleLog ) {  
             static int n;  
             if( ++n < 10 ) {  
                 log() << "recover skipping application of section seq:" << h->seqNumber << " < lsn:" << _lastDataSyncedFromLastRun << endl;  
             }  
             else if( n == 10 ) {   
                 log() << "recover skipping application of section more..." << endl;  
             }  
             _lastSeqMentionedInConsoleLog = h->seqNumber;  
         }  
         return;  
     }  
     auto_ptr<JournalSectionIterator> i;  
     if( _recovering ) {  
         i = auto_ptr<JournalSectionIterator>(new JournalSectionIterator(*h, p, len, _recovering));  
     }  
     else {   
         i = auto_ptr<JournalSectionIterator>(new JournalSectionIterator(*h, /*after header*/p, /*w/out header*/len));  
     }  
     // we use a static so that we don't have to reallocate every time through.  occasionally we   
     // go back to a small allocation so that if there were a spiky growth it won't stick forever.  
     static vector<ParsedJournalEntry> entries;  
     entries.clear();  
     // first read all entries to make sure this section is valid  
     ParsedJournalEntry e;  
     while( !i->atEof() ) {//读出所有entry,上面分析代码我们知道1个entry就对应于一片修改的区域  
         i->next(e);  
         entries.push_back(e);  
     }  
     // after the entries check the footer checksum  
     if( _recovering ) {//恢复操作的校验  
         verify( ((const char *)h) + sizeof(JSectHeader) == p );  
         if( !f->checkHash(h, len + sizeof(JSectHeader)) ) {   
             msgasserted(13594, "journal checksum doesn't match");  
         }  
     }  
     // got all the entries for one group commit.  apply them:  
     applyEntries(entries);  
 }  

_groupCommit->WRITETODATAFILES->WRITETODATAFILES_Impl1->processSection->applyEntries

[cpp]view plaincopy 
   
 void RecoveryJob::applyEntries(const vector<ParsedJournalEntry> &entries) {  
     bool apply = (cmdLine.durOptions & CmdLine::DurScanOnly) == 0;  
     bool dump = cmdLine.durOptions & CmdLine::DurDumpJournal;//循环写每一个entry  
     for( vector<ParsedJournalEntry>::const_iterator i = entries.begin(); i != entries.end(); ++i ) {  
         applyEntry(*i, apply, dump);  
     }  
 }  

_groupCommit->WRITETODATAFILES->WRITETODATAFILES_Impl1->processSection->applyEntries->applyEntry

[cpp]view plaincopy 
   
 void RecoveryJob::applyEntry(const ParsedJournalEntry& entry, bool apply, bool dump) {  
     if( entry.e ) {  
         if( apply ) {//数据的更改  
             write(entry);  
         }  
     }  
     else if(entry.op) {  
         // a DurOp subclass operation  
         if( apply ) {//文件的创建或者删除  
             if( entry.op->needFilesClosed() ) {  
                 _close(); // locked in processSection  
             }  
             entry.op->replay();//重新执行一次上一次指定的操作,这里不再分析  
         }  
     }  
 }  

_groupCommit->WRITETODATAFILES->WRITETODATAFILES_Impl1->processSection->applyEntries->applyEntry->write

[cpp]view plaincopy 
   
 void RecoveryJob::write(const ParsedJournalEntry& entry) {  
     const string fn = fileName(entry.dbName, entry.e->getFileNo());  
     MongoFile* file;  
     {  
         MongoFileFinder finder; // must release lock before creating new MongoMMF  
         file = finder.findByPath(fn);//根据Entry中的信息得到对于的MongoMMF  
     }  
     MongoMMF* mmf;  
     if (file) {  
         verify(file->isMongoMMF());  
         mmf = (MongoMMF*)file;  
     }  
     else {  
         boost::shared_ptr<MongoMMF> sp (new MongoMMF);  
         verify(sp->open(fn, false));  
         _mmfs.push_back(sp);  
         mmf = sp.get();  
     }  
     if ((entry.e->ofs + entry.e->len) <= mmf->length()) {  
         verify(mmf->view_write());  
         verify(entry.e->srcData());//这里出现实际的_view_write映射位置,将数据写入到该映射中,后面DataFileSync将数据从这个映射中的胀数据刷到磁盘中  
         void* dest = (char*)mmf->view_write() + entry.e->ofs;  
         memcpy(dest, entry.e->srcData(), entry.e->len);  
         stats.curr->_writeToDataFilesBytes += entry.e->len;  
     }  
     else {  
         massert(13622, "Trying to write past end of file in WRITETODATAFILES", _recovering);  
     }  
 }  

到这里写入到实际的映射的工作完成,commit最后可能为_view_private做重新的映射.下面我们来看看

日志文件的结构.其中的JEntry结构可能会换成文件的创建和删除操作,JHeader后的JSectHeader到JFooter

会不断的重复,图上并未画出来.

通过上面上面的分析我们已经详细了解了日志的创建以及数据的提交,下面来看看启动时候日志的检查

与恢复.持久化的是在dur.cpp中的函数startup中进行的.

[cpp]view plaincopy 
   
 void startup() {  
     if( !cmdLine.dur )  
         return;  
     DurableInterface::enableDurability();  
     journalMakeDir();//创建journal目录  
     recover();//全局锁下调用_recover()做日志恢复工作  
     preallocateFiles();//如果允许预分配日志文件这里就进行日志文件的预分配,默认为true,可以--nopreallocj关闭  
     boost::thread t(durThread);//创建持久化线程,journalCommitInterval ms提交一次,设置为0则为默认,默认情况下日志文件与数据库同分区则100ms一次,否则30ms一次        }  

[cpp]view plaincopy 
   
 void _recover() {  
     boost::filesystem::path p = getJournalDir();  
     vector<boost::filesystem::path> journalFiles;  
     getFiles(p, journalFiles);//得到所有的日志文件  
     RecoveryJob::get().go(journalFiles);//正式做恢复工作  
 }  

[cpp]view plaincopy 
   
 void RecoveryJob::go(vector<boost::filesystem::path>& files) {  
     _recovering = true;  
     // load the last sequence number synced to the datafiles on disk before the last crash  
     _lastDataSyncedFromLastRun = journalReadLSN();//读出要恢复的起始时间戳  
     for( unsigned i = 0; i != files.size(); ++i ) {  
   bool abruptEnd = processFile(files[i]);//实际做每一个文件的处理  
     }  
     close();  
     removeJournalFiles();//移出日志文件,可能循环再利用将其变成预分配日志文件  
     okToCleanUp = true;  
     _recovering = false;  
 }  

[cpp]view plaincopy 
   
 bool RecoveryJob::processFile(boost::filesystem::path journalfile) {  
     MemoryMappedFile f;//首先将日志文件映射到内存  
     void *p = f.mapWithOptions(journalfile.string().c_str(), MongoFile::READONLY | MongoFile::SEQUENTIAL);  
     return processFileBuffer(p, (unsigned) f.length());  
 }  

[cpp]view plaincopy 
   
 bool RecoveryJob::processFileBuffer(const void *p, unsigned len) {  
     try {  
         unsigned long long fileId;  
         BufReader br(p,len);  
         {  
             // read file header  
             JHeader h;  
             br.read(h);//日志文件头,得到其特有的fileid  
             /* [dm] not automatically handled.  we should eventually handle this automatically.  i think: 
                (1) if this is the final journal file 
                (2) and the file size is just the file header in length (or less) -- this is a bit tricky to determine if prealloced 
                then can just assume recovery ended cleanly and not error out (still should log). 
             */  
             fileId = h.fileId;  
         }  
         // read sections  
         while ( !br.atEof() ) {//上面说过一次提交形成一个JSectHeader,这里一个一个Section处理  
             JSectHeader h;  
             br.peek(h);  
             if( h.fileId != fileId ) {//表明日志文件有错误  
                 return true;  
             }  
             unsigned slen = h.sectionLen();  
             unsigned dataLen = slen - sizeof(JSectHeader) - sizeof(JSectFooter);  
             const char *hdr = (const char *) br.skip(h.sectionLenWithPadding());  
             const char *data = hdr + sizeof(JSectHeader);  
             const char *footer = data + dataLen;//这个函数上面已经分析,这里不再分析  
             processSection((const JSectHeader*) hdr, data, dataLen, (const JSectFooter*) footer);  
             // ctrl c check  
             killCurrentOp.checkForInterrupt(false);  
         }  
     }  
     return false; // non-abrupt end  
 }  

到这里,mongodb持久化的代码分析完了,需要注意的是通过代码的分析我们发现,在突然故障时mongodb

还是会丢失部分数据.双map中为了防止持久化专用的map _view_private过大,也会经常remap,来释放物理内存.

原文链接:mongodb源码分析(十三)持久化

作者:yhjj0108,杨浩

mongodb源码分析(十四)replication主从模式

   mongodb提供数据的复制机制,老的master/slave和新的replset模式,本文分析老的master/slave

机制,replset在下一篇文中分析.master/slave机制是一台主服务器,其它的从服务器,从服务器从主服务

器中读出操作记录,然后在自己这端重现操作,达到和主服务器一致的目的.主从服务器是启动时设定的,

之间无法动态的切换,其提供数据的备份机制,默认情况下从服务器是不能读写的,需要读操作那么可以调

用rs.slaveOk(),这样每次对从服务器的查询都会带上标志QueryOption_SlaveOk表示可以读从服务器.

主从模式的流程,主服务器将每一次的操作记录到local.oplog.$main中,这个集合是capped,集合大

小固定,可以通过--oplogSize设置其大小,单位是M.默认情况下32位系统大小为50M,64位系统最小为

990M,最大为数据库所在磁盘的可用空间的5%.

从服务器首先从主服务器复制一份数据库数据,然后就只从主服务器的local.oplog.$main集合中读

取操作记录然后replay了.如果由于local.oplog.$main上的操作时间戳超过了从服务器,这说明主服务器

的操作记录已经被更新的操作记录覆盖了,但是从服务器没有读取到做replay,从服务器只能再次完全从

主服务器中拷贝一份数据库了.下面是本文分析到的collection的作用.

local.sources: 记录从服务器要同步的主服务器地址.

local.oplog.$main: 主服务器的binlog.

下面来看代码吧.主服务器的启动是通过--master完成的,入口函数为repl.cpp startReplication.删除

了不相关的代码.

[cpp] view plain copy

void startReplication() {
    oldRepl();//设置记录binlog的函数指针.
    {
        Lock::GlobalWrite lk;
        replLocalAuth();//增加本地用户local数据库_repl账户写权限
    }
    if ( replSettings.slave ) {//从服务器的线程完成读取local.oplog.$main并且replay
        boost::thread repl_thread(replSlaveThread);
    }
    if ( replSettings.master ) {
        replSettings.master = true;
        createOplog();//若未建立local.oplog.$main集合则在这里建立.
        boost::thread t(replMasterThread);//这个线程没做什么事
    }
    while( replSettings.fastsync ) // don't allow writes until we've set up from log
        sleepmillis( 50 );
}

来看看主服务器的操作日志记录.操作日志分几种.

i: 插入操作.

u: 更新操作.

c db命令操作.

d: 删除操作.

n: 无操作,仅仅是一种心跳,告诉从服务器主服务器在正常运行.

继续logOp操作:

[cpp] view plain copy

void logOp(const char opstr, const char ns, const BSONObj& obj, BSONObj patt, bool b, bool fromMigrate) {
    if ( replSettings.master ) {//主服务器的log,记录到local.oplog.$main中
        _logOp(opstr, ns, 0, obj, patt, b, fromMigrate);
    }

    logOpForSharding( opstr , ns , obj , patt );
}

_logOp这种在初始化时设置为了_logOpOld.

[cpp] view plain copy

static void _logOpOld(const char opstr, const char ns, const char logNS, const BSONObj& obj, BSONObj o2, bool bb, bool fromMigrate ) {
    Lock::DBWrite lk("local");
    static BufBuilder bufbuilder(81024); // todo there is likely a mutex on this constructor
    mutex::scoped_lock lk2(OpTime::m);
    const OpTime ts = OpTime::now(lk2);
    Client::Context context("",0,false);
    /* we jump through a bunch of hoops here to avoid copying the obj buffer twice --
       instead we do a single copy to the destination position in the memory mapped file.
    /
    bufbuilder.reset();
    BSONObjBuilder b(bufbuilder);
    b.appendTimestamp("ts", ts.asDate());//记录日志时间,同步用
    b.append("op", opstr);
    b.append("ns", ns);
    if (fromMigrate)
        b.appendBool("fromMigrate", true);
    if ( bb )
        b.appendBool("b", bb);
    if ( o2 )//只有update操作存在,为query对象
        b.append("o2", o2);
    BSONObj partial = b.done(); // partial is everything except the o:... part.
    int po_sz = partial.objsize();
    int len = po_sz + obj.objsize() + 1 + 2 /o:/;
    Record r;//这里完成空间分配
    if( logNS == 0 ) {
        logNS = "local.oplog.$main";
        if ( localOplogMainDetails == 0 ) {
            Client::Context ctx( logNS , dbpath, false);
            localDB = ctx.db();
            verify( localDB );
            localOplogMainDetails = nsdetails(logNS);
            verify( localOplogMainDetails );
        }
        Client::Context ctx( logNS , localDB, false );
        r = theDataFileMgr.fast_oplog_insert(localOplogMainDetails, logNS, len);
    }
    else {
        Client::Context ctx( logNS, dbpath, false );
        verify( nsdetails( logNS ) );
        // first we allocate the space, then we fill it below.
        r = theDataFileMgr.fast_oplog_insert( nsdetails( logNS ), logNS, len);
    }
    append_O_Obj(r->data(), partial, obj);//实际的数据插入
    context.getClient()->setLastOp( ts );//更新最后log的时间
}

下面来看看从服务器的同步工作,归结起来可以是加载同步的数据源,读取操作日志,replay.从服务

器的同步入口为replSlaveThread,其内部调用replMain做同步工作.下面直接从replMain开始分析.

[cpp] view plain copy

void replMain() {
    ReplSource::SourceVector sources;
    while ( 1 ) {
        int s = 0;
        {
            Lock::GlobalWrite lk;
            if ( replAllDead ) {//同步出现错误了,resync是删除数据库后再同步
                // throttledForceResyncDead can throw
                if ( !replSettings.autoresync || !ReplSource::throttledForceResyncDead( "auto" ) ) {
                    break;
                }
            }
            verify( syncing == 0 ); // i.e., there is only one sync thread running. we will want to change/fix this.
            syncing++;
        }
        try {
            int nApplied = 0;
            s = _replMain(sources, nApplied);
            if( s == 1 ) {
                if( nApplied == 0 ) s = 2;
                else if( nApplied > 100 ) {
                    // sleep very little - just enought that we aren't truly hammering master
                    sleepmillis(75);
                    s = 0;
                }
            }
        }
        catch (...) {
            out() << "caught exception in _replMain" << endl;
            s = 4;
        }
        {
            Lock::GlobalWrite lk;
            verify( syncing == 1 );
            syncing--;
        }
        if( relinquishSyncingSome )  {
            relinquishSyncingSome = 0;
            s = 1; // sleep before going back in to syncing=1
        }
        if ( s ) {
            stringstream ss;
            ss << "repl: sleep " << s << " sec before next pass";
            string msg = ss.str();
            if ( ! cmdLine.quiet )
                log() << msg << endl;
            ReplInfo r(msg.c_str());
            sleepsecs(s);
        }
    }
}

replMain->_replMain

[cpp] view plain copy

int _replMain(ReplSource::SourceVector& sources, int& nApplied) {
    {
        Lock::GlobalWrite lk;
        ReplSource::loadAll(sources);//加载要需要sync的源端
        replSettings.fastsync = false; // only need this param for initial reset
    }
    int sleepAdvice = 1;
    for ( ReplSource::SourceVector::iterator i = sources.begin(); i != sources.end(); i++ ) {
        ReplSource s = i->get();
        int res = -1;
        try {
            res = s->sync(nApplied);//从具体的主服务器端读取操作记录做同步工作
            bool moreToSync = s->haveMoreDbsToSync();
            if( res < 0 ) {
                sleepAdvice = 3;
            }
            else if( moreToSync ) {
                sleepAdvice = 0;
            }
            else if ( s->sleepAdvice() ) {
                sleepAdvice = s->sleepAdvice();
            }
            else
                sleepAdvice = res;
        }
        if ( res < 0 )
            s->oplogReader.resetConnection();
    }
    return sleepAdvice;
}

replMain->_replMain->loadAll

[cpp] view plain copy

void ReplSource::loadAll(SourceVector &v) {
    Client::Context ctx("local.sources");
    SourceVector old = v;
    v.clear();
    if ( !cmdLine.source.empty() ) {
        // --source <host> specified.
        // check that no items are in sources other than that
        // add if missing
        shared_ptr<Cursor> c = findTableScan("local.sources", BSONObj());
        if ( n == 0 ) {//local.sources中不存在同步资源，这里加入
            // source missing.  add.
            ReplSource s;
            s.hostName = cmdLine.source;
            s.only = cmdLine.only;
            s.save();//将数据记录到local.sources集合中
        }
    }
//这里加载的Cursor是Reverse的,加载最后一个需要同步的资源
    shared_ptr<Cursor> c = findTableScan("local.sources", BSONObj());
    while ( c->ok() ) {
        ReplSource tmp(c->current());
        if ( tmp.syncedTo.isNull() ) {
            DBDirectClient c;//这里从本地local.oplog.$main拿出当前同步到的时间点
            if ( c.exists( "local.oplog.$main" ) ) {//倒序查找最后一个操作不是n的记录并根据其记录sync时间
                BSONObj op = c.findOne( "local.oplog.$main", QUERY( "op" << NE << "n" ).sort( BSON( "$natural" << -1 ) ) );
                if ( !op.isEmpty() ) {
                    tmp.syncedTo = op[ "ts" ].date();
                }
            }
        }
        addSourceToList(v, tmp, old);//加入每一个同步源
        c->advance();
    }
}

replMain->_replMain->sync

[cpp] view plain copy

int ReplSource::sync(int& nApplied) {
    if ( !oplogReader.connect(hostName) ) {//连接master,并完成认证工作
        log(4) << "repl:  can't connect to sync source" << endl;
        return -1;
    }
    return sync_pullOpLog(nApplied);//获取操作日志与replay
}

replMain->_replMain->sync->sync_pullOpLog

[cpp] view plain copy

  int ReplSource::sync_pullOpLog(int& nApplied) {
      int okResultCode = 1;
      string ns = string("local.oplog.$") + sourceName();
      bool tailing = true;
      oplogReader.tailCheck();
      bool initial = syncedTo.isNull();
      if ( !oplogReader.haveCursor() || initial ) {//初次同步数据
          if ( initial ) {
              // Important to grab last oplog timestamp before listing databases.
              syncToTailOfRemoteLog();//读取local.oplog.$main中的最新一条有用的操作数据,指定这次sync从哪个时间点开始
              BSONObj info;
              bool ok = oplogReader.conn()->runCommand( "admin", BSON( "listDatabases" << 1 ), info );
              BSONObjIterator i( info.getField( "databases" ).embeddedObject() );
              while( i.moreWithEOO() ) {//加入所有非空的并且不为local的数据库，若只指定了
                  BSONElement e = i.next();//only,则只加入only指定的数据库
                  string name = e.embeddedObject().getField( "name" ).valuestr();
                  if ( !e.embeddedObject().getBoolField( "empty" ) ) {
                      if ( name != "local" ) {
                          if ( only.empty() || only == name ) {
                              addDbNextPass.insert( name );
                          }
                      }
                  }
              }
              Lock::GlobalWrite lk;
              save();
          }
//初始化cursor,这里指定的查询条件是大于等于这个syncedTo,
//而这个syncedTo在slave第一次启动时第一次运行到这里时是
//master的指定的最后一条数据
          BSONObjBuilder q;
          q.appendDate("$gte", syncedTo.asDate());
          BSONObjBuilder query;
          query.append("ts", q.done());
          if ( !only.empty() ) {
              // note we may here skip a LOT of data table scanning, a lot of work for the master.
              // maybe append "\\." here?
              query.appendRegex("ns", string("^") + pcrecpp::RE::QuoteMeta( only ));
          }
          BSONObj queryObj = query.done();
          // e.g. queryObj = { ts: { $gte: syncedTo } }

          oplogReader.tailingQuery(ns.c_str(), queryObj);
          tailing = false;
      }
      else {
          log(2) << "repl: tailing=true\n";
      }
      { // show any deferred database creates from a previous pass
          set<string>::iterator i = addDbNextPass.begin();
          if ( i != addDbNextPass.end() ) {//这里是待添加数据库的处理,一次处理一个
              BSONObjBuilder b;
              b.append("ns", i + '.');
              b.append("op", "db");
              BSONObj op = b.done();
              sync_pullOpLog_applyOperation(op, false);
          }
      }
      OpTime nextOpTime;
      {
          BSONObj op = oplogReader.next();
          BSONElement ts = op.getField("ts");
          nextOpTime = OpTime( ts.date() );
          if( tailing ) {
              oplogReader.putBack( op ); // op will be processed in the loop below
              nextOpTime = OpTime(); // will reread the op below
          }
      }
      {   // apply operations
          int n = 0;
          time_t saveLast = time(0);
          while ( 1 ) {
              bool moreInitialSyncsPending = !addDbNextPass.empty() && n; // we need "&& n" to assure we actually process at least one op to get a sync point recorded in the first place.
              if ( moreInitialSyncsPending || !oplogReader.more() ) {//还有数据库等待添加,这里只是保存了sync的时间戳
                  Lock::GlobalWrite lk;
                  if( oplogReader.awaitCapable() && tailing )
                      okResultCode = 0; // don't sleep
                  syncedTo = nextOpTime;
                  save(); // note how far we are synced up to now
                  nApplied = n;
                  break;
              }
              BSONObj op = oplogReader.next();
              unsigned b = replApplyBatchSize;
              bool justOne = b == 1;
              scoped_ptr<Lock::GlobalWrite> lk( justOne ? 0 : new Lock::GlobalWrite() );
              while( 1 ) {
                  BSONElement ts = op.getField("ts");
                  OpTime last = nextOpTime;
                  nextOpTime = OpTime( ts.date() );//这里sync的delay还没到,暂时不sync了,等待下一次sync
                  if ( replSettings.slavedelay && ( unsigned( time( 0 ) ) < nextOpTime.getSecs() + replSettings.slavedelay ) ) {
                      oplogReader.putBack( op );
                      _sleepAdviceTime = nextOpTime.getSecs() + replSettings.slavedelay + 1;
                      Lock::GlobalWrite lk;
                      if ( n > 0 ) {
                          syncedTo = last;
                          save();
                      }
                      return okResultCode;
                  }//实际的log处理
                  sync_pullOpLog_applyOperation(op, !justOne);
                  n++;
                  if( --b == 0 )
                      break;
                  // if to here, we are doing mulpile applications in a singel write lock acquisition
                  if( !oplogReader.moreInCurrentBatch() ) {
                      // break if no more in batch so we release lock while reading from the master
                      break;
                  }
                  op = oplogReader.next();
                  getDur().commitIfNeeded();
              }
          }
      }
      return okResultCode;
  }

replMain->_replMain->sync->sync_pullOpLog->sync_pullOpLog_applyOperation

[cpp] view plain copy

void ReplSource::sync_pullOpLog_applyOperation(BSONObj& op, bool alreadyLocked) {
    if( op.getStringField("op")[0] == 'n' )
        return;
    char clientName[MaxDatabaseNameLen];
    const char ns = op.getStringField("ns");
    nsToDatabase(ns, clientName);
    if ( !only.empty() && only != clientName )//slave启动时指定了only,只sync某一个数据库
        return;
  //将要更新的数据部分预加载到数据库中
    if( cmdLine.pretouch && !alreadyLocked/doesn't make sense if in write lock already/ ) {
        if( cmdLine.pretouch > 1 ) {
            / note: this is bad - should be put in ReplSource.  but this is first test... /
            static int countdown;
            if( countdown > 0 ) {
                countdown--; // was pretouched on a prev pass
            }
            else {
                const int m = 4;
                if( tp.get() == 0 ) {
                    int nthr = min(8, cmdLine.pretouch);
                    nthr = max(nthr, 1);
                    tp.reset( new ThreadPool(nthr) );
                }
                vector<BSONObj> v;
                oplogReader.peek(v, cmdLine.pretouch);
                unsigned a = 0;
                while( 1 ) {
                    if( a >= v.size() ) break;
                    unsigned b = a + m - 1; // v[a..b]
                    if( b >= v.size() ) b = v.size() - 1;
                    tp->schedule(pretouchN, v, a, b);
                    a += m;
                }
                // we do one too...
                pretouchOperation(op);
                tp->join();
                countdown = v.size();
            }
        }
        else {
            pretouchOperation(op);
        }
    }
    scoped_ptr<Lock::GlobalWrite> lk( alreadyLocked ? 0 : new Lock::GlobalWrite() );
  //如果待添加数据库与本地数据库同名,删除本地数据库
    if ( !handleDuplicateDbName( op, ns, clientName ) ) {
        return;
    }
    Client::Context ctx( ns );
    ctx.getClient()->curop()->reset();
    bool empty = ctx.db()->isEmpty();
    bool incompleteClone = incompleteCloneDbs.count( clientName ) != 0;
    // always apply admin command command
    // this is a bit hacky -- the semantics of replication/commands aren't well specified
    if ( strcmp( clientName, "admin" ) == 0 && op.getStringField( "op" ) == 'c' ) {
        applyOperation( op );//admin的命令,直接执行了
        return;
    }
   //该数据库在本地(slave)才建立,这里克隆数据库到本地
    if ( ctx.justCreated() || empty || incompleteClone ) {
        // we must add to incomplete list now that setClient has been called
        incompleteCloneDbs.insert( clientName );
        if ( nClonedThisPass ) {//已经在克隆一个数据库了，下次再克隆另一个
            /* we only clone one database per pass, even if a lot need done.  This helps us
             avoid overflowing the master's transaction log by doing too much work before going
             back to read more transactions. (Imagine a scenario of slave startup where we try to
             clone 100 databases in one pass.)
             */
            addDbNextPass.insert( clientName );
        }
        else {
            save();
            Client::Context ctx(ns);
            nClonedThisPass++;
            resync(ctx.db()->name);//同步复制数据库,整个复制,也就是一个一个的collection复制,过程可能很慢
            addDbNextPass.erase(clientName);
            incompleteCloneDbs.erase( clientName );
        }
        save();
    }
    else {
        applyOperation( op );//这里将insert,update,delete等操作在本地执行一次,流程简单,不再分析
        addDbNextPass.erase( clientName );
    }
}

到这里master/slave模式分析完毕,主要需要注意的是数据库的复制,与当前同步的时间戳问题.

每一次都是查询从上一次同步到的时间戳到最新的时间戳,得到的结果必定是上一次同步到的时间

戳,否则说明主服务器操作太多,local.oplog.$main已经丢掉了老旧的操作日志,这时就只能重新复制

整个数据库了.

原文链接:mongodb源码分析(十四)replication主从模式

作者:yhjj0108,杨浩

mongodb源码分析(十五)replication replset模式的初始化

相对于主从模式,replset模式复杂得多,其中的主从对应于这里的primary,secondary概念,primary和

secondary之间可以切换,primary掉线后能够自动的选取一个secondary成为新的primary,当然这里也是有

限制的,本文将会分析到.首先来看replset模式用到的几个集合.

local.oplog.rs: 记录replset模式下的操作日志,master/slave模式下为local.oplog.$main.

local.system.replset replset模式的配置.就是rs.initiate,rs.add等设置的信息.

先来看看一个典型的replset 配置.

当我们写一个数据时如:

[cpp] view plain copy

db.foo.insert({x:1})
db.runCommand({getLastError:1,w:"veryImportant"})

只有当这次写被写到了veryImportant指定的三个地方,如ny sf cloud时,getLastError才会返回成功,否则其

会一直等待.这种方式可以确保一份数据被写到了不同的服务器上.来看看另一种的replset配置.

[cpp] view plain copy

{_id:'myset', members:[{_id:0,host:'192.168.136.1:27040'},{_id:1,host:'192.168.136.1:27050',votes:0}]}

这里只有两台服务器,若端口为27050的服务器关闭,那么27040端口的服务器还是primary.并不会转成secondary

并且无法工作.但是如下配置:

[cpp] view plain copy

{_id:'myset', members:[{_id:0,host:'192.168.136.1:27040'},{_id:1,host:'192.168.136.1:27050'}]}

那么当27050关闭后27040将从primary转成secondary,整个replset将无法工作.原因在于这里的votes.mongodb的

replset规定在线的服务器的votes总和的两倍要大于所有replset中配置的服务器votes的总和,

2online_votes>all_replset_config_vote,这时replset才能正常的工作,否则将无法正常的工作.如果不设置votes默

认其值为1.讨论另外一种情况,当27040的服务器掉线时那么27050的服务器将无法成为primary,系统将不再工作.

若一开始配置如下,27040的服务器成为primary,这个时候若27040掉线,27050将接管工作成为primary.但是若

27050掉线,那么服务器将变得不可用,因为votes值为0了.这里最好通过添加仲裁来解决问题,仲裁虽然只做投票,并

[cpp] view plain copy

{_id:'myset', members:[{_id:0,host:'192.168.136.1:27040',votes:0},{_id:1,host:'192.168.136.1:27050'}]}

不会成为primary,secondary,但是其可以在一些服务器掉线时通过保证votes值让整个系统保持正常运行,所以

10gen也建议:

Deploy an arbiter to ensure that a replica set will have a sufficient number of members to elect a primary. While having replica sets with 2 members is not recommended for production environments, if you have just two members, deploy an arbiter. Also, for any replica set with an even number of members, deploy an arbiter.

继续看replset的流程.

1. 初始化时如果启动参数不配置replset那么启动时replset会不断的加载config.config的来源有三个.一是本地

local.system.replset集合中保存的数据,二是调用rs.initiate函数设置的config,三是来自其它replset集的心跳协

议传过来的.

2. 得到配置信息后初始化,和其它服务器建立心跳连接.

3. 启动同步线程,replset集都需要启动同步线程,但是只有secondary会去同步primary的数据.

4. 启动produce线程,这个线程负责向primary请求数据,同步线程从这个线程得到操作log然后在本地replay.

5. 启动时和后面的心态协议部分会调用msgCheckNewState更改服务器状态,从secondary转primary或者反之.

下面来看代码.首先从rs.initiate(cfg)初始化开始.初始化时执行replSetInitiate命令.直接转到该命令的执行.

[cpp] view plain copy

virtual bool run(const string& , BSONObj& cmdObj, int, string& errmsg, BSONObjBuilder& result, bool fromRepl) {
    if( cmdObj["replSetInitiate"].type() != Object ) {//配置数据来自于启动命令行
        string name;
        vector<HostAndPort> seeds;
        set<HostAndPort> seedSet;
        parseReplsetCmdLine(cmdLine._replSet, name, seeds, seedSet); // may throw...
        bob b;
        b.append("_id", name);
        bob members;
        members.append("0", BSON( "_id" << 0 << "host" << HostAndPort::me().toString() ));
        result.append("me", HostAndPort::me().toString());
        for( unsigned i = 0; i < seeds.size(); i++ )
            members.append(bob::numStr(i+1), BSON( "_id" << i+1 << "host" << seeds[i].toString()));
        b.appendArray("members", members.obj());
        configObj = b.obj();
    }
    else {//得到配置
        configObj = cmdObj["replSetInitiate"].Obj();
    }
    bool parsed = false;
    ReplSetConfig newConfig(configObj);//从配置数据中得到配置结构.
    parsed = true;
    checkMembersUpForConfigChange(newConfig, result, true);//查看配置的服务器是否能够连接
    createOplog();//建立local.system.replset集合.
    Lock::GlobalWrite lk;
    bo comment = BSON( "msg" << "initiating set");
    newConfig.saveConfigLocally(comment);//将配置保存到local.system.replset
    result.append("info", "Config now saved locally.  Should come online in about a minute.");
    ReplSet::startupStatus = ReplSet::SOON;
    ReplSet::startupStatusMsg.set("Received replSetInitiate - should come online shortly.");
    return true;
}

run->ReplSetConfig

[cpp] view plain copy

ReplSetConfig::ReplSetConfig(BSONObj cfg, bool force) :
    _ok(false),_majority(-1)
{
    _constructed = false;
    clear();
    from(cfg);//具体的读取配置,每一个服务器得到一个MemberCfg,解析可能的setting设置.
    if( force ) {
        version += rand() % 100000 + 10000;
    }
    if( version < 1 )
        version = 1;
    _ok = true;
    _constructed = true;
}

run->checkMembersUpForConfigChange

[cpp] view plain copy

void checkMembersUpForConfigChange(const ReplSetConfig& cfg, BSONObjBuilder& result, bool initial) {
    int failures = 0, allVotes = 0, allowableFailures = 0;
    int me = 0;
    for( vector<ReplSetConfig::MemberCfg>::const_iterator i = cfg.members.begin(); i != cfg.members.end(); i++ ) {
        allVotes += i->votes;//得到投票总数
    }
    allowableFailures = allVotes - (allVotes/2 + 1);//允许丢掉的投票数
    vector<string> down;
    for( vector<ReplSetConfig::MemberCfg>::const_iterator i = cfg.members.begin(); i != cfg.members.end(); i++ ) {
        // we know we're up
        if (i->h.isSelf()) {
            continue;
        }
        BSONObj res;
        {
            bool ok = false;
             {
                int theirVersion = -1000;//心跳协议查看配置的服务器是否能够连接
                ok = requestHeartbeat(cfg._id, "", i->h.toString(), res, -1, theirVersion, initial/check if empty/);
                if( theirVersion >= cfg.version ) {
                    stringstream ss;
                    ss << "replSet member " << i->h.toString() << " has too new a config version (" << theirVersion << ") to reconfigure";
                    uasserted(13259, ss.str());
                }
            }
            if( !ok && !res["rs"].trueValue() ) {//不能连接
                down.push_back(i->h.toString());
                bool allowFailure = false;
                failures += i->votes;
                if( !initial && failures <= allowableFailures ) {
                    const Member m = theReplSet->findById( i->_id );
                    // it's okay if the down member isn't part of the config,
                    // we might be adding a new member that isn't up yet
                    allowFailure = true;
                }
                if( !allowFailure ) {//初始化时要求所有配置的服务器能够被连接
                    string msg = string("need all members up to initiate, not ok : ") + i->h.toString();
                    if( !initial )
                        msg = string("need most members up to reconfigure, not ok : ") + i->h.toString();
                    uasserted(13144, msg);
                }
            }
        }
        if( initial ) {
            bool hasData = res["hasData"].Bool();
            uassert(13311, "member " + i->h.toString() + " has data already, cannot initiate set.  All members except initiator must be empty.",
                    !hasData || i->h.isSelf());
        }
    }
    if (down.size() > 0) {
        result.append("down", down);
    }
}

run->saveConfigLocally

[cpp] view plain copy

void ReplSetConfig::saveConfigLocally(bo comment) {
    checkRsConfig();
    {
        Lock::GlobalWrite lk; // TODO: does this really need to be a global lock?
        Client::Context cx( rsConfigNs );
        cx.db()->flushFiles(true);
        //theReplSet->lastOpTimeWritten = ??;
        //rather than above, do a logOp()? probably
        BSONObj o = asBson();//得到实际的配置,下面的putSingletonGod将配置保存到local.system.replset中
        Helpers::putSingletonGod(rsConfigNs.c_str(), o, false/logOp=false; local db so would work regardless.../);
        if( !comment.isEmpty() && (!theReplSet || theReplSet->isPrimary()) )
            logOpInitiate(comment);
        cx.db()->flushFiles(true);
    }
}

到这里初始化配置完成,下面看mongod启动时的初始化过程.启动部分是在repl.cpp startReplication

[cpp] view plain copy

void startReplication() {//和master/slave一样,启动都是在这个函数,只是流程不一样
    /* if we are going to be a replica set, we aren't doing other forms of replication. /
    if( !cmdLine._replSet.empty() ) {//replset指定了--replSet xxx,这里不为空表面是启动replSet模式
        newRepl();
        replSet = true;
        ReplSetCmdline replSetCmdline = new ReplSetCmdline(cmdLine._replSet);//解析cmdline,cmdline可能是<setname>/<seedhost1>,<seedhost2>,那么启动的时候就指定了replSet的配置
        boost::thread t( boost::bind( &startReplSets, replSetCmdline) );//开启一个线程来做replSet的初始化
        return;
    }
}

[cpp] view plain copy

void startReplSets(ReplSetCmdline replSetCmdline) {
    Client::initThread("rsStart");
    replLocalAuth();
    (theReplSet = new ReplSet(replSetCmdline))->go();//真正的初始化过程
    cc().shutdown();//关闭这个线程的client
}

[cpp] view plain copy

ReplSet::ReplSet(ReplSetCmdline& replSetCmdline) : ReplSetImpl(replSetCmdline) {}

[cpp] view plain copy

ReplSetImpl::ReplSetImpl(ReplSetCmdline& replSetCmdline) :
    elect(this),
    _forceSyncTarget(0),
    _blockSync(false),
    _hbmsgTime(0),
    _self(0),
    _maintenanceMode(0),
    mgr( new Manager(this) ),
    ghost( new GhostSync(this) ),
    _writerPool(replWriterThreadCount),
    _prefetcherPool(replPrefetcherThreadCount),
    _indexPrefetchConfig(PREFETCH_ALL) {
    _cfg = 0;
    memset(_hbmsg, 0, sizeof(_hbmsg));
    strcpy( _hbmsg , "initial startup" );
    lastH = 0;
    changeState(MemberState::RS_STARTUP);
    loadConfig();//加载replset的config,若config为空,则一直在其中循环加载,直到找到真正的config
    // Figure out indexPrefetch setting
    std::string& prefetch = cmdLine.rsIndexPrefetch;//通过--replIndexPrefetch启动设置,同步操作时首先预加载索引
    if (!prefetch.empty()) {
        IndexPrefetchConfig prefetchConfig = PREFETCH_ALL;
        if (prefetch == "none")
            prefetchConfig = PREFETCH_NONE;
        else if (prefetch == "_id_only")
            prefetchConfig = PREFETCH_ID_ONLY;
        else if (prefetch == "all")
            prefetchConfig = PREFETCH_ALL;
        else
            warning() << "unrecognized indexPrefetch setting: " << prefetch << endl;
        setIndexPrefetchConfig(prefetchConfig);
    }
}

继续来看loadConfig的加载配置部分.

[cpp] view plain copy

void ReplSetImpl::loadConfig() {
    startupStatus = LOADINGCONFIG;
    while( 1 ) {
         {
            vector<ReplSetConfig> configs;
            configs.push_back( ReplSetConfig(HostAndPort::me()) );//从本地的local.system.replset查找配置,
                                                      //这里可能是上一次设置的或者是rs.initiate初始化时保存下来的设置
            for( vector<HostAndPort>::const_iterator i = _seeds->begin(); i != _seeds->end(); i++ )
                configs.push_back( ReplSetConfig(i) );//从启动时设置的位置查找配置
            {
                scoped_lock lck( replSettings.discoveredSeeds_mx );
                if( replSettings.discoveredSeeds.size() > 0 ) {//来自远端的心跳协议,通过心跳协议知道远端
                                                  //存在同一个replset集的服务器,从远端读取配置
                    for (set<string>::iterator i = replSettings.discoveredSeeds.begin();
                         i != replSettings.discoveredSeeds.end();
                         i++) {
                            configs.push_back( ReplSetConfig(HostAndPort(i)) );
                    }
                }
            }
            if (!replSettings.reconfig.isEmpty())//来自本地配置如rs.add等的新的配置
                configs.push_back(ReplSetConfig(replSettings.reconfig, true));
            int nok = 0;
            int nempty = 0;
            for( vector<ReplSetConfig>::iterator i = configs.begin(); i != configs.end(); i++ ) {
                if( i->ok() )//成功的配置个数
                    nok++;
                if( i->empty() )
                    nempty++;
            }
            if( nok == 0 ) {//没有配置是可用的
                if( nempty == (int) configs.size() ) {
                    startupStatus = EMPTYCONFIG;
                    static unsigned once;
                    if( ++once == 1 ) {
                        log() << "replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done" << rsLog;
                    }
                }
                sleepsecs(10);
                continue;
            }
            if( !_loadConfigFinish(configs) ) {
                sleepsecs(20);
                continue;
            }
        }
        break;
    }
    startupStatus = STARTED;
}

继续看_loadConfigFinish,这个函数从可用配置中找出版本最高的一个配置,然后使用其做初始化.

[cpp] view plain copy

bool ReplSetImpl::_loadConfigFinish(vector<ReplSetConfig>& cfgs) {
    int v = -1;
    ReplSetConfig highest = 0;
    int myVersion = -2000;
    int n = 0;//选择一个版本最高的config,每当修改一次配置,如rs.add,rs.remove,version加一
    for( vector<ReplSetConfig>::iterator i = cfgs.begin(); i != cfgs.end(); i++ ) {
        ReplSetConfig& cfg = i;
        if( ++n == 1 ) myVersion = cfg.version;
        if( cfg.ok() && cfg.version > v ) {
            highest = &cfg;
            v = cfg.version;
        }
    }
    if( !initFromConfig(highest) )//使用该config初始化replset
        return false;
    if( highest->version > myVersion && highest->version >= 0 ) {//保存该配置
        highest->saveConfigLocally(BSONObj());//保存该config
    }
    return true;
}

_loadConfigFinish->initFromConfig,主要流程是对于每一个服务器建立一个MemberCfg的结构,并对其启动心跳协议.

[cpp] view plain copy

bool ReplSetImpl::initFromConfig(ReplSetConfig& c, bool reconf/=false/) {
    lock lk(this);
    if( getLastErrorDefault || !c.getLastErrorDefaults.isEmpty() ) {
        // see comment in dbcommands.cpp for getlasterrordefault
        getLastErrorDefault = new BSONObj( c.getLastErrorDefaults );
    }
    list<ReplSetConfig::MemberCfg> newOnes;
    // additive short-cuts the new config setup. If we are just adding a
    // node/nodes and nothing else is changing, this is additive. If it's
    // not a reconfig, we're not adding anything
    bool additive = reconf;
    {
        unsigned nfound = 0;
        int me = 0;
        for( vector<ReplSetConfig::MemberCfg>::iterator i = c.members.begin(); i != c.members.end(); i++ ) {
            ReplSetConfig::MemberCfg& m = i;
            if( m.h.isSelf() ) {
                me++;
            }
            if( reconf ) {//从新的配置
                const Member old = findById(m._id);
                if( old ) {
                    nfound++;
                    if( old->config() != m ) {//同一台服务器配置配置更改了,如vote,priority更改
                        additive = false;
                    }
                }
                else {
                    newOnes.push_back(&m);
                }
            }
        }//配置中没有本机地址，进入RS_SHUNNED状态,关闭所有连接,关闭心跳协议,重新进入加载配置状态
        if( me == 0 ) { // we're not in the config -- we must have been removed
            if (state().shunned()) {
                // already took note of our ejection from the set
                // so just sit tight and poll again
                return false;
            }
            _members.orphanAll();
            // kill off rsHealthPoll threads (because they Know Too Much about our past)
            endOldHealthTasks();
            // close sockets to force clients to re-evaluate this member
            MessagingPort::closeAllSockets(0);
            // take note of our ejection
            changeState(MemberState::RS_SHUNNED);
            loadConfig();  // redo config from scratch
            return false;
        }
        // if we found different members that the original config, reload everything
        if( reconf && config().members.size() != nfound )
            additive = false;
    }
    _cfg = new ReplSetConfig(c);
    _name = config()._id;
    // this is a shortcut for simple changes
    if( additive ) {//reconfig配置的路径
        for( list<ReplSetConfig::MemberCfg>::const_iterator i = newOnes.begin(); i != newOnes.end(); i++ ) {
            ReplSetConfig::MemberCfg m = i;
            Member mi = new Member(m->h, m->_id, m, false);
            /** we will indicate that new members are up() initially so that we don't relinquish our
                primary state because we can't (transiently) see a majority.  they should be up as we
                check that new members are up before getting here on reconfig anyway.
                /
            mi->get_hbinfo().health = 0.1;
            _members.push(mi);//新添加的member,启动心跳协议
            startHealthTaskFor(mi);
        }
        // if we aren't creating new members, we may have to update the
        // groups for the current ones
        _cfg->updateMembers(_members);//更新replset集中的member
        return true;
    }
    // start with no members.  if this is a reconfig, drop the old ones.
    _members.orphanAll();//这里不只是初始化的配置,还可能是因为修改了某些member的配置来到这里
    endOldHealthTasks();//所以结束所有心跳协议
    int oldPrimaryId = -1;
    {
        const Member p = box.getPrimary();
        if( p )
            oldPrimaryId = p->id();
    }
    forgetPrimary();//重置primary为空,后面primary将重新设置
    // not setting _self to 0 as other threads use _self w/o locking
    int me = 0;
    string members = "";
    for( vector<ReplSetConfig::MemberCfg>::const_iterator i = config().members.begin(); i != config().members.end(); i++ ) {
        const ReplSetConfig::MemberCfg& m = i;
        Member mi;
        members += ( members == "" ? "" : ", " ) + m.h.toString();
        if( m.h.isSelf() ) {//该member是自己,且自己在配置前是primary,则再次将自己设置为primary,初始化时primary并不在这里决定
            mi = new Member(m.h, m._id, &m, true);
            setSelfTo(mi);
            if( (int)mi->id() == oldPrimaryId )
                box.setSelfPrimary(mi);
        }
        else {
            mi = new Member(m.h, m._id, &m, false);
            _members.push(mi);
            if( (int)mi->id() == oldPrimaryId )
                box.setOtherPrimary(mi);
        }
    }
    if( me == 0 ){
        log() << "replSet warning did not detect own host in full reconfig, members " << members << " config: " << c << rsLog;
    }
    else {//启动心跳设置,每有一个member就需要一个线程与之通信,每2s启动一次连接
        // Do this after we've found ourselves, since _self needs
        // to be set before we can start the heartbeat tasks
        for( Member mb = _members.head(); mb; mb=mb->next() ) {
            startHealthTaskFor( mb );
        }
    }
    return true;
}

_loadConfigFinish->initFromConfig->startHealthTaskFor

[cpp] view plain copy

void ReplSetImpl::startHealthTaskFor(Member m) {
    ReplSetHealthPollTask task = new ReplSetHealthPollTask(m->h(), m->hbinfo());
    healthTasks.insert(task);
    task::repeat(task, 2000);//这里开启一个新的线程,并与m指定的服务器建立连接2000ms,执行一次replSetHeartbeat,查看远端服务器是否可达
}

继续来看ReplSetHealthPollTask执行命令的函数ReplSetHealthPollTask::doWork

[cpp] view plain copy

void doWork() {
    HeartbeatInfo mem = m;
    HeartbeatInfo old = mem;
    try {
        BSONObj info;
        int theirConfigVersion = -10000;//心跳协议查看是否能够连接远端服务器
        bool ok = _requestHeartbeat(mem, info, theirConfigVersion);
        // weight new ping with old pings
        // on the first ping, just use the ping value
        if (old.ping != 0) {//设置ping一次的时间
            mem.ping = (unsigned int)((old.ping  .8) + (mem.ping * .2));
        }
        if( ok ) {//远端服务器可达,则尝试将其加入到候选名单
            up(info, mem);
        }
        else if (!info["errmsg"].eoo() &&//心跳协议显示该机有问题，从候选名单中删除
                 info["errmsg"].str() == "need to login") {//无法成为primary了
            authIssue(mem);
        }
        else {//无法连接该机器
            down(mem, info.getStringField("errmsg"));
        }
    }
    catch(DBException& e) {
        down(mem, e.what());
    }
    catch(...) {
        down(mem, "replSet unexpected exception in ReplSetHealthPollTask");
    }
    m = mem;//更新该member的信息,包括状态如RS_STARTUP,RS_SECONDARY等
    theReplSet->mgr->send( boost::bind(&ReplSet::msgUpdateHBInfo, theReplSet, mem) );
    static time_t last = 0;
    time_t now = time(0);
    bool changed = mem.changed(old);
    if( changed ) {
        if( old.hbstate != mem.hbstate )
            log() << "replSet member " << h.toString() << " is now in state " << mem.hbstate.toString() << rsLog;
    }
    if( changed || now-last>4 ) {//需要进行一次状态检查.
        last = now;
        theReplSet->mgr->send( boost::bind(&Manager::msgCheckNewState, theReplSet->mgr) );
    }
}

_loadConfigFinish->initFromConfig->startHealthTaskFor->up

[cpp] view plain copy

void up(const BSONObj& info, HeartbeatInfo& mem) {
    HeartbeatInfo::numPings++;
    mem.authIssue = false;
    if( mem.upSince == 0 ) {
        mem.upSince = mem.lastHeartbeat;
    }
    mem.health = 1.0;
    mem.lastHeartbeatMsg = info["hbmsg"].String();
    if( info.hasElement("opTime") )
        mem.opTime = info["opTime"].Date();
    // see if this member is in the electable set
    if( info["e"].eoo() ) {
        // for backwards compatibility
        const Member member = theReplSet->findById(mem.id());
        if (member && member->config().potentiallyHot()) {//不是仲裁,且priority设置不为0,默认是1,为0则不可能成为primary
            theReplSet->addToElectable(mem.id());
        }
        else {
            theReplSet->rmFromElectable(mem.id());
        }
    }
    // add this server to the electable set if it is within 10
    // seconds of the latest optime we know of
    else if( info["e"].trueValue() &&
             mem.opTime >= theReplSet->lastOpTimeWritten.getSecs() - 10) {
        unsigned lastOp = theReplSet->lastOtherOpTime().getSecs();
        if (lastOp > 0 && mem.opTime >= lastOp - 10) {
            theReplSet->addToElectable(mem.id());
        }
    }
    else {
        theReplSet->rmFromElectable(mem.id());
    }
    be cfg = info["config"];
    if( cfg.ok() ) {//有新的config配置到来,更新配置
        // received a new config
        boost::function<void()> f =
            boost::bind(&Manager::msgReceivedNewConfig, theReplSet->mgr, cfg.Obj().copy());
        theReplSet->mgr->send(f);
    }
}

[cpp] view plain copy

void down(HeartbeatInfo& mem, string msg) {
    mem.authIssue = false;//无法连接的服务器,将其标志为RS_DOWN,无法成为primary候选.
    mem.health = 0.0;
    mem.ping = 0;
    if( mem.upSince || mem.downSince == 0 ) {
        mem.upSince = 0;
        mem.downSince = jsTime();
        mem.hbstate = MemberState::RS_DOWN;
        log() << "replSet info " << h.toString() << " is down (or slow to respond): " << msg << rsLog;
    }
    mem.lastHeartbeatMsg = msg;
    theReplSet->rmFromElectable(mem.id());
}

回到initFromConfig,该函数执行完毕,继续回到ReplSetImpl,该对象构造完毕.回到startReplSets继续

执行

[cpp] view plain copy

(theReplSet = new ReplSet(replSetCmdline))->go();

其执行的是ReplSetImpl::_go函数,继续来看这里的_go函数.

[cpp] view plain copy

void ReplSetImpl::_go() {
    loadLastOpTimeWritten();//得到最近一次写local.oplog.rs的时间,初始化时在saveConfigLocally时第一次写
    changeState(MemberState::RS_STARTUP2);
    startThreads();//开启同步线程,读取操作日志的线程.
    newReplUp(); // oplog.cpp设置新的log函数
}

[cpp] view plain copy

void ReplSetImpl::startThreads() {
    task::fork(mgr);//这里启动管理服务,可以通过如下mgr->send让其执行send指定的函数,其内部是一个做服务的线程,接收执行任务,然后执行
    mgr->send( boost::bind(&Manager::msgCheckNewState, theReplSet->mgr) );
    if (myConfig().arbiterOnly) {//该服务器只执行仲裁动作
        return;
    }
    boost::thread t(startSyncThread);//这个线程除了sync外还有一个功能将当前服务器设置为secondary,初始化时到这里其状态为RS_STARTUP2
    replset::BackgroundSync* sync = replset::BackgroundSync::get();
    boost::thread producer(boost::bind(&replset::BackgroundSync::producerThread, sync));//为syncThread获取同步数据
    boost::thread notifier(boost::bind(&replset::BackgroundSync::notifierThread, sync));//为tags准备的,后面会有一篇文章专门讲到replset tags
    task::fork(ghost);
    // member heartbeats are started in ReplSetImpl::initFromConfig
}

本文就分析到这里,几个线程的作用以及状态的切换留待下文.总结:

本文分析replication replset模式初始化流程,初始化过程中主要是根据配置信息在各个服务器间建

立心跳协议,保证连接可达,根据连接信息更新各个服务器的状态,为下一步选取primary做准备.

本文链接:mongodb源码分析(十五)replication replset模式的初始化

作者:yhjj0108,杨浩

happylife1527

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mongodb源码分析(五)

mongodb源码分析(十三)持久化先来看看持久化的流程.默认情况下持久化是开启的,需要关闭启动时--nodur或者--nojournal.在开启journal时mongodb保留了多数据库的两份映射,每一个文件有两个映射的初始地址_view_write和_view_private,_view_private是为了持久化而生的.这就是为什么用mongostat查看系统信
复制链接

扫一扫

专栏目录