Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:
- Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
- Postgres-XL的项目发起人Mason Sharp
- pgpool的作者石井达夫(Tatsuo Ishii)
- PG-Strom的作者海外浩平(Kaigai Kohei)
- Greenplum研发总监姚延栋
- 周正中(德哥), PostgreSQL中国用户会创始人之一
- 汪洋,平安科技数据库技术部经理
- ……
|
接着上一篇,
这篇主要谈一下
CheckPointBuffers(flags)
.
CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c
/** CheckPointBuffers** Flush all dirty blocks in buffer pool to disk at checkpoint time.** Note: temporary relations do not participate in checkpoints, so they don't* need to be flushed.*/voidCheckPointBuffers(int flags){TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags); // buffer checkpoint开始探针CheckpointStats.ckpt_write_t = GetCurrentTimestamp();BufferSync(flags); // 这个是重量级操作, 需要全扫描1次BUFFER, 锁buffer头, 设置标记。 再扫描一次buffer,将前面标记过的脏块flush到磁盘。CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); // buffer checkpoint sync开始探针smgrsync(); // sync操作CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); // buffer checkpoint 结束探针}
BufferSync是一个比较重的操作。
第一次全扫描buffer区,将脏数据块头部设置为本次checkpoint需要flush的块。
第二次扫描,将前面设置为本次需要checkpoint的块FLUSH到磁盘。
但是需要注意,第一次设置为need checkpoint的块,有一个计数,第二次在刷数据块时,可能提前到达这个计数,所以第二次刷脏块的动作可能不需要扫全缓存区域。
但是,第一次被标记的脏块,也可能在这期间被其他进程如bgwriter写掉了,所以第二次扫描时无法达到计数,则还是需要全扫描整个缓存区。
(为什么不在第一次设置时同时记住脏块的内存位置,第二次直接去FLUSH这些位置的块呢?还需要重复再扫一次)
BufferSync@src/backend/storage/buffer/bufmgr.c
/** BufferSync -- Write out all dirty buffers in the pool.** This is called at checkpoint time to write out all dirty shared buffers.* The checkpoint request flags should be passed in. If CHECKPOINT_IMMEDIATE* is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,* CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even* unlogged buffers, which are otherwise skipped. The remaining flags* currently have no effect here.*/static voidBufferSync(int flags){int buf_id;int num_to_scan;int num_to_write;int num_written;int mask = BM_DIRTY; // 脏块掩码
/* Make sure we can handle the pin inside SyncOneBuffer */ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
/** Unless this is a shutdown checkpoint or we have been explicitly told,* we write only permanent, dirty buffers. But at shutdown or end of* recovery, we write all dirty buffers.*/if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |CHECKPOINT_FLUSH_ALL))))mask |= BM_PERMANENT; // 持久对象掩码
/** Loop over all buffers, and mark the ones that need to be written with* BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we* can estimate how much work needs to be done.** This allows us to write only those pages that were dirty when the* checkpoint began, and not those that get dirtied while it proceeds.* Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us* later in this function, or by normal backends or the bgwriter cleaning* scan, the flag is cleared. Any buffer dirtied after this point won't* have the flag set.** Note that if we fail to write some buffer, we may leave buffers with* BM_CHECKPOINT_NEEDED still set. This is OK since any such buffer would* certainly need to be written for the next checkpoint attempt, too.*/num_to_write = 0; // BM_CHECKPOINT_NEEDED计数for (buf_id = 0; buf_id < NBuffers; buf_id++) // 将当前数据库中的脏块标记为本次检查点需要flush的状态// 也就是说,flush过程中数据库产生的脏块不用理会。{volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/** Header spinlock is enough to examine BM_DIRTY, see comment in* SyncOneBuffer.*/LockBufHdr(bufHdr); // 锁缓存头
if ((bufHdr->flags & mask) == mask) // 将包含脏块掩码或者并且包含持久化掩码的缓存增加标记BM_CHECKPOINT_NEEDED{bufHdr->flags |= BM_CHECKPOINT_NEEDED;num_to_write++;}
UnlockBufHdr(bufHdr);}
if (num_to_write == 0)return; /* nothing to do */
TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write); // 刷缓存开始,探针
/** Loop over all buffers again, and write the ones (still) marked with* BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point* since we might as well dump soon-to-be-recycled buffers first.** Note that we don't read the buffer alloc count here --- that should be* left untouched till the next BgBufferSync() call.*/buf_id = StrategySyncStart(NULL, NULL);num_to_scan = NBuffers;num_written = 0;while (num_to_scan-- > 0) // 需要sync的buffer块计数递减{volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
/** We don't need to acquire the lock here, because we're only looking* at a single bit. It's possible that someone else writes the buffer* and clears the flag right after we check, but that doesn't matter* since SyncOneBuffer will then do nothing. However, there is a* further race condition: it's conceivable that between the time we* examine the bit here and the time SyncOneBuffer acquires lock,* someone else not only wrote the buffer but replaced it with another* page and dirtied it. In that improbable case, SyncOneBuffer will* write the buffer though we didn't need to. It doesn't seem worth* guarding against this, though.*/if (bufHdr->flags & BM_CHECKPOINT_NEEDED) // 判断掩码,如果包含BM_CHECKPOINT_NEEDED,则刷{if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN) // 调用SyncOneBuffer刷缓存{TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); // 表示该数据块刷新成功BgWriterStats.m_buf_written_checkpoints++;num_written++;
/** We know there are at most num_to_write buffers with* BM_CHECKPOINT_NEEDED set; so we can stop scanning if* num_written reaches num_to_write.** Note that num_written doesn't include buffers written by* other backends, or by the bgwriter cleaning scan. That* means that the estimate of how much progress we've made is* conservative, and also that this test will often fail to* trigger. But it seems worth making anyway.*/if (num_written >= num_to_write) // 如果提前完成刷新,不需要扫全缓存区,退出break;
/** Sleep to throttle our I/O rate.*/CheckpointWriteDelay(flags, (double) num_written / num_to_write); // 将目前刷缓存完成比例传给CheckpointWriteDelay,如果达到休息点,则会触发一个100毫秒的等待。// 假设一共有1000个需要刷的块(num_to_write),目前已经刷了100个(num_written )。// CheckpointWriteDelay(flags, 0.1); , 假设CheckPointCompletionTarget为默认的0.5// IsCheckpointOnSchedule里, progress *= CheckPointCompletionTarget; = 0.1*0.5 = 0.05// elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments// 如果 progress < elapsed_xlogs 不休息// progress最大就是0.5, 因为num_written / num_to_write最大就是1, 1乘以0.5还是0.5// 因此CheckPointCompletionTarget越大,休息区间越大。}}
if (++buf_id >= NBuffers)buf_id = 0;}
/** Update checkpoint statistics. As noted above, this doesn't include* buffers written by other backends or bgwriter scan.*/CheckpointStats.ckpt_bufs_written += num_written;
TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write); // 标记为BM_CHECKPOINT_NEEDED的脏块已全部flush完}
刷单个BUFFER,返回bitmask,
BUF_WRITTEN表示已写入磁盘。
SyncOneBuffer@src/backend/storage/buffer/bufmgr.c
/** SyncOneBuffer -- process a single buffer during syncing.** If skip_recently_used is true, we don't write currently-pinned buffers, nor* buffers marked recently used, as these are not replacement candidates.** Returns a bitmask containing the following flag bits:* BUF_WRITTEN: we wrote the buffer.* BUF_REUSABLE: buffer is available for replacement, ie, it has* pin count 0 and usage count 0.** (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean* after locking it, but we don't care all that much.)** Note: caller must have done ResourceOwnerEnlargeBuffers.*/static intSyncOneBuffer(int buf_id, bool skip_recently_used){volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];int result = 0;
/** Check whether buffer needs writing.** We can make this check without taking the buffer content lock so long* as we mark pages dirty in access methods *before* logging changes with* XLogInsert(): if someone marks the buffer dirty just after our check we* don't worry because our checkpoint.redo points before log record for* upcoming changes and so we are not required to write such dirty buffer.*/LockBufHdr(bufHdr);
if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)result |= BUF_REUSABLE;else if (skip_recently_used){/* Caller told us not to write recently-used buffers */UnlockBufHdr(bufHdr);return result;}
if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY)){/* It's clean, so nothing to do */UnlockBufHdr(bufHdr);return result;}
/** Pin it, share-lock it, write it. (FlushBuffer will do nothing if the* buffer is clean by the time we've locked it.)*/PinBuffer_Locked(bufHdr);LWLockAcquire(bufHdr->content_lock, LW_SHARED);
FlushBuffer(bufHdr, NULL); // 调用FlushBuffer刷buffer
LWLockRelease(bufHdr->content_lock);UnpinBuffer(bufHdr, true);
return result | BUF_WRITTEN;}
调用FlushBuffer将BUFFER刷到内核,内核负责写如磁盘,在写checkpoint WAL前,必须写到磁盘。
FlushBuffer@src/backend/storage/buffer/bufmgr.c
/** FlushBuffer* Physically write out a shared buffer.** NOTE: this actually just passes the buffer contents to the kernel; the* real write to disk won't happen until the kernel feels like it. This* is okay from our point of view since we can redo the changes from WAL.* However, we will need to force the changes to disk via fsync before* we can checkpoint WAL. 在写checkpoint WAL前,buffer必须写到磁盘。** The caller must hold a pin on the buffer and have share-locked the* buffer contents. (Note: a share-lock does not prevent updates of* hint bits in the buffer, so the page could change while the write* is in progress, but we assume that that will not invalidate the data* written.)** If the caller has an smgr reference for the buffer's relation, pass it* as the second parameter. If not, pass NULL.*/static voidFlushBuffer(volatile BufferDesc *buf, SMgrRelation reln){XLogRecPtr recptr;ErrorContextCallback errcallback;instr_time io_start,io_time;Block bufBlock;char *bufToWrite;
/** Acquire the buffer's io_in_progress lock. If StartBufferIO returns* false, then someone else flushed the buffer before we could, so we need* not do anything.*/if (!StartBufferIO(buf, false))return;
/* Setup error traceback support for ereport() */errcallback.callback = shared_buffer_write_error_callback;errcallback.arg = (void *) buf;errcallback.previous = error_context_stack;error_context_stack = &errcallback;
/* Find smgr relation for buffer */if (reln == NULL)reln = smgropen(buf->tag.rnode, InvalidBackendId);
TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,buf->tag.blockNum,reln->smgr_rnode.node.spcNode,reln->smgr_rnode.node.dbNode,reln->smgr_rnode.node.relNode);
LockBufHdr(buf);
/** Run PageGetLSN while holding header lock, since we don't have the* buffer locked exclusively in all cases.*/recptr = BufferGetLSN(buf); // 这里又一个BUFFER头锁
/* To check if block content changes while flushing. - vadim 01/17/97 */buf->flags &= ~BM_JUST_DIRTIED;UnlockBufHdr(buf);
/** Force XLOG flush up to buffer's LSN. This implements the basic WAL // XLOG 强写到buffer lsn位置,* rule that log updates must hit disk before any of the data-file changes // 确保在此之前数据块改变产生的XLOG都写入磁盘了.* they describe do.** However, this rule does not apply to unlogged relations, which will be* lost after a crash anyway. Most unlogged relation pages do not bear* LSNs since we never emit WAL records for them, and therefore flushing* up through the buffer LSN would be useless, but harmless. However,* GiST indexes use LSNs internally to track page-splits, and therefore* unlogged GiST pages bear "fake" LSNs generated by* GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake* LSN counter could advance past the WAL insertion point; and if it did* happen, attempting to flush WAL through that location would fail, with* disastrous system-wide consequences. To make sure that can't happen,* skip the flush if the buffer isn't permanent.*/if (buf->flags & BM_PERMANENT)XLogFlush(recptr);
/** Now it's safe to write buffer to disk. Note that no one else should* have been able to write it while we were busy with log flushing because* we have the io_in_progress lock.*/bufBlock = BufHdrGetBlock(buf);
/** Update page checksum if desired. Since we have only shared lock on the* buffer, other processes might be updating hint bits in it, so we must* copy the page to private storage if we do checksumming.*/bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
if (track_io_timing)INSTR_TIME_SET_CURRENT(io_start);
/** bufToWrite is either the shared buffer or a copy, as appropriate.*/smgrwrite(reln, // 将BUFFER写入磁盘buf->tag.forkNum,buf->tag.blockNum,bufToWrite,false);
if (track_io_timing){INSTR_TIME_SET_CURRENT(io_time);INSTR_TIME_SUBTRACT(io_time, io_start);pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);}
pgBufferUsage.shared_blks_written++;
/** Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and* end the io_in_progress state.*/TerminateBufferIO(buf, true, 0);
TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum, // 单个buffer块 flush结束buf->tag.blockNum,reln->smgr_rnode.node.spcNode,reln->smgr_rnode.node.dbNode,reln->smgr_rnode.node.relNode);
/* Pop the error context stack */error_context_stack = errcallback.previous;}
Write the supplied buffer out.
smgrwrite@src/backend/storage/smgr/smgr.c
/** smgrwrite() -- Write the supplied buffer out.** This is to be used only for updating already-existing blocks of a* relation (ie, those before the current EOF). To extend a relation,* use smgrextend().** This is not a synchronous write -- the block is not necessarily* on disk at return, only dumped out to the kernel. However,* provisions will be made to fsync the write before the next checkpoint.** skipFsync indicates that the caller will make other provisions to* fsync the relation, so we needn't bother. Temporary relations also* do not require fsync.*/voidsmgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,char *buffer, bool skipFsync){(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,buffer, skipFsync);}
最后一步是将前面的write sync 到磁盘.
smgrsync@src/backend/storage/smgr/smgr.c
/** smgrsync() -- Sync files to disk during checkpoint.*/voidsmgrsync(void){int i;
for (i = 0; i < NSmgr; i++){if (smgrsw[i].smgr_sync)(*(smgrsw[i].smgr_sync)) ();}}
smgr_sync实际调用的是
mdsync@src/backend/storage/smgr/md.c
/** mdsync() -- Sync previous writes to stable storage.*/voidmdsync(void){static bool mdsync_in_progress = false;
HASH_SEQ_STATUS hstat;PendingOperationEntry *entry;int absorb_counter;
/* Statistics on sync times */int processed = 0;instr_time sync_start,sync_end,sync_diff;uint64 elapsed;uint64 longest = 0;uint64 total_elapsed = 0;/** This is only called during checkpoints, and checkpoints should only* occur in processes that have created a pendingOpsTable.*/if (!pendingOpsTable)elog(ERROR, "cannot sync without a pendingOpsTable");
/** If we are in the checkpointer, the sync had better include all fsync* requests that were queued by backends up to this point. The tightest* race condition that could occur is that a buffer that must be written* and fsync'd for the checkpoint could have been dumped by a backend just* before it was visited by BufferSync(). We know the backend will have* queued an fsync request before clearing the buffer's dirtybit, so we* are safe as long as we do an Absorb after completing BufferSync().*/AbsorbFsyncRequests();
/** To avoid excess fsync'ing (in the worst case, maybe a never-terminating* checkpoint), we want to ignore fsync requests that are entered into the* hashtable after this point --- they should be processed next time,* instead. We use mdsync_cycle_ctr to tell old entries apart from new* ones: new ones will have cycle_ctr equal to the incremented value of* mdsync_cycle_ctr.** In normal circumstances, all entries present in the table at this point* will have cycle_ctr exactly equal to the current (about to be old)* value of mdsync_cycle_ctr. However, if we fail partway through the* fsync'ing loop, then older values of cycle_ctr might remain when we* come back here to try again. Repeated checkpoint failures would* eventually wrap the counter around to the point where an old entry* might appear new, causing us to skip it, possibly allowing a checkpoint* to succeed that should not have. To forestall wraparound, any time the* previous mdsync() failed to complete, run through the table and* forcibly set cycle_ctr = mdsync_cycle_ctr.** Think not to merge this loop with the main loop, as the problem is* exactly that that loop may fail before having visited all the entries.* From a performance point of view it doesn't matter anyway, as this path* will never be taken in a system that's functioning normally.*/if (mdsync_in_progress){/* prior try failed, so update any stale cycle_ctr values */hash_seq_init(&hstat, pendingOpsTable);while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL){entry->cycle_ctr = mdsync_cycle_ctr;}}
/* Advance counter so that new hashtable entries are distinguishable */mdsync_cycle_ctr++;
/* Set flag to detect failure if we don't reach the end of the loop */mdsync_in_progress = true;
/* Now scan the hashtable for fsync requests to process */absorb_counter = FSYNCS_PER_ABSORB;hash_seq_init(&hstat, pendingOpsTable);while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL){ForkNumber forknum;
/** If the entry is new then don't process it this time; it might* contain multiple fsync-request bits, but they are all new. Note* "continue" bypasses the hash-remove call at the bottom of the loop.*/if (entry->cycle_ctr == mdsync_cycle_ctr)continue;
/* Else assert we haven't missed it */Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
/** Scan over the forks and segments represented by the entry.** The bitmap manipulations are slightly tricky, because we can call* AbsorbFsyncRequests() inside the loop and that could result in* bms_add_member() modifying and even re-palloc'ing the bitmapsets.* This is okay because we unlink each bitmapset from the hashtable* entry before scanning it. That means that any incoming fsync* requests will be processed now if they reach the table before we* begin to scan their fork.*/for (forknum = 0; forknum <= MAX_FORKNUM; forknum++){Bitmapset *requests = entry->requests[forknum];int segno;
entry->requests[forknum] = NULL;entry->canceled[forknum] = false;
while ((segno = bms_first_member(requests)) >= 0){int failures;
/** If fsync is off then we don't have to bother opening the* file at all. (We delay checking until this point so that* changing fsync on the fly behaves sensibly.)*/if (!enableFsync)continue;/** If in checkpointer, we want to absorb pending requests* every so often to prevent overflow of the fsync request* queue. It is unspecified whether newly-added entries will* be visited by hash_seq_search, but we don't care since we* don't need to process them anyway.*/if (--absorb_counter <= 0){AbsorbFsyncRequests();absorb_counter = FSYNCS_PER_ABSORB;}
/** The fsync table could contain requests to fsync segments* that have been deleted (unlinked) by the time we get to* them. Rather than just hoping an ENOENT (or EACCES on* Windows) error can be ignored, what we do on error is* absorb pending requests and then retry. Since mdunlink()* queues a "cancel" message before actually unlinking, the* fsync request is guaranteed to be marked canceled after the* absorb if it really was this case. DROP DATABASE likewise* has to tell us to forget fsync requests before it starts* deletions.*/for (failures = 0;; failures++) /* loop exits at "break" */{SMgrRelation reln;MdfdVec *seg;char *path;int save_errno;
/** Find or create an smgr hash entry for this relation.* This may seem a bit unclean -- md calling smgr? But* it's really the best solution. It ensures that the* open file reference isn't permanently leaked if we get* an error here. (You may say "but an unreferenced* SMgrRelation is still a leak!" Not really, because the* only case in which a checkpoint is done by a process* that isn't about to shut down is in the checkpointer,* and it will periodically do smgrcloseall(). This fact* justifies our not closing the reln in the success path* either, which is a good thing since in non-checkpointer* cases we couldn't safely do that.)*/reln = smgropen(entry->rnode, InvalidBackendId);
/* Attempt to open and fsync the target segment */seg = _mdfd_getseg(reln, forknum,(BlockNumber) segno * (BlockNumber) RELSEG_SIZE,false, EXTENSION_RETURN_NULL);
INSTR_TIME_SET_CURRENT(sync_start);
if (seg != NULL &&FileSync(seg->mdfd_vfd) >= 0){/* Success; update statistics about sync timing */INSTR_TIME_SET_CURRENT(sync_end);sync_diff = sync_end;INSTR_TIME_SUBTRACT(sync_diff, sync_start);elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);if (elapsed > longest)longest = elapsed;total_elapsed += elapsed;processed++;if (log_checkpoints)elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",processed,FilePathName(seg->mdfd_vfd),(double) elapsed / 1000);
break; /* out of retry loop */}/* Compute file name for use in message */save_errno = errno;path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);errno = save_errno;
/** It is possible that the relation has been dropped or* truncated since the fsync request was entered.* Therefore, allow ENOENT, but only if we didn't fail* already on this file. This applies both for* _mdfd_getseg() and for FileSync, since fd.c might have* closed the file behind our back.** XXX is there any point in allowing more than one retry?* Don't see one at the moment, but easy to change the* test here if so.*/if (!FILE_POSSIBLY_DELETED(errno) ||failures > 0)ereport(ERROR,(errcode_for_file_access(),errmsg("could not fsync file \"%s\": %m",path)));elseereport(DEBUG1,(errcode_for_file_access(),errmsg("could not fsync file \"%s\" but retrying: %m",path)));pfree(path);
/** Absorb incoming requests and check to see if a cancel* arrived for this relation fork.*/AbsorbFsyncRequests();absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */if (entry->canceled[forknum])break;} /* end retry loop */}bms_free(requests);}
/** We've finished everything that was requested before we started to* scan the entry. If no new requests have been inserted meanwhile,* remove the entry. Otherwise, update its cycle counter, as all the* requests now in it must have arrived during this cycle.*/for (forknum = 0; forknum <= MAX_FORKNUM; forknum++){if (entry->requests[forknum] != NULL)break;}if (forknum <= MAX_FORKNUM)entry->cycle_ctr = mdsync_cycle_ctr;else{/* Okay to remove it */if (hash_search(pendingOpsTable, &entry->rnode,HASH_REMOVE, NULL) == NULL)elog(ERROR, "pendingOpsTable corrupted");}} /* end loop over hashtable entries */
/* Return sync performance metrics for report at checkpoint end */CheckpointStats.ckpt_sync_rels = processed;CheckpointStats.ckpt_longest_sync = longest;CheckpointStats.ckpt_agg_sync_time = total_elapsed;
/* Flag successful completion of mdsync */mdsync_in_progress = false;}
[小结]
checkpointer刷缓存主要分几个步骤,
1. 遍历shared buffer区,将当前SHARED BUFFER中脏块新增FLAG need checkpoint,
2. 遍历shared buffer区,将上一步标记为need checkpoint的块write到磁盘,WRITE前需要确保该buffer lsn前的XLOG已经fsync到磁盘,
3. 将前面的write sync到持久化存储。
具体耗时可以参考期间的探针,或者检查点日志输出。
下一篇讲一下检查点的跟踪。