PostgreSQL Why checkpointer impact performance so much ?

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大，国内顶级PostgreSQL数据库专家将悉数到场，并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
Postgres-XL的项目发起人Mason Sharp
pgpool的作者石井达夫(Tatsuo Ishii)
PG-Strom的作者海外浩平(Kaigai Kohei)
Greenplum研发总监姚延栋
周正中(德哥), PostgreSQL中国用户会创始人之一
汪洋，平安科技数据库技术部经理
……

 
 2015年度PG大象会报名地址：http://postgres2015.eventdove.com/
PostgreSQL中国社区： http://postgres.cn/
PostgreSQL专业1群： 3336901（已满）
PostgreSQL专业2群： 100910388
PostgreSQL专业3群： 150657323

接着上一篇，

http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

这篇主要谈一下 CheckPointBuffers(flags) .

CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c

* CheckPointBuffers

* Flush all dirty blocks in buffer pool to disk at checkpoint time.

* Note: temporary relations do not participate in checkpoints, so they don't

* need to be flushed.

void

CheckPointBuffers(int flags)

{

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags); // buffer checkpoint开始探针

CheckpointStats.ckpt_write_t = GetCurrentTimestamp();

BufferSync(flags); // 这个是重量级操作, 需要全扫描1次BUFFER, 锁buffer头, 设置标记。再扫描一次buffer,将前面标记过的脏块flush到磁盘。

CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START(); // buffer checkpoint sync开始探针

smgrsync(); // sync操作

CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();

TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE(); // buffer checkpoint 结束探针

}

BufferSync是一个比较重的操作。

第一次全扫描buffer区，将脏数据块头部设置为本次checkpoint需要flush的块。

第二次扫描，将前面设置为本次需要checkpoint的块FLUSH到磁盘。

但是需要注意，第一次设置为need checkpoint的块，有一个计数，第二次在刷数据块时，可能提前到达这个计数，所以第二次刷脏块的动作可能不需要扫全缓存区域。

但是，第一次被标记的脏块，也可能在这期间被其他进程如bgwriter写掉了，所以第二次扫描时无法达到计数，则还是需要全扫描整个缓存区。

（为什么不在第一次设置时同时记住脏块的内存位置，第二次直接去FLUSH这些位置的块呢？还需要重复再扫一次）

BufferSync@src/backend/storage/buffer/bufmgr.c

* BufferSync -- Write out all dirty buffers in the pool.

* This is called at checkpoint time to write out all dirty shared buffers.

* The checkpoint request flags should be passed in. If CHECKPOINT_IMMEDIATE

* is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,

* CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even

* unlogged buffers, which are otherwise skipped. The remaining flags

* currently have no effect here.

static void

BufferSync(int flags)

{

int buf_id;

int num_to_scan;

int num_to_write;

int num_written;

int mask = BM_DIRTY; // 脏块掩码

/* Make sure we can handle the pin inside SyncOneBuffer */

ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

* Unless this is a shutdown checkpoint or we have been explicitly told,

* we write only permanent, dirty buffers. But at shutdown or end of

* recovery, we write all dirty buffers.

if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |

CHECKPOINT_FLUSH_ALL))))

mask |= BM_PERMANENT; // 持久对象掩码

* Loop over all buffers, and mark the ones that need to be written with

* BM_CHECKPOINT_NEEDED. Count them as we go (num_to_write), so that we

* can estimate how much work needs to be done.

* This allows us to write only those pages that were dirty when the

* checkpoint began, and not those that get dirtied while it proceeds.

* Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us

* later in this function, or by normal backends or the bgwriter cleaning

* scan, the flag is cleared. Any buffer dirtied after this point won't

* have the flag set.

* Note that if we fail to write some buffer, we may leave buffers with

* BM_CHECKPOINT_NEEDED still set. This is OK since any such buffer would

* certainly need to be written for the next checkpoint attempt, too.

num_to_write = 0; // BM_CHECKPOINT_NEEDED计数

for (buf_id = 0; buf_id < NBuffers; buf_id++) // 将当前数据库中的脏块标记为本次检查点需要flush的状态

// 也就是说，flush过程中数据库产生的脏块不用理会。

{

volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

* Header spinlock is enough to examine BM_DIRTY, see comment in

* SyncOneBuffer.

LockBufHdr(bufHdr); // 锁缓存头

if ((bufHdr->flags & mask) == mask) // 将包含脏块掩码或者并且包含持久化掩码的缓存增加标记BM_CHECKPOINT_NEEDED

{

bufHdr->flags |= BM_CHECKPOINT_NEEDED;

num_to_write++;

}

UnlockBufHdr(bufHdr);

}

if (num_to_write == 0)

return; /* nothing to do */

TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write); // 刷缓存开始,探针

* Loop over all buffers again, and write the ones (still) marked with

* BM_CHECKPOINT_NEEDED. In this loop, we start at the clock sweep point

* since we might as well dump soon-to-be-recycled buffers first.

* Note that we don't read the buffer alloc count here --- that should be

* left untouched till the next BgBufferSync() call.

buf_id = StrategySyncStart(NULL, NULL);

num_to_scan = NBuffers;

num_written = 0;

while (num_to_scan-- > 0) // 需要sync的buffer块计数递减

{

volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

* We don't need to acquire the lock here, because we're only looking

* at a single bit. It's possible that someone else writes the buffer

* and clears the flag right after we check, but that doesn't matter

* since SyncOneBuffer will then do nothing. However, there is a

* further race condition: it's conceivable that between the time we

* examine the bit here and the time SyncOneBuffer acquires lock,

* someone else not only wrote the buffer but replaced it with another

* page and dirtied it. In that improbable case, SyncOneBuffer will

* write the buffer though we didn't need to. It doesn't seem worth

* guarding against this, though.

if (bufHdr->flags & BM_CHECKPOINT_NEEDED) // 判断掩码，如果包含BM_CHECKPOINT_NEEDED，则刷

{

if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN) // 调用SyncOneBuffer刷缓存

{

TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); // 表示该数据块刷新成功

BgWriterStats.m_buf_written_checkpoints++;

num_written++;

* We know there are at most num_to_write buffers with

* BM_CHECKPOINT_NEEDED set; so we can stop scanning if

* num_written reaches num_to_write.

* Note that num_written doesn't include buffers written by

* other backends, or by the bgwriter cleaning scan. That

* means that the estimate of how much progress we've made is

* conservative, and also that this test will often fail to

* trigger. But it seems worth making anyway.

if (num_written >= num_to_write) // 如果提前完成刷新，不需要扫全缓存区，退出

break;

* Sleep to throttle our I/O rate.

CheckpointWriteDelay(flags, (double) num_written / num_to_write); // 将目前刷缓存完成比例传给CheckpointWriteDelay，如果达到休息点，则会触发一个100毫秒的等待。

// 假设一共有1000个需要刷的块（num_to_write），目前已经刷了100个（num_written ）。

// CheckpointWriteDelay(flags, 0.1); , 假设CheckPointCompletionTarget为默认的0.5

// IsCheckpointOnSchedule里, progress *= CheckPointCompletionTarget; = 0.1*0.5 = 0.05

// elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments

// 如果 progress < elapsed_xlogs 不休息

// progress最大就是0.5，因为num_written / num_to_write最大就是1, 1乘以0.5还是0.5

// 因此CheckPointCompletionTarget越大，休息区间越大。

}

if (++buf_id >= NBuffers)

buf_id = 0;

}

* Update checkpoint statistics. As noted above, this doesn't include

* buffers written by other backends or bgwriter scan.

CheckpointStats.ckpt_bufs_written += num_written;

TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write); // 标记为BM_CHECKPOINT_NEEDED的脏块已全部flush完

}

刷单个BUFFER，返回bitmask， BUF_WRITTEN表示已写入磁盘。

SyncOneBuffer@src/backend/storage/buffer/bufmgr.c

* SyncOneBuffer -- process a single buffer during syncing.

* If skip_recently_used is true, we don't write currently-pinned buffers, nor

* buffers marked recently used, as these are not replacement candidates.

* Returns a bitmask containing the following flag bits:

* BUF_WRITTEN: we wrote the buffer.

* BUF_REUSABLE: buffer is available for replacement, ie, it has

* pin count 0 and usage count 0.

* (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean

* after locking it, but we don't care all that much.)

* Note: caller must have done ResourceOwnerEnlargeBuffers.

static int

SyncOneBuffer(int buf_id, bool skip_recently_used)

{

volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

int result = 0;

* Check whether buffer needs writing.

* We can make this check without taking the buffer content lock so long

* as we mark pages dirty in access methods *before* logging changes with

* XLogInsert(): if someone marks the buffer dirty just after our check we

* don't worry because our checkpoint.redo points before log record for

* upcoming changes and so we are not required to write such dirty buffer.

LockBufHdr(bufHdr);

if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)

result |= BUF_REUSABLE;

else if (skip_recently_used)

{

/* Caller told us not to write recently-used buffers */

UnlockBufHdr(bufHdr);

return result;

}

if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))

{

/* It's clean, so nothing to do */

UnlockBufHdr(bufHdr);

return result;

}

* Pin it, share-lock it, write it. (FlushBuffer will do nothing if the

* buffer is clean by the time we've locked it.)

PinBuffer_Locked(bufHdr);

LWLockAcquire(bufHdr->content_lock, LW_SHARED);

FlushBuffer(bufHdr, NULL); // 调用FlushBuffer刷buffer

LWLockRelease(bufHdr->content_lock);

UnpinBuffer(bufHdr, true);

return result | BUF_WRITTEN;

}

调用FlushBuffer将BUFFER刷到内核，内核负责写如磁盘，在写checkpoint WAL前，必须写到磁盘。

FlushBuffer@src/backend/storage/buffer/bufmgr.c

* FlushBuffer

* Physically write out a shared buffer.

* NOTE: this actually just passes the buffer contents to the kernel; the

* real write to disk won't happen until the kernel feels like it. This

* is okay from our point of view since we can redo the changes from WAL.

* However, we will need to force the changes to disk via fsync before

* we can checkpoint WAL. 在写checkpoint WAL前，buffer必须写到磁盘。

* The caller must hold a pin on the buffer and have share-locked the

* buffer contents. (Note: a share-lock does not prevent updates of

* hint bits in the buffer, so the page could change while the write

* is in progress, but we assume that that will not invalidate the data

* written.)

* If the caller has an smgr reference for the buffer's relation, pass it

* as the second parameter. If not, pass NULL.

static void

FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)

{

XLogRecPtr recptr;

ErrorContextCallback errcallback;

instr_time io_start,

io_time;

Block bufBlock;

char *bufToWrite;

* Acquire the buffer's io_in_progress lock. If StartBufferIO returns

* false, then someone else flushed the buffer before we could, so we need

* not do anything.

if (!StartBufferIO(buf, false))

return;

/* Setup error traceback support for ereport() */

errcallback.callback = shared_buffer_write_error_callback;

errcallback.arg = (void *) buf;

errcallback.previous = error_context_stack;

error_context_stack = &errcallback;

/* Find smgr relation for buffer */

if (reln == NULL)

reln = smgropen(buf->tag.rnode, InvalidBackendId);

TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,

buf->tag.blockNum,

reln->smgr_rnode.node.spcNode,

reln->smgr_rnode.node.dbNode,

reln->smgr_rnode.node.relNode);

LockBufHdr(buf);

* Run PageGetLSN while holding header lock, since we don't have the

* buffer locked exclusively in all cases.

recptr = BufferGetLSN(buf); // 这里又一个BUFFER头锁

/* To check if block content changes while flushing. - vadim 01/17/97 */

buf->flags &= ~BM_JUST_DIRTIED;

UnlockBufHdr(buf);

* Force XLOG flush up to buffer's LSN. This implements the basic WAL // XLOG 强写到buffer lsn位置，

* rule that log updates must hit disk before any of the data-file changes // 确保在此之前数据块改变产生的XLOG都写入磁盘了.

* they describe do.

* However, this rule does not apply to unlogged relations, which will be

* lost after a crash anyway. Most unlogged relation pages do not bear

* LSNs since we never emit WAL records for them, and therefore flushing

* up through the buffer LSN would be useless, but harmless. However,

* GiST indexes use LSNs internally to track page-splits, and therefore

* unlogged GiST pages bear "fake" LSNs generated by

* GetFakeLSNForUnloggedRel. It is unlikely but possible that the fake

* LSN counter could advance past the WAL insertion point; and if it did

* happen, attempting to flush WAL through that location would fail, with

* disastrous system-wide consequences. To make sure that can't happen,

* skip the flush if the buffer isn't permanent.

if (buf->flags & BM_PERMANENT)

XLogFlush(recptr);

* Now it's safe to write buffer to disk. Note that no one else should

* have been able to write it while we were busy with log flushing because

* we have the io_in_progress lock.

bufBlock = BufHdrGetBlock(buf);

* Update page checksum if desired. Since we have only shared lock on the

* buffer, other processes might be updating hint bits in it, so we must

* copy the page to private storage if we do checksumming.

bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);

if (track_io_timing)

INSTR_TIME_SET_CURRENT(io_start);

* bufToWrite is either the shared buffer or a copy, as appropriate.

smgrwrite(reln, // 将BUFFER写入磁盘

buf->tag.forkNum,

buf->tag.blockNum,

bufToWrite,

false);

if (track_io_timing)

{

INSTR_TIME_SET_CURRENT(io_time);

INSTR_TIME_SUBTRACT(io_time, io_start);

pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));

INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);

}

pgBufferUsage.shared_blks_written++;

* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and

* end the io_in_progress state.

TerminateBufferIO(buf, true, 0);

TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum, // 单个buffer块 flush结束

buf->tag.blockNum,

reln->smgr_rnode.node.spcNode,

reln->smgr_rnode.node.dbNode,

reln->smgr_rnode.node.relNode);

/* Pop the error context stack */

error_context_stack = errcallback.previous;

}

Write the supplied buffer out.

smgrwrite@src/backend/storage/smgr/smgr.c

* smgrwrite() -- Write the supplied buffer out.

* This is to be used only for updating already-existing blocks of a

* relation (ie, those before the current EOF). To extend a relation,

* use smgrextend().

* This is not a synchronous write -- the block is not necessarily

* on disk at return, only dumped out to the kernel. However,

* provisions will be made to fsync the write before the next checkpoint.

* skipFsync indicates that the caller will make other provisions to

* fsync the relation, so we needn't bother. Temporary relations also

* do not require fsync.

void

smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,

char *buffer, bool skipFsync)

{

(*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,

buffer, skipFsync);

}

最后一步是将前面的write sync 到磁盘.

smgrsync@src/backend/storage/smgr/smgr.c


  
  
   
   /*
  
  
  
  
   
    *      smgrsync() -- Sync files to disk during checkpoint.
  
  
  
  
   
    */
  
  
  
  
   
   void
  
  
  
  
   
   smgrsync(void)
  
  
  
  
   
   {
  
  
  
  
   
           int                     i;
  
  
  
  
   
   

  
  
  
  
   
           for (i = 0; i < NSmgr; i++)
  
  
  
  
   
           {
  
  
  
  
   
                   if (smgrsw[i].smgr_sync)
  
  
  
  
   
                           (*(smgrsw[i].smgr_sync)) ();
  
  
  
  
   
           }
  
  
  
  
   
   }

smgr_sync实际调用的是

mdsync@src/backend/storage/smgr/md.c

* mdsync() -- Sync previous writes to stable storage.

void

mdsync(void)

{

static bool mdsync_in_progress = false;

HASH_SEQ_STATUS hstat;

PendingOperationEntry *entry;

int absorb_counter;

/* Statistics on sync times */

int processed = 0;

instr_time sync_start,

sync_end,

sync_diff;

uint64 elapsed;

uint64 longest = 0;

uint64 total_elapsed = 0;

* This is only called during checkpoints, and checkpoints should only

* occur in processes that have created a pendingOpsTable.

if (!pendingOpsTable)

elog(ERROR, "cannot sync without a pendingOpsTable");

* If we are in the checkpointer, the sync had better include all fsync

* requests that were queued by backends up to this point. The tightest

* race condition that could occur is that a buffer that must be written

* and fsync'd for the checkpoint could have been dumped by a backend just

* before it was visited by BufferSync(). We know the backend will have

* queued an fsync request before clearing the buffer's dirtybit, so we

* are safe as long as we do an Absorb after completing BufferSync().

AbsorbFsyncRequests();

* To avoid excess fsync'ing (in the worst case, maybe a never-terminating

* checkpoint), we want to ignore fsync requests that are entered into the

* hashtable after this point --- they should be processed next time,

* instead. We use mdsync_cycle_ctr to tell old entries apart from new

* ones: new ones will have cycle_ctr equal to the incremented value of

* mdsync_cycle_ctr.

* In normal circumstances, all entries present in the table at this point

* will have cycle_ctr exactly equal to the current (about to be old)

* value of mdsync_cycle_ctr. However, if we fail partway through the

* fsync'ing loop, then older values of cycle_ctr might remain when we

* come back here to try again. Repeated checkpoint failures would

* eventually wrap the counter around to the point where an old entry

* might appear new, causing us to skip it, possibly allowing a checkpoint

* to succeed that should not have. To forestall wraparound, any time the

* previous mdsync() failed to complete, run through the table and

* forcibly set cycle_ctr = mdsync_cycle_ctr.

* Think not to merge this loop with the main loop, as the problem is

* exactly that that loop may fail before having visited all the entries.

* From a performance point of view it doesn't matter anyway, as this path

* will never be taken in a system that's functioning normally.

if (mdsync_in_progress)

{

/* prior try failed, so update any stale cycle_ctr values */

hash_seq_init(&hstat, pendingOpsTable);

while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)

{

entry->cycle_ctr = mdsync_cycle_ctr;

}

/* Advance counter so that new hashtable entries are distinguishable */

mdsync_cycle_ctr++;

/* Set flag to detect failure if we don't reach the end of the loop */

mdsync_in_progress = true;

/* Now scan the hashtable for fsync requests to process */

absorb_counter = FSYNCS_PER_ABSORB;

hash_seq_init(&hstat, pendingOpsTable);

while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)

{

ForkNumber forknum;

* If the entry is new then don't process it this time; it might

* contain multiple fsync-request bits, but they are all new. Note

* "continue" bypasses the hash-remove call at the bottom of the loop.

if (entry->cycle_ctr == mdsync_cycle_ctr)

continue;

/* Else assert we haven't missed it */

Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);

* Scan over the forks and segments represented by the entry.

* The bitmap manipulations are slightly tricky, because we can call

* AbsorbFsyncRequests() inside the loop and that could result in

* bms_add_member() modifying and even re-palloc'ing the bitmapsets.

* This is okay because we unlink each bitmapset from the hashtable

* entry before scanning it. That means that any incoming fsync

* requests will be processed now if they reach the table before we

* begin to scan their fork.

for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)

{

Bitmapset *requests = entry->requests[forknum];

int segno;

entry->requests[forknum] = NULL;

entry->canceled[forknum] = false;

while ((segno = bms_first_member(requests)) >= 0)

{

int failures;

* If fsync is off then we don't have to bother opening the

* file at all. (We delay checking until this point so that

* changing fsync on the fly behaves sensibly.)

if (!enableFsync)

continue;

* If in checkpointer, we want to absorb pending requests

* every so often to prevent overflow of the fsync request

* queue. It is unspecified whether newly-added entries will

* be visited by hash_seq_search, but we don't care since we

* don't need to process them anyway.

if (--absorb_counter <= 0)

{

AbsorbFsyncRequests();

absorb_counter = FSYNCS_PER_ABSORB;

}

* The fsync table could contain requests to fsync segments

* that have been deleted (unlinked) by the time we get to

* them. Rather than just hoping an ENOENT (or EACCES on

* Windows) error can be ignored, what we do on error is

* absorb pending requests and then retry. Since mdunlink()

* queues a "cancel" message before actually unlinking, the

* fsync request is guaranteed to be marked canceled after the

* absorb if it really was this case. DROP DATABASE likewise

* has to tell us to forget fsync requests before it starts

* deletions.

for (failures = 0;; failures++) /* loop exits at "break" */

{

SMgrRelation reln;

MdfdVec *seg;

char *path;

int save_errno;

* Find or create an smgr hash entry for this relation.

* This may seem a bit unclean -- md calling smgr? But

* it's really the best solution. It ensures that the

* open file reference isn't permanently leaked if we get

* an error here. (You may say "but an unreferenced

* SMgrRelation is still a leak!" Not really, because the

* only case in which a checkpoint is done by a process

* that isn't about to shut down is in the checkpointer,

* and it will periodically do smgrcloseall(). This fact

* justifies our not closing the reln in the success path

* either, which is a good thing since in non-checkpointer

* cases we couldn't safely do that.)

reln = smgropen(entry->rnode, InvalidBackendId);

/* Attempt to open and fsync the target segment */

seg = _mdfd_getseg(reln, forknum,

(BlockNumber) segno * (BlockNumber) RELSEG_SIZE,

false, EXTENSION_RETURN_NULL);

INSTR_TIME_SET_CURRENT(sync_start);

if (seg != NULL &&

FileSync(seg->mdfd_vfd) >= 0)

{

/* Success; update statistics about sync timing */

INSTR_TIME_SET_CURRENT(sync_end);

sync_diff = sync_end;

INSTR_TIME_SUBTRACT(sync_diff, sync_start);

elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);

if (elapsed > longest)

longest = elapsed;

total_elapsed += elapsed;

processed++;

if (log_checkpoints)

elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",

processed,

FilePathName(seg->mdfd_vfd),

(double) elapsed / 1000);

break; /* out of retry loop */

}

/* Compute file name for use in message */

save_errno = errno;

path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);

errno = save_errno;

* It is possible that the relation has been dropped or

* truncated since the fsync request was entered.

* Therefore, allow ENOENT, but only if we didn't fail

* already on this file. This applies both for

* _mdfd_getseg() and for FileSync, since fd.c might have

* closed the file behind our back.

* XXX is there any point in allowing more than one retry?

* Don't see one at the moment, but easy to change the

* test here if so.

if (!FILE_POSSIBLY_DELETED(errno) ||

failures > 0)

ereport(ERROR,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\": %m",

path)));

else

ereport(DEBUG1,

(errcode_for_file_access(),

errmsg("could not fsync file \"%s\" but retrying: %m",

path)));

pfree(path);

* Absorb incoming requests and check to see if a cancel

* arrived for this relation fork.

AbsorbFsyncRequests();

absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */

if (entry->canceled[forknum])

break;

} /* end retry loop */

}

bms_free(requests);

}

* We've finished everything that was requested before we started to

* scan the entry. If no new requests have been inserted meanwhile,

* remove the entry. Otherwise, update its cycle counter, as all the

* requests now in it must have arrived during this cycle.

for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)

{

if (entry->requests[forknum] != NULL)

break;

}

if (forknum <= MAX_FORKNUM)

entry->cycle_ctr = mdsync_cycle_ctr;

else

{

/* Okay to remove it */

if (hash_search(pendingOpsTable, &entry->rnode,

HASH_REMOVE, NULL) == NULL)

elog(ERROR, "pendingOpsTable corrupted");

}

} /* end loop over hashtable entries */

/* Return sync performance metrics for report at checkpoint end */

CheckpointStats.ckpt_sync_rels = processed;

CheckpointStats.ckpt_longest_sync = longest;

CheckpointStats.ckpt_agg_sync_time = total_elapsed;

/* Flag successful completion of mdsync */

mdsync_in_progress = false;

}

[小结]

checkpointer刷缓存主要分几个步骤，

1. 遍历shared buffer区，将当前SHARED BUFFER中脏块新增FLAG need checkpoint，

2. 遍历shared buffer区，将上一步标记为need checkpoint的块write到磁盘，WRITE前需要确保该buffer lsn前的XLOG已经fsync到磁盘，

3. 将前面的write sync到持久化存储。

具体耗时可以参考期间的探针，或者检查点日志输出。

下一篇讲一下检查点的跟踪。

[参考]

1. http://blog.163.com/digoal@126/blog/static/1638770402015463252387/

PostgreSQL Why checkpointer impact performance so much ? - 3