PostgreSQL Why checkpointer impact performance so much ? - 3

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

  • Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
  • Postgres-XL的项目发起人Mason Sharp
  • pgpool的作者石井达夫(Tatsuo Ishii)
  • PG-Strom的作者海外浩平(Kaigai Kohei)
  • Greenplum研发总监姚延栋
  • 周正中(德哥), PostgreSQL中国用户会创始人之一
  • 汪洋,平安科技数据库技术部经理
  • ……


 
  • 2015年度PG大象会报名地址:http://postgres2015.eventdove.com/
  • PostgreSQL中国社区: http://postgres.cn/
  • PostgreSQL专业1群: 3336901(已满)
  • PostgreSQL专业2群: 100910388
  • PostgreSQL专业3群: 150657323



接着上一篇,
这篇主要谈一下 CheckPointBuffers(flags) .
CheckPointBuffers(flags)@src/backend/storage/buffer/bufmgr.c

/*
 * CheckPointBuffers
 *
 * Flush all dirty blocks in buffer pool to disk at checkpoint time.
 *
 * Note: temporary relations do not participate in checkpoints, so they don't
 * need to be flushed.
 */
void
CheckPointBuffers(int flags)
{
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);  // buffer checkpoint开始探针
        CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
        BufferSync(flags);  //  这个是重量级操作, 需要全扫描1次BUFFER, 锁buffer头, 设置标记。 再扫描一次buffer,将前面标记过的脏块flush到磁盘。
        CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();   // buffer checkpoint sync开始探针
        smgrsync();  // sync操作
        CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
        TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();   // buffer checkpoint 结束探针

BufferSync是一个比较重的操作。
第一次全扫描buffer区,将脏数据块头部设置为本次checkpoint需要flush的块。
第二次扫描,将前面设置为本次需要checkpoint的块FLUSH到磁盘。
但是需要注意,第一次设置为need checkpoint的块,有一个计数,第二次在刷数据块时,可能提前到达这个计数,所以第二次刷脏块的动作可能不需要扫全缓存区域。
但是,第一次被标记的脏块,也可能在这期间被其他进程如bgwriter写掉了,所以第二次扫描时无法达到计数,则还是需要全扫描整个缓存区。
(为什么不在第一次设置时同时记住脏块的内存位置,第二次直接去FLUSH这些位置的块呢?还需要重复再扫一次)
BufferSync@src/backend/storage/buffer/bufmgr.c

/*
 * BufferSync -- Write out all dirty buffers in the pool.
 *
 * This is called at checkpoint time to write out all dirty shared buffers.
 * The checkpoint request flags should be passed in.  If CHECKPOINT_IMMEDIATE
 * is set, we disable delays between writes; if CHECKPOINT_IS_SHUTDOWN,
 * CHECKPOINT_END_OF_RECOVERY or CHECKPOINT_FLUSH_ALL is set, we write even
 * unlogged buffers, which are otherwise skipped.  The remaining flags
 * currently have no effect here.
 */
static void
BufferSync(int flags)
{
        int                     buf_id;
        int                     num_to_scan;
        int                     num_to_write;
        int                     num_written;
        int                     mask = BM_DIRTY;  // 脏块掩码

        /* Make sure we can handle the pin inside SyncOneBuffer */
        ResourceOwnerEnlargeBuffers(CurrentResourceOwner);

        /*
         * Unless this is a shutdown checkpoint or we have been explicitly told,
         * we write only permanent, dirty buffers.  But at shutdown or end of
         * recovery, we write all dirty buffers.
         */
        if (!((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
                                        CHECKPOINT_FLUSH_ALL))))
                mask |= BM_PERMANENT;  // 持久对象掩码

        /*
         * Loop over all buffers, and mark the ones that need to be written with
         * BM_CHECKPOINT_NEEDED.  Count them as we go (num_to_write), so that we
         * can estimate how much work needs to be done.
         *
         * This allows us to write only those pages that were dirty when the
         * checkpoint began, and not those that get dirtied while it proceeds.
         * Whenever a page with BM_CHECKPOINT_NEEDED is written out, either by us
         * later in this function, or by normal backends or the bgwriter cleaning
         * scan, the flag is cleared.  Any buffer dirtied after this point won't
         * have the flag set.
         *
         * Note that if we fail to write some buffer, we may leave buffers with
         * BM_CHECKPOINT_NEEDED still set.  This is OK since any such buffer would
         * certainly need to be written for the next checkpoint attempt, too.
         */
        num_to_write = 0;  // BM_CHECKPOINT_NEEDED计数
        for (buf_id = 0; buf_id < NBuffers; buf_id++)   //  将当前数据库中的脏块标记为本次检查点需要flush的状态
                                                                                      //   也就是说,flush过程中数据库产生的脏块不用理会。
                                                                   
        {
                volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

                /*
                 * Header spinlock is enough to examine BM_DIRTY, see comment in
                 * SyncOneBuffer.
                 */
                LockBufHdr(bufHdr);  // 锁缓存头

                if ((bufHdr->flags & mask) == mask)   // 将包含脏块掩码或者并且包含持久化掩码的缓存增加标记BM_CHECKPOINT_NEEDED
                {
                        bufHdr->flags |= BM_CHECKPOINT_NEEDED;  
                        num_to_write++;
                }

                UnlockBufHdr(bufHdr);
        }

        if (num_to_write == 0)
                return;                                 /* nothing to do */

        TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_write);  // 刷缓存开始,探针

        /*
         * Loop over all buffers again, and write the ones (still) marked with
         * BM_CHECKPOINT_NEEDED.  In this loop, we start at the clock sweep point
         * since we might as well dump soon-to-be-recycled buffers first.
         *
         * Note that we don't read the buffer alloc count here --- that should be
         * left untouched till the next BgBufferSync() call.
         */
        buf_id = StrategySyncStart(NULL, NULL);
        num_to_scan = NBuffers;
        num_written = 0;
        while (num_to_scan-- > 0)  // 需要sync的buffer块计数递减
        {
                volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];

                /*
                 * We don't need to acquire the lock here, because we're only looking
                 * at a single bit. It's possible that someone else writes the buffer
                 * and clears the flag right after we check, but that doesn't matter
                 * since SyncOneBuffer will then do nothing.  However, there is a
                 * further race condition: it's conceivable that between the time we
                 * examine the bit here and the time SyncOneBuffer acquires lock,
                 * someone else not only wrote the buffer but replaced it with another
                 * page and dirtied it.  In that improbable case, SyncOneBuffer will
                 * write the buffer though we didn't need to.  It doesn't seem worth
                 * guarding against this, though.
                 */
                if (bufHdr->flags & BM_CHECKPOINT_NEEDED)  // 判断掩码,如果包含BM_CHECKPOINT_NEEDED,则刷
                {
                        if (SyncOneBuffer(buf_id, false) & BUF_WRITTEN)  // 调用SyncOneBuffer刷缓存
                        {
                                TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);  //  表示该数据块刷新成功
                                BgWriterStats.m_buf_written_checkpoints++;
                                num_written++;

                                /*
                                 * We know there are at most num_to_write buffers with
                                 * BM_CHECKPOINT_NEEDED set; so we can stop scanning if
                                 * num_written reaches num_to_write.
                                 *
                                 * Note that num_written doesn't include buffers written by
                                 * other backends, or by the bgwriter cleaning scan. That
                                 * means that the estimate of how much progress we've made is
                                 * conservative, and also that this test will often fail to
                                 * trigger.  But it seems worth making anyway.
                                 */
                                if (num_written >= num_to_write)  // 如果提前完成刷新,不需要扫全缓存区,退出
                                        break;

                                /*
                                 * Sleep to throttle our I/O rate.
                                 */
                                CheckpointWriteDelay(flags, (double) num_written / num_to_write);   // 将目前刷缓存完成比例传给CheckpointWriteDelay,如果达到休息点,则会触发一个100毫秒的等待。
                                //  假设一共有1000个需要刷的块(num_to_write),目前已经刷了100个(num_written )。
                                //   CheckpointWriteDelay(flags, 0.1); , 假设CheckPointCompletionTarget为默认的0.5
                               //    IsCheckpointOnSchedule里, progress *= CheckPointCompletionTarget; = 0.1*0.5 = 0.05
                               //   elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments
                               //   如果 progress < elapsed_xlogs 不休息
                               //   progress最大就是0.5, 因为num_written / num_to_write最大就是1, 1乘以0.5还是0.5
                               //   因此CheckPointCompletionTarget越大,休息区间越大。
                        }
                }

                if (++buf_id >= NBuffers)
                        buf_id = 0;
        }

        /*
         * Update checkpoint statistics. As noted above, this doesn't include
         * buffers written by other backends or bgwriter scan.
         */
        CheckpointStats.ckpt_bufs_written += num_written;

        TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_write);  // 标记为BM_CHECKPOINT_NEEDED的脏块已全部flush完
}


刷单个BUFFER,返回bitmask, BUF_WRITTEN表示已写入磁盘。
SyncOneBuffer@src/backend/storage/buffer/bufmgr.c

/*
 * SyncOneBuffer -- process a single buffer during syncing.
 *
 * If skip_recently_used is true, we don't write currently-pinned buffers, nor
 * buffers marked recently used, as these are not replacement candidates.
 *
 * Returns a bitmask containing the following flag bits:
 *      BUF_WRITTEN: we wrote the buffer.
 *      BUF_REUSABLE: buffer is available for replacement, ie, it has
 *              pin count 0 and usage count 0.
 *
 * (BUF_WRITTEN could be set in error if FlushBuffers finds the buffer clean
 * after locking it, but we don't care all that much.)
 *
 * Note: caller must have done ResourceOwnerEnlargeBuffers.
 */
static int
SyncOneBuffer(int buf_id, bool skip_recently_used)
{
        volatile BufferDesc *bufHdr = &BufferDescriptors[buf_id];
        int                     result = 0;

        /*
         * Check whether buffer needs writing.
         *
         * We can make this check without taking the buffer content lock so long
         * as we mark pages dirty in access methods *before* logging changes with
         * XLogInsert(): if someone marks the buffer dirty just after our check we
         * don't worry because our checkpoint.redo points before log record for
         * upcoming changes and so we are not required to write such dirty buffer.
         */
        LockBufHdr(bufHdr);

        if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)   
                result |= BUF_REUSABLE;
        else if (skip_recently_used)
        {
                /* Caller told us not to write recently-used buffers */
                UnlockBufHdr(bufHdr);
                return result;
        }

        if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
        {
                /* It's clean, so nothing to do */
                UnlockBufHdr(bufHdr);
                return result;
        }

        /*
         * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
         * buffer is clean by the time we've locked it.)
         */
        PinBuffer_Locked(bufHdr);
        LWLockAcquire(bufHdr->content_lock, LW_SHARED);

        FlushBuffer(bufHdr, NULL);  // 调用FlushBuffer刷buffer

        LWLockRelease(bufHdr->content_lock);
        UnpinBuffer(bufHdr, true);

        return result | BUF_WRITTEN;
}


调用FlushBuffer将BUFFER刷到内核,内核负责写如磁盘,在写checkpoint WAL前,必须写到磁盘。
FlushBuffer@src/backend/storage/buffer/bufmgr.c

/*
 * FlushBuffer
 *              Physically write out a shared buffer.
 *
 * NOTE: this actually just passes the buffer contents to the kernel; the
 * real write to disk won't happen until the kernel feels like it.  This
 * is okay from our point of view since we can redo the changes from WAL.
 * However, we will need to force the changes to disk via fsync before
 * we can checkpoint WAL.  在写checkpoint WAL前,buffer必须写到磁盘。
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *
 * If the caller has an smgr reference for the buffer's relation, pass it
 * as the second parameter.  If not, pass NULL.
 */
static void
FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
{
        XLogRecPtr      recptr;
        ErrorContextCallback errcallback;
        instr_time      io_start,
                                io_time;
        Block           bufBlock;
        char       *bufToWrite;

        /*
         * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
         * false, then someone else flushed the buffer before we could, so we need
         * not do anything.
         */
        if (!StartBufferIO(buf, false))
                return;

        /* Setup error traceback support for ereport() */
        errcallback.callback = shared_buffer_write_error_callback;
        errcallback.arg = (void *) buf;
        errcallback.previous = error_context_stack;
        error_context_stack = &errcallback;

        /* Find smgr relation for buffer */
        if (reln == NULL)
                reln = smgropen(buf->tag.rnode, InvalidBackendId);

        TRACE_POSTGRESQL_BUFFER_FLUSH_START(buf->tag.forkNum,
                                                                                buf->tag.blockNum,
                                                                                reln->smgr_rnode.node.spcNode,
                                                                                reln->smgr_rnode.node.dbNode,
                                                                                reln->smgr_rnode.node.relNode);

        LockBufHdr(buf);

        /*
         * Run PageGetLSN while holding header lock, since we don't have the
         * buffer locked exclusively in all cases.
         */
        recptr = BufferGetLSN(buf);  // 这里又一个BUFFER头锁

        /* To check if block content changes while flushing. - vadim 01/17/97 */
        buf->flags &= ~BM_JUST_DIRTIED;
        UnlockBufHdr(buf);

        /*
         * Force XLOG flush up to buffer's LSN.  This implements the basic WAL  //  XLOG 强写到buffer lsn位置,
         * rule that log updates must hit disk before any of the data-file changes  // 确保在此之前数据块改变产生的XLOG都写入磁盘了.
         * they describe do.
         *
         * However, this rule does not apply to unlogged relations, which will be
         * lost after a crash anyway.  Most unlogged relation pages do not bear
         * LSNs since we never emit WAL records for them, and therefore flushing
         * up through the buffer LSN would be useless, but harmless.  However,
         * GiST indexes use LSNs internally to track page-splits, and therefore
         * unlogged GiST pages bear "fake" LSNs generated by
         * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
         * LSN counter could advance past the WAL insertion point; and if it did
         * happen, attempting to flush WAL through that location would fail, with
         * disastrous system-wide consequences.  To make sure that can't happen,
         * skip the flush if the buffer isn't permanent.
         */
        if (buf->flags & BM_PERMANENT)
                XLogFlush(recptr);

        /*
         * Now it's safe to write buffer to disk. Note that no one else should
         * have been able to write it while we were busy with log flushing because
         * we have the io_in_progress lock.
         */
        bufBlock = BufHdrGetBlock(buf);  

        /*
         * Update page checksum if desired.  Since we have only shared lock on the
         * buffer, other processes might be updating hint bits in it, so we must
         * copy the page to private storage if we do checksumming.
         */
        bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);

        if (track_io_timing)
                INSTR_TIME_SET_CURRENT(io_start);

        /*
         * bufToWrite is either the shared buffer or a copy, as appropriate.
         */
        smgrwrite(reln,               //  将BUFFER写入磁盘
                          buf->tag.forkNum,
                          buf->tag.blockNum,
                          bufToWrite,
                          false);

        if (track_io_timing)
        {
                INSTR_TIME_SET_CURRENT(io_time);
                INSTR_TIME_SUBTRACT(io_time, io_start);
                pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
                INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
        }

        pgBufferUsage.shared_blks_written++;

        /*
         * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
         * end the io_in_progress state.
         */
        TerminateBufferIO(buf, true, 0);

        TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(buf->tag.forkNum,    // 单个buffer块 flush结束
                                                                           buf->tag.blockNum,
                                                                           reln->smgr_rnode.node.spcNode,
                                                                           reln->smgr_rnode.node.dbNode,
                                                                           reln->smgr_rnode.node.relNode);

        /* Pop the error context stack */
        error_context_stack = errcallback.previous;
}


Write the supplied buffer out.
smgrwrite@src/backend/storage/smgr/smgr.c

/*
 *      smgrwrite() -- Write the supplied buffer out.
 *
 *              This is to be used only for updating already-existing blocks of a
 *              relation (ie, those before the current EOF).  To extend a relation,
 *              use smgrextend().
 *
 *              This is not a synchronous write -- the block is not necessarily
 *              on disk at return, only dumped out to the kernel.  However,
 *              provisions will be made to fsync the write before the next checkpoint.
 *
 *              skipFsync indicates that the caller will make other provisions to
 *              fsync the relation, so we needn't bother.  Temporary relations also
 *              do not require fsync.
 */
void
smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
                  char *buffer, bool skipFsync)
{
        (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
                                                                                          buffer, skipFsync);
}


最后一步是将前面的write sync 到磁盘.
smgrsync@src/backend/storage/smgr/smgr.c

/*
 *      smgrsync() -- Sync files to disk during checkpoint.
 */
void
smgrsync(void)
{
        int                     i;

        for (i = 0; i < NSmgr; i++)
        {
                if (smgrsw[i].smgr_sync)
                        (*(smgrsw[i].smgr_sync)) ();
        }
}


smgr_sync实际调用的是
mdsync@src/backend/storage/smgr/md.c

/*
 *      mdsync() -- Sync previous writes to stable storage.
 */
void
mdsync(void)
{
        static bool mdsync_in_progress = false;

        HASH_SEQ_STATUS hstat;
        PendingOperationEntry *entry;
        int                     absorb_counter;

        /* Statistics on sync times */
        int                     processed = 0;
        instr_time      sync_start,
                                sync_end,
                                sync_diff;
        uint64          elapsed;
        uint64          longest = 0;
        uint64          total_elapsed = 0;
        /*
         * This is only called during checkpoints, and checkpoints should only
         * occur in processes that have created a pendingOpsTable.
         */
        if (!pendingOpsTable)
                elog(ERROR, "cannot sync without a pendingOpsTable");

        /*
         * If we are in the checkpointer, the sync had better include all fsync
         * requests that were queued by backends up to this point.  The tightest
         * race condition that could occur is that a buffer that must be written
         * and fsync'd for the checkpoint could have been dumped by a backend just
         * before it was visited by BufferSync().  We know the backend will have
         * queued an fsync request before clearing the buffer's dirtybit, so we
         * are safe as long as we do an Absorb after completing BufferSync().
         */
        AbsorbFsyncRequests();

        /*
         * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
         * checkpoint), we want to ignore fsync requests that are entered into the
         * hashtable after this point --- they should be processed next time,
         * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
         * ones: new ones will have cycle_ctr equal to the incremented value of
         * mdsync_cycle_ctr.
         *
         * In normal circumstances, all entries present in the table at this point
         * will have cycle_ctr exactly equal to the current (about to be old)
         * value of mdsync_cycle_ctr.  However, if we fail partway through the
         * fsync'ing loop, then older values of cycle_ctr might remain when we
         * come back here to try again.  Repeated checkpoint failures would
         * eventually wrap the counter around to the point where an old entry
         * might appear new, causing us to skip it, possibly allowing a checkpoint
         * to succeed that should not have.  To forestall wraparound, any time the
         * previous mdsync() failed to complete, run through the table and
         * forcibly set cycle_ctr = mdsync_cycle_ctr.
         *
         * Think not to merge this loop with the main loop, as the problem is
         * exactly that that loop may fail before having visited all the entries.
         * From a performance point of view it doesn't matter anyway, as this path
         * will never be taken in a system that's functioning normally.
         */
        if (mdsync_in_progress)
        {
                /* prior try failed, so update any stale cycle_ctr values */
                hash_seq_init(&hstat, pendingOpsTable);
                while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
                {
                        entry->cycle_ctr = mdsync_cycle_ctr;
                }
        }

        /* Advance counter so that new hashtable entries are distinguishable */
        mdsync_cycle_ctr++;

        /* Set flag to detect failure if we don't reach the end of the loop */
        mdsync_in_progress = true;

        /* Now scan the hashtable for fsync requests to process */
        absorb_counter = FSYNCS_PER_ABSORB;
        hash_seq_init(&hstat, pendingOpsTable);
        while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
        {
                ForkNumber      forknum;

                /*
                 * If the entry is new then don't process it this time; it might
                 * contain multiple fsync-request bits, but they are all new.  Note
                 * "continue" bypasses the hash-remove call at the bottom of the loop.
                 */
                if (entry->cycle_ctr == mdsync_cycle_ctr)
                        continue;

                /* Else assert we haven't missed it */
                Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);

                /*
                 * Scan over the forks and segments represented by the entry.
                 *
                 * The bitmap manipulations are slightly tricky, because we can call
                 * AbsorbFsyncRequests() inside the loop and that could result in
                 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
                 * This is okay because we unlink each bitmapset from the hashtable
                 * entry before scanning it.  That means that any incoming fsync
                 * requests will be processed now if they reach the table before we
                 * begin to scan their fork.
                 */
                for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
                {
                        Bitmapset  *requests = entry->requests[forknum];
                        int                     segno;

                        entry->requests[forknum] = NULL;
                        entry->canceled[forknum] = false;

                        while ((segno = bms_first_member(requests)) >= 0)
                        {
                                int                     failures;

                                /*
                                 * If fsync is off then we don't have to bother opening the
                                 * file at all.  (We delay checking until this point so that
                                 * changing fsync on the fly behaves sensibly.)
                                 */
                                if (!enableFsync)
                                        continue;
                                /*
                                 * If in checkpointer, we want to absorb pending requests
                                 * every so often to prevent overflow of the fsync request
                                 * queue.  It is unspecified whether newly-added entries will
                                 * be visited by hash_seq_search, but we don't care since we
                                 * don't need to process them anyway.
                                 */
                                if (--absorb_counter <= 0)
                                {
                                        AbsorbFsyncRequests();
                                        absorb_counter = FSYNCS_PER_ABSORB;
                                }

                                /*
                                 * The fsync table could contain requests to fsync segments
                                 * that have been deleted (unlinked) by the time we get to
                                 * them. Rather than just hoping an ENOENT (or EACCES on
                                 * Windows) error can be ignored, what we do on error is
                                 * absorb pending requests and then retry.  Since mdunlink()
                                 * queues a "cancel" message before actually unlinking, the
                                 * fsync request is guaranteed to be marked canceled after the
                                 * absorb if it really was this case. DROP DATABASE likewise
                                 * has to tell us to forget fsync requests before it starts
                                 * deletions.
                                 */
                                for (failures = 0;; failures++) /* loop exits at "break" */
                                {
                                        SMgrRelation reln;
                                        MdfdVec    *seg;
                                        char       *path;
                                        int                     save_errno;

                                        /*
                                         * Find or create an smgr hash entry for this relation.
                                         * This may seem a bit unclean -- md calling smgr?      But
                                         * it's really the best solution.  It ensures that the
                                         * open file reference isn't permanently leaked if we get
                                         * an error here. (You may say "but an unreferenced
                                         * SMgrRelation is still a leak!" Not really, because the
                                         * only case in which a checkpoint is done by a process
                                         * that isn't about to shut down is in the checkpointer,
                                         * and it will periodically do smgrcloseall(). This fact
                                         * justifies our not closing the reln in the success path
                                         * either, which is a good thing since in non-checkpointer
                                         * cases we couldn't safely do that.)
                                         */
                                        reln = smgropen(entry->rnode, InvalidBackendId);

                                        /* Attempt to open and fsync the target segment */
                                        seg = _mdfd_getseg(reln, forknum,
                                                         (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
                                                                           false, EXTENSION_RETURN_NULL);

                                        INSTR_TIME_SET_CURRENT(sync_start);

                                        if (seg != NULL &&
                                                FileSync(seg->mdfd_vfd) >= 0)
                                        {
                                                /* Success; update statistics about sync timing */
                                                INSTR_TIME_SET_CURRENT(sync_end);
                                                sync_diff = sync_end;
                                                INSTR_TIME_SUBTRACT(sync_diff, sync_start);
                                                elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
                                                if (elapsed > longest)
                                                        longest = elapsed;
                                                total_elapsed += elapsed;
                                                processed++;
                                                if (log_checkpoints)
                                                        elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
                                                                 processed,
                                                                 FilePathName(seg->mdfd_vfd),
                                                                 (double) elapsed / 1000);

                                                break;  /* out of retry loop */
                                        }
                                        /* Compute file name for use in message */
                                        save_errno = errno;
                                        path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
                                        errno = save_errno;

                                        /*
                                         * It is possible that the relation has been dropped or
                                         * truncated since the fsync request was entered.
                                         * Therefore, allow ENOENT, but only if we didn't fail
                                         * already on this file.  This applies both for
                                         * _mdfd_getseg() and for FileSync, since fd.c might have
                                         * closed the file behind our back.
                                         *
                                         * XXX is there any point in allowing more than one retry?
                                         * Don't see one at the moment, but easy to change the
                                         * test here if so.
                                         */
                                        if (!FILE_POSSIBLY_DELETED(errno) ||
                                                failures > 0)
                                                ereport(ERROR,
                                                                (errcode_for_file_access(),
                                                                 errmsg("could not fsync file \"%s\": %m",
                                                                                path)));
                                        else
                                                ereport(DEBUG1,
                                                                (errcode_for_file_access(),
                                                errmsg("could not fsync file \"%s\" but retrying: %m",
                                                           path)));
                                        pfree(path);

                                        /*
                                         * Absorb incoming requests and check to see if a cancel
                                         * arrived for this relation fork.
                                         */
                                        AbsorbFsyncRequests();
                                        absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
                                        if (entry->canceled[forknum])
                                                break;
                                }                               /* end retry loop */
                        }
                        bms_free(requests);
                }

                /*
                 * We've finished everything that was requested before we started to
                 * scan the entry.  If no new requests have been inserted meanwhile,
                 * remove the entry.  Otherwise, update its cycle counter, as all the
                 * requests now in it must have arrived during this cycle.
                 */
                for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
                {
                        if (entry->requests[forknum] != NULL)
                                break;
                }
                if (forknum <= MAX_FORKNUM)
                        entry->cycle_ctr = mdsync_cycle_ctr;
                else
                {
                        /* Okay to remove it */
                        if (hash_search(pendingOpsTable, &entry->rnode,
                                                        HASH_REMOVE, NULL) == NULL)
                                elog(ERROR, "pendingOpsTable corrupted");
                }
        }                                                       /* end loop over hashtable entries */

        /* Return sync performance metrics for report at checkpoint end */
        CheckpointStats.ckpt_sync_rels = processed;
        CheckpointStats.ckpt_longest_sync = longest;
        CheckpointStats.ckpt_agg_sync_time = total_elapsed;

        /* Flag successful completion of mdsync */
        mdsync_in_progress = false;
}


[小结]
checkpointer刷缓存主要分几个步骤,
1. 遍历shared buffer区,将当前SHARED BUFFER中脏块新增FLAG need checkpoint,
2. 遍历shared buffer区,将上一步标记为need checkpoint的块write到磁盘,WRITE前需要确保该buffer lsn前的XLOG已经fsync到磁盘,
3. 将前面的write sync到持久化存储。
具体耗时可以参考期间的探针,或者检查点日志输出。
下一篇讲一下检查点的跟踪。

[参考]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值