https://postgreshelp.com/postgresql-checkpoint/
PostgreSQL Checkpoint Demystified
BY POSTGRESHELP · FEBRUARY 11, 2019
PostgreSQL Checkpoint: Database blocks are temporarily stored in Database shared buffers. As blocks are read, they are stored in DB shared buffers so that if any user accesses them later, they are available in memory and need not be read from the disk.
When we update any row, the buffer in DB shared buffers corresponding to the block containing that row is updated in memory. Record of the change made is kept in wal buffer. On commit, the changes we made are written to the wal file on disk thereby making them permanent. If the instance crashed right then, the DB shared buffers will be wiped out but on restarting the database, PostgreSQL will apply the changes recorded in WAL files to the corresponding data files.
Why doesn’t PostgreSQL write the changes to data files right away when we commit the transaction?
The reason is simple. If it chose to write directly to the data files, it will have to physically locate the data block in the datafile first and then update it which means that after committing, a user has to wait until it finds and writes.
Moreover writing to datafile is not sequential whereas writing to WAL is sequential.
So wal writer process writes the changes to WAL file and later Checkpointer writes the corresponding dirty blocks to the respective data file.
wait, what?
checkpointer process?
Yes,
Dirty blocks are written into datafile by 3 different processes, Checkpointer process, writer process(转帖者注:应该是background writer) and user backend process.
We will limit our discussion here to CHECKPOINT only.
Table of Contents
What is a checkpoint?
A checkpoint is a synchronization event at a specific point in time which causes some or all dirty pages in the shared buffer to be written to disk.
But why?
In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the Write ahead log from which it should start the REDO operation.
Confused?
Let’s understand the life cycle of a database change.
As we all know PostgreSQL keeps a cache of pages in RAM(shared_buffers) to make it available to all backends for faster access.
How does a change happen in a database?
STEP 1: Whenever PostgreSQL needs to change anything on a give a table, it checks if it is in shared_buffers, if yes, it modifies them directly, and if not – it loads the required pages into shared_buffers and then modifies them.
STEP 2: The modified pages are termed as dirty pages.
STEP 3: Upon commit, the changes we made are written to the Write ahead log thereby making them permanent on disk.
STEP 4: The dirty pages are flushed to the respective file to disk periodically.
But how often dirty pages are written to disk?
This is where checkpoint comes in to picture.
At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. Any changes made to data files before that point are guaranteed to be already on disk.
If your system crashes, recovery from the crash will start from this new last point where the database was sure that everything on disk was consistent.
A checkpoint occurs in the following cases:
When the administrator runs the CHECKPOINT statement.
With the interval specified in parameter checkpoint_timeout(default 300 seconds)
Amount of data written to the WAL has reached a parameter max_wal_size (default: 1GB).
At the start of an online backup
At the execution of pg_start_backup function
At the execution of a pg_basebackup command
At the shutdown of an instance(Except for the pg_ctl stop -m immediate command execution)
At the time of database configuration such as CREATE DATABASE / DROP DATABASE statement
during a CHECKPOINT, the database needs to perform these three basic steps
1. Identify all the dirty pages in shared buffers.
2. Write the dirty pages to the respective files.
3. Issue fsync() on all the modified files to disk.
NOTE:The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk.