Tune Linux Kernel Parameters For PostgreSQL Optimization
https://www.percona.com/blog/2018/08/29/tune-linux-kernel-parameters-for-postgresql-optimization/
For optimum performance, a PostgreSQL database depends on the operating system parameters being defined correctly. Poorly configured OS kernel parameters can cause degradation in database server performance. Therefore, it is imperative that these parameters are configured according to the database server and its workload. In this post, we will discuss some important Linux kernel parameters that can affect database server performance and how these should be tuned.
SHMMAX / SHMALL
SHMMAX is a kernel parameter used to define the maximum size of a single shared memory segment a Linux process can allocate. Until version 9.2, PostgreSQL uses System V (SysV) that requires SHMMAX setting. After 9.2, PostgreSQL switched to POSIX shared memory. So now it requires fewer bytes of System V shared memory.
Prior to version 9.3 SHMMAX was the most important kernel parameter. The value of SHMMAX is in bytes.
Similarly, SHMALL is another kernel parameter used to define system-wide total amount of shared memory pages. To view the current values for SHMMAX, SHMALL or SHMMIN, use the ipcs command.
SHM* Details - Linux
Shell
1 2 3 4 5 6 7 | $ ipcs -lm
------ Shared Memory Limits -------- max number of segments = 4096 max seg size (kbytes) = 1073741824 max total shared memory (kbytes) = 17179869184 min seg size (bytes) = 1 |
SHM* Details - MacOS X
Shell
1 2 3 4 5 6 7 8 | $ ipcs -M IPC status from as of Thu Aug 16 22:20:35 PKT 2018 shminfo: shmmax: 16777216 (max shared memory segment size) shmmin: 1 (min shared memory segment size) shmmni: 32 (max number of shared memory identifiers) shmseg: 8 (max shared memory segments per process) shmall: 1024 (max amount of shared memory in pages) |
PostgreSQL uses System V IPC to allocate shared memory. This parameter is one of the most important kernel parameters. Whenever you get following error messages, it means that you have an older version PostgreSQL and you have a very low SHMMAX value. Users are expected to adjust and increase the value according to the shared memory they are going to use.
Possible misconfiguration errors
If SHMMAX is misconfigured, you can get an error when trying to initialize a PostgreSQL cluster using the initdbcommand.
initdb Failure
Shell
1 2 3 4 5 6 7 8 9 10 | DETAIL: Failed system call was shmget(key=1, size=2072576, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX. To reduce the request size (currently 2072576 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration. child process exited with exit code 1 |
Similarly, you can get an error when starting the PostgreSQL server using the pg_ctl command.
pg_ctl Failure
Shell
1 2 3 4 5 6 7 8 9 10 | DETAIL: Failed system call was shmget(key=5432001, size=14385152, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX.; To reduce the request size (currently 14385152 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
If the request size is already small, it's possible that it is less than your kernel's SHMMIN parameter, in which case raising the request size or reconfiguring SHMMIN is called for.
The PostgreSQL documentation contains more information about shared memory configuration. |
Be aware of differing definitions
The definition of the SHMMAX/SHMALL parameters is slightly different between Linux and MacOS X. These are the definitions:
- Linux: kernel.shmmax, kernel.shmall
- MacOS X: kern.sysv.shmmax, kern.sysv.shmall
The sysctl command can be used to change the value temporarily. To permanently set the value, add an entry into /etc/sysctl.conf. The details are given below.
Change Kernel Parameters On MacOS X
Shell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # Get the value of SHMMAX sudo sysctl kern.sysv.shmmax kern.sysv.shmmax: 4096
# Get the value of SHMALL sudo sysctl kern.sysv.shmall kern.sysv.shmall: 4096
# Set the value of SHMMAX sudo sysctl -w kern.sysv.shmmax=16777216 kern.sysv.shmmax: 4096 -> 16777216<br>
# Set the value of SHMALL sudo sysctl -w kern.sysv.shmall=16777216 kern.sysv.shmall: 4096 -> 16777216 |
Change Kernel Parameters on Linux
Shell
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # Get the value of SHMMAX sudo sysctl kernel.shmmax kernel.shmmax: 4096
# Get the value of SHMALL sudo sysctl kernel.shmall kernel.shmall: 4096
# Set the value of SHMMAX sudo sysctl -w kernel.shmmax=16777216 kernel.shmmax: 4096 -> 16777216<br>
# Set the value of SHMALL sudo sysctl -w kernel.shmall=16777216 kernel.shmall: 4096 -> 16777216 |
Remember: to make the change permanent add these values in /etc/sysctl.conf
Huge Pages
Linux, by default, uses 4K memory pages, BSD has Super Pages, whereas Windows has Large Pages. A page is a chunk of RAM that is allocated to a process. A process may own more than one page depending on its memory requirements. The more memory a process needs the more pages that are allocated to it. The OS maintains a table of page allocation to processes. The smaller the page size, the bigger the table, the more time required to look up a page in that page table. Therefore, huge pages make it possible to use a large amount of memory with reduced overheads; fewer page lookups, fewer page faults, faster read/write operations through larger buffers. This results in improved performance.
PostgreSQL has support for bigger pages on Linux only. By default, Linux uses 4K of memory pages, so in cases where there are too many memory operations, there is a need to set bigger pages. Performance gains have been observed by using huge pages with sizes 2 MB and up to 1 GB. The size of Huge Page can be set boot time. You can easily check the huge page settings and utilization on your Linux box using cat /proc/meminfo | grep -i huge command.
Get HugePage Info - On Linux (only)
Shell
1 2 3 4 5 6 7 8 | Note: This is only for Linux, for other OS this operation is ignored$ cat /proc/meminfo | grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB |
In this example, although huge page size is set at 2,048 (2 MB), the total number of huge pages has a value of 0. which signifies that huge pages are disabled.
Script to quantify Huge Pages
This is a simple script which returns the number of Huge Pages required. Execute the script on your Linux box while your PostgreSQL is running. Ensure that $PGDATA environment variable is set to PostgreSQL’s data directory.
Get Number of Required HugePages
Shell
1 2 3 4 5 6 7 8 9 | #!/bin/bash pid=`head -1 $PGDATA/postmaster.pid` echo "Pid: $pid" peak=`grep ^VmPeak /proc/$pid/status | awk '{ print $2 }'` echo "VmPeak: $peak kB" hps=`grep ^Hugepagesize /proc/meminfo | awk '{ print $2 }'` echo "Hugepagesize: $hps kB" hp=$((peak/hps)) echo Set Huge Pages: $hp |
The output of the script looks like this:
Script Output
Shell
1 2 3 4 | Pid: 12737 VmPeak: 180932 kB Hugepagesize: 2048 kB Set Huge Pages: 88 |
The recommended huge pages are 88, therefore you should set the value to 88.
Set HugePages
Shell
1 | sysctl -w vm.nr_hugepages= 88 |
Check the huge pages now, you will see no huge page is in use (HugePages_Free = HugePages_Total).
Again Get HugePage Info - On Linux (only)
Shell
1 2 3 4 5 6 7 8 | $ cat /proc/meminfo | grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 88 HugePages_Free: 88 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB |
Now set the parameter huge_pages “on” in $PGDATA/postgresql.conf and restart the server.
And Again Get HugePage Info - On Linux (only)
Shell
1 2 3 4 5 6 7 8 | $ cat /proc/meminfo | grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 88 HugePages_Free: 81 HugePages_Rsvd: 64 HugePages_Surp: 0 Hugepagesize: 2048 kB |
Now you can see that a very few of the huge pages are used. Let’s now try to add some data into the database.
Some DB Operations to Utilise HugePages
Shell
1 2 3 4 | postgres=# CREATE TABLE foo(a INTEGER); CREATE TABLE postgres=# INSERT INTO foo VALUES(generate_Series(1,10000000)); INSERT 0 10000000 |
Let’s see if we are now using more huge pages than before.
Once More Get HugePage Info - On Linux (only)
Shell
1 2 3 4 5 6 7 8 | $ cat /proc/meminfo | grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB HugePages_Total: 88 HugePages_Free: 18 HugePages_Rsvd: 1 HugePages_Surp: 0 Hugepagesize: 2048 kB |
Now you can see that most of the huge pages are in use.
Note: The sample value for HugePages used here is very low, which is not a normal value for a big production machine. Please assess the required number of pages for your system and set those accordingly depending on your system’s workload and resources.
vm.swappiness
vm.swappiness is another kernel parameter that can affect the performance of the database. This parameter is used to control the swappiness (swapping pages to and from swap memory into RAM) behavior on a Linux system. The value ranges from 0 to 100. It controls how much memory will be swapped or paged out. Zero means disable swap and 100 means aggressive swapping.
You may get good performance by setting lower values.
Setting a value of 0 in newer kernels may cause the OOM Killer (out of memory killer process in Linux) to kill the process. Therefore, you can be on the safe side and set the value to 1 if you want to minimize swapping. The default value on a Linux system is 60. A higher value causes the MMU (memory management unit) to utilize more swap space than RAM, whereas a lower value preserves more data/code in memory.
A smaller value is a good bet to improve performance in PostgreSQL.
vm.overcommit_memory / vm.overcommit_ratio
Applications acquire memory and free that memory when it is no longer needed. But in some cases, an application acquires too much memory and does not release it. This can invoke the OOM killer. Here are the possible values for vm.overcommit_memory parameter with a description for each:
-
- Heuristic overcommit, Do it intelligently (default); based kernel heuristics
- Allow overcommit anyway
- Don’t over commit beyond the overcommit ratio.
Reference: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
vm.overcommit_ratio is the percentage of RAM that is available for overcommitment. A value of 50% on a system with 2 GB of RAM may commit up to 3 GB of RAM.
A value of 2 for vm.overcommit_memory yields better performance for PostgreSQL. This value maximizes RAM utilization by the server process without any significant risk of getting killed by the OOM killer process. An application will be able to overcommit, but only within the overcommit ratio, thus reducing the risk of having OOM killer kill the process. Hence a value to 2 gives better performance than the default 0 value. However, reliability can be improved by ensuring that memory beyond an allowable range is not overcommitted. It avoids the risk of the process being killed by OOM-killer.
On systems without swap, one may experience a problem when vm.overcommit_memory is 2.
https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
vm.dirty_background_ratio / vm.dirty_background_bytes
The vm.dirty_background_ratio is the percentage of memory filled with dirty pages that need to be flushed to disk. Flushing is done in the background. The value of this parameter ranges from 0 to 100; however, a value lower than 5 may not be effective and some kernels do not internally support it. The default value is 10 on most Linux systems. You can gain performance for write-intensive operations with a lower ratio, which means that Linux flushes dirty pages in the background.
You need to set a value of vm.dirty_background_bytes depending on your disk speed.
There are no “good” values for these two parameters since both depend on the hardware. However, setting vm.dirty_background_ratio to 5 and vm.dirty_background_bytes to 25% of your disk speed improves performance by up to ~25% in most cases.
vm.dirty_ratio / dirty_bytes
This is the same as vm.dirty_background_ratio / dirty_background_bytes except that the flushing is done in the foreground, blocking the application. So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will ensure that background processes kick in before the foreground processes to avoid blocking the application, as much as possible. You can tune the difference between the two ratios depending on your disk IO load.
Summing up
You can tune other parameters for performance, but the improvement gains are likely to be minimal. We must keep in mind that not all parameters are relevant to all applications types. Some applications perform better by tuning some parameters and some applications don’t. You need to find a good balance between these parameter configurations for the expected application workload and type, and OS behavior must also be kept in mind when making adjustments. Tuning kernel parameters are not as easy as tuning database parameters: it’s harder to be prescriptive.
In my next post, I’ll take a look at tuning PostgreSQL’s database parameters.
You May Also Like
Did you know sysbench-tpcc works with PostgreSQL now? Read our blog to learn how you can set up PostgreSQL to perform optimally for sysbench-tpcc.
Reviewing raw MySQL query logs can be more useful than working with tools that only have summary data. Percona CEO Peter Zaitsev details why this is the case and the benefits of examining this kind of information in his blog, Why Analyze Raw MySQL Query Logs?