一、问题背景
最近一个生产库内存告警单个进程内存使用超过25%,发现是一个远程容灾postgres库。告警抓到的进程是wal receiver进程,使用了近85GB内存。top查看后,发现是SHR使用了85GB,核验配置后发现,确实是数据库的shared_buffers配置了85GB。使用pmap打印内存分布发现85GB占用几乎都在/dev/zero中,那么/dev/zero和共享内存之间有什么关系呢?
二、共享内存
1.什么是共享内存
共享内存可以说是最有用的进程间通信方式,也是最快的IPC形式。比如两个不同进程A、B共享内存的意思是同一块物理内存被映射到进程A、B各自的进程地址空间。进程A可以即时看到进程B对共享内存中数据的更新,反之亦然。
2.共享内存实现方式
Linux的2.2.x内核支持多种共享内存方式,System V共享内存,Posix共享内存。
三、pg中共享内存创建
以上提到的方式postgres中都有使用,且在postgresql9.3版本之后,主要是使用Posix(mmap)的方式来创建共享内存,而System V共享内存,仅用于提供互锁来保护数据目录。
那么为什么要转向使用mmap呢?
由于PostgreSQL采用了Buffer I/O,除了自己的shared_buffer,还使用到了内核page cache,这样虽然提高了写入效率,减少了读磁盘的次数,相当于减缓了io负载;但是数据需要在程序内存以及page cache之间进行多次拷贝操作,当缓存较大时会带来很大的cpu以及内存开销。因此瓶颈在于内核到用户空间拷贝
而mmap是将硬盘文件映射到用户内存中,其实就是将page cache中的页直接映射到用户进程地址空间中,从而进程可以直接访问自身地址空间的虚拟地址来访问page cache中的页,从而省去了内核空间到用户空间的copy,程序可以像访问内存那样读写文件。因此算是为了改善Buffer I/O带来的系统开销吧。
mmap的映射大致原理
1).检查参数,并根据传入的映射类型设置vma的flags
2).进程查找其虚拟地址空间,找到一块空闲的满足要求的虚拟地址空间
3).根据找到的虚拟地址空间初始化vma
4).设置vma->vm_file
5).根据文件系统类型,将vma->vm_ops设为对应的file_operations
6).将vma插入mm的链表中
我们来看下mmap函数原型
。
void *mmap(void *addr, size_t length, int prot, int flags,
int fd, off_t offset);
do_mmap()是整个mmap的主体操作函数
unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags,
unsigned long pgoff, unsigned long *populate,
struct list_head *uf)
{
/***省略部分代码行***/
} else {
switch (flags & MAP_TYPE) {
/*当flags为MAP_SHARED时,vm_flages |= VM_SHARED | VM_MAYSHARE*/
case MAP_SHARED:
if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
return -EINVAL;
/*
* Ignore pgoff.
*/
pgoff = 0;
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
}
/***省略部分代码行***/
}
展示部分建立映射的源代码:
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
struct list_head *uf)
{
/***省略部分代码行***/
} else if (vm_flags & VM_SHARED) {
/* 假如标志为VM_SHARED,但没有指定映射文件,需要调用shmem_zero_setup()
shmem_zero_setup()实际映射的文件是dev/zero
*/
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
}
/***省略部分代码行***/
}
从以上得知mmap()中若指定flags为MAP_SHARED,但没有指定映射文件,则实际映射的文件是/dev/zero
postgresql中对于mmap的使用:
在postgresql启动时,mmap
调用链为:PostmasterMain->Reset_shared->CreateSharedMemoryAndSemaphores->PGSharedMemoryCreate->CreateAnonymousSegment->mmap
主要看下CreateAnonymousSegment()函数
/*
* Creates an anonymous mmap()ed shared memory segment.
*
* Pass the requested size in *size. This function will modify *size to the
* actual size of the allocation, if it ends up allocating a segment that is
* larger than requested.
*/
static void *
CreateAnonymousSegment(Size *size)
{
Size allocsize = *size;
void *ptr = MAP_FAILED;
int mmap_errno = 0;
#ifndef MAP_HUGETLB
/* PGSharedMemoryCreate should have dealt with this case */
Assert(huge_pages != HUGE_PAGES_ON);
#else
if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
{
/*虽然数据库配置huge_page=try,但此服务器并没有使用huge_page,因此这里mmap申请大页内存失败*/
/*
* Round up the request size to a suitable large value.
*/
Size hugepagesize;
int mmap_flags;
GetHugePageSize(&hugepagesize, &mmap_flags);
if (allocsize % hugepagesize != 0)
allocsize += hugepagesize - (allocsize % hugepagesize);
ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
PG_MMAP_FLAGS | mmap_flags, -1, 0);
mmap_errno = errno;
if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
allocsize);
}
#endif
if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
{
/*在这里调用mmap,申请size大小的内存(这个size是在CreateSharedMemoryAndSemaphores函数里,通过读取数据库guc参数,计算出来的),并且PG_MMAP_FLAGS这个宏值为MAP_SHARED,而且未指定映射文件,结合mmap源代码,这里会选择映射到/dev/zero*/
/*
* Use the original size, not the rounded-up value, when falling back
* to non-huge pages.
*/
allocsize = *size;
/*PG_MMAP_FLAGS的宏定义 #define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)*/
ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
PG_MMAP_FLAGS, -1, 0);
mmap_errno = errno;
}
if (ptr == MAP_FAILED)
{
errno = mmap_errno;
ereport(FATAL,
(errmsg("could not map anonymous shared memory: %m"),
(mmap_errno == ENOMEM) ?
errhint("This error usually means that PostgreSQL's request "
"for a shared memory segment exceeded available memory, "
"swap space, or huge pages. To reduce the request size "
"(currently %zu bytes), reduce PostgreSQL's shared "
"memory usage, perhaps by reducing shared_buffers or "
"max_connections.",
*size) : 0));
}
*size = allocsize;
return ptr;
}
四、操作验证
以上我们分析了小部分源代码,初步认识到pg使用mmap来创建共享内存,并映射到/dev/zero。下来我们可以尝试断点调试,来验证这个过程。
[postgres@postgres_zabbix data]$ gdb --args pg_ctl start //准备启动pg并gdb调试
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl...done.
(gdb) b do_start //断点1,pg_ctl中start的函数实体
Breakpoint 1 at 0x402949: file pg_ctl.c, line 868.
(gdb) b start_postmaster //断点2,这个函数中启动守护进程postmaster
Breakpoint 2 at 0x402078: file pg_ctl.c, line 437.
(gdb) set detach-on-fork off //这里需要设置下gdb参数,pg的启动是由pg_ctl 进程fork出一个子进程,子进程中通过exec直接执行postgres命令,来拉起postmaster进程。我们需要attach fork出的进程,因此关闭detach-on-fork
(gdb) set follow-fork-mode child //已知我们将要调试的postmaster为pg_ctl的子进程,因此选择follow-fork-mode为child
(gdb) r //让程序跑起来
Starting program: /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl start
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 1, do_start () at pg_ctl.c:868 //进入断点1
868 pgpid_t old_pid = 0;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) n
883 if (ctl_command == RESTART_COMMAND || pgdata_opt == NULL)
(gdb)
884 pgdata_opt = "";
(gdb)
886 if (exec_path == NULL)
(gdb)
887 exec_path = find_other_exec_or_die(argv0, "postgres", PG_BACKEND_VERSIONSTR); //这里已经准备执行postgres**
(gdb)
[New process 3837]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Thread 0x7ffff7fdf740 (LWP 3837) is executing new program: /usr/bin/bash
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
Thread 0x7ffff7fdf740 (LWP 3837) is executing new program: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres //这里执行了postgres
Missing separate debuginfos, use: debuginfo-install bash-4.2.46-31.el7.x86_64
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Inferior 2 (process 3837) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64.
(gdb) info inferiors //查看进程,发现postgres已经存在,但还没真正跑起来
Num Description Executable
* 2 <null> /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
1 process 3833 /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl
(gdb) b PostmasterMain //这里开始之前分析的mmap调用链,依次打断点,断点3.postgres的总入口
Breakpoint 3 at 0x73c466: file postmaster.c, line 567.
(gdb) b CreateSharedMemoryAndSemaphores //断点4
Breakpoint 4 at 0x793152: file ipci.c, line 96.
(gdb) b PGSharedMemoryCreate //断点5
Breakpoint 5 at 0x72ce30: file pg_shmem.c, line 579.
(gdb) b CreateAnonymousSegment //断点6
Breakpoint 6 at 0x72cbab: file pg_shmem.c, line 458.
(gdb) r //运行
Starting program: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Breakpoint 3, PostmasterMain (argc=1, argv=0xdc8bd0) at postmaster.c:567 //进入PostmasterMain
567 char *userDoption = NULL;
(gdb) n
568 bool listen_addr_saved = false;
(gdb) n
570 char *output_config_variable = NULL;
(gdb) n
572 MyProcPid = PostmasterPid = getpid();
(gdb) info inferiors //打印进程信息,已经获取到postgres主进程的pid了
Num Description Executable
* 2 process 4712 /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
1 process 3833 /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl
这个时候使用pmap去看postgres的内存分布,并未看到/dev/zero有占用,只是一些lib、stack和anon空间
[postgres@postgres_zabbix data]$ pmap 4712
4712: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
0000000000400000 7588K r-x-- postgres
0000000000d69000 4K r---- postgres
0000000000d6a000 52K rw--- postgres
0000000000d77000 616K rw--- [ anon ]
00007ffff097d000 232K r-x-- pg_pathman.so
00007ffff09b7000 2048K ----- pg_pathman.so
00007ffff0bb7000 4K r---- pg_pathman.so
00007ffff0bb8000 8K rw--- pg_pathman.so
00007ffff0bba000 103592K r---- locale-archive
00007ffff70e4000 92K r-x-- libpthread-2.17.so
00007ffff70fb000 2044K ----- libpthread-2.17.so
00007ffff72fa000 4K r---- libpthread-2.17.so
00007ffff72fb000 4K rw--- libpthread-2.17.so
00007ffff72fc000 16K rw--- [ anon ]
00007ffff7300000 1800K r-x-- libc-2.17.so
00007ffff74c2000 2048K ----- libc-2.17.so
00007ffff76c2000 16K r---- libc-2.17.so
00007ffff76c6000 8K rw--- libc-2.17.so
00007ffff76c8000 20K rw--- [ anon ]
00007ffff76cd000 1028K r-x-- libm-2.17.so
00007ffff77ce000 2044K ----- libm-2.17.so
00007ffff79cd000 4K r---- libm-2.17.so
00007ffff79ce000 4K rw--- libm-2.17.so
00007ffff79cf000 8K r-x-- libdl-2.17.so
00007ffff79d1000 2048K ----- libdl-2.17.so
00007ffff7bd1000 4K r---- libdl-2.17.so
00007ffff7bd2000 4K rw--- libdl-2.17.so
00007ffff7bd3000 28K r-x-- librt-2.17.so
00007ffff7bda000 2044K ----- librt-2.17.so
00007ffff7dd9000 4K r---- librt-2.17.so
00007ffff7dda000 4K rw--- librt-2.17.so
00007ffff7ddb000 136K r-x-- ld-2.17.so
00007ffff7fde000 20K rw--- [ anon ]
00007ffff7ff9000 4K rw--- [ anon ]
00007ffff7ffa000 8K r-x-- [ anon ]
00007ffff7ffc000 4K r---- ld-2.17.so
00007ffff7ffd000 4K rw--- ld-2.17.so
00007ffff7ffe000 4K rw--- [ anon ]
00007ffffffde000 132K rw--- [ stack ]
ffffffffff600000 4K r-x-- [ anon ]
total 127736K
继续调试
(gdb) c //为了更清楚的看到整个过程,我们选择了单步调试。由于断点3.PostserMain到断点4有很多代码行,所以这里选择Continuing,跳过该断点剩下代码行,进入下一断点入口
Continuing.
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.
Breakpoint 4, CreateSharedMemoryAndSemaphores (makePrivate=0 '\000', port=6543) at ipci.c:96 //进入断点4,这里主要是计算需要申请的内存大小,一直在叠加计算
96 PGShmemHeader *shim = NULL;
(gdb) n
98 if (!IsUnderPostmaster)
(gdb)
113 size = 100000;
(gdb)
114 size = add_size(size, SpinlockSemaSize());
(gdb) p size
$4 = 100000
(gdb) n
115 size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
(gdb) p size
$5 = 100000
(gdb) m
Ambiguous command "m": macro, maintenance, make, mem, monitor, mt.
(gdb) n
117 size = add_size(size, BufferShmemSize());
(gdb) n
118 size = add_size(size, LockShmemSize());
(gdb) n
119 size = add_size(size, PredicateLockShmemSize());
(gdb) n
120 size = add_size(size, ProcGlobalShmemSize());
(gdb)
121 size = add_size(size, XLOGShmemSize());
(gdb)
122 size = add_size(size, CLOGShmemSize());
(gdb)
123 size = add_size(size, CommitTsShmemSize());
(gdb)
124 size = add_size(size, SUBTRANSShmemSize());
(gdb)
125 size = add_size(size, TwoPhaseShmemSize());
(gdb)
126 size = add_size(size, BackgroundWorkerShmemSize());
(gdb)
127 size = add_size(size, MultiXactShmemSize());
(gdb)
128 size = add_size(size, LWLockShmemSize());
(gdb)
129 size = add_size(size, ProcArrayShmemSize());
(gdb)
130 size = add_size(size, BackendStatusShmemSize());
(gdb)
131 size = add_size(size, SInvalShmemSize());
(gdb)
132 size = add_size(size, PMSignalShmemSize());
(gdb)
133 size = add_size(size, ProcSignalShmemSize());
(gdb)
134 size = add_size(size, CheckpointerShmemSize());
(gdb)
135 size = add_size(size, AutoVacuumShmemSize());
(gdb)
136 size = add_size(size, ReplicationSlotsShmemSize());
(gdb)
137 size = add_size(size, ReplicationOriginShmemSize());
(gdb)
138 size = add_size(size, WalSndShmemSize());
(gdb)
139 size = add_size(size, WalRcvShmemSize());
(gdb)
140 size = add_size(size, SnapMgrShmemSize());
(gdb)
141 size = add_size(size, BTreeShmemSize());
(gdb)
142 size = add_size(size, SyncScanShmemSize());
(gdb)
143 size = add_size(size, AsyncShmemSize());
(gdb)
149 addin_request_allowed = false;
(gdb)
150 size = add_size(size, total_addin_request);
(gdb) n
153 size = add_size(size, 8192 - (size % 8192));
(gdb) n
155 elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
(gdb) p size //这里计算出申请的共享内存总大小为148455424Byte即144976KB,如果你设置的数据库日志级别为DEBUG3,将会打印在屏显
$10 = 148455424
(gdb) n
160 seghdr = PGSharedMemoryCreate(size, makePrivate, port, &shim);
(gdb)
Breakpoint 5, PGSharedMemoryCreate (size=148455424, makePrivate=0 '\000', port=6543, shim=0x7fffffffe1e0) at pg_shmem.c:579 //进入断点5
579 AnonymousShmem = CreateAnonymousSegment(&size);
(gdb)
Breakpoint 6, CreateAnonymousSegment (size=0x7fffffffe0e8) at pg_shmem.c:458 //进入断点6,mmap将在这里调用申请共享内存
458 Size allocsize = *size;
(gdb) n
459 void *ptr = MAP_FAILED;
(gdb)
460 int mmap_errno = 0;
(gdb)
466 if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
(gdb) n //尝试申请huge_page
474 GetHugePageSize(&hugepagesize, &mmap_flags);
(gdb)
476 if (allocsize % hugepagesize != 0)
(gdb)
477 allocsize += hugepagesize - (allocsize % hugepagesize);
(gdb) n
479 ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
(gdb) n
481 mmap_errno = errno;
(gdb) n
482 if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
(gdb) n
483 elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
(gdb) n //由于服务器未配置huge_page,所以申请失败
488 if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
(gdb) n //使用mmap申请内存成功,映射到/dev/zero
494 allocsize = *size;
(gdb) n
495 ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
(gdb) n
497 mmap_errno = errno;
(gdb) n
500 if (ptr == MAP_FAILED)
(gdb)
515 *size = allocsize;
(gdb) p allocsize 申请的大小为148455424Byte 即144976KB
$22 = 148455424
这时通过pmap打印,已经看到了/dev/zero(这里显示为zero)占用了144976KB,说明之前的分析是正确的。
[postgres@postgres_zabbix data]$ pmap 4712
4712: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
0000000000400000 7588K r-x-- postgres
0000000000d69000 4K r---- postgres
0000000000d6a000 52K rw--- postgres
0000000000d77000 616K rw--- [ anon ]
00007fffe7be9000 144976K rw-s- zero (deleted)
00007ffff097d000 232K r-x-- pg_pathman.so
00007ffff09b7000 2048K ----- pg_pathman.so
00007ffff0bb7000 4K r---- pg_pathman.so
00007ffff0bb8000 8K rw--- pg_pathman.so
00007ffff0bba000 103592K r---- locale-archive
00007ffff70e4000 92K r-x-- libpthread-2.17.so
00007ffff70fb000 2044K ----- libpthread-2.17.so
00007ffff72fa000 4K r---- libpthread-2.17.so
00007ffff72fb000 4K rw--- libpthread-2.17.so
00007ffff72fc000 16K rw--- [ anon ]
00007ffff7300000 1800K r-x-- libc-2.17.so
00007ffff74c2000 2048K ----- libc-2.17.so
00007ffff76c2000 16K r---- libc-2.17.so
00007ffff76c6000 8K rw--- libc-2.17.so
00007ffff76c8000 20K rw--- [ anon ]
00007ffff76cd000 1028K r-x-- libm-2.17.so
00007ffff77ce000 2044K ----- libm-2.17.so
00007ffff79cd000 4K r---- libm-2.17.so
00007ffff79ce000 4K rw--- libm-2.17.so
00007ffff79cf000 8K r-x-- libdl-2.17.so
00007ffff79d1000 2048K ----- libdl-2.17.so
00007ffff7bd1000 4K r---- libdl-2.17.so
00007ffff7bd2000 4K rw--- libdl-2.17.so
00007ffff7bd3000 28K r-x-- librt-2.17.so
00007ffff7bda000 2044K ----- librt-2.17.so
00007ffff7dd9000 4K r---- librt-2.17.so
00007ffff7dda000 4K rw--- librt-2.17.so
00007ffff7ddb000 136K r-x-- ld-2.17.so
00007ffff7fde000 20K rw--- [ anon ]
00007ffff7ff9000 4K rw--- [ anon ]
00007ffff7ffa000 8K r-x-- [ anon ]
00007ffff7ffc000 4K r---- ld-2.17.so
00007ffff7ffd000 4K rw--- ld-2.17.so
00007ffff7ffe000 4K rw--- [ anon ]
00007ffffffde000 132K rw--- [ stack ]
ffffffffff600000 4K r-x-- [ anon ]
total 272712K
这里还有一个小问题,记得以前有同学问过,为什么设置的shared_buffers和实际申请的共享内存大小不一致,实际申请的要稍大些。我们之前的gdb调试中看到CreateSharedMemoryAndSemaphores 函数通过多项叠加计算最终结果为需要申请的共享内存大小,因此,实际占用的共享内存比shared_buffers稍大是正常的。这个调试环境shared_buffers配置为131072kB,实际申请为144976KB。
参考:https://blog.csdn.net/lggbxf/article/details/94012088
https://blog.csdn.net/mw_nice/article/details/82888091