PostgreSQL共享内存创建相关分析

一、问题背景
最近一个生产库内存告警单个进程内存使用超过25%,发现是一个远程容灾postgres库。告警抓到的进程是wal receiver进程,使用了近85GB内存。top查看后,发现是SHR使用了85GB,核验配置后发现,确实是数据库的shared_buffers配置了85GB。使用pmap打印内存分布发现85GB占用几乎都在/dev/zero中,那么/dev/zero和共享内存之间有什么关系呢?

二、共享内存

1.什么是共享内存
共享内存可以说是最有用的进程间通信方式,也是最快的IPC形式。比如两个不同进程A、B共享内存的意思是同一块物理内存被映射到进程A、B各自的进程地址空间。进程A可以即时看到进程B对共享内存中数据的更新,反之亦然。

2.共享内存实现方式
Linux的2.2.x内核支持多种共享内存方式,System V共享内存,Posix共享内存。

三、pg中共享内存创建

以上提到的方式postgres中都有使用,且在postgresql9.3版本之后,主要是使用Posix(mmap)的方式来创建共享内存,而System V共享内存,仅用于提供互锁来保护数据目录。

那么为什么要转向使用mmap呢?

由于PostgreSQL采用了Buffer I/O,除了自己的shared_buffer,还使用到了内核page cache,这样虽然提高了写入效率,减少了读磁盘的次数,相当于减缓了io负载;但是数据需要在程序内存以及page cache之间进行多次拷贝操作,当缓存较大时会带来很大的cpu以及内存开销。因此瓶颈在于内核到用户空间拷贝

而mmap是将硬盘文件映射到用户内存中,其实就是将page cache中的页直接映射到用户进程地址空间中,从而进程可以直接访问自身地址空间的虚拟地址来访问page cache中的页,从而省去了内核空间到用户空间的copy,程序可以像访问内存那样读写文件。因此算是为了改善Buffer I/O带来的系统开销吧。

mmap的映射大致原理

1).检查参数,并根据传入的映射类型设置vma的flags

2).进程查找其虚拟地址空间,找到一块空闲的满足要求的虚拟地址空间

3).根据找到的虚拟地址空间初始化vma

4).设置vma->vm_file

5).根据文件系统类型,将vma->vm_ops设为对应的file_operations

6).将vma插入mm的链表中

我们来看下mmap函数原型

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

do_mmap()是整个mmap的主体操作函数

unsigned long do_mmap(struct file *file, unsigned long addr,
                        unsigned long len, unsigned long prot,
                        unsigned long flags, vm_flags_t vm_flags,
                        unsigned long pgoff, unsigned long *populate,
                        struct list_head *uf)
{
	/***省略部分代码行***/
	 } else {
	                switch (flags & MAP_TYPE) {
	                /*当flags为MAP_SHARED时,vm_flages |= VM_SHARED | VM_MAYSHARE*/
	                case MAP_SHARED:
	                        if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
	                                return -EINVAL;
	                        /*
	                         * Ignore pgoff.
	                         */
	                        pgoff = 0;
	                        vm_flags |= VM_SHARED | VM_MAYSHARE;
	                        break;
	          }              
	/***省略部分代码行***/
}

展示部分建立映射的源代码:

unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
		struct list_head *uf)
{
	
	/***省略部分代码行***/
	} else if (vm_flags & VM_SHARED) {
			/* 假如标志为VM_SHARED,但没有指定映射文件,需要调用shmem_zero_setup()
			   shmem_zero_setup()实际映射的文件是dev/zero
			*/
			error = shmem_zero_setup(vma);
			if (error)
				goto free_vma;
	}
	/***省略部分代码行***/
}

从以上得知mmap()中若指定flags为MAP_SHARED,但没有指定映射文件,则实际映射的文件是/dev/zero

postgresql中对于mmap的使用:

在postgresql启动时,mmap
调用链为:PostmasterMain->Reset_shared->CreateSharedMemoryAndSemaphores->PGSharedMemoryCreate->CreateAnonymousSegment->mmap

主要看下CreateAnonymousSegment()函数

/*
 * Creates an anonymous mmap()ed shared memory segment.
 *
 * Pass the requested size in *size.  This function will modify *size to the
 * actual size of the allocation, if it ends up allocating a segment that is
 * larger than requested.
 */
static void *
CreateAnonymousSegment(Size *size)
{
	Size		allocsize = *size;
	void	   *ptr = MAP_FAILED;
	int			mmap_errno = 0;

#ifndef MAP_HUGETLB
	/* PGSharedMemoryCreate should have dealt with this case */
	Assert(huge_pages != HUGE_PAGES_ON);
#else
	if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
	{
		/*虽然数据库配置huge_page=try,但此服务器并没有使用huge_page,因此这里mmap申请大页内存失败*/
		/*
		 * Round up the request size to a suitable large value.
		 */
		Size		hugepagesize;
		int			mmap_flags;

		GetHugePageSize(&hugepagesize, &mmap_flags);

		if (allocsize % hugepagesize != 0)
			allocsize += hugepagesize - (allocsize % hugepagesize);

		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
				   PG_MMAP_FLAGS | mmap_flags, -1, 0);
		mmap_errno = errno;
		if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
			elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
				 allocsize);
	}
#endif

	if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
	{
		/*在这里调用mmap,申请size大小的内存(这个size是在CreateSharedMemoryAndSemaphores函数里,通过读取数据库guc参数,计算出来的),并且PG_MMAP_FLAGS这个宏值为MAP_SHARED,而且未指定映射文件,结合mmap源代码,这里会选择映射到/dev/zero*/
		/*
		 * Use the original size, not the rounded-up value, when falling back
		 * to non-huge pages.
		 */
		allocsize = *size;
		/*PG_MMAP_FLAGS的宏定义 #define PG_MMAP_FLAGS			(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)*/
		ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
				   PG_MMAP_FLAGS, -1, 0);
	  		mmap_errno = errno;
	}

	if (ptr == MAP_FAILED)
	{
		errno = mmap_errno;
		ereport(FATAL,
				(errmsg("could not map anonymous shared memory: %m"),
				 (mmap_errno == ENOMEM) ?
				 errhint("This error usually means that PostgreSQL's request "
					"for a shared memory segment exceeded available memory, "
					 "swap space, or huge pages. To reduce the request size "
						 "(currently %zu bytes), reduce PostgreSQL's shared "
					   "memory usage, perhaps by reducing shared_buffers or "
						 "max_connections.",
						 *size) : 0));
	}

	*size = allocsize;
	return ptr;
}

四、操作验证

以上我们分析了小部分源代码,初步认识到pg使用mmap来创建共享内存,并映射到/dev/zero。下来我们可以尝试断点调试,来验证这个过程。

[postgres@postgres_zabbix data]$ gdb --args pg_ctl start    //准备启动pg并gdb调试
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl...done.
(gdb) b do_start      //断点1,pg_ctl中start的函数实体
Breakpoint 1 at 0x402949: file pg_ctl.c, line 868.
(gdb) b start_postmaster //断点2,这个函数中启动守护进程postmaster
Breakpoint 2 at 0x402078: file pg_ctl.c, line 437.
(gdb) set detach-on-fork off  //这里需要设置下gdb参数,pg的启动是由pg_ctl 进程fork出一个子进程,子进程中通过exec直接执行postgres命令,来拉起postmaster进程。我们需要attach fork出的进程,因此关闭detach-on-fork
(gdb) set follow-fork-mode child  //已知我们将要调试的postmaster为pg_ctl的子进程,因此选择follow-fork-mode为child
(gdb) r                         //让程序跑起来
Starting program: /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl start
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, do_start () at pg_ctl.c:868      //进入断点1
868             pgpid_t         old_pid = 0;
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
(gdb) n
883             if (ctl_command == RESTART_COMMAND || pgdata_opt == NULL)
(gdb)
884                     pgdata_opt = "";
(gdb)
886             if (exec_path == NULL)
(gdb)
887                     exec_path = find_other_exec_or_die(argv0, "postgres", PG_BACKEND_VERSIONSTR);                //这里已经准备执行postgres**       
(gdb)
[New process 3837]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Thread 0x7ffff7fdf740 (LWP 3837) is executing new program: /usr/bin/bash
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64
Thread 0x7ffff7fdf740 (LWP 3837) is executing new program: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres   //这里执行了postgres
Missing separate debuginfos, use: debuginfo-install bash-4.2.46-31.el7.x86_64
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Inferior 2 (process 3837) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.6.x86_64.
(gdb) info inferiors      //查看进程,发现postgres已经存在,但还没真正跑起来
  Num  Description       Executable
* 2    <null>            /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
  1    process 3833      /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl
(gdb) b PostmasterMain   //这里开始之前分析的mmap调用链,依次打断点,断点3.postgres的总入口
Breakpoint 3 at 0x73c466: file postmaster.c, line 567.
(gdb) b CreateSharedMemoryAndSemaphores       //断点4
Breakpoint 4 at 0x793152: file ipci.c, line 96.
(gdb) b PGSharedMemoryCreate                  //断点5
Breakpoint 5 at 0x72ce30: file pg_shmem.c, line 579.
(gdb) b CreateAnonymousSegment                //断点6      
Breakpoint 6 at 0x72cbab: file pg_shmem.c, line 458.
(gdb) r                                        //运行
Starting program: /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 3, PostmasterMain (argc=1, argv=0xdc8bd0) at postmaster.c:567  //进入PostmasterMain
567             char       *userDoption = NULL;
(gdb) n
568             bool            listen_addr_saved = false;
(gdb) n
570             char       *output_config_variable = NULL;
(gdb) n
572             MyProcPid = PostmasterPid = getpid();
(gdb) info inferiors     //打印进程信息,已经获取到postgres主进程的pid了
  Num  Description       Executable
* 2    process 4712      /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
  1    process 3833      /home/postgres/postgresql-9.6.6/pg9debug/bin/pg_ctl

这个时候使用pmap去看postgres的内存分布,并未看到/dev/zero有占用,只是一些lib、stack和anon空间

[postgres@postgres_zabbix data]$ pmap 4712
4712:   /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
0000000000400000   7588K r-x-- postgres
0000000000d69000      4K r---- postgres
0000000000d6a000     52K rw--- postgres
0000000000d77000    616K rw---   [ anon ]
00007ffff097d000    232K r-x-- pg_pathman.so
00007ffff09b7000   2048K ----- pg_pathman.so
00007ffff0bb7000      4K r---- pg_pathman.so
00007ffff0bb8000      8K rw--- pg_pathman.so
00007ffff0bba000 103592K r---- locale-archive
00007ffff70e4000     92K r-x-- libpthread-2.17.so
00007ffff70fb000   2044K ----- libpthread-2.17.so
00007ffff72fa000      4K r---- libpthread-2.17.so
00007ffff72fb000      4K rw--- libpthread-2.17.so
00007ffff72fc000     16K rw---   [ anon ]
00007ffff7300000   1800K r-x-- libc-2.17.so
00007ffff74c2000   2048K ----- libc-2.17.so
00007ffff76c2000     16K r---- libc-2.17.so
00007ffff76c6000      8K rw--- libc-2.17.so
00007ffff76c8000     20K rw---   [ anon ]
00007ffff76cd000   1028K r-x-- libm-2.17.so
00007ffff77ce000   2044K ----- libm-2.17.so
00007ffff79cd000      4K r---- libm-2.17.so
00007ffff79ce000      4K rw--- libm-2.17.so
00007ffff79cf000      8K r-x-- libdl-2.17.so
00007ffff79d1000   2048K ----- libdl-2.17.so
00007ffff7bd1000      4K r---- libdl-2.17.so
00007ffff7bd2000      4K rw--- libdl-2.17.so
00007ffff7bd3000     28K r-x-- librt-2.17.so
00007ffff7bda000   2044K ----- librt-2.17.so
00007ffff7dd9000      4K r---- librt-2.17.so
00007ffff7dda000      4K rw--- librt-2.17.so
00007ffff7ddb000    136K r-x-- ld-2.17.so
00007ffff7fde000     20K rw---   [ anon ]
00007ffff7ff9000      4K rw---   [ anon ]
00007ffff7ffa000      8K r-x--   [ anon ]
00007ffff7ffc000      4K r---- ld-2.17.so
00007ffff7ffd000      4K rw--- ld-2.17.so
00007ffff7ffe000      4K rw---   [ anon ]
00007ffffffde000    132K rw---   [ stack ]
ffffffffff600000      4K r-x--   [ anon ]
 total           127736K

继续调试

(gdb) c       //为了更清楚的看到整个过程,我们选择了单步调试。由于断点3.PostserMain到断点4有很多代码行,所以这里选择Continuing,跳过该断点剩下代码行,进入下一断点入口
Continuing.
Reading symbols from /home/postgres/postgresql-9.6.6/pg9debug/lib/pg_pathman.so...done.

Breakpoint 4, CreateSharedMemoryAndSemaphores (makePrivate=0 '\000', port=6543) at ipci.c:96   //进入断点4,这里主要是计算需要申请的内存大小,一直在叠加计算
96              PGShmemHeader *shim = NULL;
(gdb) n
98              if (!IsUnderPostmaster)
(gdb)
113                     size = 100000;
(gdb)
114                     size = add_size(size, SpinlockSemaSize());
(gdb) p size
$4 = 100000
(gdb) n
115                     size = add_size(size, hash_estimate_size(SHMEM_INDEX_SIZE,
(gdb) p size
$5 = 100000
(gdb) m
Ambiguous command "m": macro, maintenance, make, mem, monitor, mt.
(gdb) n
117                     size = add_size(size, BufferShmemSize());
(gdb) n
118                     size = add_size(size, LockShmemSize());
(gdb) n
119                     size = add_size(size, PredicateLockShmemSize());
(gdb) n
120                     size = add_size(size, ProcGlobalShmemSize());
(gdb)
121                     size = add_size(size, XLOGShmemSize());
(gdb)
122                     size = add_size(size, CLOGShmemSize());
(gdb)
123                     size = add_size(size, CommitTsShmemSize());
(gdb)
124                     size = add_size(size, SUBTRANSShmemSize());
(gdb)
125                     size = add_size(size, TwoPhaseShmemSize());
(gdb)
126                     size = add_size(size, BackgroundWorkerShmemSize());
(gdb)
127                     size = add_size(size, MultiXactShmemSize());
(gdb)
128                     size = add_size(size, LWLockShmemSize());
(gdb)
129                     size = add_size(size, ProcArrayShmemSize());
(gdb)
130                     size = add_size(size, BackendStatusShmemSize());
(gdb)
131                     size = add_size(size, SInvalShmemSize());
(gdb)
132                     size = add_size(size, PMSignalShmemSize());
(gdb)
133                     size = add_size(size, ProcSignalShmemSize());
(gdb)
134                     size = add_size(size, CheckpointerShmemSize());
(gdb)
135                     size = add_size(size, AutoVacuumShmemSize());
(gdb)
136                     size = add_size(size, ReplicationSlotsShmemSize());
(gdb)
137                     size = add_size(size, ReplicationOriginShmemSize());
(gdb)
138                     size = add_size(size, WalSndShmemSize());
(gdb)
139                     size = add_size(size, WalRcvShmemSize());
(gdb)
140                     size = add_size(size, SnapMgrShmemSize());
(gdb)
141                     size = add_size(size, BTreeShmemSize());
(gdb)
142                     size = add_size(size, SyncScanShmemSize());
(gdb)
143                     size = add_size(size, AsyncShmemSize());
(gdb)
149                     addin_request_allowed = false;
(gdb)
150                     size = add_size(size, total_addin_request);
(gdb) n
153                     size = add_size(size, 8192 - (size % 8192));
(gdb) n
155                     elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
(gdb) p size            //这里计算出申请的共享内存总大小为148455424Byte即144976KB,如果你设置的数据库日志级别为DEBUG3,将会打印在屏显
$10 = 148455424
(gdb) n
160                     seghdr = PGSharedMemoryCreate(size, makePrivate, port, &shim);
(gdb)

Breakpoint 5, PGSharedMemoryCreate (size=148455424, makePrivate=0 '\000', port=6543, shim=0x7fffffffe1e0) at pg_shmem.c:579          //进入断点5
579             AnonymousShmem = CreateAnonymousSegment(&size);
(gdb)

Breakpoint 6, CreateAnonymousSegment (size=0x7fffffffe0e8) at pg_shmem.c:458     //进入断点6,mmap将在这里调用申请共享内存
458             Size            allocsize = *size;
(gdb) n
459             void       *ptr = MAP_FAILED;
(gdb)
460             int                     mmap_errno = 0;
(gdb)
466             if (huge_pages == HUGE_PAGES_ON || huge_pages == HUGE_PAGES_TRY)
(gdb) n         //尝试申请huge_page
474                     GetHugePageSize(&hugepagesize, &mmap_flags);
(gdb)
476                     if (allocsize % hugepagesize != 0)
(gdb)
477                             allocsize += hugepagesize - (allocsize % hugepagesize);
(gdb) n
479                     ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
(gdb) n
481                     mmap_errno = errno;
(gdb) n
482                     if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
(gdb) n
483                             elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m",
(gdb) n         //由于服务器未配置huge_page,所以申请失败
488             if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
(gdb) n           //使用mmap申请内存成功,映射到/dev/zero
494                     allocsize = *size;
(gdb) n
495                     ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
(gdb) n
497                     mmap_errno = errno;
(gdb) n
500             if (ptr == MAP_FAILED)
(gdb)
515             *size = allocsize;
(gdb) p allocsize  申请的大小为148455424Byte 即144976KB
$22 = 148455424 

这时通过pmap打印,已经看到了/dev/zero(这里显示为zero)占用了144976KB,说明之前的分析是正确的。

[postgres@postgres_zabbix data]$ pmap 4712
4712:   /home/postgres/postgresql-9.6.6/pg9debug/bin/postgres
0000000000400000   7588K r-x-- postgres
0000000000d69000      4K r---- postgres
0000000000d6a000     52K rw--- postgres
0000000000d77000    616K rw---   [ anon ]
00007fffe7be9000 144976K rw-s- zero (deleted)
00007ffff097d000    232K r-x-- pg_pathman.so
00007ffff09b7000   2048K ----- pg_pathman.so
00007ffff0bb7000      4K r---- pg_pathman.so
00007ffff0bb8000      8K rw--- pg_pathman.so
00007ffff0bba000 103592K r---- locale-archive
00007ffff70e4000     92K r-x-- libpthread-2.17.so
00007ffff70fb000   2044K ----- libpthread-2.17.so
00007ffff72fa000      4K r---- libpthread-2.17.so
00007ffff72fb000      4K rw--- libpthread-2.17.so
00007ffff72fc000     16K rw---   [ anon ]
00007ffff7300000   1800K r-x-- libc-2.17.so
00007ffff74c2000   2048K ----- libc-2.17.so
00007ffff76c2000     16K r---- libc-2.17.so
00007ffff76c6000      8K rw--- libc-2.17.so
00007ffff76c8000     20K rw---   [ anon ]
00007ffff76cd000   1028K r-x-- libm-2.17.so
00007ffff77ce000   2044K ----- libm-2.17.so
00007ffff79cd000      4K r---- libm-2.17.so
00007ffff79ce000      4K rw--- libm-2.17.so
00007ffff79cf000      8K r-x-- libdl-2.17.so
00007ffff79d1000   2048K ----- libdl-2.17.so
00007ffff7bd1000      4K r---- libdl-2.17.so
00007ffff7bd2000      4K rw--- libdl-2.17.so
00007ffff7bd3000     28K r-x-- librt-2.17.so
00007ffff7bda000   2044K ----- librt-2.17.so
00007ffff7dd9000      4K r---- librt-2.17.so
00007ffff7dda000      4K rw--- librt-2.17.so
00007ffff7ddb000    136K r-x-- ld-2.17.so
00007ffff7fde000     20K rw---   [ anon ]
00007ffff7ff9000      4K rw---   [ anon ]
00007ffff7ffa000      8K r-x--   [ anon ]
00007ffff7ffc000      4K r---- ld-2.17.so
00007ffff7ffd000      4K rw--- ld-2.17.so
00007ffff7ffe000      4K rw---   [ anon ]
00007ffffffde000    132K rw---   [ stack ]
ffffffffff600000      4K r-x--   [ anon ]
 total           272712K

这里还有一个小问题,记得以前有同学问过,为什么设置的shared_buffers和实际申请的共享内存大小不一致,实际申请的要稍大些。我们之前的gdb调试中看到CreateSharedMemoryAndSemaphores 函数通过多项叠加计算最终结果为需要申请的共享内存大小,因此,实际占用的共享内存比shared_buffers稍大是正常的。这个调试环境shared_buffers配置为131072kB,实际申请为144976KB。

参考:https://blog.csdn.net/lggbxf/article/details/94012088
https://blog.csdn.net/mw_nice/article/details/82888091

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值