pg_stat_statements使用LWLock排他锁性能问题分析

一、 问题描述

值班同学发现一实例异常,用户反馈实例不可用。监控显示出现死锁,并且执行简单sql例如select 1也会“卡住”。持续一段时间后恢复

抓取了某几个卡在parse环节的sql进程堆栈,大致如下:

[postgres@postgres ~]$  ps -ef|grep pgbase|head -10|grep -v grep
postgres  13622 89757 93 16:02 ?        00:00:28 postgres: pguser pgbase 192.138.23.26(44877) SELECT
postgres  20183 89757 21 16:02 ?        00:00:03 postgres: pguser pgbase 100.148.201.20(36986) ANALYZE
postgres  22170 89757  0 16:02 ?        00:00:00 postgres: pguser pgbase 192.13.3.238(31975) PARSE
postgres  24434 89757 44 16:02 ?        00:00:00 postgres: pguser pgbase 100.148.201.21(37262) ANALYZE
postgres  71518 89757  4 16:01 ?        00:00:04 postgres: pguser pgbase 100.148.201.20(34800) COPY
postgres  71520 89757 85 16:01 ?        00:01:27 postgres: pguser pgbase 100.148.201.20(34801) SELECT
postgres  71526 89757 93 16:01 ?        00:01:35 postgres: pguser pgbase 100.148.201.20(34804) SELECT
postgres  71538 89757 86 16:01 ?        00:01:27 postgres: pguser pgbase 100.148.201.20(34810) SELECT
postgres  71563 89757 86 16:01 ?        00:01:28 postgres: pguser pgbase 100.148.201.20(34825) PARSE  
postgres  86511 89757  0 16:01 ?        00:00:00 postgres: pguser pgbase 192.13.3.238(29980) PARSE
[postgres@postgres ~]$ 
[postgres@postgres ~]$ pstack 86511
#0  0x00007f575bea6a0b in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
#1  0x00007f575bea6a9f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
#2  0x00007f575bea6b3b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
#3  0x00000000006c9672 in PGSemaphoreLock ()
#4  0x0000000000739bac in LWLockAcquire ()
#5  0x00007f575915155c in pgss_store () from /data/postgres/pgapp/54321/lib/postgresql/pg_stat_statements.so
#6  0x00007f575915245e in pgss_post_parse_analyze () from /data/postgres/pgapp/54321/lib/postgresql/pg_stat_statements.so
#7  0x00007f5758f478c5 in pgsl_post_parse_analyze () from /data/postgres/pgapp/54321/lib/postgresql/pg_stat_log.so
#8  0x00007f57583168ca in pg_hint_plan_post_parse_analyze () from /data/postgres/pgapp/54321/lib/postgresql/pg_hint_plan.so
#9  0x00007f575810567f in post_analyze_hook () from /data/postgres/pgapp/54321/lib/postgresql/timescaledb.so
#10 0x000000000054c93c in parse_analyze ()
#11 0x0000000000746e73 in pg_analyze_and_rewrite ()
#12 0x00000000007474db in exec_simple_query ()
#13 0x0000000000748602 in PostgresMain ()
#14 0x0000000000481d83 in ServerLoop ()
#15 0x00000000006db8f3 in PostmasterMain ()
#16 0x0000000000482756 in main ()
[postgres@postgres ~]$

堆栈很明显,目前sql进程在parse环节中等待LWlock,是由pgss_store函数申请LWLock的,这个函数是pg_stat_statements插件(后续缩写为pgss)用来存储sql统计信息。

那这里大概可以推测是大量的sql并发,导致并发申请LWlock,冲突在LWLock排他模式(LW_EXCLUSIVE)的位置,导致Lwlock的队列比较长

那么select 1也“卡住”,why?

这里出现锁冲突的原因就是加锁的是相同的共享内存结构体,访问相同的临界区。

当select 1都“卡住”,说明大概率是pgss插件LWLock 排他锁持有时间过久,导致其他申请锁的操作都会被阻塞。

二、 场景模拟

我们模拟某个进程长时间持有排他锁,看看其他sql进程会不会等锁“卡住”

Session1 起一个sql进程1703进行查询

Nick postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
           1703
(1 row)

Nick postgres=# select id,context from table_test where id > 100 limit 10;



Session2 attach 1703,在pgss_store函数设置断点,跳转至断点1

Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols from /usr/lib/debug/usr/lib64/ld-2.17.so.debug...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /home/postgres/postgresql-12.4/pg12debug/lib/pg_stat_statements.so...done.
Loaded symbols for /home/postgres/postgresql-12.4/pg12debug/lib/pg_stat_statements.so
0x00007fba1697c913 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81      T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
/*设置断点1*/
(gdb) b pgss_store
Breakpoint 1 at 0x7fba1667d044: file pg_stat_statements.c, line 1107.
/*打印当前pid*/
(gdb) p MyProc->pid
$1 = 1703
(gdb) c 
Continuing.
/*跳转至断点1*/
Breakpoint 1, pgss_store (query=0x1400f20 "select id,context from table_test where id > 100 limit 10;", queryId=12350619769349426705, query_location=0, query_len=57, total_time=0, rows=0, bufusage=0x0, jstate=0x7ffc98cb63f0)
    at pg_stat_statements.c:1107
1107            char       *norm_query = NULL;
(gdb) n

继续给LWLockAcquire设置断点2,跳转至断点2,单步执行至加排他锁LW_EXCLUSIVE成功

/*设置断点2*/
(gdb) b LWLockAcquire
Breakpoint 2 at 0x867e77: file lwlock.c, line 1125.
(gdb) c 
Continuing.
/*跳转至断点2*/
Breakpoint 2, LWLockAcquire (lock=0x7fba0d741700, mode=LW_EXCLUSIVE) at lwlock.c:1125
1125            PGPROC     *proc = MyProc;
(gdb) 
1126            bool            result = true;
(gdb) 
1127            int                     extraWaits = 0;
(gdb) 
1154            if (num_held_lwlocks >= MAX_SIMUL_LWLOCKS)
(gdb) 
1162            HOLD_INTERRUPTS();
(gdb) 
1188                    mustwait = LWLockAttemptLock(lock, mode);
/*这里已经成功获取到了排它锁*/
(gdb) 
1190                    if (!mustwait)
(gdb) 
1193                            break; 
/* got the lock */


打印堆栈:

(gdb) bt
#0  LWLockAcquire (lock=0x7fba0d741700, mode=LW_EXCLUSIVE) at lwlock.c:1193
#1  0x00007fba1667d249 in pgss_store (query=0x1400f20 "select id,context from table_test where id > 100 limit 10;", queryId=12350619769349426705, query_location=0, query_len=56, total_time=0, rows=0, bufusage=0x0, 
    jstate=0x7ffc98cb63f0) at pg_stat_statements.c:1212
#2  0x00007fba1667c591 in pgss_post_parse_analyze (pstate=0x1402020, query=0x1402130) at pg_stat_statements.c:836
#3  0x000000000058e7e9 in parse_analyze (parseTree=0x1401fa0, sourceText=0x1400f20 "select id,context from table_test where id > 100 limit 10;", paramTypes=0x0, numParams=0, queryEnv=0x0) at analyze.c:119
#4  0x00000000008770c1 in pg_analyze_and_rewrite (parsetree=0x1401fa0, query_string=0x1400f20 "select id,context from table_test where id > 100 limit 10;", paramTypes=0x0, numParams=0, queryEnv=0x0) at postgres.c:695
#5  0x000000000087769c in exec_simple_query (query_string=0x1400f20 "select id,context from table_test where id > 100 limit 10;") at postgres.c:1140
#6  0x000000000087b7c8 in PostgresMain (argc=1, argv=0x142ca48, dbname=0x142c908 "postgres", username=0x142c8e8 "postgres") at postgres.c:4247
#7  0x00000000007e311c in BackendRun (port=0x1424a80) at postmaster.c:4448
#8  0x00000000007e2903 in BackendStartup (port=0x1424a80) at postmaster.c:4139
#9  0x00000000007def73 in ServerLoop () at postmaster.c:1704
#10 0x00000000007de84c in PostmasterMain (argc=1, argv=0x13fcc80) at postmaster.c:1377
#11 0x000000000070c53e in main (argc=1, argv=0x13fcc80) at main.c:228
/*打印加锁信息*/
(gdb) p *pgss->lock
$2 = {tranche = 70, state = {value = 1627389952}, waiters = {head = 99, tail = 99}}
(gdb)

这个时候进程1703已经获取到pgss->lock对象(即lock=0x7fba0d741700,mode=LW_EXCLUSIVE)的排他LWLock

Session3 进行查询select 1,执行后发现sql果然“卡住”了

[postgres@postgres:pg12.4:5314 ~]$psql
psql (12.4)
PG干O,天天象上!

Type "help" for help.

Nick postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
          19732
(1 row)

Nick postgres=# select 1;

^CCancel request sent
^CCancel request sent
^CCancel request sent

Session4 查看select 1进程堆栈

(gdb) bt
/*select 1目前在do_futex_wait*/
#0  0x00007fba17363a0b in futex_abstimed_wait (cancel=true, private=<optimized out>, abstime=0x0, expected=0, futex=0x7fba0d73a1b8) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:43
#1  do_futex_wait (sem=sem@entry=0x7fba0d73a1b8, abstime=0x0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:223
#2  0x00007fba17363a9f in __new_sem_wait_slow (sem=0x7fba0d73a1b8, abstime=0x0) at ../nptl/sysdeps/unix/sysv/linux/sem_waitcommon.c:292
#3  0x00007fba17363b3b in __new_sem_wait (sem=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/sem_wait.c:28
#4  0x00000000007cb776 in PGSemaphoreLock (sema=0x7fba0d73a1b8) at pg_sema.c:316
#5  0x0000000000867f6e in LWLockAcquire (lock=0x7fba0d741700, mode=LW_SHARED) at lwlock.c:1245
#6  0x00007fba1667d174 in pgss_store (query=0x1400f20 "select 1;", queryId=12083610478250330165, query_location=0, query_len=8, total_time=0, rows=0, bufusage=0x0, jstate=0x7ffc98cb63f0) at pg_stat_statements.c:1170
#7  0x00007fba1667c591 in pgss_post_parse_analyze (pstate=0x1401b38, query=0x1401c48) at pg_stat_statements.c:836
#8  0x000000000058e7e9 in parse_analyze (parseTree=0x1401ab8, sourceText=0x1400f20 "select 1;", paramTypes=0x0, numParams=0, queryEnv=0x0) at analyze.c:119
#9  0x00000000008770c1 in pg_analyze_and_rewrite (parsetree=0x1401ab8, query_string=0x1400f20 "select 1;", paramTypes=0x0, numParams=0, queryEnv=0x0) at postgres.c:695
#10 0x000000000087769c in exec_simple_query (query_string=0x1400f20 "select 1;") at postgres.c:1140
#11 0x000000000087b7c8 in PostgresMain (argc=1, argv=0x142ca48, dbname=0x142c908 "postgres", username=0x142c8e8 "postgres") at postgres.c:4247
#12 0x00000000007e311c in BackendRun (port=0x1424a80) at postmaster.c:4448
#13 0x00000000007e2903 in BackendStartup (port=0x1424a80) at postmaster.c:4139
#14 0x00000000007def73 in ServerLoop () at postmaster.c:1704
#15 0x00000000007de84c in PostmasterMain (argc=1, argv=0x13fcc80) at postmaster.c:1377
#16 0x000000000070c53e in main (argc=1, argv=0x13fcc80) at main.c:228
/*打印当前进程pid*/
(gdb) p MyProc->pid
$1 = 19732
/*打印加锁对象,可以看到和session1是相同的*/
(gdb) p *pgss->lock
$2 = {tranche = 70, state = {value = 1627389952}, waiters = {head = 99, tail = 99}}
(gdb)

堆栈可以看到目前select 1在parse阶段进入HooK,在pgss_store函数中申请LW_SHARED模式的LWLock,目前在等锁中(do_futex_wait),我们几乎复现出了相同场景。

Session1的查询目前获取到了排他锁,session3这里申请共享锁,加锁对象都是pgss->lock因此等锁。LWLock底层实现细节这里不展开了。

后续所有查询都需要等锁,直到排他锁释放。

三、 问题分析

那看来长时间持有排他LWLock确实会导致性能问题。那什么时候pgss插件会使用到LW_EXCLUSIVE模式的LWLock? 为什么会使用到LW_EXCLUSIVE模式?

就这个问题来看,当sql是第一次执行时会使用到LW_EXCLUSIVE模式

摘录一部分pgss_store函数源码
在这里插入图片描述

在pgss插件为sql计算hash值(queryId等信息),会用这个hash值去查看pgss_hash表,如果已经存在,则entry为hash表查询到的结果;不存在则entry为false,if判断为TURE,最终会调用LWLockAcquire(pgss->lock, LW_EXCLUSIVE);申请排他锁。也就是说新的sql第一次执行时会申请排他锁。

为什么要使用排他模式?排他模式用来做什么?

简单描述下LWLock功能,用来控制pgsql中较长的临界区的并发访问。主要保护对象为共享内存,比如操作Clog buffer(事务提交状态缓存)、Shared buffers(数据页缓存)、wal buffer(wal缓存)等等。当涉及到修改某些共享内存时,当然就要使用排他模式了。

这里可以看到加排他锁后,主要进行以下操作

在这里插入图片描述

1、 把query_text即sql语句写入pgss_query_texts.stat文件
2、 在hash表插入该sql对应信息
3、 回收垃圾统计信息(这个不频繁,shutdown或者pgss reset时 )

那我们推测下现网场景:

1、 大量新sql并发,导致实例cpu负载高,pgss插件在排他锁临界区的执行效率变慢,排他锁持有时间长,等锁队列长。造成库“卡顿”
2、 大量新sql并发,io负载高,pgss插件在持有排他锁后,写入pgss_query_texts.stat文件效率慢,排他锁持有时间变长,造成库“卡顿”

四、 小结

我们使用pgss插件是为了获取sql执行的一些统计数据,不过看来还是存在一些雷区的。

当某个进程hang在LWlock排他锁获取成功后,整个库基本不可以用;

大量排他锁的申请及操作,会导致整体性能变差。

大量排他锁申请时会存在大量的pg_stat_statements等待事件,这就是pgss 插件LWLock锁冲突的等待事件

wait_event_type  | LWLock
wait_event       | pg_stat_statements
state            | active
backend_xid      | 
backend_xmin     | 2914
query            | select 1+1;
backend_type     | client backend

这里优化点:

1、 业务侧,上新sql不要同时大批量操作
2、 pgss插件去除query_text写入pgss_query_texts.stat文件逻辑,改为写表方式(实际效果不会太明显)
3、 放弃pgss插件,内核侧实现sql统计信息指标

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值