记录一些 PostgreSQL问题分析思路

140 篇文章 120 订阅

原文来自微信公众号文章 生产案例 | 损坏的索引

本文主要学习记录一些其中的问题分析思路和方法。

一、 问题现象

"有个SQL在数据库中执行1ms就能出结果,但是业务压测发起的SQL就会一直执行,查不出结果,进程也没有锁,CPU占用也很高"

二、 文中排查思路

1. pg_stat_activity查看

主要看是否处于活跃状态及等待事件,即state 和 wait_event两个字段,一般还会看看执行计划。

select * from pg_stat_activity where pid = '105430';

2. GDB抓运行堆栈

步骤是:

  • gdb 跟踪问题SQL进程 105430(问题SQL需要正在运行中)
  • gdb执行bt命令显示堆栈,这里 LWLockAcquire 是轻量锁的获取函数
#0 0x000000000071f5e3 in pg_atomic_read_u32_impl (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics/generic.h:48
#1 pg_atomic_read_u32 (ptr=0x2b7667e03e04) at ../../../../src/include/port/atomics.h:247
#2 LWLockAttemptLock (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:750
#3 0x00000000007201b7 in LWLockAcquire (lock=lock@antry=0x2b7667e03e00,mode=mode@entry=LW_SHARED) at lwlock.c:1187

根据这个步骤可以自己模拟抓取一下慢SQL运行堆栈

会话1

 会话2

gdb -p 13411

会话1执行慢sql

select count(*) from t_insert a,t_insert b;

会话2,执行c及bt。就可以看到该会话当前正在执行ExecNestLoop函数,以及整个调用栈。

3. strace分析

 常用命令参考 strace分析sqlplus登录慢问题_Hehuyi_In的博客-CSDN博客

strace -p 进程号

# 输出如下
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0 
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfa6b8,FUTEX_WAIT,0,NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x2b7667dfbd38,FUTEX_WAKE,1) = 0 

       可以看到此时在不断地在唤醒等待,唤醒等待,如此重复,仿佛陷入到了一个死循环里面,自旋锁自死锁。而自旋锁本身又没有等待队列、死锁检测机制,需要手动释放,因此看现象貌似遇到了死锁。

//uaddr指向一个地址,val代表这个地址期待的值,当*uaddr==val时,才会进行wait
int futex_wait(int *uaddr, int val);

//唤醒n个在uaddr指向的锁变量上挂起等待的进程
int futex_wake(int *uaddr, int n);

4. perf抓取热点函数

perf record -g -p pid 抓取30s数据

       可以看到热点函数有PinBuffer、LWLockAttempLock、LWLockRelease等,这些函数都会涉及到spinlock。但是到目前为止,我们只发现了基本现象,但是不知道根因,即便知道了锁的特征,也无法知悉是哪一个环节出了问题。

5. 开启全日志 log_statement = all

观察发现,同一类SQL有些SQL可以正确执行,但是唯独有一个参数无法执行,只有BIND的过程,而没有EXECUTE的过程!

2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6102,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6103,"SELECT",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.234 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2648,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,""

# 就是下面这条
2022-08-05 17:36:29.235 
CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2649,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,",""

2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241835,"30.69.79.67:41914",62ece183.3b0ab,6104,"BIND",2022-08-05 17:23:15 CST,44/1713,0,LOG,00000,"""
2022-08-05 17:36:29.235 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2650,"UPDATE",",,,,,,,,""
2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2651,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.236 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2652,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"'
2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2653,"BIND",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""
2022-08-05 17:36:29.237 CST,"xxxx","xxxx",241840,"30.69.79.67:46722",62ece183.3b0b0,2654,"UPDATE",2022-08-05 17:23:15 CST,45/809,820459524,LOG,00000,"""

       可以看到,当最后一次17:36:29.235执行了这个SQL之后,241835进程经过BIND之后,就没有后续日志了!而其他同类SQL都可以正常执行,于是把这个参数抓取到数据库中去执行一下,可以看到其他参数都可以瞬间执行,唯独这一个SQL一跑就死,ctrl + c都停止不了!

可以看到第一次51.927ms就出来了,而第二次一直无法结束,执行计划显示是走的索引,根据我之前的经验,隐约感觉是索引出了问题。

6. amcheck插件检测索引

SELECT bt_index_check(index => c.oid, heapallindexed => i.indisunique),
               c.relname,
               c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'public'
-- Don't check temp tables, which may be from another session:
AND c.relpersistence != 't'
AND c.relname = '索引名'
-- Function may throw an error when this is omitted:
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC LIMIT 10;

果然。。索引又双叒叕出错了!

另外这里还显示了报错的函数  static void  bt_target_page_check(BtreeCheckState *state)

三、 小结

       这个问题的现象很费解,当传入了某个值之后,就会一直运行停不下来,现象就是一直在自旋,等待给数据页加锁。

       根据结果反推的话,查询读取到了这个错误的索引页之后,尝试加锁,但是数据异常,加锁失败,加之自旋锁显著的特点就是"死等",需要获取SpinLock的进程会一直主动检查能否获取得到锁,直到获取到锁后才会继续执行下面的逻辑,这把锁会一直被这个线程持有,直到线程自己主动释放,这也是这个进程停不下来的原因,以及传入相同参数的进程都在等待,越堆越多!当然解决方式也好解决,reindex重建索引即可。

参考

生产案例 | 损坏的索引

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hehuyi_In

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值