以下是针对 PostgreSQL 数据库的性能瓶颈与高可用异常切换的排查步骤、常用命令和工具的完整清单,适用于生产环境下的问题诊断与优化。
🧱 一、PostgreSQL 性能瓶颈排查清单
1. 监控初步判断
关键指标:
TPS / QPS(transactions/queries per second) active connections / max_connections buffer hit ratio deadlocks long running queries checkpointer/write activity replication lag(主备架构)
工具推荐:
pg_stat_statements
pg_stat_activity
pg_stat_io
, pg_stat_bgwriter
Prometheus + Grafana(postgres_exporter
) 自带日志 + log_min_duration_statement
2. 系统资源分析
top -c
vmstat 1 5
iostat -x 1 3
free -m
sar -n DEV 1 3
3. 数据库内部排查
🔍 当前连接/会话
SELECT * FROM pg_stat_activity WHERE state <> 'idle' ;
🔍 慢查询分析
log_min_duration_statement = 1000
log_statement = 'none'
SELECT query, total_time, calls, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10 ;
🔍 执行计划分析
EXPLAIN ( ANALYZE , BUFFERS) SELECT . . . ;
4. 等待事件与锁分析(PostgreSQL 12+)
🔍 查看等待事件
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL ;
🔍 查看阻塞与被阻塞会话
SELECT blocked. pid AS blocked_pid, blocked. query AS blocked_query,
blocking. pid AS blocking_pid, blocking. query AS blocking_query
FROM pg_locks blocked
JOIN pg_stat_activity blocked_activity ON blocked. pid = blocked_activity. pid
JOIN pg_locks blocking ON blocking. locktype = blocked. locktype
AND blocking. database IS NOT DISTINCT FROM blocked. database
AND blocking. relation IS NOT DISTINCT FROM blocked. relation
AND blocking. page IS NOT DISTINCT FROM blocked. page
AND blocking. tuple IS NOT DISTINCT FROM blocked. tuple
AND blocking. pid != blocked. pid
JOIN pg_stat_activity blocking_activity ON blocking. pid = blocking_activity. pid;
5. PostgreSQL I/O 层排查
SELECT relname, heap_blks_read, heap_blks_hit
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 10 ;
SELECT * FROM pg_stat_bgwriter;
6. 内存与配置项建议
参数 说明 建议配置(需调整需重启) shared_buffers
内存缓存 25-40% 实体内存 work_mem
单个操作排序哈希内存 2-16MB,注意并发数量 maintenance_work_mem
VACUUM 重建索引内存 64-256MB effective_cache_size
操作系统文件缓存估计 60-75% 实体内存 max_connections
最大连接数 搭配连接池合理设置
🌐 二、PostgreSQL 高可用异常切换排查
主流 HA 架构:
主备复制(流复制) 基于 repmgr
, Patroni
, pg_auto_failover
, Stolon
的自动化切换 云厂商 RDS 高可用方案(封装)
1. 流复制状态检查
SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;
SELECT pg_last_wal_receive_lsn( ) , pg_last_wal_replay_lsn( ) ;
指标判断复制延迟
SELECT EXTRACT( EPOCH FROM now ( ) - pg_last_xact_replay_timestamp( ) ) AS replication_lag;
2. 常见复制/高可用问题与排查
问题 原因 排查方式 复制延迟大 主库 WAL 积压、网络慢 检查 LSN 差值与网络传输 无法接收WAL 连接中断、认证失败 查看主库 pg_hba.conf
与日志 复制中断 standby.signal
丢失检查 recovery.conf
或 standby.signal
是否存在 自动切换失败 无法选主、脑裂 检查 repmgrd
, etcd
, consul
状态
3. 高可用框架工具与排查方法
工具 查看状态命令 常见日志 repmgr repmgr cluster show
/var/log/repmgr/repmgrd.log
Patroni patronilctl ist
patroni.log
, etcd
Stolon stolonctl clusterdata
stolon.log
, sentinel.log
pg_auto_failover pg_autoctl show state
pg_autoctl.log
🔧 三、PostgreSQL 常用诊断工具/命令
类型 工具/命令 用途 性能分析 pg_stat_statements
, EXPLAIN ANALYZE
, auto_explain
SQL优化 会话管理 pg_stat_activity
, pg_terminate_backend(pid)
查看/终止连接 锁分析 pg_locks
, pg_blocking_pids()
排查阻塞 复制监控 pg_stat_replication
, pg_last_wal_replay_lsn()
复制延迟监测 I/O 检查 pg_statio_user_tables
, pg_stat_bgwriter
热点表和刷盘信息 自动化工具 pgBadger
, pg_stat_monitor
, Prometheus
日志分析、监控 高可用管理 repmgr
, patroni
, pg_autoctl
主从管理与切换
✅ PostgreSQL 故障排查抽象表格
维度 子类 检查点 命令/工具 判断标准 连接数 活跃连接是否超限 pg_stat_activity
active > max_connections SQL 慢 查询耗时高 pg_stat_statements
, EXPLAIN
mean_time > 500ms 锁 会话阻塞 pg_locks
+ pg_blocking_pids()
有 blocking pid 存在 复制延迟 LSN 差距大 pg_stat_replication
lag > 1s I/O瓶颈 热表、顺序读多 pg_statio_user_tables
heap_blks_read 多 Checkpoint 问题 写入突增、卡顿 pg_stat_bgwriter
buffers_checkpoint 多