innodb_buffer_pool_size过小+同步任务目标表过大,导致MySQL无法处理正常请求

背景

悲伤的一天从数据库阻塞开始。早上发现项目页面的正常数据无法加载,心里一紧,赶紧检查接口调用,但接口半天没有响应,只有少部分接口没问题,估计是数据库出问题了,别问为什么😂总之就是(经历过的我万只草泥马在内心狂奔)。

用数据库工具连接数据库却发现连不上,直接用命令连接却可以。使用 show processlist; 查看所有线程执行的语句,发现每个连接的语句都阻塞了(还好都是查询语句)。

第一反应是连接创建过多,之前出现过一个程序在失败后循环创建连接,导致连接数达到配置最大值。但 show processlist; 显示只有200个连接,还没有满,所以排除这个原因。

 

接下来用 SHOW ENGINE INNODB STATUS;,发现半天没反应,完了无从下手,只能分析 error 日志。将最近一天(昨天还是正常的)的 error 和慢 SQL 日志都下载下来(这两个日志都是增量的,没有清理,非常大,通过 tail -n xx.log > temp.log 获取最近的 n 行日志)。

 

由于是线上单机,虽然调用量不多,但还是赶紧重启恢复一下。安全关闭(/opt/mysql/bin/mysqladmin -u root -p shutdown)也阻塞了,无法关闭,只能 kill 了(因为正常业务流量进来都阻塞了,所以不会误伤正在进行中的流量,如果还有正常的业务更新操作,还是要衡量一下的,虽然 MySQL 有 crash 恢复机制,还是得慎重),重启后服务没啥问题。于是走上了分析日志的漫长道路😂。

分析日志找出元凶

分析mysql-error.log

日志时间里的时间,和当前比较需要加8个小时,默认是0时区

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

# =====================================

2024-08-15 03:22:01 0x7f74f278e700 INNODB MONITOR OUTPUT

## Per second averages calculated from the last 28 seconds

## BACKGROUND THREAD

## srv_master_thread loops: 348143 srv_active, 0 srv_shutdown, 1570815 srv_idle

srv_master_thread log flush and writes: 1918958

## SEMAPHORES

OS WAIT ARRAY INFO: reservation count 112950472
--Thread 140140226561792 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 252  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

- Thread 140140118513408 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 30 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140224128768 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 117 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140110944000 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 61 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140117161728 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140107970304 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140225480448 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140221966080 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140110132992 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140098508544 has waited at [buf0buf.cc](http://buf0buf.cc/) line 5175 for 252 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f74f9baf5e8 created in file [buf0buf.cc](http://buf0buf.cc/) line 1474
a writer (thread id 140140098508544) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file [row0sel.cc](http://row0sel.cc/) line 3766
Last time write locked in file /var/lib/pb2/sb_1-12949965-1697025598.1/mysql-5.7.44/storage/innobase/buf/buf0buf.cc line 5175
--Thread 140140112836352 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 91 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140375672576 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140116350720 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 76 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140107699968 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140109592320 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 117 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140099860224 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140109051648 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 0 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140108240640 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140374050560 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140107159296 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140111755008 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140223047424 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 15 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140110403328 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140288374528 has waited at [buf0flu.cc](http://buf0flu.cc/) line 1919 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140117972736 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 252 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140112025344 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 241 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140226021120 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 121 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1
- Thread 140140225210112 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 45 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

## OS WAIT ARRAY INFO: signal count 290631174

RW-shared spins 0, rounds 1987319567, OS waits 43678928
RW-excl spins 0, rounds 954282804, OS waits 18020372
RW-sx spins 55746, rounds 1343649, OS waits 37909
Spin rounds per wait: 1987319567.00 RW-shared, 954282804.00 RW-excl, 24.10 RW-sx

## LATEST DETECTED DEADLOCK

2024-08-08 01:00:07 0x7f74e9ffb700
*** (1) TRANSACTION:
TRANSACTION 217309164, ACTIVE 1 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 192 row lock(s)
MySQL thread id 46308, OS thread handle 140140220614400, query id 78035010 10.61.1.172 normal_lp updating
UPDATE xxTable  SET xx_order=3,
xx_time='2024-08-08 01:00:01.294000'

WHERE  policy_id = 2509 AND credit_no = '91330106MA27W6MX8K'
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 10787 page no 6 n bits 184 index PRIMARY of table database`-test`.xxxtable trx id 217309164 lock_mode X waiting
Record lock, heap no 93 PHYSICAL RECORD: n_fields 15; compact format; info bits 0
0: len 4; hex 800006cc; asc     ;;
1: len 6; hex 00000cf3dfeb; asc       ;;
2: len 7; hex c9000080030110; asc        ;;
3: len 18; hex 33313333303030303036303537353235324c; asc 31330000060575252L;;
4: len 27; hex e6b599e6b19fe4b985e4b8b0e5be8be5b888e4ba8be58aa1e68980; asc                            ;;
5: SQL NULL;
6: len 4; hex 80000a5c; asc    \;;
7: len 8; hex 800000000000002d; asc        -;;
8: len 5; hex 99b4101006; asc      ;;
9: len 5; hex 99b4101006; asc      ;;
10: len 1; hex 80; asc  ;;
11: len 1; hex 80; asc  ;;
12: SQL NULL;
13: len 8; hex 95939710b8d80002; asc         ;;
14: len 4; hex 80000866; asc    f;;

- * (2) TRANSACTION:
TRANSACTION 217309163, ACTIVE 1 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 2
MySQL thread id 46316, OS thread handle 140140118783744, query id 78036175 10.61.1.172 normal_lp updating
UPDATE xxTable SET xx_order=2,
xx_time='2024-08-08 01:00:02.156000'

WHERE  id = 1739 AND xx_order = 0
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 10787 page no 6 n bits 184 index PRIMARY of table `xxDatabase-test`.`xxTable` trx id 217309163 lock_mode X locks rec but not gap
Record lock, heap no 93 PHYSICAL RECORD: n_fields 15; compact format; info bits 0
0: len 4; hex 800006cc; asc     ;;
1: len 6; hex 00000cf3dfeb; asc       ;;
2: len 7; hex c9000080030110; asc        ;;
3: len 18; hex 33313333303030303036303537353235324c; asc 31330000060575252L;;
4: len 27; hex e6b599e6b19fe4b985e4b8b0e5be8be5b888e4ba8be58aa1e68980; asc                            ;;
5: SQL NULL;
6: len 4; hex 80000a5c; asc    \;;
7: len 8; hex 800000000000002d; asc        -;;
8: len 5; hex 99b4101006; asc      ;;
9: len 5; hex 99b4101006; asc      ;;
10: len 1; hex 80; asc  ;;
11: len 1; hex 80; asc  ;;
12: SQL NULL;
13: len 8; hex 95939710b8d80002; asc         ;;
14: len 4; hex 80000866; asc    f;;

- * (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 10787 page no 6 n bits 184 index PRIMARY of table `xxDatabase-test`.`xxTable` trx id 217309163 lock_mode X locks rec but not gap waiting
Record lock, heap no 110 PHYSICAL RECORD: n_fields 15; compact format; info bits 0
0: len 4; hex 800006cb; asc ;;
1: len 6; hex 00000ce5f919; asc ;;
2: len 7; hex f50000c03a0122; asc : ";;
3: len 18; hex 393133333030303037353539333335393743; asc 91330000755933597C;;
4: len 30; hex e6b599e6b19fe58d8ee5a48fe4bc9ae8aea1e5b888e4ba8be58aa1e68980; asc ;;
5: SQL NULL;
6: len 4; hex 80000a5c; asc \;;
7: len 8; hex 8000000000000005; asc ;;
8: len 5; hex 99b3fe1007; asc ;;
9: len 5; hex 99b3fe1007; asc ;;
10: len 1; hex 80; asc ;;
11: len 1; hex 80; asc ;;
12: SQL NULL;
13: len 8; hex 95939710b8d80002; asc ;;
14: len 4; hex 80000866; asc f;;

## * WE ROLL BACK TRANSACTION (1)

## TRANSACTIONS

## Trx id counter 218129917

Purge done for trx's n:o < 218127641 undo n:o < 0 state: running but idle
History list length 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421615428877536, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428874800, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428852000, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428852912, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428869328, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428866592, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428876624, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428872064, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428871152, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428865680, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428863856, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428860208, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428859296, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421615428857472, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 218129916, ACTIVE 241 sec
0 lock struct(s), heap size 1136, 0 row lock(s)
Trx read view will not see trx with id >= 218129917, sees < 218129917

## FILE I/O

## I/O thread 0 state: waiting for completed aio requests (insert buffer thread)

I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 0
18137925483 OS file reads, 30578731 OS file writes, 2505503 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

## INSERT BUFFER AND ADAPTIVE HASH INDEX

## Ibuf: size 1, free list len 1574, seg size 1576, 586323 merges

merged operations:
insert 17994732, delete mark 280, delete 1
discarded operations:
insert 0, delete mark 0, delete 0
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
Hash table size 17387, node heap has 1 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s

## LOG

## Log sequence number 8739889627572

Log flushed up to 8739889627572
Pages flushed up to 8739889627572
Last checkpoint at 8739889627563
0 pending log flushes, 0 pending chkp writes
1152048 log i/o's done, 0.00 log i/o's/second

## BUFFER POOL AND MEMORY

Total large memory allocated 68714496
Dictionary memory allocated 3134324
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-08-14T19:22:20.878068Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140226561792 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878236Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140117161728 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878263Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140225480448 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878283Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140221966080 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878304Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140110132992 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878320Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140098508544 has waited at [buf0buf.cc](http://buf0buf.cc/) line 5175 for 272  seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f74f9baf5e8 created in file [buf0buf.cc](http://buf0buf.cc/) line 1474
a writer (thread id 140140098508544) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file [row0sel.cc](http://row0sel.cc/) line 3766
Last time write locked in file /var/lib/pb2/sb_1-12949965-1697025598.1/mysql-5.7.44/storage/innobase/buf/buf0buf.cc line 5175
2024-08-14T19:22:20.878361Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140375672576 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878376Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140099860224 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878391Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140110403328 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878405Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140288374528 has waited at [buf0flu.cc](http://buf0flu.cc/) line 1919 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878453Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140117972736 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 272  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:20.878468Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140112025344 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 261  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-08-14T19:22:51.879192Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140226561792 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879292Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140117161728 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879310Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140225480448 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879327Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140221966080 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879343Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140110132992 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879357Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140098508544 has waited at [buf0buf.cc](http://buf0buf.cc/) line 5175 for 303  seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f74f9baf5e8 created in file [buf0buf.cc](http://buf0buf.cc/) line 1474
a writer (thread id 140140098508544) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file [row0sel.cc](http://row0sel.cc/) line 3766
Last time write locked in file /var/lib/pb2/sb_1-12949965-1697025598.1/mysql-5.7.44/storage/innobase/buf/buf0buf.cc line 5175
2024-08-14T19:22:51.879384Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140375672576 has waited at [buf0buf.cc](http://buf0buf.cc/) line 3530 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879400Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140099860224 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879416Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140110403328 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879432Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140288374528 has waited at [buf0flu.cc](http://buf0flu.cc/) line 1919 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879447Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140117972736 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 303  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

2024-08-14T19:22:51.879462Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140140112025344 has waited at [buf0lru.cc](http://buf0lru.cc/) line 1328 for 292  seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created [buf0buf.cc:1747](http://buf0buf.cc:1747/), lock var 1

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0
InnoDB: ###### Diagnostic info printed to the standard error stream
2024-08-14T19:41:22.510361Z 112546 [Note] Aborted connection 112546 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T19:41:38.894340Z 112548 [Note] Aborted connection 112548 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T19:41:38.894340Z 112547 [Note] Aborted connection 112547 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:11:24.750323Z 112602 [Note] Aborted connection 112602 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:11:24.750323Z 112603 [Note] Aborted connection 112603 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:11:24.750345Z 112601 [Note] Aborted connection 112601 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:40:54.222298Z 112672 [Note] Aborted connection 112672 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:40:54.222298Z 112668 [Note] Aborted connection 112668 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T20:40:54.222298Z 112671 [Note] Aborted connection 112671 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T21:10:23.694508Z 112737 [Note] Aborted connection 112737 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T21:10:56.462411Z 112742 [Note] Aborted connection 112742 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout reading communication packets)
2024-08-14T21:10:56.462669Z 112743 [Note] Aborted connection 112743 to db: 'xxDatabase' user: 'root' host: 'xx.xx.xx.244' (Got timeout re

在文件中看到有大量创建BUF_POOL锁竞争,应该buffer pool不足


2024-08-14T19:21:49.876992Z 0 [Warning] InnoDB: A long semaphore wait:
–Thread 140140117161728 has waited at buf0lru.cc line 1328 for 241 seconds the semaphore:
Mutex at 0x389f4c8, Mutex BUF_POOL created buf0buf.cc:1747, lock var 1

继续往下发现数据库的预警将SHOW ENGINE INNODB STATUS;的信息打印出来了

看到buffer pool 分配内存只有65M多,应该是配置比较少
BUFFER POOL AND MEMORY
Total large memory allocated 68714496
实用命令查看67108864(64M),查看实际配置确实是64M(my.cnf)
而且也没有后续日志了,估计的锁竞争一直存在


mysql> SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
ERROR 2013 (HY000): Lost connection to MySQL server during query
No connection. Trying to reconnect...
Enter password:
Connection id:    2021
Current database: *** NONE ***

+-------------------------+----------+
| Variable_name           | Value    |
+-------------------------+----------+
| innodb_buffer_pool_size | 67108864 |
+-------------------------+----------+
1 row in set (12.84 sec)

分析mysql-slow.log

[时间需要加8个小时]从上面mysql-error.log知道buffer pool内存不足导致锁竞争
因为数据使用的系统有好几个,不知道是哪里触发的所以看看慢SQL日志是否有线索
通过mysql-error.log触发锁的时间前看看有哪些异常的慢SQL,经过翻看果然有猫腻
这个表查询时间特别长且不是业务系统的表,从命名规则知道是大数据平台的(命名的重要性),查了一下这个表数据有1300+W,去数据平台找相关任务果真有一个周期任务每天跑,大致逻辑是从某个接口读取数据到MySQL暂存,再通过datax转存到hive中(一脸懵逼,不知道为什么不直接存到hive还专门存在业务数据库中,难到这样比较快吗?还是这个业务库以前是数据平台的,被薅过来当业务库了,主要是这个业务平台也没有用)。从这些慢SQL还无法给数据平台的任务定罪。因为这些慢SQL之所以能进到日志文件中也就是已经正常执行完了。第7感告诉我这个肯定和这个任务有关系。

# Time: 2024-08-14T19:03:47.013400Z
# User@Host: prod_xxDatabase[prod_xxDatabase] @  [xx.xx.xx.28]  Id: 112765
# Query_time: 107.714147  Lock_time: 0.000392 Rows_sent: 67794  Rows_examined: 13370000
use xxDatabase;
SET timestamp=1723662227;
select column.. from xx_table where (condition1 in ('val1', 'val2')) and  ('val4' <= condition2 AND condition2 < 'val3');
# Time: 2024-08-14T19:05:41.874913Z
# User@Host: prod_xxDatabase[prod_xxDatabase] @  [xx.xx.xx.28]  Id: 112770
# Query_time: 113.823608  Lock_time: 0.000592 Rows_sent: 157095  Rows_examined: 13370000
use xxDatabase;
SET timestamp=1723662341;
select column.. from xx_table where (condition1 in ('val1', 'val2')) and  ('val5' <= condition2 AND condition2 <= 'val6');
# Time: 2024-08-14T19:07:21.968329Z
# User@Host: prod_xxDatabase[prod_xxDatabase] @  [xx.xx.xx.28]  Id: 112774
# Query_time: 99.362173  Lock_time: 0.000580 Rows_sent: 0  Rows_examined: 13370000
use xxDatabase;
SET timestamp=1723662441;
select column.. from xx_table where (condition1 in ('val1', 'val2')) and  condition2 IS NULL;
# Time: 2024-08-14T19:10:34.862962Z
# User@Host: prod_xxDatabase[prod_xxDatabase] @  [xx.xx.xx.28]  Id: 112766
# Query_time: 515.489991  Lock_time: 0.000440 Rows_sent: 3859885  Rows_examined: 13370000
SET timestamp=1723662634;
select column.. from xx_table where (condition1 in ('val1', 'val2')) and  ('val7' <= condition2 AND condition2 < 'val5');

踏上定罪元凶的荆棘路

日志大致情况:开始查看数据平台任务的执行日志,发现这个任务被datax拆分了5个SQL,其中4个执行成功了1个卡住了,直到重启了MySQL才中断失败的。

分析日志:和mysql-slow.log比较少了select column… from ods_rsj_ac08_mi_sgj_api where (condition1 in (‘val1’, ‘val2’)) and (‘val3’ <= condition2 AND condition2 < ‘val7’) 因为这个任务卡住了,但还是无法确认它和谁在抢buffer pool,这个语句有快1千万的数据而这个表1300+w,这个表的idb文件占5.4G。

虽然这个表数据是有些大单,buffer pool只配置了64M确实有点少了。虽然这个表量大不是原罪,架构奇葩也不是原罪,原罪的是buffer pool给得有点少了还是配置里写了64M,16G内存的机器给64M,虽然机器还有redis(最后默默给了2G,大部分建议是60%-80%,考虑到redis在用,先给10%左右比原先多得多了,后面在观察进行调整)

2024-08-15 03:01:58.888 [job-0] INFO  SingleTableSplitUtil - After split(), allQuerySql=[
select column.. from ods_rsj_ac08_mi_sgj_api where (condition1 in ('val1', 'val2')) and  ('val4' <= condition2 AND condition2 < 'val3') 
select column.. from ods_rsj_ac08_mi_sgj_api where (condition1 in ('val1', 'val2')) and  ('val3' <= condition2 AND condition2 < 'val7') 
select column.. from ods_rsj_ac08_mi_sgj_api where (condition1 in ('val1', 'val2')) and  ('val7' <= condition2 AND condition2 < 'val5')
select column.. from ods_rsj_ac08_mi_sgj_api where (condition1 in ('val1', 'val2')) and  ('val5' <= condition2 AND condition2 <= 'val6')
select column.. from ods_rsj_ac08_mi_sgj_api where (condition1 in (condition1 in ('val1', 'val2')) and  condition2 IS NULL
].
2024-08-15 03:01:58.889 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [5] tasks.
2024-08-15 03:01:58.930 [job-0] INFO  HiveWriter$Job - end do split.
2024-08-15 03:01:58.930 [job-0] INFO  JobContainer - DataX Writer.Job [hivewriter] splits to [5] tasks.
2024-08-15 03:01:58.981 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2024-08-15 03:01:59.001 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2024-08-15 03:01:59.006 [job-0] INFO  JobContainer - Running by standalone Mode.
2024-08-15 03:01:59.026 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [3] channels for [5] tasks.
2024-08-15 03:01:59.035 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to 2097152.
2024-08-15 03:01:59.035 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2024-08-15 03:01:59.051 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2024-08-15 03:01:59.057 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2024-08-15 03:01:59.069 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
...
2024-08-15 03:03:47.723 [0-0-0-writer] INFO  HiveWriter$Task - end do write
2024-08-15 03:03:47.794 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[108744]ms
2024-08-15 03:03:47.798 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
...
2024-08-15 03:05:42.295 [0-0-3-writer] INFO  HiveWriter$Task - end do write
2024-08-15 03:05:42.363 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[114564]ms
2024-08-15 03:05:42.368 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
...
2024-08-15 03:07:22.078 [0-0-4-writer] INFO  HiveWriter$Task - end do write
2024-08-15 03:07:22.148 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[99781]ms
...
2024-08-15 03:10:36.476 [0-0-2-writer] INFO  HiveWriter$Task - end do write
2024-08-15 03:10:36.526 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[517465]ms
...
2024-08-15 03:17:29.303 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11179046 records, 3117258434 bytes | Speed 2.08MB/s, 7993 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 192.916s |  All Task WaitReaderTime 805.506s | Percentage 80.00%
2024-08-15 03:17:39.305 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11259206 records, 3139112292 bytes | Speed 2.08MB/s, 8016 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 193.999s |  All Task WaitReaderTime 810.068s | Percentage 80.00%
2024-08-15 03:17:49.306 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11339014 records, 3160909448 bytes | Speed 2.08MB/s, 7980 records/s | Error 0 records, 
-- 最后一个成功读取的日志
2024-08-15 03:17:49.306 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11339014 records, 3160909448 bytes | Speed 2.08MB/s, 7980 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195.029s |  All Task WaitReaderTime 814.803s | Percentage 80.00%
-- 这里开始就读取有点问题了后面都是一样的,从Total 11395366 records看出来数据量没有增加,后面应该都是在重试
2024-08-15 03:17:59.308 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11395366 records, 3176282803 bytes | Speed 1.47MB/s, 5635 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195.757s |  All Task WaitReaderTime 818.217s | Percentage 80.00%
2024-08-15 03:18:09.309 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11395366 records, 3176282803 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195.757s |  All Task WaitReaderTime 818.217s | Percentage 80.00%
2024-08-15 03:18:19.311 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11395366 records, 3176282803 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195.757s |  All Task WaitReaderTime 818.217s | Percentage 80.00%
2024-08-15 03:18:29.313 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11395366 records, 3176282803 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195.757s |  All Task WaitReaderTime 818.217s | Percentage 80.00%
2024-08-15 03:18:39.314 [job-0] INFO  StandAloneJobContainerCommunicator - Total 11395366 records, 3176282803 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 195

总结归纳

此次的问题主要是因为buffer pool配置不当导致buffer pool资源抢占,应该是这个没有死锁预防,不像表的锁有超时机制。所以有些问题只有经历了才知道,不经历永远是书本的知识。
其次这同步数据的任务,数据是先过业务库再到数据平台和奇怪的架构。

最终结论:
项目过太多手,都无法完全理清整个系统,特别是一些奇怪逻辑都不敢轻易去掉,生怕有特殊用途,都是抱着只要不挂就不管(吃力不讨好😂)。只能在闲暇之时为了历练自己会琢磨一下。

 

  • 21
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值