现象
SQL在执行并发扫描的时候无法启用ESP进程而报错,
2018-08-11 16:39:20,486, ERROR, SQL.EXE, Node Number: 0, CPU: 0, PIN: 9467, Process Name: $Z0007QH, SQLCODE: 2012, QID: MXID11000009467212400779772986156000000000306U3333308T150000000_11_STMT1, *** ERROR[2012] Server process tdm_arkesp could not be created on \NSK cpu 0 - Operating system error 4022, TPCError = 31, error detail = 0. (See variants of Seabed procedure msg_mon_start_process for details).
2018-08-11 16:39:20,487, ERROR, SQL.EXE, Node Number: 0, CPU: 0, PIN: 9467, Process Name: $Z0007QH, SQLCODE: 2013, QID: MXID11000009467212400779772986156000000000306U3333308T150000000_11_STMT1, *** ERROR[2013] Server process tdm_arkesp could not be created on \NSK cpu 0 - Operating system error 4022.
分析
从错误信息发现,cpu 0,即第一个节点上启动ESP进程会有问题,通过sqps检查esp进程的状态,发现系统有超过2000个ESP进程,几乎所有的ESP进程的父进程为”NONE”,通过以下结果可以发现。
[trafodion@datanode-1 logs]$ sqps | grep esp | wc -l
2021
...
[$Z000G8K] 000,00014979 001 GEN ES--U-- $Z000C7Z NONE tdm_arkesp
[$Z000G8K] 000,00015025 001 GEN ES--U-- $Z000C9A NONE tdm_arkesp
[$Z000G8K] 000,00015467 001 GEN ES--U-- $Z000CLX NONE tdm_arkesp
[$Z000G8K] 000,00015905 001 GEN ES--U-- $Z000CZF NONE tdm_arkesp
[$Z000G8K] 000,00016080 001 GEN ES--U-- $Z000D4F NONE tdm_arkesp
[$Z000G8K] 000,00016241 001 GEN ES--U-- $Z000D91 NONE tdm_arkesp
[$Z000G8K] 000,00017532 001 GEN ES--U-- $Z000EAX NONE tdm_arkesp
[$Z000G8K] 000,00017601 001 GEN ES--U-- $Z000ECW NONE tdm_arkesp
[$Z000G8K] 000,00018405 001 GEN ES--U-- $Z000F0V NONE tdm_arkesp
...
找一个具体的ESP进程号,使用ps命令查看进程相关信息,我们发现这些ESP进程都处于defunct状态,他们都有一个共同的父进程号30037。
trafodion@datanode-1 logs]$ ps -ef | grep 18405
trafodi+ 18405 30037 0 03:50 ? 00:00:27 [tdm_arkesp] <defunct>
trafodi+ 22279 19210 0 17:32 pts/0 00:00:00 grep --color=auto 18405
解决
手动kill 30037这个父进程,并再次查看ESP个数,发现esp个数变为0。
kill -9 30037
30037进程是monitor进程,此问题发生的原因有可能是monitor异常重启,导致原有的monitor进程没有正常退出,从而相关的ESP进程都成为了僵尸状态。