1 故障描述
内存异常,XCF报警导致数据库主机宕机 。
2 故障恢复
2.1 REPORT 日志分析
将数据库主机起来之后,数据库可以 正常启动,OGG进程也是都起来了,但是过一段时间后,pump传输进程abend。
使用view report pump1日志如下:
源端错误信息就是一致刷
2023-02-17 15:00:21 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.2023-02-17 15:00:31 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:41 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:51 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:21 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.2023-02-17 15:00:31 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:41 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809
过一段时间后abend。
目标MGR rpport信息如下:
类似信息一直刷新 。
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
此时看信息就觉得是7809 连接 有问题,于是是看
netstat -a|grep 7809
发现端口正常进行监听。就是连接不上 ,奇怪纳闷了。
2.2 GGSERR.LOG分析
从目标端的err日志发现如下信息:
目标端报错时错误信息error log信息
2023-02-17 14:46:51 INFO OGG-01677 Oracle GoldenGate Collector for Oracle: Waiting for connection (started dynamically).
2023-02-17 14:46:51 ERROR OGG-00303 Oracle GoldenGate Collector for Oracle: TCP/IP bind error 125 (Address already in use). 我们开始未发现这个error,是 所有ogg进程重启 在重启间接解决的
2023-02-17 14:46:51 ERROR OGG-01668 Oracle GoldenGate Collector for Oracle: PROCESS ABENDING.
2023-02-17 14:46:51 INFO OGG-01677 Oracle GoldenGate Collector for Oracle: Waiting for connection (started dynamically).
2023-02-17 14:46:51 ERROR OGG-00303 Oracle GoldenGate Collector for Oracle: TCP/IP bind error 125 (Address already in use).
2023-02-17 14:46:51 ERROR OGG-01668 Oracle GoldenGate Collector for Oracle: PROCESS ABENDING.
2.3 正确日志汇总
源端pump
正确日志刷到一定程度建立其他端口连接
2023-02-17 15:00:51 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:21 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:31 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:41 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:00:51 WARNING OGG-01223 TCP/IP error 79 (Connection refused), endpoint: 192.168.248.92:7809.
2023-02-17 15:01:06 INFO OGG-01226 Socket buffer size set to 27985 (flush size 27985).
2023-02-17 15:01:06 INFO OGG-01230 Recovered from TCP error, host 192.168.248.92, port 7840.
2023-02-17 15:01:09 INFO OGG-01056 Recovery initialization completed for target file ./dirdat/mo041446, at RBA 145191066, CSN 13091982261579.
2023-02-17 15:01:09 INFO OGG-01478 Output file ./dirdat/mo is using format RELEASE 11.2.
目标端正常时错误信息error log信息
2023-02-17 15:01:01 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-01677 Oracle GoldenGate Collector for Oracle: Waiting for connection (started dynamically).
2023-02-17 15:01:01 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from SERVER on host [127.0.0.1]:33045 (REPORT 14418 7840).
2023-02-17 15:01:01 INFO OGG-00974 Oracle GoldenGate Manager for Oracle, mgr.prm: Manager started collector process (Port 7840).
2023-02-17 15:01:01 INFO OGG-01228 Oracle GoldenGate Collector for Oracle: Timeout in 300 seconds.
2023-02-17 15:01:01 INFO OGG-01677 Oracle GoldenGate Collector for Oracle: Waiting for connection (started dynamically).
2023-02-17 15:01:01 INFO OGG-00963 Oracle GoldenGate Manager for Oracle, mgr.prm: Command received from SERVER on host [127.0.0.1]:33046 (REPORT 14419 7841).
2023-02-17 15:01:01 INFO OGG-00974 Oracle GoldenGate Manager for Oracle, mgr.prm: Manager started collector process (Port 7841).
2023-02-17 15:01:01 INFO OGG-01228 Oracle GoldenGate Collector for Oracle: Timeout in 300 seconds.
2023-02-17 15:01:04 INFO OGG-00987 Oracle GoldenGate Command Interpreter for Oracle: GGSCI command (eoms): start er *.
mgr report 正确日志
2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51344 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).2023-02-17 15:01:01 INFO OGG-00963 Command received from EXTRACT on host [192.168.243.29]:51345 (START SERVER CPU -1 PRI -1 TIMEOUT 300 PARAMS ).
2023-02-17 15:01:01 INFO OGG-00963 Command received from SERVER on host [127.0.0.1]:33045 (REPORT 14418 7840).
2023-02-17 15:01:01 INFO OGG-00974 Manager started collector process (Port 7840).
2023-02-17 15:01:01 INFO OGG-00963 Command received from SERVER on host [127.0.0.1]:33046 (REPORT 14419 7841).
2023-02-17 15:01:01 INFO OGG-00974 Manager started collector process (Port 7841).
3 故障总结
分析report日志的同时也需要查看ggserr.log进行分析,片面了。MD