奇迹mysql老是被误杀_误杀进程导致rac hang住 – 提供7*24专业数据库(Oracle,SQL Server,MySQL等)恢复和Oracle技术服务@Tel:+86 1342964878...

有客户反馈系统hang住,不能归档,需要我们紧急介入分析

节点1日志

出现redo不能归档,redo日志都已经被写满,人工执行了ALTER SYSTEM ARCHIVE LOG CURRENT,数据库就开始把redo全部归档,但是后面产生的redo又不能归档,当redo全部写满之后,数据库有出现大量log file switch (archiving needed)等待

Tue Sep 24 22:05:37 2013

Thread 1 advanced to log sequence 47282 (LGWR switch)

Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409

Tue Sep 24 22:07:31 2013

ORACLE Instance q9db1 - Can not allocate log, archival required

Thread 1 cannot allocate new log, sequence 47283

All online logs needed archiving

Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409

Tue Sep 24 22:28:17 2013

ALTER SYSTEM ARCHIVE LOG

Archived Log entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1:

Tue Sep 24 22:28:18 2013

Thread 1 advanced to log sequence 47283 (LGWR switch)

Current log# 7 seq# 47283 mem# 0: +DATA/q9db/onlinelog/group_7.1243.818697415

Archived Log entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1:

Archived Log entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1:

Archived Log entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1:

Archived Log entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1:

Archived Log entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1:

Archived Log entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1:

Tue Sep 24 22:28:28 2013

Archived Log entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1:

Archived Log entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1:

Archived Log entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1:

Archived Log entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1:

Archived Log entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1:

Archived Log entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1:

Archived Log entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1:

Tue Sep 24 22:28:39 2013

Archived Log entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1:

Archived Log entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1:

Archived Log entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1:

Tue Sep 24 22:29:39 2013

Thread 1 advanced to log sequence 47284 (LGWR switch)

Current log# 8 seq# 47284 mem# 0: +DATA/q9db/onlinelog/group_8.1242.818697417

Tue Sep 24 22:31:18 2013

Thread 1 advanced to log sequence 47285 (LGWR switch)

Current log# 16 seq# 47285 mem# 0: +DATA/q9db/onlinelog/group_16.1884.827003545

Thread 1 advanced to log sequence 47286 (LGWR switch)

Current log# 17 seq# 47286 mem# 0: +DATA/q9db/onlinelog/group_17.1885.827003587

节点2日志

节点2中出现大量的IPC Send timeout

Tue Sep 24 15:22:19 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

…………

Tue Sep 24 18:51:55 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

Tue Sep 24 18:57:54 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

Receiver: inst 1 binc 464003926 ospid 1566

Tue Sep 24 19:03:57 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

Receiver: inst 1 binc 464003926 ospid 1566

Tue Sep 24 19:09:53 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

…………

Tue Sep 24 20:22:00 2013

IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]

节点1因为不能归档hang住,节点2紧接着也就hang住。对节点1hang住之时对两个节点分别做systemstate dump,使用ass进行分析得到节点1和节点2的记录大体如下:

节点1

393:waiting for 'log file switch (archiving needed)'

394:waiting for 'log file switch (archiving needed)'

Cmd: Insert

395:waiting for 'log file switch (archiving needed)'

Cmd: Insert

397:waiting for 'log file switch (archiving needed)'

Cmd: Insert

398:waiting for 'log file switch (archiving needed)'

Cmd: Insert

451:waiting for 'SQL*Net message from client'

469:waiting for 'log file switch (archiving needed)'

Cmd: Insert

470:waiting for 'log file switch (archiving needed)'

Cmd: Insert

471:waiting for 'log file switch (archiving needed)'

Cmd: Insert

618:waiting for 'log file switch (archiving needed)'

Cmd: Insert

626:waiting for 'log file switch (archiving needed)'

Cmd: Insert

NO BLOCKING PROCESSES FOUND

节点2

515:waiting for 'gc buffer busy acquire'

Cmd: Insert

516:waiting for 'gc buffer busy acquire'

Cmd: Insert

517:waiting for 'gc buffer busy acquire'

Cmd: Insert

518:waiting for 'gc buffer busy acquire'

Cmd: Insert

519:waiting for 'gc buffer busy acquire'

Cmd: Insert

520:waiting for 'gc buffer busy acquire'

Cmd: Select

521:waiting for 'gc current request'

Cmd: Insert

522:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C]

Cmd: Select

523:waiting for 'gc buffer busy acquire'

Cmd: Insert

524:waiting for 'SQL*Net message from client'

525:waiting for 'gc buffer busy acquire'

Cmd: Insert

526:waiting for 'gc buffer busy acquire'

Cmd: Insert

527:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C]

Cmd: Select

528:waiting for 'SQL*Net message from client'

529:waiting for 'gc buffer busy acquire'

Cmd: Select

Resource Holder State

Enq TX-0005001E-0022374F 223: waiting for 'gc current request'

Enq TX-0047001B-002BCEB2 247: waiting for 'gc current request'

Enq TX-015B001E-000041FF 330: waiting for 'gc current request'

Enq TX-00010010-002EA7CD 179: waiting for 'gc current request'

Enq TX-00BA0020-001E3E3C ??? Blocker

Object Names

~~~~~~~~~~~~

Enq TX-0005001E-0022374F

Enq TX-0047001B-002BCEB2

Enq TX-015B001E-000041FF

Enq TX-00010010-002EA7CD

Enq TX-00BA0020-001E3E3C

通过这里,我们可以明白,节点2的很多事务hang住是因为请求gc current request,而该等待是因为节点1无法归档,有些block无法正常传输到节点2,导致节点2一直hang在这里,然后就出现IPC Send timeout;节点1上的事务阻塞甚至hang住是因为无法归档导致.到此需要定位的问题是为什么节点1不能归档

继续分析节点1 alert日志

Tue Sep 24 15:18:20 2013

opidrv aborting process O000 ospid (7332) as a result of ORA-28

Immediate Kill Session#: 1904, Serial#: 1065

Immediate Kill Session: sess: 0x24a2522a38 OS pid: 7338

Immediate Kill Session#: 3597, Serial#: 11107

Immediate Kill Session: sess: 0x24c27cf498 OS pid: 7320

Tue Sep 24 15:18:23 2013

opidrv aborting process W000 ospid (7980) as a result of ORA-28

Tue Sep 24 15:18:23 2013

opidrv aborting process W001 ospid (8560) as a result of ORA-28

Tue Sep 24 15:18:35 2013

LGWR: Detected ARCH process failure

LGWR: Detected ARCH process failure

LGWR: Detected ARCH process failure

LGWR: Detected ARCH process failure

LGWR: STARTING ARCH PROCESSES

Tue Sep 24 15:18:35 2013

ARC0 started with pid=66, OS id=10793

Tue Sep 24 15:18:35 2013

Errors in file /u01/app/oracle/diag/rdbms/q9db/q9db1/trace/q9db1_nsa2_12635.trc:

ORA-00028: your session has been killed

LNS: Failed to archive log 8 thread 1 sequence 47156 (28)

ARC0: Archival started

LGWR: STARTING ARCH PROCESSES COMPLETE

Thread 1 advanced to log sequence 47157 (LGWR switch)

ARC0: STARTING ARCH PROCESSES

Current log# 9 seq# 47157 mem# 0: +DATA/q9db/onlinelog/group_9.1241.818697421

Tue Sep 24 15:18:36 2013

ARC1 started with pid=81, OS id=10805

Tue Sep 24 15:18:36 2013

ARC2 started with pid=84, OS id=10807

Tue Sep 24 15:18:36 2013

ARC3 started with pid=87, OS id=10809

ARC1: Archival started

ARC2: Archival started

ARC2: Becoming the 'no FAL' ARCH

ARC2: Becoming the 'no SRL' ARCH

ARC1: Becoming the heartbeat ARCH

Error 1031 received logging on to the standby

PING[ARC1]: Heartbeat failed to connect to standby 'q9adgdg'. Error is 1031.

ARC3: Archival started

ARC0: STARTING ARCH PROCESSES COMPLETE

Archived Log entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1:

Error 1031 received logging on to the standby

FAL[server, ARC3]: Error 1031 creating remote archivelog file 'q9adgdg'

FAL[server, ARC3]: FAL archive failed, see trace file.

ARCH: FAL archive failed. Archiver continuing

ORACLE Instance q9db1 - Archival Error. Archiver continuing.

Tue Sep 24 15:18:46 2013

opidrv aborting process O001 ospid (9605) as a result of ORA-28

Tue Sep 24 15:18:46 2013

opidrv aborting process O000 ospid (10813) as a result of ORA-28

Tue Sep 24 15:18:46 2013

Immediate Kill Session#: 2909, Serial#: 369

Immediate Kill Session: sess: 0x24226c7200 OS pid: 9091

Immediate Kill Session#: 3380, Serial#: 30271

Immediate Kill Session: sess: 0x2422782c58 OS pid: 10265

Immediate Kill Session#: 3597, Serial#: 11109

Immediate Kill Session: sess: 0x24c27cf498 OS pid: 10267

Tue Sep 24 15:20:14 2013

Restarting dead background process DIAG

Tue Sep 24 15:20:14 2013

DIAG started with pid=64, OS id=20568

Restarting dead background process PING

Tue Sep 24 15:20:14 2013

PING started with pid=68, OS id=20570

Restarting dead background process LMHB

Tue Sep 24 15:20:14 2013

LMHB started with pid=70, OS id=20572

Restarting dead background process SMCO

…………

Tue Sep 24 15:23:13 2013

ARC0: Detected ARCH process failure

Tue Sep 24 15:23:13 2013

Thread 1 advanced to log sequence 47158 (LGWR switch)

Current log# 10 seq# 47158 mem# 0: +DATA/q9db/onlinelog/group_10.1240.818697423

ARC0: STARTING ARCH PROCESSES

ARC0: STARTING ARCH PROCESSES COMPLETE

ARC0: Becoming the heartbeat ARCH

ARCH: Archival stopped, error occurred. Will continue retrying

ORACLE Instance q9db1 - Archival Error

ORA-00028: your session has been killed

查看ARCn进程

[oracle@q9db01 ~]$ ps -ef|grep ora_ar

oracle 20718 12870 0 22:07 pts/14 00:00:00 grep ora_ar

[oracle@q9db01 ~]$ ps -ef|grep ora_ar

oracle 25998 12870 0 22:07 pts/14 00:00:00 grep ora_ar

这里基本上明白了,因为客户的系统从15:15开始由于中间件程序异常,导致大量会话连接数据库,然后dba为了防止其他业务不受影响,然后开始大量通过alter system kill session,误杀了不少系统进程,包括ARCn(0,1,2,3)进程,在后面ARCn进程因为某种原因无法正常启动,导致redo无法归档,所有的redo组写满系统即hang住,该系统由于大量kill session已经导致了实例本身异常(正常情况ARCn进程kill之后会自动重启),处理方案:先增加redo组配合定时人工归档,等待业务低峰重启节点1,解决问题。温馨提醒:kill进程请小心

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值