SQL 复制环境,进程不能进程正常停止

  1. 环境
    DB2 9.5
    SQL Replication

背景信息:
DBA 团队维护人员发现数据库维护脚本长期执行,没有停止下来。
检查日志找到Apply进程没有停止下来。

  1. 问题现象:
    SQL复制的Apply端进程不能正常停止。
    Apply日志显示一直在运行过程中。 没有接收到STOP命令的信号,导致数据库周维护脚本不能继续向下进行。

尝试停止Apply进程的脚本日志:

Maintain check end: Sun Feb 3 11:27:21 BRST 2019
End 
Step 2: Stop_Apply
Start to force application
DB20000I The FORCE APPLICATION command completed successfully.
DB21024I This command is asynchronous and may not be effective immediately.

There are 0 applications connecting to PROM_DW currently
2019-02-03-11.27.22.663296 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.27.30.670859 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.27.30.698669 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.27.38.704735 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.27.38.730952 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.27.46.736826 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.27.46.765199 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.27.54.771442 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.27.54.800140 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.02.806075 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.02.834397 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.10.840373 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.20.929295 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.28.935275 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.28.963148 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.36.969152 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.36.997498 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.45.003476 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.45.031106 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.28.53.037173 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.28.53.065487 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.01.071580 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.01.099321 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.09.105558 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.19.192983 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.27.199061 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.27.226878 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.35.232951 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.35.261635 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.43.267691 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.43.295588 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.51.301679 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.51.329809 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.29.59.335915 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.29.59.363234 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.30.07.369353 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).
2019-02-03-11.30.17.458325 ASN0600I "AsnAcmd" : "" : "Initial" : Program "applycmd 9.5.10" is starting.
2019-02-03-11.30.25.464664 ASN0506E "AsnAcmd" : "TEST001" : "Initial" : The command was not processed because the target replication program "Apply" was not running or because you entered an incorrect parameter value (for example, a server name, schema, or qualifier might be misspelled).

Apply的日志。

2019-02-03-11.27.13.662269 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.28.13.706808 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.29.13.759718 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.30.13.804167 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.31.13.846738 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.32.13.892540 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.33.13.936337 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.34.13.992169 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.35.14.034834 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.36.14.080520 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "0" minutes and "59" seconds.
2019-02-03-11.37.13.128013 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.38.13.173299 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.39.13.229953 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-03-11.40.13.274607 <CPREST(01/00)> ASN1044I APPLY "TEST001" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds

3:在测试环境中重现,发现不能重现这个问题。 SQL复制都可以正常停止下来。
测试环境中显示的Apply端停止的日志。

测试环境中都会收到这个命令“ The program received the “STOP” command.”

2019-02-15-14.25.14.889290 <cmdsEngine> ASN0522I  "Apply" : "MYQUAL1" : "Initial" : The program received the "STOP" command.
	2019-02-15-14.25.16.891334 <asnThread::stop> ASN0590I  "Apply" : "MYQUAL1" : "Initial" The thread "Initial" received return code "2011" from the exiting thread "AdminThread".
	2019-02-15-14.25.22.895746 <asnThread::stop> ASN0590I  "Apply" : "MYQUAL1" : "Initial" The thread "Initial" received return code "2011" from the exiting thread "WorkerThread".
	2019-02-15-14.25.22.895926 <Asnenv:delEnvIpcQRcvHdl> ASN0595I  "Apply" : "MYQUAL1" : "Initial" The program removed an IPC queue with keys "(0x3000d6a9, 0x3100d6a9, 0x3200d6a9)".
	2019-02-15-14.25.22.895943 <asnThread::stop> ASN0590I  "Apply" : "MYQUAL1" : "Initial" The thread "Initial" received return code "0" from the exiting thread "HoldLThread".
	2019-02-15-14.25.22.896008 <erWhatSignal> ASN0591I  "Apply" : "MYQUAL1" : "HoldLThread" The thread "HoldLThread" received "Handled" signal "SIGUSR2".

4:问题的原因:
经过咨询IBM工程师,怀疑是临时文件夹中IPC文件被删除。有可能导致这个问题。
https://www.ibm.com/support/docview.wss?uid=swg21289248
其实自己也查到了这篇technote.但是没有仔细的研究这段话的含义。
看到technote中报错日志和我环境的日志没有类似的,就排除掉了。

个人怀疑这个文件记录了SQL复制启动的进程号的信息。
SQL复制正常启动时候会一直挂住发出启动命令的那个进程。 如果这个进程被关闭了。数据库也会停止的。所以这个IPC文件被删除了,当停止命令发出来以后,找不到对应的进程信息,所以日志中看到没有收到STOP命令。

The IPC files created by replication programs in either /tmp or TMPDIR are accidentally
deleted. These files identify IPC message queues and shared memory and are used by
asnccmd, asnqccmd, asnacmd, asnqacmd to communicate with the replication program

5:重现这个问题。

在测试环境中启动SQL复制。观察/tmp文件夹。 发现生成了一个后缀名为IPC的问题文件。 并且这些文件的是以SQL复制的一些配置进行命令的。(而且这些文件的大小是0KB)
如下:

-rw-r--r--. 1 e95q10a db2iadm1    0 Feb 16 23:16 e95q10a.SAMPLE.ASN.CAP.IPC
-rw-r--r--. 1 e95q10a db2iadm1    0 Feb 16 23:16 e95q10a.SAMPLE.MYQUAL1.APP.IPC

重现步骤:

  • 启动SQL复制 capture 和apply
  • 到tmp文件夹目录。删除这个0K大小的 IPC文件。
  • 尝试停止SQL复制的capture及apply(无论是否force掉apply及capture 对应的db2的application )
  • 问题重现。
  • 日志不断的报错。但是capture 及apply的日志运行正常。
    日志如下:
2019-02-16-23.16.48.991344 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "0" minutes and "33" seconds.
2019-02-16-23.17.21.998772 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.18.22.004724 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.19.22.011058 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.20.22.022288 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.21.22.031237 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.22.22.038483 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.23.22.046662 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.24.22.057401 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.25.22.065753 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
2019-02-16-23.26.22.073823 <CPREST(01/00)> ASN1044I  APPLY "MYQUAL1" : "WorkerThread". The Apply program will become inactive for "1" minutes and "0" seconds.
  1. 解决方式
    临时解决方案:
    如果遇到这种情况。其实是可以使用操作系统命令直接Kill掉的。
    kill掉以后也不会启动不起来。
    脚本中可以这样写:
kill -9 ` ps -ef|grep -v grep|grep -iE 'asncap|asnapply'  | awk '{print $2}'`

永久解决方案:
这个问题的触发原因是,SQL复制写了一个IPC文件到临时文件夹中去。
如果临时IPC文件夹被删除,就会遇到这个问题。我个人认为这个设计有缺陷,谁也不能保证临时文件不被删除。 但是DB2还是给出了一个解决方案。

设置TMPDIR环境变量。
https://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.swg.im.iis.repl.qrepl.doc/topics/iiyrqenvtempdir.html

apply
mkdir apply_ipc
export TMPDIR=/db2/prom_dba/PROM_DW/apply/apply_ipc
cd ~
vi .profile
#this env variable used for DB2 replication apply temp file#
export TMPDIR=/db2/prom_dba/PROM_DW/apply/apply_ipc
#this env variable used for DB2 replication apply temp file#



capture 
mkdir  capture_ipc
cd ~
vi .profiles
#this env variable used for DB2 replication capture temp file#
export TMPDIR=/db2/prom_dba/PROM/capture/capture_ipc
#this env variable used for DB2 replication capture temp file#

最后将那个technote内容附上。
Why am I getting asn0506e when trying to stop SQL/Q Apply or SQL/Q Capture?
https://www-01.ibm.com/support/docview.wss?uid=swg21289248

Technote (troubleshooting)

Problem(Abstract)
Getting ASN0506E when trying to stop SQL/Q Apply or SQL/Q Capture

Symptom
When trying to stop any of the replication programs using any of these commands -

asnccmd, asnqccmd, asnacmd, asnqacmd, ASN0506E is returned and the program is not stopped.

For example,

  1. asnqacmd apply_server=<apply_server> stop

ASN0600I “AsnQAcmd” : “” : “Initial” :Program “asnqacmd 9.1.0” is starting.
ASN0506E “AsnQAcmd” : “ASN” : “Initial” :
The command was not processed. The “Q Apply” program is presumed down.

  1. asnccmd Capture_server=<capture_server> stop

ASN0600I “AsnCcmd” : “” : “Initial” : Program “capcmd 9.1.0” is starting.
ASN0506E “AsnCcmd” : “ASN” : “Initial” : The command was not processed. The “Capture” program is presumed down.
Cause
The IPC files created by replication programs in either /tmp or TMPDIR are accidentally

deleted. These files identify IPC message queues and shared memory and are used by
asnccmd, asnqccmd, asnacmd, asnqacmd to communicate with the replication program

Environment
Linux, Unix

Diagnosing the problem
asnqacmd apply_server=<apply_server> apply_schema=<apply_schema> stop logstdout

asnqccmd capture_server=<capture_server> capture_schema=<capture_schema> stop logstdout
asnacmd apply_server=<apply_server> apply_schema=<apply_schema> stop logstdout
asnccmd capture_server=<capture_server> capture_schema=<capture_schema> stop logstdout

TMPDIR environment variable setting

Resolving the problem
If the processes do not stop due to this error, a forceful stop maybe needed. To avoid this problem, in future, please set TMPDIR environment variable to point to a directory where the SQL Apply/Q Apply program or SQL Capture/Q Capture program can write the files. Then start the SQL Apply/Q Apply or SQL Capture/Q Capture program.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值