本周进行了生产库由单实例到oracle rac 环境的迁移,现将迁移过程中的问题做个简单的记录:
环境说明:
原生产库环库:aix+ 单实例 oracle
临时生产环境:linux + 4节点 oracle rac
调整后的生产:aix + 2节点 oracle rac
周五做了一个60小时的plan,计划步骤:
1.将原生产库(aix +oracle) 数据expdp出dmp来,上传到linux rac上。
2.在linux rac 环境中用impdp 方式导入原生产库的数据。
3.搭建aix rac环境
4.将linux rac 环境中数据expdp导出,在impdp到aix rac中。
1,2步骤计划在三小时内完成。
3计划30小时内完成。包括划分存储,网络,打操作系统补丁,集群件,安装oracle等。
4.计划在3小时结束
遇到的问题:
1.在linux 上impdp 导入数据时有个40G的dmp,执行了一个多小时没反应,进程僵死在那。
原因:在导入数据的时候有个用户用PL/SQL连接进了数据库执行了一个sql,消耗掉了大量的系统资源不释放。导致IMPDP进程僵死。
执行stop_job。kill掉连接进来的那个用户。重新执行导入操作成功。
在表导入完成,导入索引的时候又出现问题了。进程又僵死了,一直处在等待中。差点疯掉。。。。
仔细分析数据库的资源使用情况后,发现这个时候数据库在进行自动统计分析。停掉这个job后问题impdp很快完成。
结果步骤1,2用了近6个小时结束。这个在测试的时候只用了2小时40分。
2.在rac安装完成后,oracle的工程师准备撤退。我灵感突发,要重启一个节点的操作系统看数据库能否自动启动。
结果不但重启那个节点的库没起来,反而导致另一个节点的监听也出问题了。
搞了两个小时,还没找到问题的原因,几乎要崩溃。这时候里plan的时间已经不多了。
又过了两个小时,终于搞定了。原来是aix的一个bug。oracle rac在aix上的问题在 meatlink上居然没有记录。
步骤3比计划多了3个小时。
感谢这篇文章的作者:http://www.ubtools.com/jira/browse/QA-48
在本文的最后转载了这篇文章的内容。
3.执行第四步的时候出现了更奇怪的问题。
1.在linux rac 上用impdp导出的dmp,在aix rac 上无法导入。
原因:bin ftp传输方式有问题。
2.改用ssh方式传输,搞定。高兴中。。。 。天有不测风云。
原来只能导入小于2个G的dmp,大于2Gdmp无法导入。又郁闷中。。。。
错误如下:
With the Partitioning, Real Application Clusters, Data Mining and Real Application Testing options
ORA-39002: invalid operation
ORA-31694: master table "CAICPROD"."MYJOB2" failed to load/unload
ORA-31640: unable to open dump file "/arch01/oracle/backup/dmp/caicprod820100125.dmp" for read
ORA-19505: failed to identify file "/arch01/oracle/backup/dmp/caicprod820100125.dmp"
ORA-27092: size of file exceeds file size limit of the process
Additional information: 262143
Additional information: 9294200
/home/oracle$ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) unlimited
stack(kbytes) unlimited
memory(kbytes) 32768
coredump(blocks) unlimited
nofiles(descriptors) unlimited
实在是无法解决了。在linux rac 上用exp 方式导出,在aix rac上用imp导入问题解决。
linux rac 上exp 导出花了2个小时
传输到aix rac 0.5 小时
aix rac 上 imp 导入2.5小时
搞定这一切已经到周一凌晨6:00了,幸亏没有耽误生产环境运行。要不就玩大了。
总结:1.计划一定要做,越细越好。
2.适当的给计划留一些富裕的时间,解决突发问题。
用一句老话来结束我当时的感受:如果坏事有可能发生,不管这种可能性多么小,它总会发生,并引起最大可能的损失
When starting a VIP on a node, it fails and started on the other node.
Starting the VIP:
# ./crs_start ora.akyorap2.vip Attempting to start `ora.akyorap2.vip` on member `akyorap2` Start of `ora.akyorap2.vip` on member `akyorap2` failed. Attempting to start `ora.akyorap2.vip` on member `akyorap1` Start of `ora.akyorap2.vip` on member `akyorap1` succeeded. #
The log level increased to get more detailed diagnostic data.
Setting Log Level:
#./crsctl debug log res "ora.akyorap2.vip:1" Set Resource Debug Module: ora.akyorap2.vip Level: 1 #
Errors from the Log:
(<ORA_CRS_HOME>/log/<nodeName>/racg/ora.akyorap2.vip.log)
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] checkIf: start for if=en1 Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] IsIfAlive: start for if=en1 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw: started Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw: completed with 10.46.1 80.1 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S 10.46.180.52 -c 1 -w 1 10.46.180.1 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S 10.46.180.52 -c 1 -w 1 10.46.180.1 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host =akyorap2) Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: end for if=en1 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] checkIf: end for if=en1 Invalid parameters, or failed to bring up VIP (host=akyorap2)
All | Comments | Change History | Sort Order: |
... # get RX packets numbers _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"` logx "--------------> by dunal: _O1: $_O1" x=$CHECK_TIMES while [ $x -gt 0 ] do if [ -n "$tmpIP" ] then logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW " $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1 else logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW" $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1 fi _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"` logx "--------------> by dunal: _O2: $_O2" ...
As seen above, logx "--------------> by dunal: ..." lines are added to the script. Don't do that if you're not sure about what you do.
After restarting the VIP, the values of _O1 and _O2 are dumped in the logs.
Failed Node:
... Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O1: - 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S 10.46.180.52 -c 1 -w 1 10.46.180.1 Wed Mar 18 20:58:50 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: - 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S 10.46.180.52 -c 1 -w 1 10.46.180.1 Wed Mar 18 20:58:51 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: - 2009-03-18 20:58:52.212: [ RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host =akyorap2) ...
As seen above, the values are '-'. It's wrong. But, they are same. So, RX packet number not changed.
Successful Node:
Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O1: 17297 2009-03-18 20:58:55.793: [ RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] About to execute command: /usr/sbin/ping -S 10.46.180.51 -c 1 -w 1 10.46.180.1 Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O2: 17298 2009-03-18 20:58:55.793: [ RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] IsIfAlive: RX packets checked if=en1 OK
_O1 and _O2 are different. That means RX packet number changed and the interface is up.
netstat Output on Failed Node:
/usr/bin/netstat -f inet -n -I en1 | /usr/bin/awk "{ if (/^en1/) {print $5; exit}}" en1 1500 link#3 0.21.5e.34.55.bc - 34601 0 16269 3 0
The column#5 is '-'. This is wrong and caused the problem.
netstat Output on Successful Node:
en1 1500 link#3 0.21.5e.34.57.fe 29223 0 10609 3 0
The column#5 is 29223. This is expected number.
Headers of netstat on Failed Node:
#/usr/bin/netstat -f inet -n -I en1 Name Mtu Network Address ZoneID Ipkts Ierrs Opkts Oerrs Coll en1 1500 link#3 0.21.5e.34.55.bc - 35645 0 16801 3 0 en1 1500 10.46.180 10.46.180.52 - 35645 0 16801 3 0
Headers of netstat on Successful Node:
#/usr/bin/netstat -f inet -n -I en1 Name Mtu Network Address ZoneID Ipkts Ierrs Opkts Oerrs Coll en1 1500 link#3 0.21.5e.34.57.fe 29743 0 10762 3 0 en1 1500 10.46.180 10.46.180.51 29743 0 10762 3 0 en1 1500 10.46.180 10.46.180.53 29743 0 10762 3 0 en1 1500 10.46.180 10.46.180.54 29743 0 10762 3 0
The difference is the ZoneID column.
Looks like a network configuration problem. This issue will be open for an update from Network Administrators.
But, this fix changes ZoneID from blank value to '-'. After this fix, no VIP could be started.
Workaround:
Capturing column number of netstat must be changed from 5 to 6.
Original lines for _O1:
... tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'` # get RX packets numbers _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"` x=$CHECK_TIMES while [ $x -gt 0 ] ...
Modified line for _O1:
... tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'` # get RX packets numbers _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$6; exit}}"` x=$CHECK_TIMES while [ $x -gt 0 ] ...
Original lines for _O2:
... fi _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"` if [ "$_O1" != "$_O2" ] then # RX packets numbers changed ...
Modified line for _O2:
... fi _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$6; exit}}"` if [ "$_O1" != "$_O2" ] then # RX packets numbers changed ...
Then, VIP could be started on the correct nodes:
./crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora....ap1.gsd application ONLINE ONLINE akyorap1 ora....ap1.ons application ONLINE ONLINE akyorap1 ora....ap1.vip application ONLINE ONLINE akyorap1 ora....ap2.gsd application ONLINE ONLINE akyorap2 ora....ap2.ons application ONLINE ONLINE akyorap2 ora....ap2.vip application ONLINE ONLINE akyorap2
Note: Don't edit Oracle scripts unless you know what you're doing.
Here are the related excerpt from racgvip:
According to the the code above, it does the followings: