一次惊心动魄的生产数据库迁移

 

本周进行了生产库由单实例到oracle rac 环境的迁移,现将迁移过程中的问题做个简单的记录:

 

环境说明:

原生产库环库:aix+ 单实例  oracle

临时生产环境:linux + 4节点 oracle  rac

调整后的生产:aix + 2节点  oracle rac

 

周五做了一个60小时的plan,计划步骤:

1.将原生产库(aix +oracle) 数据expdp出dmp来,上传到linux rac上。

2.在linux rac 环境中用impdp 方式导入原生产库的数据。

3.搭建aix rac环境

4.将linux rac 环境中数据expdp导出,在impdp到aix rac中。

1,2步骤计划在三小时内完成。

3计划30小时内完成。包括划分存储,网络,打操作系统补丁,集群件,安装oracle等。

4.计划在3小时结束

 

遇到的问题:

1.在linux 上impdp 导入数据时有个40G的dmp,执行了一个多小时没反应,进程僵死在那。

原因:在导入数据的时候有个用户用PL/SQL连接进了数据库执行了一个sql,消耗掉了大量的系统资源不释放。导致IMPDP进程僵死。

         执行stop_job。kill掉连接进来的那个用户。重新执行导入操作成功。

         在表导入完成,导入索引的时候又出现问题了。进程又僵死了,一直处在等待中。差点疯掉。。。。

         仔细分析数据库的资源使用情况后,发现这个时候数据库在进行自动统计分析。停掉这个job后问题impdp很快完成。

结果步骤1,2用了近6个小时结束。这个在测试的时候只用了2小时40分。

 

2.在rac安装完成后,oracle的工程师准备撤退。我灵感突发,要重启一个节点的操作系统看数据库能否自动启动。

   结果不但重启那个节点的库没起来,反而导致另一个节点的监听也出问题了。

   搞了两个小时,还没找到问题的原因,几乎要崩溃。这时候里plan的时间已经不多了。

   又过了两个小时,终于搞定了。原来是aix的一个bug。oracle rac在aix上的问题在 meatlink上居然没有记录。

   步骤3比计划多了3个小时。

   感谢这篇文章的作者:http://www.ubtools.com/jira/browse/QA-48

   在本文的最后转载了这篇文章的内容。

 

 3.执行第四步的时候出现了更奇怪的问题。

    1.在linux rac 上用impdp导出的dmp,在aix rac  上无法导入。

      原因:bin  ftp传输方式有问题。

    2.改用ssh方式传输,搞定。高兴中。。。 。天有不测风云。

      原来只能导入小于2个G的dmp,大于2Gdmp无法导入。又郁闷中。。。。

 

错误如下:

Import: Release 10.2.0.4.0 - 64bit Production on Monday, 25 January, 2010 0:51:11
Copyright (c) 2003, 2007, Oracle.  All rights reserved.
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Data Mining and Real Application Testing options
ORA-39002: invalid operation
ORA-31694: master table "CAICPROD"."MYJOB2" failed to load/unload
ORA-31640: unable to open dump file "/arch01/oracle/backup/dmp/caicprod820100125.dmp" for read
ORA-19505: failed to identify file "/arch01/oracle/backup/dmp/caicprod820100125.dmp"
ORA-27092: size of file exceeds file size limit of the process
Additional information: 262143
Additional information: 9294200

 /home/oracle$ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        unlimited
memory(kbytes)       32768
coredump(blocks)     unlimited
nofiles(descriptors) unlimited

 

实在是无法解决了。在linux rac 上用exp 方式导出,在aix rac上用imp导入问题解决。

linux rac 上exp 导出花了2个小时

传输到aix rac  0.5  小时

aix rac 上 imp 导入2.5小时

 

搞定这一切已经到周一凌晨6:00了,幸亏没有耽误生产环境运行。要不就玩大了。

 

总结:1.计划一定要做,越细越好。

         2.适当的给计划留一些富裕的时间,解决突发问题。

 

用一句老话来结束我当时的感受:如果坏事有可能发生,不管这种可能性多么小,它总会发生,并引起最大可能的损失

 

 

 

When starting a VIP on a node, it fails and started on the other node.

Starting the VIP:

# ./crs_start ora.akyorap2.vip
Attempting to start `ora.akyorap2.vip` on member `akyorap2`
Start of `ora.akyorap2.vip` on member `akyorap2` failed.
Attempting to start `ora.akyorap2.vip` on member `akyorap1`
Start of `ora.akyorap2.vip` on member `akyorap1` succeeded.
#

The log level increased to get more detailed diagnostic data.

Setting Log Level:

#./crsctl debug log res "ora.akyorap2.vip:1" 
Set Resource Debug Module: ora.akyorap2.vip  Level: 1
#

Errors from the Log:
(<ORA_CRS_HOME>/log/<nodeName>/racg/ora.akyorap2.vip.log)

Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] checkIf: start for if=en1
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] IsIfAlive: start for if=en1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw:  started
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] defaultgw:  completed with 10.46.1
80.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host
=akyorap2)
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: end for if=en1

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] checkIf: end for if=en1
Invalid parameters, or failed to bring up VIP (host=akyorap2)

 


 All   Comments   Change History     Sort Order: Ascending order - Click to sort in descending order

ubTools Support - 18/Mar/09 08:10 PM
The problem raised from IsIfAlive() of $ORA_CRS_HOME/racgvip.

Here are the related excerpt from racgvip:

  # Check the status of the interface thro' pinging gateway
  if [ -n "$DEFAULTGW" ]
  then
    _RET=1
    # get base IP address of the interface
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
    do
      if [ -n "$tmpIP" ]
      then
        logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW
"
        $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      else
        logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW"
        $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
        _RET=0
        break
      fi
      $SLEEP 1
      x=`$EXPR $x - 1`
    done
    if [ $_RET -ne 0 ]
    then
      logx "IsIfAlive: RX packets checked if=$_IF failed"
    else
      logx "IsIfAlive: RX packets checked if=$_IF OK"
    fi
....

According to the the code above, it does the followings:

  • Assigns the current RX packet number to _O1 variable as the first RX packet number.
  • Loops $CHECK_TIMES times:
    • Pings default gateway.
    • Assigns the current RX packet number to _O2 variable as the next RX packet number.
    • If RX packet number changed(_O1!=_O2), break the loop.
    • Sleep 1 second.
  • If RX packet number is NOT changed(_O1==_O2) raise the error; else it's OK.

ubTools Support - 18/Mar/09 08:28 PM
racgvip was modified as below to dump the values of _ O1 and _ O2:
...
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
    logx "--------------> by dunal: _O1: $_O1"

    x=$CHECK_TIMES
    while [ $x -gt 0 ]
    do
      if [ -n "$tmpIP" ]
      then
        logx "About to execute command: $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW
"
        $PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      else
        logx "About to execute command: $PING $PING_TIMEOUT $DEFAULTGW"
        $PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
      logx "--------------> by dunal: _O2: $_O2"
...

As seen above, logx "--------------> by dunal: ..." lines are added to the script. Don't do that if you're not sure about what you do.

After restarting the VIP, the values of _O1 and _O2 are dumped in the logs.

Failed Node:

...
Wed Mar 18 20:58:49 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O1: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:49 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:50 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:51 GMT+02:00 2009 [ 413770 ] About to execute command: /usr/sbin/ping -S
10.46.180.52  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:51 GMT+02:00 2009 [ 413770 ] --------------> by dunal: _O2: -

2009-03-18 20:58:52.212: [    RACG][1] [360462][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:52 GMT+02:00 2009 [ 413770 ] IsIfAlive: RX packets checked if=en1 failed
Wed Mar 18 20:58:52 GMT+02:00 2009 [ 413770 ] Interface en1 checked failed (host
=akyorap2)
...

As seen above, the values are '-'. It's wrong. But, they are same. So, RX packet number not changed.

Successful Node:

Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O1: 17297

2009-03-18 20:58:55.793: [    RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:55 GMT+02:00 2009 [ 405728 ] About to execute command: /usr/sbin/ping -S
10.46.180.51  -c 1 -w 1 10.46.180.1
Wed Mar 18 20:58:55 GMT+02:00 2009 [ 405728 ] --------------> by dunal: _O2: 17298

2009-03-18 20:58:55.793: [    RACG][1] [397546][1][ora.akyorap2.vip]: Wed Mar 18
 20:58:55 GMT+02:00 2009 [ 405728 ] IsIfAlive: RX packets checked if=en1 OK

_O1 and _O2 are different. That means RX packet number changed and the interface is up.


ubTools Support - 18/Mar/09 08:44 PM

netstat Output on Failed Node:

/usr/bin/netstat -f inet -n -I en1 | /usr/bin/awk "{ if (/^en1/) {print $5; exit}}"
en1   1500  link#3      0.21.5e.34.55.bc       -    34601     0    16269     3     0

The column#5 is '-'. This is wrong and caused the problem.

netstat Output on Successful Node:

en1   1500  link#3      0.21.5e.34.57.fe            29223     0    10609     3     0

The column#5 is 29223. This is expected number.

Headers of netstat on Failed Node:

#/usr/bin/netstat -f inet -n -I en1 
Name  Mtu   Network     Address           ZoneID    Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  link#3      0.21.5e.34.55.bc       -    35645     0    16801     3     0
en1   1500  10.46.180   10.46.180.52           -    35645     0    16801     3     0

Headers of netstat on Successful Node:

#/usr/bin/netstat -f inet -n -I en1 
Name  Mtu   Network     Address           ZoneID    Ipkts Ierrs    Opkts Oerrs  Coll
en1   1500  link#3      0.21.5e.34.57.fe            29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.51                29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.53                29743     0    10762     3     0
en1   1500  10.46.180   10.46.180.54                29743     0    10762     3     0

The difference is the ZoneID column.

Looks like a network configuration problem. This issue will be open for an update from Network Administrators.


ubTools Support - 19/Mar/09 12:54 PM
The Network Adminisitrator said it was an AIX Bug:

But, this fix changes ZoneID from blank value to '-'. After this fix, no VIP could be started.


ubTools Support - 19/Mar/09 01:11 PM - edited
No solution found from Metalink.

ubTools Support - 19/Mar/09 01:45 PM
Looks like an inconsistency of Oracle on AIX 6.1.

Workaround:

Capturing column number of netstat must be changed from 5 to 6.

Original lines for _O1:

...
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
...

Modified line for _O1:

...
    tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK '{print $2}'`
    # get RX packets numbers
    _O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$6; exit}}"`
    x=$CHECK_TIMES
    while [ $x -gt 0 ]
...

Original lines for _O2:

...
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$5; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
...

Modified line for _O2:

...
      fi
      _O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print //$6; exit}}"`
      if [ "$_O1" != "$_O2" ]
      then
        # RX packets numbers changed
...

Then, VIP could be started on the correct nodes:

./crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora....ap1.gsd application    ONLINE    ONLINE    akyorap1
ora....ap1.ons application    ONLINE    ONLINE    akyorap1
ora....ap1.vip application    ONLINE    ONLINE    akyorap1
ora....ap2.gsd application    ONLINE    ONLINE    akyorap2
ora....ap2.ons application    ONLINE    ONLINE    akyorap2
ora....ap2.vip application    ONLINE    ONLINE    akyorap2

Note: Don't edit Oracle scripts unless you know what you're doing.

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值