在日常维护中,如果涉足一些需要重启cell的操作,我们如何能在不影响业务的情况下进行这个操作呢,这里有分以下几步来完成。
1. 首先需要解释一个概念,DISK_REPAIR_TIME参数,这个参数是定义在一定时间内,维护磁盘,并将该磁盘offline,是不会触发ASM的rebalance的,但是我们需要确定我们的维护时间,这个参数默认是3.6小时,如果我们的维护时间更长,需要设定更长的时间周期
(a)登陆ASM实例,检查DISK_REPAIR_TIME的值,语句如下
- SQL> select dg.name,a.value from v$asm_diskgroup
- dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
(b)如何根据需要更改时间周期
- SQL> ALTER DISKGROUP DATA SET ATTRIBUTE 'DISK_REPAIR_TIME'='8.5H';
2) 通过以下命令确认griddisk的状态,确定该磁盘的镜像都是正常的,才能offline这个griddisk,不然会造成数据丢失。
该语句返回‘Yes’,表示这个griddisk可以offline。
- cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
3) 只要有一个griddisk返回asmdeactivationoutcome='No',你都需要等该一段时间,然后再次查询,知道全部griddisk的状态正常,才可以继续操作
如果在状态异常的情况下,依然offline,将会导致ASM diskgroup卸载,并引起数据库的异常宕机。
4) 执行下面命令Inactivate这个cell上的全部的griddisk, 这个操作大概需要10分钟或者更长时间。
这一步是非常重要的,在重启cell之前一定要保证全部的griddisk成功的offline。 Inactivate griddisk会自动的在ASM实例中offline 相对应的磁盘。
- cellcli -e alter griddisk all inactive
5) 在确保的griddisks全部offline之后,执行下面的步骤
(a) 在griddisks离线之后,执行下面的命令会看到输出的状态为asmmodestatus=OFFLINE, asmmodestatus=UNUSED 和asmdeactivationoutcome=Yes. 只有这样状态,我们才能安全的重启cell
- cellcli -e list griddisk attributes name,asmmodestatus,asmdeactivationoutcome
(b) 查看griddisk状态,并确认已经是inactive状态:
- cellcli -e list griddisk
6) 现在你可以通过linux命令来重启Cell
- (a) The following command will shut down Oracle Exadata Storage Server immediately: (as root):
- #shutdown -h now
(当关闭Cell的时候,所有相关的storage服务都会自动停止)
- (b) The following command will reboot Oracle Exadata Storage Server immediately and force fsck on reboot:
- #shutdown -F -r now
7) 当cell重新启动后,你需要手动重新激活griddisks。
- cellcli -e alter griddisk all active
8) 检查griddisk是否'active':
- cellcli -e list griddisk
9) 验证grid disk状态:
(a) 验证所有的grid disks已经成功online:
- cellcli -e list griddisk attributes name, asmmodestatus
(b) 查看状态发现,cell是有个'SYNCING'的状态,等全部同步完成,才能变成‘online’,等到griddisk的asmmodestatus属性全都‘online’。
- Wait until asmmodestatus is ONLINE for all grid disks. Each disk will go to a 'SYNCING' state first then 'ONLINE'. The following is an example of the output:
- DATA_CD_00_dm01cel01 ONLINE
- DATA_CD_01_dm01cel01 SYNCING
- DATA_CD_02_dm01cel01 OFFLINE
- DATA_CD_03_dm01cel01 OFFLINE
- DATA_CD_04_dm01cel01 OFFLINE
- DATA_CD_05_dm01cel01 OFFLINE
- DATA_CD_06_dm01cel01 OFFLINE
- DATA_CD_07_dm01cel01 OFFLINE
- DATA_CD_08_dm01cel01 OFFLINE
- DATA_CD_09_dm01cel01 OFFLINE
- DATA_CD_10_dm01cel01 OFFLINE
- DATA_CD_11_dm01cel01 OFFLINE
(c) 等全部griddisk的asmmodestatus属性全都‘online’, Oracle ASM 同步才算完成
( Please note: this operation uses Fast Mirror Resync operation - which does not trigger an ASM rebalance. The Resync operation restores only the extents that would have been written while the disk was offline.)
9) 在操作另一个cell并offline之前,已经要确保之前的cell已经同步完成。 如果之前的cell同步没有完成,那么执行另一个cell的检查操作会失败,下面是一个错误的输出
- CellCLI> list griddisk attributes name where asmdeactivationoutcome != 'Yes'
- DATA_CD_00_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_01_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_02_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_03_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_04_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_05_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_06_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_07_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_08_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_09_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_10_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"
- DATA_CD_11_dm01cel02 "Cannot de-activate due to other offline disks in the diskgroup"