保证RAC 24*7安全稳定运行秘籍之一（IO闪断解决办法）

最新推荐文章于 2021-11-03 10:58:39 发布

cqrw65623

最新推荐文章于 2021-11-03 10:58:39 发布

阅读量175

点赞数

文章标签：数据库运维

保证RAC 24*7安全稳定运行秘籍之一（IO闪断解决办法）

多年运维生涯当中，闪断无疑是最让我头疼的问题。闪断问题多数只能“事后诸葛”。
闪断问题处理思路：
1.使用合理的工具收集有效信息进而分析闪断问题。比如涉及ORACLE数据库的推荐使用OSwatch。
2.如果无法使用合里的工具收集有效的错误信息，这时你的重点应该转移到收集的现象和技术原理上。（比如银行类想安装一个收集信息工具基本不可能）以少数的现象和夯实的技术原理分

析给领导，逻辑必须合理。这样也是可以解决问题的。

今天就说一个光纤线引起的IO闪断问题，进而影响了多台RAC同时OCR被dismount。
下图简单的介绍了一个CASE的背景。同时也普及了一下存储到光纤交换机再到服务器的链路走向。

此次的CASE就发生在存储A0到光纤交换机AO的光纤线上，这条光纤线的闪断就影响了所有走DEFAULT SPA的LUN，恰好设计的DEFAULT SPA的LUN 多数是在OCR磁盘组内。所以多次发生OCR被

DISMOUNT的现象。通过主机，存储，数据库三方共同协作最终问题确认是光纤线问题，更换光纤线。
IO闪断解决办法总结：
存储：通过此次CASE，更换一些不达标准的光纤线。
数据库：分析PST的原理，进而优化了"_asm_hbeatiowait"值。增强系统的强壮性和稳定性。

PST heartbeat：往往发生在IO闪断/繁忙/CPU繁忙时，PST检测到同步延迟超过"_asm_hbeatiowait"值时，会通知ORACLE ASM INSTANCE dismount disk group,由GMON进程完成disk group。
cssd voting heartbeat:往往发生在本地无法范围OCR的情况下（IO彻底中断），进而脑裂。

1.What is PST ?
PST is Partner Status Table .
1.1 Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.
1.2 PST determines whether the disk is OFFLINE by judging the number of partners.
1.3 Disk Group Offline was Oracle ASM.
Oracle ASM forces the dismounting of the disk group.Otherwise,Oracle ASM takes the disk offline.

2.How is PST working ?

This problem has a few distinctive symptoms but the highest is a node crash:
Also:

Diskgroup outage
Very Slow IO Performance
Possible very high CPU
Timeouts for IO
Communications to ASM, CRS or CSS failures
More:

Only one node active, the other one hangs while starting ASM.
After an outage the Node restarts, but IO Waits are very high
Overall Very slow performance on one node, but no load or evidence of why IO stats are so high

工作原理，引用 http://www.askmaclean.com/archives/pst-partnership-status-table.html

External Redundancy一般有一个PST
Normal Redundancy至多有个3个PST
High Redundancy 至多有5个PST

如下场景中PST 可能被重定位：
存有PST的ASM DISK不可用了(当ASM启东时)
ASM DISK OFFLINE了
当对PST的读写发生了I/O错误
disk被正常DROP了

在读取其他ASM metadata之前会先检查PST
当ASM实例被要求mount diskgroup时，GMON进程会读取diskgroup中所有磁盘去找到和确认PST拷贝
如果他发现有足够的PST，那么会mount diskgroup
之后，PST会被缓存在ASM缓存中，以及GMON的PGA中并使用排他的PT.n.0锁保护
同集群中的其他ASM实例也将缓存PST到GMON的PGA，并使用共享PT.n.o锁保护
仅仅那个持有排他锁的GMON能更新磁盘上的PST信息
每一个ASM DISK上的AUN=1均为PST保留，但只有几个磁盘上真的有PST数据

3.Why is using PST ?
已冗余度保证数据安全。
确保Normal Redundancy和High Redundancy策略的磁盘组组内磁盘在一定时间内数据的一致性。进而以Normal Redundancy和High Redundancy策略保护数据的安全。

Reference:
Best Practices for Corruption Detection, Prevention, and Automatic Repair - in a Data Guard Configuration (文档 ID 1302539.1)

Oracle Automatic Storage Management (ASM) - Background

Read errors can be the result of a loss of access to the entire disk or media corruptions on an otherwise healthy disk. Oracle ASM tries to recover from read errors on

corrupted sectors on a disk. When a read error by the database or Oracle ASM triggers the Oracle ASM instance to attempt bad block remapping, Oracle ASM reads a good

copy of the extent and copies it to the disk that had the read error.

If the write to the same location succeeds, then the underlying allocation unit (sector) is deemed healthy. This might be because the underlying disk did its own bad

block reallocation.
If the write fails, Oracle ASM attempts to write the extent to a new allocation unit on the same disk. If this write succeeds, the original allocation unit is marked

as unusable. If the write fails, the disk is taken offline.
Another benefit with Oracle ASM based mirroring is that the database instance is aware of the mirroring. For many types of physical block corruptions such as a bad

checksum, the database instance proceeds through the mirror side looking for valid content and proceeds without errors. If the process in the database that encountered

the read can obtain the appropriate locks to ensure data consistency, it writes the correct data to all mirror sides.

When encountering a write error, a database instance sends the Oracle ASM instance a disk offline message.

If database can successfully complete a write to at least one extent copy and receive acknowledgment of the offline disk from Oracle ASM, the write is considered

successful.
If the write to all mirror side fails, database takes the appropriate actions in response to a write error such as taking the tablespace offline.
When the Oracle ASM instance receives a write error message from a database instance or when an Oracle ASM instance encounters a write error itself, the Oracle ASM

instance attempts to take the disk offline. Oracle ASM consults the Partner Status Table (PST) to see whether any of the disk's partners are offline. If too many

partners are offline, Oracle ASM forces the dismounting of the disk group. Otherwise, Oracle ASM takes the disk offline.

The ASMCMD remap command was introduced to address situations where a range of bad sectors exists on a disk and must be corrected before Oracle ASM or database I/O.

For information about the remap command, see "remap".

When ASM detects any block corruptions, ASM logs the error to the ASM alert.log file. The same corruption error may not appear in the database alert.log or

application if ASM can correct the corruption automatically.

Starting Oracle 12c, Oracle ASM disk scrubbing checks logical data corruptions and repairs the corruptions automatically in normal and high redundancy disks groups.

The feature is designed so that it does not have any impact to the regular input and output (I/O) operations in production systems. The scrubbing process repairs

logical corruptions using the Oracle ASM mirror disks. Disk scrubbing uses Oracle ASM rebalancing to minimize I/O overhead.

The scrubbing process is visible in fields of the V$ASM_OPERATION view. Refer to Oracle? Automatic Storage Management Administrator's Guide 12c Release 1 (12.1).

These ASM benefits are available for all databases using ASM. Since every Exadata Database Machine uses ASM, all these benefits are always available for Exadata

customers.

########################################################################################
版权所有，文章允许转载，但必须以链接方式注明源地址，否则追究法律责任!【QQ交流群：53993419】
QQ：14040928 E-mail：dbadoudou@163.com
本文链接： http://blog.itpub.net/26442936/viewspace-2096971/
########################################################################################

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/26442936/viewspace-2096971/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/26442936/viewspace-2096971/

cqrw65623

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
保证RAC 24*7安全稳定运行秘籍之一（IO闪断解决办法）

保证RAC 24*7安全稳定运行秘籍之一（IO闪断解决办法）多年运维生涯当中，闪断无疑是最让我头疼的问题。闪断问题多数只能“事后诸葛”。闪断问题处理思路：1.使用合理的工具收集有效信息进而分析闪断问...
复制链接

扫一扫