一、问题描述
存储更换新的磁盘,asm磁盘组不rebalance时,查看rebalance预估信息,记过如下:
检查ASM日志
show parameter dump
NAME TYPE VALUE
---------------------------------- ----------- ------------------------------
background_core_dump string partial
background_dump_dest string /u01/app/12.2.0.1/grid/rdbms/log
core_dump_dest string /u01/app/grid/diag/asm/+asm/+ASM1/cdump
max_dump_file_size string unlimited
shadow_core_dump string partial
user_dump_dest string /u01/app/12.2.0.1/grid/rdbms/log
$ tail -100f /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
2019-06-05T13:37:19.502603+08:00
SUCCESS: alter diskgroup FRA
add disk 'AFD:A700FRA'
drop disk 'FRA0001_','FRA0002'
REBALANCE POWER 10 >>>>>>>>>进程仍在,但是hung死状态
2019-06-05713:37:19.503854+08:00
NOTE: starting rebalance of group 2/0x7c32f595 (FRA) at power 10
NOTE: starting process ARBA
Starting background process AREA
2019-06-05713:37:19.520989+08:00
ARBA started with pid=53, OS id=14496 >>>>>>>>>相关进程
NOTE: starting process ARBO
Starting background process ARBO
2019-06-05713:37:19.531873+08:00
ARBO started with pid=58, OS id=14498 >>>>>>>>>相关进程
NOTE: assigning AREA to group 2/0x7c32f595 (FRA) to compute estimates
NOTE: assigning ARBO to group 2/0x7c32f595 (FRA) with 10 parallel I/Os
查看进程的状态,14498没有堆栈信息
pstack 14498
pstack 14496
尝试唤醒rebalance进程,唤起进程失败,系统进程被终止
SQL> alter diskgroup FRA rebalance power 11
2019-06-05T15:10:09.063564+08:00
SQL> alter diskgroup FRA rebalance power 16
2019-06-05T 15: 10: 09.067600+08: 00
NOTE: GroupBlock outside rolling migration privileged region
2019-06-05T15:11:08.053992+08:00
WARNING: process ARBO terminated via OS
数据库的相关日志没有记录,于是查看系统日志,排查系统问题:
$ tail -100f /var/log/massage
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit. >>有溢出
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 128:32. >>多路径失败
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 69:80.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 71:144.
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 69:224.
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 1:0:1:2: alua: port group 3e8 state N non-preferred supports To1USNA >>>似乎与alua有关
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 3:0:1:2: alua: port group 3e8 state N non-preferred supports TolUsNA
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 1:0:2:2: alua: port group 3e8 state N non-preferred supports To1UsNA
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 3:0:2:2: alua: port group 3e8 state N non-preferred supports To1USNA
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdch: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 7
Jun 5 13:37:19 lxmsrdrac1 multipathd: sddr: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 6
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdcq: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 5
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdea: mark as failed
检查多路径状态
$ multipath -ll
asma700-fral (3600a0980383047515a2b4d3259355573) dm-46 NETAPP ,LUN C-Mode
size=l.OT features='2 queue_if_no_path retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=enabled
| |- 1:0:3:2 sdch 69:80 failed ready running
| |- 3:0:3:2 sddr 71:144 failed ready running
| |- 1:0:4:2 sdcq 69:224 failed ready running
| `- 3:0:4:2 sdea 128:32 failed ready running
`-+- policy='round-robin 0' prio=l0 status=active
|- 1:0:1:2 sdbp 68:48 failed ready running
|- 3:0:1:2 sdcz 70:112 failed ready running
|- 1:0:2:2 sdby 68:192 failed ready running
`- 3:0:2:2 sddi 71:0 failed ready running
多路径中多条链路都出现failed,抓到了!
二、问题解决
support上找到相关报错的解决方案
multipath failure with 4.1.12-94.5.9.el7uek.x86_64 kernel and EMC Xtremio storage (Doc ID 2524139.1)
修改multipath.conf文件,内容如下:
devices {
device {
vendor "3PARdata"
product "VV"
path_grouping_policy group_by_prio
path_selector "round-robin 0"
path_checker tur
features "0"
hardware_handler "1 alua"
prio alua
failback immediate
rr_weight uniform
no_path_retry 18
rr_min_io_rq 1
detect_prio yes
fast_io_fail_tmo 30
dev_loss_tmo "infinity"
}
device {
vendor "NETAPP"
product "LUN"
path_grouping_policy group_by_prio
features "1 queue_if_no_path"
# prio_callout "/sbin/mpath_prio_alua /dev/%n"
path_checker directio
path_selector "round-robin 0"
failback immediate
hardware_handler "1 alua"
max_sectors_kb 4096
}
}
multipath -v2 重启multipath,有如下报错:
手动修改两个节点上该磁盘下对应的参数
cat /sys/block/dm-46/queue/max_sectors_kb
echo 4096 > /sys/block/dm-46/queue/max_sectors_kb
cat /sys/block/sdcn/queue/max_sectors_kb
cat /sys/block/sddt/queue/max_sectors_kb
cat /sys/block/sdcv/queue/max_sectors_kb
cat /sys/block/sdeb/queue/max_sectors_kb
cat /sys/block/sdbx/queue/max_sectors_kb
cat /sys/block/sddd/queue/max_sectors_kb
cat /sys/block/sdcf/queue/max_sectors_kb
cat /sys/block/sddl/queue/max_sectors_kb
echo 4096 > /sys/block/sdcn/queue/max_sectors_kb
echo 4096 > /sys/block/sddt/queue/max_sectors_kb
echo 4096 > /sys/block/sdcv/queue/max_sectors_kb
echo 4096 > /sys/block/sdeb/queue/max_sectors_kb
echo 4096 > /sys/block/sdbx/queue/max_sectors_kb
echo 4096 > /sys/block/sddd/queue/max_sectors_kb
echo 4096 > /sys/block/sdcf/queue/max_sectors_kb
echo 4096 > /sys/block/sddl/queue/max_sectors_kb
再次检查发现磁盘状态已正常
asm中rebalance也自动重启,rebalance完成,旧磁盘已踢出。经过进一步的检查,是multipath的bug,multipath补丁安装完毕后,重启multipath,参数可应用
注:oracle asm中的TFA会自动收集信息,其中,orachk任务会调用一个parted作业,会将已经设置的multipath中的max_sectors_kb参数重置回64,有可能会导致multipath故障,为避免此故障,须在udev文件中做如下配置: