multipath链路异常中断【blk_cloned_rq_check_limits: over max size limit.】

一、问题描述

存储更换新的磁盘,asm磁盘组不rebalance时,查看rebalance预估信息,记过如下:


检查ASM日志

show parameter dump

NAME				           TYPE	 VALUE
---------------------------------- ----------- ------------------------------
background_core_dump		     string	 partial
background_dump_dest		     string	 /u01/app/12.2.0.1/grid/rdbms/log
core_dump_dest			     string	 /u01/app/grid/diag/asm/+asm/+ASM1/cdump
max_dump_file_size		     string	 unlimited
shadow_core_dump		           string	 partial
user_dump_dest			     string	 /u01/app/12.2.0.1/grid/rdbms/log

$ tail -100f /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log
2019-06-05T13:37:19.502603+08:00
SUCCESS: alter diskgroup FRA
add disk 'AFD:A700FRA'
drop disk 'FRA0001_','FRA0002'
REBALANCE POWER 10     >>>>>>>>>进程仍在,但是hung死状态
2019-06-05713:37:19.503854+08:00
NOTE: starting rebalance of group 2/0x7c32f595 (FRA) at power 10
NOTE: starting process ARBA
Starting background process AREA
2019-06-05713:37:19.520989+08:00
ARBA started with pid=53, OS id=14496     >>>>>>>>>相关进程
NOTE: starting process ARBO
Starting background process ARBO
2019-06-05713:37:19.531873+08:00
ARBO started with pid=58, OS id=14498     >>>>>>>>>相关进程
NOTE: assigning AREA to group 2/0x7c32f595 (FRA) to compute estimates
NOTE: assigning ARBO to group 2/0x7c32f595 (FRA) with 10 parallel I/Os

查看进程的状态,14498没有堆栈信息

pstack 14498
pstack 14496

尝试唤醒rebalance进程,唤起进程失败,系统进程被终止

SQL> alter diskgroup FRA rebalance power 11
2019-06-05T15:10:09.063564+08:00

SQL> alter diskgroup FRA rebalance power 16
2019-06-05T 15: 10: 09.067600+08: 00
NOTE: GroupBlock outside rolling migration privileged region
2019-06-05T15:11:08.053992+08:00
WARNING: process ARBO terminated via OS

数据库的相关日志没有记录,于是查看系统日志,排查系统问题:

$ tail -100f /var/log/massage
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.  >>有溢出
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 128:32.  >>多路径失败
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 69:80.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: blk_cloned_rq_check_limits: over max size limit.
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 71:144.
Jun 5 13:37:19 lxmsrdrac1 kernel: device-mapper: multipath: Failing path 69:224.
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 1:0:1:2: alua: port group 3e8 state N non-preferred supports To1USNA  >>>似乎与alua有关
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 3:0:1:2: alua: port group 3e8 state N non-preferred supports TolUsNA
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 1:0:2:2: alua: port group 3e8 state N non-preferred supports To1UsNA
Jun 5 13:37:19 lxmsrdrac1 kernel: sd 3:0:2:2: alua: port group 3e8 state N non-preferred supports To1USNA
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdch: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 7
Jun 5 13:37:19 lxmsrdrac1 multipathd: sddr: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 6
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdcq: mark as failed
Jun 5 13:37:19 lxmsrdrac1 multipathd: asma700-fral: remaining active paths: 5
Jun 5 13:37:19 lxmsrdrac1 multipathd: sdea: mark as failed

检查多路径状态

$ multipath -ll

asma700-fral  (3600a0980383047515a2b4d3259355573)  dm-46 NETAPP   ,LUN C-Mode
size=l.OT features='2  queue_if_no_path  retain_attached_hw_handler' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=enabled
|  |- 1:0:3:2   sdch                    69:80    failed ready running
|  |- 3:0:3:2   sddr                    71:144   failed ready running
|  |- 1:0:4:2   sdcq                    69:224   failed ready running
|  `- 3:0:4:2   sdea                    128:32   failed ready running
`-+- policy='round-robin 0' prio=l0 status=active
   |- 1:0:1:2   sdbp                    68:48    failed ready running
   |- 3:0:1:2   sdcz                    70:112   failed ready running
   |- 1:0:2:2   sdby                    68:192   failed ready running
   `- 3:0:2:2   sddi                    71:0     failed ready running

多路径中多条链路都出现failed,抓到了!

二、问题解决

support上找到相关报错的解决方案

multipath failure with 4.1.12-94.5.9.el7uek.x86_64 kernel and EMC Xtremio storage (Doc ID 2524139.1)

修改multipath.conf文件,内容如下:

devices {
        device {
                vendor "3PARdata"
                product "VV"
                path_grouping_policy group_by_prio
                path_selector "round-robin 0"
                path_checker tur
                features "0"
                hardware_handler "1 alua"
                prio alua
                failback immediate
                rr_weight uniform
                no_path_retry 18
                rr_min_io_rq 1
                detect_prio yes
                fast_io_fail_tmo 30
                dev_loss_tmo "infinity"
        }
        device {
                vendor                  "NETAPP"
                product                 "LUN"
                path_grouping_policy    group_by_prio
                features                "1 queue_if_no_path"
#                prio_callout            "/sbin/mpath_prio_alua /dev/%n"
                path_checker            directio
                path_selector           "round-robin 0"
                failback                immediate
                hardware_handler        "1 alua"
                max_sectors_kb          4096
       }
}

multipath -v2 重启multipath,有如下报错:

手动修改两个节点上该磁盘下对应的参数

cat /sys/block/dm-46/queue/max_sectors_kb
echo 4096 > /sys/block/dm-46/queue/max_sectors_kb
  
cat /sys/block/sdcn/queue/max_sectors_kb
cat /sys/block/sddt/queue/max_sectors_kb
cat /sys/block/sdcv/queue/max_sectors_kb
cat /sys/block/sdeb/queue/max_sectors_kb

cat /sys/block/sdbx/queue/max_sectors_kb
cat /sys/block/sddd/queue/max_sectors_kb
cat /sys/block/sdcf/queue/max_sectors_kb
cat /sys/block/sddl/queue/max_sectors_kb

echo 4096 > /sys/block/sdcn/queue/max_sectors_kb
echo 4096 > /sys/block/sddt/queue/max_sectors_kb
echo 4096 > /sys/block/sdcv/queue/max_sectors_kb
echo 4096 > /sys/block/sdeb/queue/max_sectors_kb

echo 4096 > /sys/block/sdbx/queue/max_sectors_kb
echo 4096 > /sys/block/sddd/queue/max_sectors_kb
echo 4096 > /sys/block/sdcf/queue/max_sectors_kb
echo 4096 > /sys/block/sddl/queue/max_sectors_kb

再次检查发现磁盘状态已正常

asm中rebalance也自动重启,rebalance完成,旧磁盘已踢出。经过进一步的检查,是multipath的bug,multipath补丁安装完毕后,重启multipath,参数可应用

注:oracle asm中的TFA会自动收集信息,其中,orachk任务会调用一个parted作业,会将已经设置的multipath中的max_sectors_kb参数重置回64,有可能会导致multipath故障,为避免此故障,须在udev文件中做如下配置: 

 

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Hannah_JK

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值