linux raid md sd,debian – Linux mdraid RAID 6,磁盘每隔几天随机丢失一次

我有一些运行Debian 8的服务器,配置为RAID6的8x800GB SSD.所有磁盘都连接到闪存为IT模式的LSI-3008.在每个服务器中,我还有一个2磁盘对作为操作系统的RAID1.

当前状态

# dpkg -l|grep mdad

ii mdadm 3.3.2-5+deb8u1 amd64 tool to administer Linux MD arrays (software RAID)

# uname -a

Linux R5U32-B 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08) x86_64 GNU/Linux

# more /proc/mdstat

Personalities : [raid1] [raid6] [raid5] [raid4]

md2 : active raid6 sde1[1](F) sdg1[3] sdf1[2] sdd1[0] sdh1[7] sdb1[6] sdj1[5] sdi1[4]

4687678464 blocks super 1.2 level 6, 512k chunk, algorithm 2 [8/7] [U_UUUUUU]

bitmap: 3/6 pages [12KB], 65536KB chunk

md1 : active (auto-read-only) raid1 sda5[0] sdc5[1]

62467072 blocks super 1.2 [2/2] [UU]

resync=PENDING

md0 : active raid1 sda2[0] sdc2[1]

1890881536 blocks super 1.2 [2/2] [UU]

bitmap: 2/15 pages [8KB], 65536KB chunk

unused devices:

# mdadm --detail /dev/md2

/dev/md2:

Version : 1.2

Creation Time : Fri Jun 24 04:35:18 2016

Raid Level : raid6

Array Size : 4687678464 (4470.52 GiB 4800.18 GB)

Used Dev Size : 781279744 (745.09 GiB 800.03 GB)

Raid Devices : 8

Total Devices : 8

Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Tue Jul 19 17:36:15 2016

State : active, degraded

Active Devices : 7

Working Devices : 7

Failed Devices : 1

Spare Devices : 0

Layout : left-symmetric

Chunk Size : 512K

Name : R5U32-B:2 (local to host R5U32-B)

UUID : 24299038:57327536:4db96d98:d6e914e2

Events : 2514191

Number Major Minor RaidDevice State

0 8 49 0 active sync /dev/sdd1

2 0 0 2 removed

2 8 81 2 active sync /dev/sdf1

3 8 97 3 active sync /dev/sdg1

4 8 129 4 active sync /dev/sdi1

5 8 145 5 active sync /dev/sdj1

6 8 17 6 active sync /dev/sdb1

7 8 113 7 active sync /dev/sdh1

1 8 65 - faulty /dev/sde1

问题

每隔1-3天左右,RAID 6阵列会半定期降级.原因是其中一个(任何一个)磁盘出现故障,并出现以下错误:

#dmesg -T

[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: attempting task abort! scmd(ffff8810350cbe00)

[Sat Jul 16 05:38:45 2016] sd 0:0:3:0: [sde] CDB:

[Sat Jul 16 05:38:45 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Sat Jul 16 05:38:45 2016] scsi target0:0:3: handle(0x000d), sas_address(0x500304801707a443), phy(3)

[Sat Jul 16 05:38:45 2016] scsi target0:0:3: enclosure_logical_id(0x500304801707a47f), slot(3)

[Sat Jul 16 05:38:46 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff8810350cbe00)

[Sat Jul 16 05:38:46 2016] end_request: I/O error, dev sde, sector 2064

[Sat Jul 16 05:38:46 2016] md: super_written gets error=-5, uptodate=0

[Sat Jul 16 05:38:46 2016] md/raid:md2: Disk failure on sde1, disabling device.md/raid:md2: Operation continuing on 7 devices.

[Sat Jul 16 05:38:46 2016] RAID conf printout:

[Sat Jul 16 05:38:46 2016] --- level:6 rd:8 wd:7

[Sat Jul 16 05:38:46 2016] disk 0, o:1, dev:sdd1

[Sat Jul 16 05:38:46 2016] disk 1, o:0, dev:sde1

[Sat Jul 16 05:38:46 2016] disk 2, o:1, dev:sdf1

[Sat Jul 16 05:38:46 2016] disk 3, o:1, dev:sdg1

[Sat Jul 16 05:38:46 2016] disk 4, o:1, dev:sdi1

[Sat Jul 16 05:38:46 2016] disk 5, o:1, dev:sdj1

[Sat Jul 16 05:38:46 2016] disk 6, o:1, dev:sdb1

[Sat Jul 16 05:38:46 2016] disk 7, o:1, dev:sdh1

[Sat Jul 16 05:38:46 2016] RAID conf printout:

[Sat Jul 16 05:38:46 2016] --- level:6 rd:8 wd:7

[Sat Jul 16 05:38:46 2016] disk 0, o:1, dev:sdd1

[Sat Jul 16 05:38:46 2016] disk 2, o:1, dev:sdf1

[Sat Jul 16 05:38:46 2016] disk 3, o:1, dev:sdg1

[Sat Jul 16 05:38:46 2016] disk 4, o:1, dev:sdi1

[Sat Jul 16 05:38:46 2016] disk 5, o:1, dev:sdj1

[Sat Jul 16 05:38:46 2016] disk 6, o:1, dev:sdb1

[Sat Jul 16 05:38:46 2016] disk 7, o:1, dev:sdh1

[Sat Jul 16 12:40:00 2016] sd 0:0:7:0: attempting task abort! scmd(ffff88000d76eb00)

已经尝试过了

我已经尝试了以下内容,没有任何改进:

>将/ sys / block / md2 / md / stripe_cache_size从256增加到16384

>将dev.raid.speed_limit_min从1000增加到50000

需要你的帮助

这些错误是由mdadm配置还是内核或控制器引起的?

更新20160802

遵循ppetraki和其他人的建议:

>使用原始磁盘代替分区

这并不能解决问题

>减少块大小

块大小已被修改为128KB然后64KB,但RAID卷仍然在几天内降级.从dmesg显示与之前的错误类似.我忘了尝试将块大小减少到32KB.

>将RAID数量减少到6个磁盘

我试图破坏现有的RAID,将每个磁盘上的超级块归零并创建具有6个磁盘(原始磁盘)和64KB块的RAID6.减少磁盘RAID的数量似乎使阵列寿命更长,大约在降级前4-7天

>更新驱动程序

[Tue Aug 2 17:57:48 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880fc0dd1980)

[Tue Aug 2 17:57:48 2016] sd 0:0:6:0: [sdg] CDB:

[Tue Aug 2 17:57:48 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Tue Aug 2 17:57:48 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)

[Tue Aug 2 17:57:48 2016] scsi target0:0:6: enclosure_logical_id(0x50030480173ee97f), slot(6)

[Tue Aug 2 17:57:49 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880fc0dd1980)

[Tue Aug 2 17:57:49 2016] end_request: I/O error, dev sdg, sector 0

就在不久之前,我的阵列已经降级了.这次/ dev / sdf和/ dev / sdg显示错误“尝试任务中止!scmd”

[Tue Aug 2 21:26:02 2016]

[Tue Aug 2 21:26:02 2016] sd 0:0:5:0: [sdf] CDB:

[Tue Aug 2 21:26:02 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Tue Aug 2 21:26:02 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)

[Tue Aug 2 21:26:02 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)

[Tue Aug 2 21:26:02 2016] scsi target0:0:5: enclosure level(0x0000), connector name( ^A)

[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88103beb5240)

[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88107934e080)

[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: [sdf] CDB:

[Tue Aug 2 21:26:03 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00

[Tue Aug 2 21:26:03 2016] scsi target0:0:5: handle(0x000f), sas_address(0x50030480173ee945), phy(5)

[Tue Aug 2 21:26:03 2016] scsi target0:0:5: enclosure logical id(0x50030480173ee97f), slot(5)

[Tue Aug 2 21:26:03 2016] scsi target0:0:5: enclosure level(0x0000), connector name( ^A)

[Tue Aug 2 21:26:03 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88107934e080)

[Tue Aug 2 21:26:04 2016] sd 0:0:5:0: [sdf] CDB:

[Tue Aug 2 21:26:04 2016] Read(10): 28 00 04 75 3b f8 00 00 08 00

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: sas_address(0x50030480173ee945), phy(5)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: enclosure logical id(0x50030480173ee97f), slot(5)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: enclosure level(0x0000), connector name( ^A)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: handle(0x000f), ioc_status(success)(0x0000), smid(35)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: request_len(4096), underflow(4096), resid(-4096)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: tag(65535), transfer_count(8192), sc->result(0x00000000)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)

[Tue Aug 2 21:26:04 2016] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)

[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: attempting task abort! scmd(ffff880931d8c840)

[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: [sdg] CDB:

[Tue Aug 2 22:14:51 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Tue Aug 2 22:14:51 2016] scsi target0:0:6: handle(0x0010), sas_address(0x50030480173ee946), phy(6)

[Tue Aug 2 22:14:51 2016] scsi target0:0:6: enclosure logical id(0x50030480173ee97f), slot(6)

[Tue Aug 2 22:14:51 2016] scsi target0:0:6: enclosure level(0x0000), connector name( ^A)

[Tue Aug 2 22:14:51 2016] sd 0:0:6:0: task abort: SUCCESS scmd(ffff880931d8c840)

[Tue Aug 2 22:14:52 2016] sd 0:0:6:0: [sdg] CDB:

[Tue Aug 2 22:14:52 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: sas_address(0x50030480173ee946), phy(6)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: enclosure logical id(0x50030480173ee97f), slot(6)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: enclosure level(0x0000), connector name( ^A)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: handle(0x0010), ioc_status(success)(0x0000), smid(85)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: request_len(0), underflow(0), resid(-8192)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: tag(65535), transfer_count(8192), sc->result(0x00000000)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)

[Tue Aug 2 22:14:52 2016] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)

[Tue Aug 2 22:14:52 2016] end_request: I/O error, dev sdg, sector 16

[Tue Aug 2 22:14:52 2016] md: super_written gets error=-5, uptodate=0

[Tue Aug 2 22:14:52 2016] md/raid:md2: Disk failure on sdg, disabling device. md/raid:md2: Operation continuing on 5 devices.

[Tue Aug 2 22:14:52 2016] RAID conf printout:

[Tue Aug 2 22:14:52 2016] --- level:6 rd:6 wd:5

[Tue Aug 2 22:14:52 2016] disk 0, o:1, dev:sdc

[Tue Aug 2 22:14:52 2016] disk 1, o:1, dev:sdd

[Tue Aug 2 22:14:52 2016] disk 2, o:1, dev:sde

[Tue Aug 2 22:14:52 2016] disk 3, o:1, dev:sdf

[Tue Aug 2 22:14:52 2016] disk 4, o:0, dev:sdg

[Tue Aug 2 22:14:52 2016] disk 5, o:1, dev:sdh

[Tue Aug 2 22:14:52 2016] RAID conf printout:

[Tue Aug 2 22:14:52 2016] --- level:6 rd:6 wd:5

[Tue Aug 2 22:14:52 2016] disk 0, o:1, dev:sdc

[Tue Aug 2 22:14:52 2016] disk 1, o:1, dev:sdd

[Tue Aug 2 22:14:52 2016] disk 2, o:1, dev:sde

[Tue Aug 2 22:14:52 2016] disk 3, o:1, dev:sdf

[Tue Aug 2 22:14:52 2016] disk 5, o:1, dev:sdh

我假设错误“尝试任务中止!scmd”导致数组降级,但不知道是什么导致它.

更新20160806

我已经尝试使用相同的规格设置其他服务器.如果没有mdadm RAID,则每个磁盘都直接安装在ext4文件系统下.一段时间内核日志显示“尝试任务中止!scmd”在某些磁盘上.这个引导/ dev / sdd1错误然后重新安装到只读模式

$dmesg -T

[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:

[Sat Aug 6 05:21:09 2016] Read(10): 28 00 2d 29 21 00 00 00 20 00

[Sat Aug 6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)

[Sat Aug 6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)

[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88006b206800)

[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: attempting task abort! scmd(ffff88019a3a07c0)

[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: [sdd] CDB:

[Sat Aug 6 05:21:09 2016] Read(10): 28 00 08 46 8f 80 00 00 20 00

[Sat Aug 6 05:21:09 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)

[Sat Aug 6 05:21:09 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)

[Sat Aug 6 05:21:09 2016] sd 0:0:3:0: task abort: SUCCESS scmd(ffff88019a3a07c0)

[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: attempting device reset! scmd(ffff880f9a49ac80)

[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: [sdd] CDB:

[Sat Aug 6 05:21:10 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Sat Aug 6 05:21:10 2016] scsi target0:0:3: handle(0x000a), sas_address(0x4433221103000000), phy(3)

[Sat Aug 6 05:21:10 2016] scsi target0:0:3: enclosure_logical_id(0x500304801a5d3f01), slot(3)

[Sat Aug 6 05:21:10 2016] sd 0:0:3:0: device reset: SUCCESS scmd(ffff880f9a49ac80)

[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)

[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)

[Sat Aug 6 05:21:10 2016] mpt3sas0: log_info(0x31110e03): originator(PL), code(0x11), sub_code(0x0e03)

[Sat Aug 6 05:21:11 2016] end_request: I/O error, dev sdd, sector 780443696

[Sat Aug 6 05:21:11 2016] Aborting journal on device sdd1-8.

[Sat Aug 6 05:21:11 2016] EXT4-fs error (device sdd1): ext4_journal_check_start:56: Detected aborted journal

[Sat Aug 6 05:21:11 2016] EXT4-fs (sdd1): Remounting filesystem read-only

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88024fc08340)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:

[Sat Aug 6 05:40:35 2016] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[Sat Aug 6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)

[Sat Aug 6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: task abort: FAILED scmd(ffff88024fc08340)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88019a12ee00)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: [sdf] CDB:

[Sat Aug 6 05:40:35 2016] Read(10): 28 00 27 c8 b4 e0 00 00 20 00

[Sat Aug 6 05:40:35 2016] scsi target0:0:5: handle(0x000c), sas_address(0x4433221105000000), phy(5)

[Sat Aug 6 05:40:35 2016] scsi target0:0:5: enclosure_logical_id(0x500304801a5d3f01), slot(5)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: task abort: SUCCESS scmd(ffff88019a12ee00)

[Sat Aug 6 05:40:35 2016] sd 0:0:5:0: attempting task abort! scmd(ffff88203eaddac0)

更新20160930

控制器固件升级到最新版本(当前)12.00.02后,问题消失了

结论

问题解决了

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值