为什么linux会伤固态硬盘,linux – 为什么这个SSD驱动器出现坏扇区故障,是否可以预测?...

注意:此问题之前已被关闭为非主题.你可以阅读

discussion.我在这里问的理由是:

>此驱动器位于赞比亚农村学校的离线内容缓存服务器中.

>服务器是从磁盘映像创建的,所有内容都是可替换的.

>它必须便宜,因为赞比亚学校的预算有限,而且会有很多.

>它也必须可靠,因为在不良道路上可能需要8小时才能更换.

>我不允许在这里询问什么驱动器不是“超便宜的垃圾”.

>因此,我们正在对符合这些标准的驱动器进行自己的研究和实验.

>我无法通过覆盖它们来修复坏扇区(自动重新分配)违背了我的假设,我想知道原因.

>我想也许一个安全删除可能会修复坏道,但在我废弃驱动器之前想要别人的意见.

>我以为我可能错过了可以预测失败的SMART数据.

这是一款金士顿240GB SSD磁盘,在网站上工作了大约3个月,突然发展出坏道:

smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.20-net6501-121115-1cw] (local build)

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Device Model: KINGSTON SVP200S3240G

Serial Number: 50026B7228010E5C

LU WWN Device Id: 5 0026b7 228010e5c

Firmware Version: 502ABBF0

User Capacity: 240,057,409,536 bytes [240 GB]

Sector Size: 512 bytes logical/physical

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ACS-2 revision 3

Local Time is: Tue Mar 5 17:10:24 2013 CAT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x02) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Disabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: ( 0) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 48) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x0021) SCT Status supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 084 084 050 Pre-fail Always - 10965286670575

5 Reallocated_Sector_Ct 0x0033 100 100 003 Pre-fail Always - 16

9 Power_On_Hours 0x0032 000 000 000 Old_age Always - 46823733462185

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 127

171 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0

172 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 0

174 Unknown_Attribute 0x0030 000 000 000 Old_age Offline - 131

177 Wear_Leveling_Count 0x0000 000 000 000 Old_age Offline - 1

181 Program_Fail_Cnt_Total 0x0032 000 000 000 Old_age Always - 0

182 Erase_Fail_Count_Total 0x0032 000 000 000 Old_age Always - 0

187 Reported_Uncorrect 0x0032 000 000 000 Old_age Always - 49900

194 Temperature_Celsius 0x0022 033 078 000 Old_age Always - 33 (Min/Max 21/78)

195 Hardware_ECC_Recovered 0x001c 120 120 000 Old_age Offline - 235163887

196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 16

201 Soft_Read_Error_Rate 0x001c 120 120 000 Old_age Offline - 235163887

204 Soft_ECC_Correction 0x001c 120 120 000 Old_age Offline - 235163887

230 Head_Amplitude 0x0013 100 100 000 Pre-fail Always - 100

231 Temperature_Celsius 0x0013 100 100 010 Pre-fail Always - 0

233 Media_Wearout_Indicator 0x0000 000 000 000 Old_age Offline - 363

234 Unknown_Attribute 0x0032 000 000 000 Old_age Always - 208

241 Total_LBAs_Written 0x0032 000 000 000 Old_age Always - 208

242 Total_LBAs_Read 0x0032 000 000 000 Old_age Always - 1001

SMART Error Log not supported

SMART Self-test Log not supported

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

现在我在磁盘上的某些地方遇到了坏块:

root@iPad2:~# badblocks /dev/sda -v

Checking blocks 0 to 234431063

Checking for bad blocks (read-only test): 8394752 done, 1:15 elapsed

8394756 done, 1:21 elapsed

8394757 done, 1:23 elapsed

8394758 done, 1:24 elapsed

8394759 done, 1:27 elapsed

...

190882871one, 29:49 elapsed

190882888one, 29:53 elapsed

190882889one, 29:54 elapsed

190882890one, 29:56 elapsed

190882891one, 29:58 elapsed

done

Pass completed, 80 bad blocks found.

它们似乎是可重复的,并且自动重新分配失败,因此无法通过写入它们来修复它们:

root@iPad2:~# badblocks /dev/sda -wvf 8394756 8394756

/dev/sda is apparently in use by the system; badblocks forced anyway.

Checking for bad blocks in read-write mode

From block 8394756 to 8394756

Testing with pattern 0xaa: 8394756

done

Reading and comparing: done

Testing with pattern 0x55: done

Reading and comparing: done

Testing with pattern 0xff: done

Reading and comparing: done

Testing with pattern 0x00: done

Reading and comparing: done

Pass completed, 1 bad blocks found.

我在系统日志中遇到这样的错误:

ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0

ata1.00: irq_stat 0x40000000

ata1.00: failed command: READ FPDMA QUEUED

ata1.00: cmd 60/08:00:08:30:00/00:00:01:00:00/40 tag 0 ncq 4096 in

res 51/40:08:08:30:00/00:00:01:00:00/40 Emask 0x409 (media error)

ata1.00: status: { DRDY ERR }

ata1.00: error: { UNC }

ata1.00: configured for UDMA/133

sd 0:0:0:0: [sda] Unhandled sense code

sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE

sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]

Descriptor sense data with sense descriptors (in hex):

72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00

01 00 30 08

sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed

sd 0:0:0:0: [sda] CDB: Read(10): 28 00 01 00 30 08 00 00 08 00

end_request: I/O error, dev sda, sector 16789512

Buffer I/O error on device sda, logical block 2098689

ata1: EH complete

现在我不明白为什么自动重新分配在这个磁盘上失败了. smartctl输出对我来说都很好看.只有16个部门被重新分配,这一点并不多.我看不出这个驱动器拒绝重新分配扇区的任何正当理由.这种型号的SSD刚刚破损或设计糟糕吗?

笔记:

>根据金斯顿的文档,属性174是“意外断电”.

> 131意外断电非常糟糕.

>属性187(Reported_Uncorrect)是可能的最大值65535中的49900

>最高温度在78’C时非常高

金士顿在这个驱动器上隐藏了最多的interesting SMART counters.但是我们可以从属性196推断出备用扇区的数量.Reallocated_Event_Count,它具有以下规范化值的公式:

100 -(100* RBC / MRC)

RBC = Retired Block Count (Grown)

MRE = Maximum reallocation count

由于归一化值是100,这意味着RBC << MRE,因此我们无法用尽所有可用的部门进行重新分配.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值