linux 运行ctl文件_Linux磁盘检测工具smartctl的使用和分析

本文介绍了Linux系统中用于磁盘检测的smartctl工具,包括其基本用法、参数选项以及SMART技术的原理。通过示例展示了如何使用smartctl进行磁盘健康状况检查、SMART信息查看、功能开关设定等操作,帮助用户了解并监测磁盘性能和稳定性。
摘要由CSDN通过智能技术生成

1编写目的

在如今大数据的环境中,磁盘的性能和稳定性是非常重要的一个业务因素。在Linux系统中,smartctl是较为常用的磁盘检测工具。

本文基于Linux系统中smartctl进行分析,目的在于说明相关工具的使用,并对SMART(Self-Monitoring,

Analysis and Reporting Technology)做一些分析。

2术语、定义和缩略语

2.1术语、定义

本文使用的专用术语、定义,见表2.1。

表2.1

术语/定义

含义

SMART

Self-Monitoring, Analysis and Reporting Technology

2.2缩略语

本文件应用了以下缩略语,见表2.2。

表2.2

缩略语

原文

中文含义

SMART

Self-Monitoring,

Analysis and Reporting Technology

自监察分析及报告技术

3smartctl

smartctl是smartmontools-5.38-2.el5 rpm中的一个命令行工具,可以执行SMART任务:打印SMART self-test和error报告,开启或关闭SMART自动测试,触发磁盘self-test。

语法:

smartctl  [options]  device

device:

"/dev/hd[a-t]"    IDE/ATA磁盘

"/dev/sd[a-z]"    SCSI devices磁盘。注意,对于SATA磁盘,由于是通过libata

库来访问,所以要增加参数"-d  ata"。

3.1[options]:

参数按照不同的类型来分类。

3.1.1显示信息 参数:

-h帮助信息

-V版本信息

-i打印基本信息(磁盘设备号、序列号、固件版本…)

-a打印磁盘所有的SMART信息

3.1.2运行时行为 参数:

-q  TYPE指定输出的安静模式。

TYPE可以有3种选择:

eorsonly只打印错误日志。

slent有任何打印。

nserial不打印序列号

-d  TYPE指定磁盘的类型。如果没有指定,smartctl会根据磁盘的名字来

猜测磁盘类型。

-T  TYPE指定当发生错误时,smartctl的容忍程度,是否继续运行。

TYPE可以有4种选择:

conservative一有错就会退出

normal如果必须支持的SMART命令失败,则退出

permissive忽略一次必须支持的SMART命令失败

verypermissive忽略所有必须支持的SMART命令失败

-b  TYPE指定当发生校验错误时,smartctl的动作。

TYPE有3种选择:

warn发出警告,继续执行

exit退出smartctl

ignore不发出告警,继续执行

-r  TYPEsmartmontools开发人员相关。

-n  POWERMODE指定当磁盘处于节能模式时,smartctl是否继续检查,

默认是不检查。

POWERMODE有4种选择:

never检查

sleep除了sleep模式,检查。

standby除了sleep或standby模式,检查。

idle除了sleep或standby或idle模式,见车。

3.1.3SMART功能开关 参数:

-s  on/off打开或关闭磁盘的SMART功能

-o  on/off打开或关闭SMART自动离线检测,该功能每4小时就会自动扫描磁盘是

否有缺陷。

-S  on/off打开或关闭“自动保存厂商指定属性”功能。

3.1.4SMART读和显示数据 参数

-H报告磁盘的是否健康。如果报告不健康,则说明磁盘已经损坏或会在24小时

内损坏。

-c显示磁盘支持的普通SMART功能,以及这些功能当前的状态。

-A显示磁盘支持的厂商指定SMART特性。这些特性的编号从1-253,并且有指

定的名字。

-l  TYPE指定显示的log类型。

TYPE有4种选择:

error只显示error  log。

selftest只显示selftest  log

selective只显示selective  self-test  log

directory只显示Log  Directory

-v  N,OPTION显示厂商指定SMART特性N时,使用厂商相关的显示方式。

-F  TYPE设置smartctl的行为,当出现一些已知但还没有解决的硬件或软件bug时,

smartctl应该怎么做。

-P  TYPE设置smartctl是否对磁盘使用数据库中已有的参数。

3.1.5SMART离线测试、自测试 参数

-t  TEST立刻执行测试,可以和-C参数一起使用。

TEST可以有以下几个选择:

offline离线测试。可以在挂载文件系统的磁盘上使用

short短时间测试。可以在挂载文件系统的磁盘上使用。

long长时间测试。可以在挂载文件系统的磁盘上使用。

conveyance  [ATA only]传输zi测试。可以在挂载文件系统的磁盘上使用。

select,

N-M

select, N+SIZE  [ATA only]有选择性测试,测试磁盘的部分LBA。N表示

LBA编号,M表示结束LBA编号,SIZE表示测试的LBA

范围。

-C在captive模式下运行测试。

注意:(1)-C必须配合-t一起使用,但如果是-t offline,则-C不生效。

(2)-C会使得磁盘很忙,所以最好是在没有挂载文件系统的磁盘上使用。

-X中断no-captive模式下运行的测试。

3.2常用example

3.2.1查看当前整体健康状态

查看/dev/sda当前整体监控状态。PASSED表示健康,否则意味着磁盘已经故障,或很快就会发生故障。

smartctl  -H  /dev/sda

3.2.2查看所有信息

打印/dev/sda所有的SMART信息。

martctl  -a  /dev/sda

相当于依次执行:

smartctl  –i

/dev/sda

smartctl  -c

/dev/sda

smartctl  -A

/dev/sda

smartctl  -l

error  /dev/sda

smartctl  -l

selftest  /dev/sda

smartctl  -l  selective  /dev/sda

3.2.3开/关SMART功能

打开或关闭/dev/sda的SMART功能。

smartctl  -s  on/off

/dev/sda

查看当前SMART功能是否开启,可以使用–i参数。

smartctl  -i  /dev/sda

3.2.4离线测试

对/dev/sda进行离线测试,它的结果主要用来更新SMART属性。

smartctl  -t

offline  /dev/sda

3.2.5短时间测试

对/dev/sda进行短时间测试。

smartctl  -t

short  /dev/sda

3.2.5.1观察测试进度

通过-c参数,可以观察到测试的进度:

# smartctl -c    /dev/sda

Self-test execution status:      ( 242) Self-test

routine in progress...

20% of

test remaining.

3.2.5.2观察测试结果

通过-l selftest参数,可以看到/dev/sda测试的结果记录:

“#1”代表的那一次测试,Completed without error表示完成,没有错误。

“#2”代表的那一次测试,Aborted by host表示测试被用户终止,还有90%没有完成。

# smartctl -l selftest    /dev/sda

...

Num

Test_Description  Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offlineCompleted without error00%        9535-

# 2

Extended offline    Aborted by host          90%        9534

-

...

3.2.6查看SMART属性值

通过-A参数,可以看到/dev/sda

SMART属性值。

smartctl  -A

/dev/sda

3.4SMART属性

使用smartctl  -A  /dev/sda能看到很多磁盘的SMART属性,可以知道磁盘是否健康。

下面是一个列表,可以知道每个属性的具体含义:

ID

Hex

Attribut

name

Description

01

0x01

Read Error Rate

(Vendor specific raw value.) Stores data related to the

rate of hardware read errors that occurred when reading data from a disk

surface. The raw value has different structure for different vendors and is

often not meaningful as a decimal number.

02

0x02

Throughput Performance

Overall (general) throughput performance of a hard disk

drive. If the value of this attribute is decreasing there is a high

probability that there is a problem with the disk.

03

0x03

Spin-Up Time

Average time of spindle spin up (from zero RPM to fully

operational [millisecs]).

04

0x04

Start/Stop Count

A tally of spindle start/stop cycles. The spindle turns

on, and hence the count is increased, both when the hard disk is turned on

after having before been turned entirely off (disconnected from power source)

and when the hard disk returns from having previously been put to sleep mode.

05

0x05

Reallocated Sectors Count

Count of reallocated sectors. When the hard drive finds a

read/write/verification error, it marks that sector as

"reallocated" and transfers data to a special reserved area (spare

area). This process is also known as remapping, and reallocated sectors are

called "remaps". The raw value normally represents a count of the

bad sectors that have been found and remapped. Thus, the higher the attribute

value, the more sectors the drive has had to reallocate. This allows a drive

with bad sectors to continue operation; however, a drive which has had any

reallocations at all is significantly more likely to fail in the near future.is forced to seek to the reserved area whenever a remap is

accessed. A workaround which will preserve drive speed at the expense of

capacity is to create aover the region which contains remaps and instruct theto not use that partition.

06

0x06

Read Channel Margin

Margin of a channel while reading data. The function of

this attribute is not specified.

07

0x07

Seek Error Rate

(Vendor specific raw value.) Rate of seek errors of the

magnetic heads. If there is a partial failure in the mechanical positioning

system, then seek errors will arise. Such a failure may be due to numerous

factors, such as damage to a servo, or thermal widening of the hard disk. The

raw value has different structure for different vendors and is often not

meaningful as a decimal number.

08

0x08

Seek Time Performance

Average performance of seek operations of the magnetic

heads. If this attribute is decreasing, it is a sign of problems in the

mechanical subsystem.

09

0x09

Count of hours in power-on state. The raw value of this

attribute shows total count of hours (or minutes, or seconds, depending on

manufacturer) in power-on state.

10

0x0A

Spin Retry Count

Count of retry of spin start attempts. This attribute

stores a total count of the spin start attempts to reach the fully

operational speed (under the condition that the first attempt was

unsuccessful). An increase of this attribute value is a sign of problems in

the hard disk mechanical subsystem.

11

0x0B

Recalibration RetriesorCalibration Retry Count

This attribute indicates the count that recalibration was

requested (under the condition that the first attempt was unsuccessful). An

increase of this attribute value is a sign of problems in the hard disk

mechanical subsystem.

12

0x0C

Power Cycle Count

This attribute indicates the count of full hard disk power

on/off cycles.

13

0x0D

Soft Read Error Rate

Uncorrected read errors reported to the operating system.

180

0xB4

Unused Reserved Block Count Total

"Pre-Fail" Attribute used at least in HP

devices.

183

0xB7

SATA Downshift Error Count

Western Digital and Samsung attribute.

184

0xb8

End-to-Enderror / IOEDC

This attribute is a part ofHewlett-Packard's

SMART IV technology, as well as part of other vendors' IO Error Detection and

Correction schemas, and it contains a count of parity errors which occur in

the data path to the media via the drive's cache RAM.

185

0xB9

Head Stability

Western Digital attribute.

186

0xBA

Induced Op-Vibration Detection

Western Digital attribute.

187

0xBB

Reported Uncorrectable Errors

The count of errors that could not be recovered using

hardware ECC.

188

0xBC

Command Timeout

The count of aborted operations due to HDD timeout.

Normally this attribute value should be equal to zero and if the value is far

above zero, then most likely there will be some serious problems with power

supply or an oxidized data cable.

189

0xBD

High Fly Writes

HDD

producers implement a Fly Height Monitor that attempts to provide additional

protections for write operations by detecting when a recording head is flying

outside its normal operating range. If an unsafe fly height condition is

encountered, the write process is stopped, and the information is rewritten

or reallocated to a safe region of the hard drive. This attribute indicates

the count of these errors detected over the lifetime of the drive.

This feature is implemented in most modern Seagate drivesand some of Western Digital’s drives, beginning with the WD Enterprise

WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all

future WD Enterprise products.

190

0xBE

Airflow Temperature (WDC)resp.Airflow Temperature Celsius (HP)

Airflow temperature on Western Digital HDs (Same as temp.

[C2], but current value is 50 less for some models. Marked as obsolete.)

191

0xBF

G-sense Error Rate

The count of errors resulting from externally-induced

shock & vibration.

192

0xC0

Power-off Retract CountorEmergency Retract Cycle Count(Fujitsu)

Count of times the heads are loaded off the media. Heads

can be unloaded without actually powering off.

193

0xC1

Load Cycle CountorLoad/Unload Cycle Count(Fujitsu)

Count of

load/unload cycles into head landing zone position.

The typical lifetime rating for laptop (2.5-in) hard

drives is 300,000 to 600,000 load cycles.Some

laptop drives are programmed to unload the heads whenever there has not been

any activity for about five seconds.Many Linux installations write to the

file system a few times a minute in the background.As a result, there may be 100 or

more load cycles per hour, and the load cycle rating may be exceeded in less

than a year

194

0xC2

Temperatureresp.Temperature Celsius

Current internal temperature.

195

0xC3

Hardware ECC Recovered

(Vendor specific raw value.) The raw value has different

structure for different vendors and is often not meaningful as a decimal

number.

196

0xC4

Reallocation Event Count

Count of remap operations. The raw value of this attribute

shows the total count of attempts to transfer data from reallocated sectors

to a spare area. Both successful & unsuccessful attempts are counted.

197

0xC5

Current Pending Sector Count

Count of "unstable" sectors (waiting to be

remapped, because of read errors). If an unstable sector is subsequently read

successfully, this value is decreased and the sector is not remapped. Read

errors on a sector will not remap the sector (since it might be readable

later); instead, the drive firmware remembers that the sector needs to be

remapped, and remaps it the next time it's written.

198

0xC6

Uncorrectable Sector Countor

Offline Uncorrectableor

Off-Line Scan

Uncorrectable Sector Count

The total count of uncorrectable errors when

reading/writing a sector. A rise in the value of this attribute indicates

defects of the disk surface and/or problems in the mechanical subsystem.

199

0xC7

UltraDMA CRC Error Count

The count of errors in data transfer via the interface

cable as determined by ICRC (Interface Cyclic Redundancy Check).

200

0xC8

Multi-Zone Error Rate

The count of errors found when writing a sector. The

higher the value, the worse the disk's mechanical condition is.

200

0xC8

Write Error Rate(Fujitsu)

The total count of errors when writing a sector.

201

0xC9

Soft Read Error Rateor

TA Counter Detected

Count of off-track errors.

202

0xCA

Data Address Mark errorsor

TA Counter Increased

Count of Data Address Mark errors (or vendor-specific).

203

0xCB

Run Out Cancel

Count of ECC errors

204

0xCC

Soft ECC Correction

Count of errors corrected by software ECC

205

0xCD

Thermal Asperity Rate (TAR)

Count of errors due to high temperature.

206

0xCE

Flying Height

Height of heads above the disk surface. A flying height

that's too low increases the chances of a head crash while a flying height

that's too high increases the chances of a read/write error.

207

0xCF

Spin High Current

Amount ofused to spin up the drive.

208

0xD0

Spin Buzz

Count of buzz routines needed to spin up the drive due to

insufficient power.

209

0xD1

Offline Seek Performance

Drive’s seek performance during its internal tests.

210

0xD2

Unkonw

(found in a Maxtor 6B200M0 200GB and Maxtor 2R015H1 15GB

disks)

211

0xD3

Vibration During Write

Vibration During Write

212

0xD4

Shock During Write

Shock During Write

220

0xDC

Disk Shift

Distance the disk has shifted relative to the spindle

(usually due to shock or temperature). Unit of measure is unknown.

222

0xDE

Loaded Hours

Time spent operating under data load (movement of magnetic

head armature)

223

0xDF

Load/Unload Retry Count

Count of times head changes position.

224

0xE0

Load Friction

Resistance caused by friction in mechanical parts while

operating.

225

0xE1

Load/Unload Cycle Count

Total count of load cycles

226

0xE2

Load 'In'-time

Total time of loading on the magnetic heads actuator (time

not spent in parking area).

227

0xE3

Torque Amplification Count

Count of attempts to compensate for platter speed

variations

228

0xE4

Power-Off Retract Cycle

The count of times the magnetic armature was retracted

automatically as a result of cutting power.

230

0xE6

GMR Head Amplitude

Amplitude of "thrashing" (distance of repetitive

forward/reverse head motion)

231

0xE7

Temperature

Drive Temperature

232

0xE8

Endurance Remaining

Number of physical erase cycles completed on the drive as

a percentage of the maximum physical erase cycles the drive is designed to

endure

232

0xE8

Available Reserved Space

Intel SSD reports the number of available reserved space

as a percentage of reserved space in a brand new SSD.

233

0xE9

Power-On Hours

Number of hours elapsed in the power-on state.

233

0xE9

Media Wearout Indicator

Intel SSD reports a normalized value of 100 (when the SSD

is new) and declines to a minimum value of 1. It decreases while the NAND

erase cycles increase from 0 to the maximum-rated cycles.

240

0xF0

Head Flying Hours

Time while head is positioning

240

0xF0

Transfer Error Rate(Fujitsu)

Count of times the link is reset during a data transfer.

241

0xF1

Total LBAs Written

Total count of LBAs written

242

0xF2

Total LBAs Read

Total count of LBAs read.Some S.M.A.R.T. utilities will report a negative

number for the raw value since in reality it has 48 bits rather than 32.

250

0xFA

Read Error Retry Rate

Count of errors while reading from a disk

254

0xFE

Free Fall Protection

ount of "Free Fall Events" detected

3.5SMART self-test

使用smartctl  –t  offline/short/long可以指定磁盘进行自测。

offline:

这个是默认的自测。

short:

短时自测的目的是快速确认磁盘是否故障。

测试过程有很多项目,都是磁盘厂商自定义的,比如下面的项目:

a)电气测试项目,测试磁盘内部的电路。具体测试细节有磁盘厂商自己指定,比如:

A)缓存测试。

B)读、写电路测试。

C)读、写磁头测试。

b)寻道、伺服测试项目,测试磁盘在数据磁道上的寻找和伺服能。

c)读、校验测试项目,测试磁盘对部分或全盘的读能力。

long:

称为扩展的自测试。测试的项目和short类型,但是时间长得多。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要说Linux用户最不愿意看到的事情,莫过于在毫无警告的情况下发现硬盘崩溃了。诸如RAID的备份和存储技术可以在任何时候帮用户恢复数据,但为预防硬件崩溃造成数据丢失所花费的代价却是相当可观的,特别是在用户从来没有提前考虑过在这些情况下的应对措施时。 硬盘的故障一般分为两种:可预测的(predictable)和不可预测的(unpredictable)。后者偶而会发生,也没有办法去预防它,例如芯片突然失效,机械撞击等。但像电机轴承磨损、盘片磁介质性能下降等都属于可预测的情况,可以在在几天甚至几星期前就发现这种不正常的现象。 对于可预测的情况,如果能通过磁盘监控技术,通过测量硬盘的几个重要的安全参数和评估他们的情况,然后由监控软件得出两种结果:“硬盘安全”或“不久后会发生故障”。那么在发生故障前,至少有足够的时间让使用者把重要资料转移到其它储存设备上。 最早期的硬盘监控技术起源于1992年,IBM在AS/400计算机的IBM 0662 SCSI 2代硬盘驱动器中使用了后来被命名为Predictive Failure Analysis(故障预警分析技术)的监控技术,它是通过在固件中测量几个重要的硬盘安全参数和评估他们的情况,然后由监控软件得出两种结果:“硬盘安全”或“不久后会发生故障”。 SMART的目的是监控硬盘的可靠性、预测磁盘故障和执行各种类型的磁盘自检。如今大部分的ATA/SATA、SCSI/SAS和固态硬盘都搭载内置的SMART系统。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值