Datanode节点一块硬盘故障处理

今天一台服务器 datanode服务自动停止了,查看datanode  log发现如下报错:

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 1, volumes configured: 2, volumes failed: 1, volume failures tolerated: 0

意思是volumes出现故障,在hdfs-site.xml文件中有个配置:
<property>
<name>dfs.datanode.data.dir</name>
<value>/diskb/hadoop/hdfs/data,/diskc/hadoop/hdfs/data,/diskd/hadoop/hdfs/data</value>
</property>

<property>
        <name> dfs.datanode.failed.volumes.tolerated</name>
        <value>0</value>
</property>


dfs.datanode.failed.volumes.tolerated值为0,意思是当diska、diskb、diskc、diskd任何一块磁盘出现问题后,
datanode就会服务停止,如何设置为1,可以有一块故障。


#dmesg
出现大量I/O错误:
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
__ratelimit: 8 callbacks suppressed
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 32 00 00 00 28 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760773, block=706740260
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 32 00 00 00 28 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760802, block=706740262
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(16): 88 00 00 00 00 01 51 00 01 2a 00 00 00 30 00 00
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2760734, block=706740257
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 2a 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 3a 00 00 10 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 52 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 62 00 00 08 00
sd 3:0:0:0: [sdd] Unhandled error code
sd 3:0:0:0: [sdd] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 3:0:0:0: [sdd] CDB: Read(10): 28 00 11 80 01 42 00 00 08 00
end_request: I/O error, dev sdd, sector 293601570
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=143361, block=36700192
end_request: I/O error, dev sdd, sector 5750391074
EXT4-fs error (device sdd1): __ext4_get_inode_loc: unable to read inode block - inode=2807809, block=718798880
end_request: I/O error, dev sdd, sector 5653922090

尝试新建文件报错如下:
#touch 111
touch: cannot touch `111': Read-only file system

硬盘的健康状况:
smartctl -H /dev/sdd

注意
result后边的结果:PASSED,这表示硬盘健康状态良好
如果这里显示Failure,那么最好立刻给服务器更换硬盘


可以肯定是这块sdd硬盘出现问题,可以将此节点服务器,从hadoop群集中排除,
umount这块硬盘,之后更换个新的,重新格式化mount,再将服务器重新加入到hadoop群集中即可。

网上有些朋友说进行linux修复模式,fsck下硬盘,但是为了避免再出现问题,还是直接换个新的。
  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值