周末公司紧急停电引起机房hadoop测试集群断电,当周一回来准备重启集群发现Cloudera Manager报HDFS块损坏的错误。我们CDH测试集群上面有HBase集群和Solr集群的数据保存在HDFS上。
shell命令行下执行:
hdfs fsck /
发现存在大量corrupt blocks
输出如下所示:
FSCK started by root (auth:SIMPLE) from /172.16.8.165 for path / at Mon Jun 06 19:15:02 CST 2016
....................................................................................................
....................................................................................................
....................................................................................................
..........
/hbase/data/default/C_PICRECORD_IDX_COLLISION/147d35ece51f9930b851faaee205067c/info/a028095ab9474f3bb8edb9a98d3a0e53: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324588_4583931. Target Replicas is 3 but found 1 replica(s).
.....................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/5181381e418e083a907116b3c4c76551/info/9c5aaf4c46e34422a00d041fd6d25b19: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324587_4583930. Target Replicas is 3 but found 1 replica(s).
.....
/hbase/data/default/C_PICRECORD_IDX_COLLISION/58b932aad968d950415d099e39b3dc5a/info/2e691116caba4872be64d64c115b1ca7: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324585_4583928. Target Replicas is 3 but found 2 replica(s).
.................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324602
/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28/info/39f7d025b4504c38b514e80c43b721f7: MISSING 1 blocks of total size 127604197 B...............................
/hbase/data/default/C_PICRECORD_IDX_COLLISION/eb89de3a0550e56e3625c0bd87f592c3/info/32027e5b58274abbb1733b9e9d62a4b3: Under replicated BP-1471870221-192.168.27.165-1461916959086:blk_1078324586_4583929. Target Replicas is 3 but found 1 replica(s).
....
.................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
.....................................................................
/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324798
/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdt: MISSING 1 blocks of total size 778472 B..
/solr/C_PICRECORD/core_node1/data/index/_3rq7.fdx: CORRUPT blockpool BP-1471870221-192.168.27.165-1461916959086 block blk_1078324784
...
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
....................................................................................................
..................................Status: CORRUPT
Total size: 111683665394 B (Total open files size: 326518049 B)
Total dirs: 1301
Total files: 3634
Total symlinks: 0 (Files currently being written: 29)
Total blocks (validated): 3925 (avg. block size 28454437 B) (Total open file blocks (not validated): 29)
********************************
CORRUPT FILES: 560
MISSING BLOCKS: 560
MISSING SIZE: 209925836 B
CORRUPT BLOCKS: 560
********************************
Minimally replicated blocks: 3365 (85.73248 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 4 (0.10191083 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.5640764
Corrupt blocks: 560
Missing replicas: 7 (0.0595694 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Mon Jun 06 19:15:02 CST 2016 in 207 milliseconds
The filesystem under path '/' is CORRUPT
留意:
Corrupt blocks: 560
2) 尝试启动solr集群,重启后再执行hdfs fsck /
发现仅剩 hbase 相关的corrupt blocks
3) 尝试重启hbase集群,重启后再执行 hdfs fsck /
,发现相关的corrupt blocks依然存在
观察hbase master web ui,发现C_PICRECORD表的region全部为offline,C_PICRECORD_IDX_COLLISION有一个region为offline.
shell下执行
hbase hbck
得到如下结果:
HBaseFsck command line options:
Version: 1.0.0-cdh5.4.2
Number of live region servers: 3
Number of dead region servers: 0
Master: master,60000,1465219825898
Number of backup masters: 0
Average load: 96.66666666666667
Number of requests: 14
Number of regions: 290
Number of regions in transition: 16
Number of empty REGIONINFO_QUALIFIER rows in hbase:meta: 0
Number of Tables: 8
ERROR: Region { meta => C_PICRECORD,\x0B,1464994589166.002457068da67399d7b3e199f1c36cc1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/002457068da67399d7b3e199f1c36cc1, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0A,1464994589166.0bac6481608a98607d30f510317e5a3f., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0bac6481608a98607d30f510317e5a3f, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0C,1464994589166.0cbbb1a413145d9289c096f0fb3f0d5e., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/0cbbb1a413145d9289c096f0fb3f0d5e, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x08,1464994589166.16e1bd458911637b07e16e35e2a5d700., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/16e1bd458911637b07e16e35e2a5d700, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x04,1464994589166.300652a5c355b5be8ecf0f7f29fe6e23., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/300652a5c355b5be8ecf0f7f29fe6e23, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x01,1464994589166.5461981c1498b8c031a723e41f92899d., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/5461981c1498b8c031a723e41f92899d, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x05,1464994589166.6530d18ae3d09d0570bb98f56a25254c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/6530d18ae3d09d0570bb98f56a25254c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,,1464994589166.751a488c823dc5e0384273fc6cf9435c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/751a488c823dc5e0384273fc6cf9435c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD_IDX_COLLISION,\x06\x00\x00\x00\x00\x00,1464994612530.7867301490e544790671057d7e18ee28., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD_IDX_COLLISION/7867301490e544790671057d7e18ee28, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x02,1464994589166.8da6d9faebabeae4cc4f7591a0bddf1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/8da6d9faebabeae4cc4f7591a0bddf1a, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x06,1464994589166.a0016ea8564da9b848012860921bd612., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a0016ea8564da9b848012860921bd612, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x09,1464994589166.a574bd7a2e6d49d25cd6fa51510155e1., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/a574bd7a2e6d49d25cd6fa51510155e1, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x03,1464994589166.b2e9309a7d0dca1218d3dcbef5511f1a., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b2e9309a7d0dca1218d3dcbef5511f1a, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0E,1464994589166.b9a199b401d9e21383551e8ef4f0a090., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/b9a199b401d9e21383551e8ef4f0a090, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x07,1464994589166.bd0d12dbbdbe4a3c01cb6c818172b083., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/bd0d12dbbdbe4a3c01cb6c818172b083, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: Region { meta => C_PICRECORD,\x0D,1464994589166.ce84407fd5dfaad56552c04620a4745c., hdfs => hdfs://master:8020/hbase/data/default/C_PICRECORD/ce84407fd5dfaad56552c04620a4745c, deployed => , replicaId => 0 } not deployed on any region server.
ERROR: There is a hole in the region chain between \x06\x00\x00\x00\x00\x00 and \x07\x00\x00\x00\x00\x00. You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table C_PICRECORD_IDX_COLLISION
ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole.
ERROR: Found inconsistency in table C_PICRECORD
Summary:
hbase:meta is okay.
Number of regions: 1
Deployed on: slave2,60020,1465219825115
C_PICRECORD_IDX_COLLISION is okay.
Number of regions: 14
Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.CATALOG is okay.
Number of regions: 1
Deployed on: slave3,60020,1465219824655
C_PICRECORD is okay.
Number of regions: 0
Deployed on:
hbase:namespace is okay.
Number of regions: 1
Deployed on: slave3,60020,1465219824655
SYSTEM.SEQUENCE is okay.
Number of regions: 256
Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.FUNCTION is okay.
Number of regions: 1
Deployed on: slave3,60020,1465219824655
C_PICRECORD_IDX is okay.
Number of regions: 15
Deployed on: slave1,60020,1465219825663 slave2,60020,1465219825115 slave3,60020,1465219824655
SYSTEM.STATS is okay.
Number of regions: 1
Deployed on: slave3,60020,1465219824655
18 inconsistencies detected.
Status: INCONSISTENT
尝试使用 hbase hbck -fix
以及 hbase hbck -repair
命令来修复,结果失败
4) 通过hdfs fsck / -delete
直接干掉坏掉的hbase corrupt blocks,然后重启hbase集群,发现region全部online,问题解决
【注意】
通过 hdfs fsck / -delete
方式删除了坏掉的hdfs block会造成数据丢失。暂时没有找到完美解决方案来修复坏掉的块,期待更高明的解决手段!