情景:运行Spark程序出现报错
1、报错信息:
17/05/09 14:30:58 WARN scheduler.TaskSetManager: Lost task 28162.1 in stage 0.0 (TID 30490, 127.0.0.1): java.io.IOException: Cannot obtain block length for LocatedBlock{BP-203532773-dfsfdf-1476004795661:blk_1080431162_6762963; getBlockSize()=411; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[127.0.0.1:1004,DS-e9905a06-4607-4113-b717-709a087b8b96,DISK], DatanodeInfoWithStorage[127.0.0.1:1004,DS-a5046b43-4416-45d9-8ff6-44891bcdf3b8,DISK], DatanodeInfoWithStorage[127.0.0.1:1004,DS-f6b04bbe-9555-4ac8-b06a-3317eb229511,DISK]]}
2、解决参考:
https://community.hortonworks.com/questions/37412/cannot-obtain-block-length-for-locatedblock.html
3、开始检查文件
hdfs fsck /user/admin/data/cdn/20170509 -locations -blocks -files
Status: HEALTHY
Total size: 2115443944 B (Total open files size: 7684855 B)
Total dirs: 1
Total files: 67353
Total symlinks: 0 (Files currently being written: 367)
Total blocks (validated): 67339 (avg. block size 31414 B) (Total open file blocks (not validated): 357)
Minimally replicated blocks: 67339 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 6
Number of racks: 1
---------------------
发现:有357个文件处于打开状态
4、再列出有问题的文件
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
Total size: 2123128799 B
Total dirs: 1
Total files: 67720
Total symlinks: 0
Total blocks (validated): 67696 (avg. block size 31362 B)
************************
CORRUPT FILES: 253
MISSING BLOCKS: 253
MISSING SIZE: 7473074 B
************************
Minimally replicated blocks: 67443 (99.626274 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.9887881
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 6
Number of racks: 1
FSCK ended at Wed May 10 10:01:56 CST 2017 in 1357 milliseconds
The filesystem under path '/user/admin/data/cdn/20170509' is CORRUPT
(1)找到有问题的文件
cat tmp.txt |tr '/' '\n' |grep ngaahcs-acc |tr ':' ' '|awk '{print $1}' |sort |uniq |grep -v "2017112318"
(2)最好的解决方法:删除tmp文件
hdfs dfs -rmr /user/admin/data/cdn/20170509/*.tmp
然而没有解决!!
(3)删除tmp文件后,再执行
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
或者用这种方式查找那些文件
[root@eeeee spark]# hdfs fsck /user/admin/data/cdn/20170509 -openforwrite |grep "/user/admin/data/cdn//20170509"
Connecting to namenode via http://rrrrrr:50070
/user/admin/data/cdn//20170509/ngaahcs-access.log..201705090002.1494259322790.gz 250 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log..201705090002.1494259322790.gz: MISSING 1 blocks of total size 250 B.......
/user/admin/data/cdn//20170509/ngaahcs-access.log.705090000.1494259200039.gz 1222 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.l4.201705090000.1494259200039.gz: MISSING 1 blocks of total size 1222
/user/admin/data/cdn//20170509/ngaahcs-access.log.C2-3l4.201705090245.1494269103909.gz 211 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CTSX2-3l4.201705090750.1494287404133.gz 1504 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-3l4.201705090820.1494289204450.gz 308 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.C2-3l4.201705091545.1494315903839.gz 437 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.SX3-3l3.201705090002.1494259321230.gz 1075 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CX3-3l4.201705090001.1494259260581.gz 521 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-X3-3l4.201705090001.1494259260581.gz: MISSING 1 blocks of total size
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-SX3-3l4.201705090002.1494259320807.gz 729 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-GX-GD-SX4-3l4.201705090001.1494259260236.gz 1138 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-3l4.201705090001.1494259260236.gz: MISSING 1 blocks of total size 1138 B.........................
/user/admin/data/cdn//20170509/ngaahcs-access.log.CTX9-3n3.201705090001.1494259260495.gz 2379 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CXq-3k1.201705090002.1494259320204.gz: MISSING 1 blocks of total size 10153 /user/admin/data/cdn//20170509/ngaahcs-access.log.CTXq-3k2.201705090001.1494259260772.gz 539 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-GXq-3n1.201705090002.1494259320328.gz 1278 bytes, 1 block(s), OPENFORWRITE:
/user/admin/data/cdn//20170509/ngaahcs-access.log.CT-G-3n2.201705090001.1494259260696.gz 2183 bytes, 1 block(s), OPENFORWRITE:
如果文件不重要则删除他们
再检查
hdfs fsck /user/admin/data/cdn/20170509 -openforwrite
Total size: 2115004402 B
Total dirs: 1
Total files: 67337
Total symlinks: 0
Total blocks (validated): 67337 (avg. block size 31409 B)
Minimally replicated blocks: 67337 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 6
Number of racks: 1
FSCK ended at Wed May 10 10:16:52 CST 2017 in 1329 milliseconds
The filesystem under path '/user/admin/data/cdn//20170509' is HEALTHY
然后再运行spark程序
注:这不是最终解决方法,所以需要查明原因
如果文件重要,则需要修复。
一个一个地查看文件状态并且恢复
以这个文件为例:/user/admin/data/cdn//20170508/ngaahcs-access.log.3k3.201705081700.1494234003128.gz
执行修复命令:
hdfs debug recoverLease -path <path-of-the-file> -retries <retry times>
hdfs debug recoverLease -path /user/admin/data/cdn//20170508/ngaahcs-access.log.C00.1494234003128.gz -retries 10
hadoop 命令汇总:
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#fsck