hdfs问题锦集

1、hdfs命令使用

hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]

path (start checking from this path 从此路径开始检查)

-move (move corrupted files to /lost+found 将损坏的文件移到/lost+found)

-delete (delete corrupted files 删除损坏的文件)

-files (print out files being checked 打印出正在检查的文件)

-openforwrite (print out files opened for write 为写入而打开的打印输出文件)

-includeSnapshots (include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it ----如果给定路径指示快照表目录或其下有快照表目录,则包含快照数据)

-list corruptfileblocks (print out list of missing blocks and files they belong to ----打印丢失的块及其所属文件的列表)

-files -blocks (print out block report ----打印出块报告)

-files -blocks -locations (print out locations for every block ----打印出每个块的位置)

-files -blocks -racks (print out network topology for data-node locations ----打印出数据节点位置的网络拓扑)

-files -blocks -replicaDetails (print out each replica details ------打印出每个复制详细信息)

-files -blocks -upgradedomains (print out upgrade domains for every block ----打印每个块的升级域)

-storagepolicies (print out storage policy summary for the blocks -----打印块的存储策略摘要)

-blockId [print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc) ----打印出此blockId属于哪个文件、此块的位置(节点、机架)和其他诊断信息(在复制、损坏或未复制等项下)]

示例:

集群之间hadoop fs -ls hdfs:ip(active):8020
集群中hadoop fs -ls hdfs://emr-cluster
hdfs的namenode元数据,一般100w文件量配置1G heap

1、hdfs haadmin -getServiceState nn1查看active或者是standby状态

2、手动切换主备
dfs.ha.automatic-failover.enabled=true时这个指令不支持
hdfs haadmin -failover -forcefence -forceactive nn2 nn1 主备切换

hdfs配置刷新
hdfs dfsadmin -refreshSuperUserGroupsConfiguration

hdfs查看副本数
hdfs dfs -lsr /

查看每个文件的副本数
hdfs fsck / -files -blocks

hdfs查看磁盘使用容量
hdfs dfs -du -h /data/

查看块的本地路径和块信息
hdfs fsck /路径 -files -blocks -locations

状态查询:hdfs dfsadmin -report
数据块健康查询: hdfs fsck /
查块丢失的命令:hdfs fsck / | egrep -v ‘^.+$’ | grep -v eplica
数据块修复操作:hdfs debug recoverLease -path 文件位置 -retries 重试次数
如不能修复丢失块,那么只能删除块路径hdfs fsck -delete /

修复丢失副本 Under-replicated blocks:
先查询到Under-replicated blocks的路径
hdfs fsck / | grep ‘Under replicated’ | awk -F’:’ ‘{print $1}’ >> /tmp/under_replicated_files
再根据路径进行修复
for hdfsfile in cat /tmp/under_replicated_files; do echo “Fixing $hdfsfile :” ; hadoop fs -setrep 3 $hdfsfile; done

HDFS在写入时有两种选择卷(磁盘)的策略:
一是基于轮询的策略(RoundRobinVolumeChoosingPolicy);
二是基于可用空间的策略(AvailableSpaceVolumeChoosingPolicy);

由hdfs-site.xml中的dfs.datanode.fsdataset.volume.choosing.policy属性来指定。可取的值为org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy或AvailableSpaceVolumeChoosingPolicy。选择基于可用空间的策略,还有两个属性需要注意。

dfs.datanode.available-space-volume-choosing-policy.balanced-space-threshold
默认值10737418240,即10G。它的含义是所有卷中最大可用空间与最小可用空间差值的阈值,如果小于这个阈值,则认为存储是平衡的,直接采用轮询来选择卷。
dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction
默认值0.75。它的含义是数据块存储到可用空间多的卷上的概率,由此可见,这个值如果取0.5以下,对该策 略而言是毫无意义的,一般就采用默认值。

问题1:hdfs读写数据失败(偶发)

2021-06-09 16:11:06,737 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-315556420-10.0.114.19-1595576268534:blk_2885184325_1811621149, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[xxxxx:50010, xxxxx:50010]
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1552)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1489)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1402)
	at java.lang.Thread.run(Thread.java:748)
2021-06-09 16:11:06,737 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-315556420-xxx-1595576268534:blk_2885184325_1811621149, type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=2:[xxxx:50010, xxxx:50010] terminating
2021-06-09 16:11:06,738 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-315556420-xxxx-1595576268534:blk_2885184325_1811621149 received exception java.io.IOException: Premature EOF from inputStream
2021-06-09 16:11:06,738 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: m89e09206.cloud.e11.am97:50010:DataXceiver error processing WRITE_BLOCK operation  src: /xxxx:50913 dst: /10.0.114.22:50010
java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:208)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:211)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:521)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:923)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:854)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:166)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:288)
	at java.lang.Thread.run(Thread.java:748)
2021-06-09 16:11:06,740 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: m89e09206.cloud.e11.am97:50010:DataXceiverServer: 
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096
	at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:150)
	at java.lang.Thread.run(Thread.java:748)
2021-06-09 16:11:06,741 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-315556420-10.0.114.19-1595576268534:blk_2885184329_1811621153 src: /xxx:50915 dest: /xxxx:50010
2021-06-09 16:11:06,746 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: m89e09206.cloud.e11.am97:50010:DataXceiverServer: 
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
	at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:150)
	at java.lang.Thread.run(Thread.java:748)

解决方法:加大fs.datanode.max.transfer.threads到8192

2、坏块处理(磁盘损坏,写数据终端,直接删了块数据)

1、hdfs fsck -list-corruptfileblocks 检查坏块文件(丢失)
2、hdfs fsck -delete / 删除坏块数据(目录可以指定坏块比较多的目录)
3、重新导入丢失的数据

3、mr任务还没启动,直接oom报错

AM日志:
21/05/10 15:15:13 INFO mapreduce.Job: Task Id : attempt_1617064346277_101596_m_000000_1, Status : FAILEDError: Java heap space21/05/10 15:15:16 INFO mapreduce.Job: Task Id : attempt_1617064346277_101596_m_000000_2, Status : FAILEDError: Java heap space
maptask的日志:
2021-05-10 17:02:41,893 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://hdfs-cluster/tmp/gc/wordcount.txt:0+52
2021-05-10 17:02:41,988 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:1000)
at org.apache.hadoop.mapred.MapTask.createSortingCollector(MapTask.java:408)
at org.apache.hadoop.mapred.MapTask.access 100 ( M a p T a s k . j a v a : 82 ) a t o r g . a p a c h e . h a d o o p . m a p r e d . M a p T a s k 100(MapTask.java:82) at org.apache.hadoop.mapred.MapTask 100(MapTask.java:82)atorg.apache.hadoop.mapred.MapTaskNewOutputCollector.(MapTask.java:710)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:782)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

原因是可能是mapreduce.task.io.sort.mb调整太大

4、hdfs磁盘均衡

1)多节点之间的rebalance
(1)开启节点数据均衡
start-balancer.sh threshold 10 (节点之间磁盘容量差不大于10%)
(2)停止数据均衡
stop-balancer.sh

2)磁盘间的balance(hadoop3.x)
(1) 生成均衡计划
hdfs diskbalancer -plan hostname
(2) 执行均衡计划
hdfs diskbalancer -execute xxxx.json
(3) 查看当前均衡任务执行情况
hdfs diskbalancer -query hostname
(4) 取消均衡任务
hdfs diskbalancer -cancel xxx.json

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值