Job运行中出错
Unable to close file because the last block BP-1820686335-10.201.48.27-144816918
ava.io.IOException: Unable to close file because the last block BP-1820686335-10.201.48.27-1448169181587:blk_1850383542_781036567 does not have enough number of replicas.
at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2705)
at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2667)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2621)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:380)
at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1060)
at org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:67)
at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:83)
at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:937)
at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2299)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2388)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2119)
at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2081)
at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1972)
at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1898)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:514)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)
at java.lang.Thread.run(Thread.java:745)
参考: 【HDFS】hive任务报HDFS异常:last block does not have enough number of replicas,知是hadoop服务器负载过大引起,重新执行HIVE SQL脚本即可。若要彻底解决问题,则需要
建议降低任务并发量或者控制cpu使用率来减轻网络的传输,使得DN能顺利向NN汇报block情况。
问题结论:
减轻系统负载。集群发生的时候负载很重,CPU的32个核(100%)全部分配跑MR认为了,至少要留20%的CPU
主要还是block太多,可以考虑做目录大扫描,把对应的太多小文件的目录整理出来再做处理
java.lang.IllegalArgumentException: java.net.UnknownHostException
解决路径,查看 resourcemanager,发现某个结点存在找不到hostname,删除后这样问题没了
但有问题还没解释通,在yarn只是看到写着没有找到对应的hostname并且没有分配container, 还有就是分配到对应的container 了,但是所对应application确执行成功了.
2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hive OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCC ESS APPID=application_1513834407876_0012 CONTAINERID=container_e91_1513834407876_0012_01_000086
595972 2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e91_1513834407876_0012_01_000086 of capacity <memory:4096, vCores:1> on host slave19.bl.bigdata:8041, which has 6 containers, <memory:27648, vCores:12> used and <memory:54272, vCores:36> available after allocation
595973 2017-12-21 13:34:36,748 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_e91_1513694506641_4872_01_000001
595974 java.lang.IllegalArgumentException: java.net.UnknownHostException: BGhadoop08
595975 at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:406)
595976 at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:256)
595977 at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220)
595978 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:455)
595979 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:823)
595980 at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:532)
595981 at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
595982 at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
595983 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
595984 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
595985 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
595986 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
595987 at java.security.AccessController.doPrivileged(Native Method)
595988 at javax.security.auth.Subject.doAs(Subject.java:422)
595989 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
595990 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
595991 Caused by: java.net.UnknownHostException: BGhadoop08
** 分析**
- 卡住的的任务都是在对应没有配置hostname的服务器上吗?
- hadoop的推测执行是怎么触发的
- 为什么有的任务可以分配到那个没有hostname ,而有的就不能分配
其实写的非常清晰UnknownHostException
集群服务
无法找到主机的NTP 服务,或该服务未响应时钟偏差请求
场景
CDH集群启动成功,但是有某些主机提示“无法找到主机的NTP 服务,或该服务未响应时钟偏差请求”
问题思路
- NTP服务没有正常启动
- CDH后台程序存在异常
解决脚本
1.先关闭CDH的服务,在界面进行关闭集群服务
2.每台主机开启NTP服务
systemctl restart ntpd
3.每台主机重启cloudera-scm-agent
systemctl restart cloudera-scm-agent
等待5分钟,到CDH控制台查看结果,该异常已经解决
查看yarn 所有log
hdfs dfs -get /user/history/done/2018/02/09/