hadoop集群运维(updating)

Job运行中出错

Unable to close file because the last block BP-1820686335-10.201.48.27-144816918

ava.io.IOException: Unable to close file because the last block BP-1820686335-10.201.48.27-1448169181587:blk_1850383542_781036567 does not have enough number of replicas.
        at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2705)
        at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2667)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2621)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:380)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1060)
        at org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:67)
        at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:83)
        at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:937)
        at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2299)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2388)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2119)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2081)
        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1972)
        at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1898)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:514)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)
        at java.lang.Thread.run(Thread.java:745)

参考: 【HDFS】hive任务报HDFS异常:last block does not have enough number of replicas,知是hadoop服务器负载过大引起,重新执行HIVE SQL脚本即可。若要彻底解决问题,则需要
建议降低任务并发量或者控制cpu使用率来减轻网络的传输,使得DN能顺利向NN汇报block情况。

问题结论:
减轻系统负载。集群发生的时候负载很重,CPU的32个核(100%)全部分配跑MR认为了,至少要留20%的CPU
主要还是block太多,可以考虑做目录大扫描,把对应的太多小文件的目录整理出来再做处理

 java.lang.IllegalArgumentException: java.net.UnknownHostException

解决路径,查看 resourcemanager,发现某个结点存在找不到hostname,删除后这样问题没了

但有问题还没解释通,在yarn只是看到写着没有找到对应的hostname并且没有分配container, 还有就是分配到对应的container 了,但是所对应application确执行成功了.

2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hive     OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCC        ESS  APPID=application_1513834407876_0012    CONTAINERID=container_e91_1513834407876_0012_01_000086
 595972 2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e91_1513834407876_0012_01_000086 of capacity <memory:4096, vCores:1> on host slave19.bl.bigdata:8041, which has 6 containers, <memory:27648, vCores:12> used and <memory:54272, vCores:36> available after allocation
 595973 2017-12-21 13:34:36,748 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_e91_1513694506641_4872_01_000001
 595974 java.lang.IllegalArgumentException: java.net.UnknownHostException: BGhadoop08
 595975         at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:406)
 595976         at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:256)
 595977         at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220)
 595978         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:455)
 595979         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:823)
 595980         at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:532)
 595981         at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 595982         at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 595983         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
 595984         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
 595985         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
 595986         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
 595987         at java.security.AccessController.doPrivileged(Native Method)
 595988         at javax.security.auth.Subject.doAs(Subject.java:422)
 595989         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
 595990         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
 595991 Caused by: java.net.UnknownHostException: BGhadoop08

** 分析**

  1. 卡住的的任务都是在对应没有配置hostname的服务器上吗?
  2. hadoop的推测执行是怎么触发的
  3. 为什么有的任务可以分配到那个没有hostname ,而有的就不能分配

其实写的非常清晰UnknownHostException

集群服务

无法找到主机的NTP 服务,或该服务未响应时钟偏差请求

场景

CDH集群启动成功,但是有某些主机提示“无法找到主机的NTP 服务,或该服务未响应时钟偏差请求”

问题思路

  1. NTP服务没有正常启动
  2. CDH后台程序存在异常

解决脚本

1.先关闭CDH的服务,在界面进行关闭集群服务
2.每台主机开启NTP服务

systemctl restart ntpd

3.每台主机重启cloudera-scm-agent

systemctl restart cloudera-scm-agent

等待5分钟,到CDH控制台查看结果,该异常已经解决

查看yarn 所有log

 hdfs dfs -get /user/history/done/2018/02/09/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大怀特

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值