项目场景:
ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs have errors: [ /opt/ha/hadoop/data/nm-local-dir : Cannot create directory : /opt/ha/hadoop/data/nm-local-dir, error mkdir of /opt/ha/hadoop/data/nm-local-dir failed ]
问题描述
在配置hadoop的kerberos時,發現出現了如下的bug,一直都無法解決
2023-06-20 16:00:01,301 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2023-06-20 16:00:01,350 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2023-06-20 16:00:01,350 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system started
2023-06-20 16:00:01,385 INFO org.apache.hadoop.security.UserGroupInformation: Login successful for user nm/hadoop103@EXAMPLE.COM using keytab file /etc/security/keytab/nm.service.keytab
2023-06-20 16:00:01,402 INFO org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-20 16:00:01,412 INFO org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-20 16:00:01,462 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Unable to create directory /opt/ha/hadoop/data/nm-local-dir error mkdir of /opt/ha/hadoop/data/nm-local-dir failed, removing from the list of valid directories.
2023-06-20 16:00:01,464 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs have errors: [ /opt/ha/hadoop/data/nm-local-dir : Cannot create directory : /opt/ha/hadoop/data/nm-local-dir, error mkdir of /opt/ha/hadoop/data/nm-local-dir failed ]
2023-06-20 16:00:01,485 INFO org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@62656be4
2023-06-20 16:00:01,487 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
2023-06-20 16:00:01,488 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService
2023-06-20 16:00:01,489 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: AMRMProxyService is disabled
2023-06-20 16:00:01,489 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: per directory file limit = 8192
2023-06-20 16:00:01,492 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2023-06-20 16:00:01,497 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker
2023-06-20 16:00:01,530 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service mapreduce_shuffle, "mapreduce_shuffle"
2023-06-20 16:00:01,571 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@68f1b
原因分析:
一直報錯無法創建這個目錄,但是這個目錄,我已經創建出來了,而且也賦予了權限,但是依舊報錯,目前也不知道是什麼原因,可能是由於其父目錄的權限是hfds:
例如:Handler
发送消息有两种方式,分别是 Handler.obtainMessage()
和 Handler.sendMessage()
,其中 obtainMessage
方式当数据量过大时,由于 MessageQuene
大小也有限,所以当 message
处理不及时时,会造成先传的数据被覆盖,进而导致数据丢失。
—
解决方案:
在hadoop所有節點中,重新心創建了nm-local-dir目錄,並賦予權限
[root@hadoop104 logs]# mkdir /opt/ha/hadoop/nm-local-dir/
[root@hadoop104 logs]# chown -R yarn:hadoop /opt/ha/hadoop/nm-local-dir/
[root@hadoop104 logs]# chmod -R 775 /opt/ha/hadoop/nm-local-dir/
[root@hadoop104 logs]# cd ..
[root@hadoop104 hadoop]# ll
total 212
drwxr-xr-x 2 sarah sarah 4096 Sep 12 2019 bin
drwx------ 6 hdfs hadoop 4096 Jun 20 14:51 data
drwxr-xr-x 3 root hadoop 4096 Sep 12 2019 etc
drwxr-xr-x 2 sarah sarah 4096 Sep 12 2019 include
drwxr-xr-x 3 sarah sarah 4096 Sep 12 2019 lib
drwxr-xr-x 4 sarah sarah 4096 Sep 12 2019 libexec
-rw-rw-r-- 1 sarah sarah 147145 Sep 4 2019 LICENSE.txt
drwxrwxr-x 3 hdfs hadoop 4096 Jun 21 08:20 logs
drwxrwxr-x 2 yarn hadoop 4096 Jun 21 08:59 nm-local-dir
-rw-rw-r-- 1 sarah sarah 21867 Sep 4 2019 NOTICE.txt
-rw-rw-r-- 1 sarah sarah 1366 Sep 4 2019 README.txt
drwxr-xr-x 3 sarah sarah 4096 Jun 20 16:08 sbin
drwxr-xr-x 4 sarah sarah 4096 Sep 12 2019 share
修改yarn-site文件,將yarn.nodemanager.local-dirs改為/opt/ha/hadoop/nm-local-dir
` <property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/ha/hadoop/nm-local-dir</value>
</property>`
分發yarn-site文件,重啟yarn集群
所有節點從unhealthy狀態變為active,done~~~~~~~~